Information Flows Thesis Research

Reading Time: 3 minutes

Following is the description of my thesis research, the main topic that I wanted to examine was to see how one’s social network centrality affects his credibility and his ability to spread information across the network. This question has an interesting meaning within professional social networks in which actors are professionals that know how to measure the quality of information they are exposed to from their peers.

In this research, I have conducted the first in depth study of the behavior of information flows in the open source software (OSS) world. I made the first file-level global behavior research, taking advantage of the masses of available online OSS information from SourceForge between the years 2005-2013. I created an information flows network which is based on code copying and reuse across all these years. I combined the information flows network with previous social network researched in (Fershtman and Gandal 2011 and Gandal and Stettner 2014) to combine social network structure and interactions with exact measurement of information flow to figure out novel understanding on the way developers operate in OSS.

Several main questions were asked (and answered) in my research:

Is the centrality of the developer or the project associated with information propagation?
How spread of open source code information looks over time and what is its’ pace?
Is there Two Step Flow of Comomunication (Katz 1957) charactersitic to open source code, in which central actors bring in information to the network from external sources?
What is the reach of different code files over the network and how is it affected by the originator of these files?
Do central actors in the open source social network use their connection to gather code from peers which is relevant to their own projects?

The Dataset:

Using the dataset of code similarities I created an information flow network for each year which was based on the quartet File-Project-Developer-Year.
Each node is a given code file, with outbound connections to other similar code files which were created up to the given year and an inbound connection if the code file was similar to a previously created file. A connection can exist between two nodes if and only if they belong to different projects and were created by different developers and both code files were created up to, including, the given year.

I combined the information flow network with two other network:
(i) Developer network: Two developers are connected in the network if both were members of the same project in the same year.
(ii) Project network: Two projects are connected in the network if both project had a mutual developer in the same year.

Spread of information over time:

Checking the reach of code files, developers and projects, I found there are power law effects taking place, there are a few developers, projects and code files which account for a very large number of copying and reuse. I then checked the speed of information spread via code reuse. I found evidence that correspond to theoretical and empirical social network information flows in which the pace of cumulative information spread in OSS over time is according to an S-Shape curve, or a bell curve if we look at the temporal distribution. Our data set showed an interesting phenomenon, in which the first years of the code existence are most important for its’ spread and later years exhibit smaller code reuse, suggesting technological aging.

Central actors behavior:

My main results showed ambiguity with regard to how much centrality in the social network is associated with being a source and originator of information. The results suggest that what’s more important is the activity of the project itself, and validation that the code itself is indeed valuable through lower modification counts and more years of existence. I saw that the license agreement of a project has a direct effect on the amount its’ code is reused. More permissive license types were associated with greater code spread.
I found that central developers in SourceForge act more as mavens than connectors, bringing new information into the network, rather than connecting developers and projects.

Two Step Communication Flows:

The online environment is ideal for two step communication flows (Katz 1957), because it is a setting in which actors can bring information from other online sources to their own social network, but using our dataset I was able to refute its’ existence in SourceForge.

Concluding thoughts:

I have started the path of understanding and measuring social interactions and information spread in OSS on a large scale. But, I see there is still to be done with regards to homophily, aggregating social characteristics of all peers who reuse a code, understanding the importance of license agreements on code reuse and more.

1 comment

Open Source Research – Following the Code – ProgressInGineering February 5, 2018 - 07:09

[…] used the same base network and datasets as described in: Information Flows Thesis Research. We constructed a “reuse” connection network between the projects where project A has a […]

Information Flows Thesis Research

1 comment

Leave a Comment Cancel Reply

Open Source Research – Following the Code

Open Source Research – Code reuse

You may also like