So, what has been happening the past year? Well, for one thing, we have made progress in our open source research, created the data set, published a paper, I have finished my thesis and more is to come. The following blog posts will be focused on our research.
Back in January 2014 I joined Prof. Neil Gandal, Head of the Economics department and Dr. Uriel Stettner from the Business department in Tel Aviv university to work together on a research involving social science within the open source code development sphere.
The goals of our research is to better understand the open source community, what makes open source projects succeed, what type of commercial companies use open source, where do open source developers choose to work and why, what are the interactions between open source developers and projects and more.
My main focus in the research was to develop and investigate a data set of measures of “information spillovers” between projects. In other words, I constructed a network of flow of code between projects, searching for similar code files and code reuse between projects.
The work comprised of downloading all files from SourceForge from year 1996 to 2015. Then, parsing out the text to a uniform text format from all these files and creating a similarity measure between the different code files. I will explain the technical details on how it was all done in a different post.
Why SourceForge, and not, say, GitHub? SourceSorge exists from the 90s, it was the most popular platform long before GitHub. This allows us to run a thorough social science research over time, to check for temporal and trending changes.
Two main topics when looking for code flows:
- Finding similar code files:
We separated code files to programming languages. Within each language, we measured similarity between two files by examining the text of function names, variable names, code fragments and comments within the code. The similarity between documents is based on their joint score in vector space representation. Accordingly, every word in each document is assigned two scores: (1) Term Frequency (TF), which measures the number of times the word appears within the same document compared to all other words in the document and (2) Inverted Document Frequency (IDF), which measures the number of documents in the entire text universe (i.e., all files) in which the particular word appears. Thus, the importance of a word in a document is proportional to its TF score and inversely proportional to its IDF score. For example, in the context of our research, the word “source” is important because it appears many times in this document and does not likely appear in many other economic papers. On the other hand, the word “the” is less important, because it is common in the English language and appears in many other documents. Using a “standard” combined TF-IDF score of each word within a document (file,) we constructed a representation vector of size K, where K is the number of distinct words. Each entry in K is the TF-IDF score of the corresponding word. We then calculated the cosine distance between the vector scores of all pairs of files across projects to determine the similarity between the documents. We chose a minimum cutoff score for similarity, above which, every similarity pair is considered a proper reuse. We checked that score manually over 30 pairs of similar files to make sure that if their score is above the cutoff, they are indeed similar.
- Creating a chronological order of the code files:
We looked at the addition date of a file. File X was considered a reuse of file Y if X was similar to Y (similarity score above the cutoff) and X was added to its’ repository after Y. We then constructed a “reuse” connection network between the projects where project A has a directed connection to project B if there is at least one pair of similarity files belonging to these projects such that the original file belongs to A and the destination file belongs to B. Note that if Project B copied from Project A, and Project C copied the same file from project B, project A gets credit as the source in both cases. In this case, project B is just a facilitator and does not get credit as the source.
At the first stage we focused on code files in Java, later we have also constructed the code flow network for C\C++. These are the dominant languages in SourceForge, comprising of 9M and 11M files, respectively. The next common language is C#, with less than 1M files.
Following is an example of how the Java code flow network looked like in 2005. The blue dots are projects and there is an arrow between two dots if code was transferred from one project to another.