We crawled all projects in SourceForge and saved their entire data from all years starting 1998. There were two main source code repository types used: SVN, CVS. CVS is an older implementation of code management, which is less used today.
Using management apis for SVN and CVS, we first tried creating a program which queries every project for its’ content for every year of existence, taking only the files that changed within the year. This proved to be time consuming. We then changed tactics and just downloaded the entire repositories, virtually creating a copy of all the projects in SourceForge on our database. This was accumulated to about 8 TB of data.
Each project is comprised of many different file types which hold relevant information: text code files, textual format documents, word documents, PDFs, compressed files which can contain more documents, configuration files and more. In order to use all this data, we needed to parse all the documents within a project and save them in a uniform format.
We created a program which was able to traverse the directory tree of every open source project, for every year of its’ existence and extract the important information for each file, with the help of the open source project: svn-search . For each file we extracted The text of the file, the author of the file, the last action preformed on the file within the corresponding year (edit, delete, add), time of the last action, the author of the last action, the comments of the last author of the file, size of file, location of the file in the project and more. We needed to adjust our program to handle both SVN and CVS repository types, parsing files which are unique to the ones we saw in SourceForge and saving them in our own XML format. This created a base of 1 TB of standardized data which we could then analyze.
We used a Lucene based textual index to index code files for every programming language, focusing on an index for the programing languages Java and C\C++. Our XML formatted files were designated to work seamlessly with the lucene index. The Java source code index is comprised of 9 Million files across 130,000 projects. We divided it to 12 sub indexes (shards) in order to improve search performance, over a cluster of 2 physical servers each with 16 CPU cores, 128GB RAM and 500GB SSD hard disks. We added Java keywords to the list of stop words in Solr, so that these words do not account for when searching similarity.
Clearing Automatically created documents
Many code files are created by automated tools. This creates a bias in the dataset in which we think there was transfer of knowledge between two projects, when in fact, they just used the same automatic tool for code creation. To account for these code files and remove them from our sample, we used Solr again. We looked at samples of extremely similar code files and picked those that we saw were created by automated tools. These code files had textual signatures within them that pointed to their automatic creation. We searched these textual signatures within our code corpus using solr and removed thee files which answered the textual search query.
After we had the textual index we created a MySQL metadata index of these files. We created a Java program which ran similarity searches for all files in Solr and indexed the results in the Database. We then could run fast SQL queries over the data and create python network objects from it.
Social Network Analysis
We had past data regarding the connections of all code developers in Source Forge saved in our MySQL DB. We combined this data with the newly consturcted code flow dataset in order to extract meaningful social network insights.
We used NetworkX package for python to process network characteristics. To get statistical insights we used Pandas a data analysis Python package, R programming language and Stata.