Computer Vision, 3D and the connecting factor - AI

About Me

AI Entrepreneur

My name is Peter, I've been enthusiastic about neural networks since I was a kid. I run the awesome reddit community:
/r/2D3DAI

Working on rendi.dev - FFmpeg as a Service

Co-Founded getmunch.com

Welcome to the journal about 2d, 3d and the connecting factor - AI

Subscribe

Subscribe my Newsletter for new blog posts. Let's stay updated!

Leave this field empty if you're human:

Contact Us

Your Name (required)

Your Email (required)

Subject

Your Message

Δ

Keep in touch

Facebook Twitter Linkedin Youtube Github Stack-Overflow Reddit

Missing consumer key - please check your settings in admin > Settings > Twitter Feed Auth

Recent Posts

From 2D to 3D Using Neural Nets technical online lecture
June 18, 2020
Overview of Human Pose Estimation Neural Networks – HRNet + HigherHRNet, Architectures and FAQ
June 14, 2020
Machine Learning project management – A decision makers’ guide
April 13, 2020
Tensorflow 2 Internals – Lessons learned from creating a 50 hours course
February 17, 2020
Implicit-Decoder part 2 – 3D generation
November 16, 2019

Recent Comments

y-aoub on From 2D to 3D Using Neural Nets technical online lecture
aicha on Implicit-Decoder part 1 – 3D reconstruction
back_to_code on 3D scene reconstruction from single image
Peter on From 2D to 3D Using Neural Nets technical online lecture
Wajahat Shah on From 2D to 3D Using Neural Nets technical online lecture

Archives

June 2020
April 2020
February 2020
November 2019
October 2019
February 2019
February 2018
December 2015

@2019 - All Right Reserved. Peter Naftaliev Abelians

Open Source Research – Technical Work

by Peter February 1, 2018

written by Peter February 1, 2018

Reading Time: 3 minutes

Crawling

We crawled all projects in SourceForge and saved their entire data from all years starting 1998. There were two main source code repository types used: SVN, CVS. CVS is an older implementation of code management, which is less used today.

Using management apis for SVN and CVS, we first tried creating a program which queries every project for its’ content for every year of existence, taking only the files that changed within the year. This proved to be time consuming. We then changed tactics and just downloaded the entire repositories, virtually creating a copy of all the projects in SourceForge on our database. This was accumulated to about 8 TB of data.

Parsing

Each project is comprised of many different file types which hold relevant information: text code files, textual format documents, word documents, PDFs, compressed files which can contain more documents, configuration files and more. In order to use all this data, we needed to parse all the documents within a project and save them in a uniform format.

We created a program which was able to traverse the directory tree of every open source project, for every year of its’ existence and extract the important information for each file, with the help of the open source project: svn-search . For each file we extracted The text of the file, the author of the file, the last action preformed on the file within the corresponding year (edit, delete, add), time of the last action, the author of the last action, the comments of the last author of the file, size of file, location of the file in the project and more. We needed to adjust our program to handle both SVN and CVS repository types, parsing files which are unique to the ones we saw in SourceForge and saving them in our own XML format. This created a base of 1 TB of standardized data which we could then analyze.

Indexing

We used a Lucene based textual index to index code files for every programming language, focusing on an index for the programing languages Java and C\C++. Our XML formatted files were designated to work seamlessly with the lucene index. The Java source code index is comprised of 9 Million files across 130,000 projects. We divided it to 12 sub indexes (shards) in order to improve search performance, over a cluster of 2 physical servers each with 16 CPU cores, 128GB RAM and 500GB SSD hard disks. We added Java keywords to the list of stop words in Solr, so that these words do not account for when searching similarity.

Clearing Automatically created documents

Many code files are created by automated tools. This creates a bias in the dataset in which we think there was transfer of knowledge between two projects, when in fact, they just used the same automatic tool for code creation. To account for these code files and remove them from our sample, we used Solr again. We looked at samples of extremely similar code files and picked those that we saw were created by automated tools. These code files had textual signatures within them that pointed to their automatic creation. We searched these textual signatures within our code corpus using solr and removed thee files which answered the textual search query.

Data analysis

After we had the textual index we created a MySQL metadata index of these files. We created a Java program which ran similarity searches for all files in Solr and indexed the results in the Database. We then could run fast SQL queries over the data and create python network objects from it.

Social Network Analysis

We had past data regarding the connections of all code developers in Source Forge saved in our MySQL DB. We combined this data with the newly consturcted code flow dataset in order to extract meaningful social network insights.

We used NetworkX package for python to process network characteristics. To get statistical insights we used Pandas a data analysis Python package, R programming language and Stata.

coding graph social-network statistics

0 comment

0

Facebook Twitter Linkedin Reddit Whatsapp Telegram Email

Leave a Comment Cancel Reply

Save my name, email, and website in this browser for the next time I comment.

Δ

Peter

My name is Peter, I've been enthusiastic about neural networks since I was a kid. I run the awesome reddit community: reddit.com/r/2D3DAI Working on rendi.dev - FFmpeg as a Service

previous post

DataHack – FlyCatcher

next post

Open Source Research – Following the Code

You may also like

Overview of Human Pose Estimation Neural Networks –...

June 14, 2020

Implicit-Decoder part 2 – 3D generation

November 16, 2019

Implicit-Decoder part 1 – 3D reconstruction

October 11, 2019

Subscribe Newsletter

Subscribe my Newsletter for new blog posts. Let's stay updated!

Leave this field empty if you're human:

@2019 - All Right Reserved. Peter Naftaliev Abelians