GitHub API Based QuantNet Mining Infrastructure in R

44 Pages Posted: 7 Mar 2017 Last revised: 9 Mar 2017

Lukas Borke

Humboldt University of Berlin

Wolfgang K. Härdle

Humboldt University of Berlin - Institute for Statistics and Econometrics; Humboldt University of Berlin - Center for Applied Statistics and Economics (CASE)

Date Written: March 6, 2017

Abstract

QuantNet being an online GitHub based organization is an integrated environment consisting of different types of statistics-related documents and program codes called Quantlets. The QuantNet Style Guide and the yamldebugger package allow a standardized audit and validation of YAML annotated software repositories within this organization. The behavior statistics of QuantNet users are measured with Web Metrics from Google Analytics. We show how the search queries obtained from Google’s metrics can be used in the test collections in order to calibrate and evaluate the information retrieval (IR) performance of QuantNet’s search engine called QuantNetXploRer. For that purpose, different text mining (TM) models will be examined by means of the new TManalyzer package. Further, we introduce the Validation Pipeline (Vali-PP) and apply it on the YAML data. Vali-PP is a functional multi-staged instrument for clustering analysis, providing multivariate statistical analysis of the co-occurrence distribution of driving factors of the pipeline. The new package rgithubS, which enables a GitHub wide search for code and repositories using the GitHub Search API and which is an essential element of the QuantNet Mining infrastructure, is briefly presented.

The TManalyzer results show that for all considered single term queries the number of true positives is maximal in a latent semantic analysis model configuration (LSA50). The Vali-PP analysis indicates that the optimality of the combination LSA50 and hierarchical clustering (HC) applies to 70−90% of the cluster sizes for most of the considered quality indices. Further, we can infer that more accurate and comprehensive metadata increases the clustering quality. Subsequently, the findings of our experimental design are implemented into the QuantNetXploRer.

Keywords: Code Search, Software Repositories, Text Mining, Information Retrieval, Smart Data, YAML, GitHub Search API, Google Analytics, Web Metrics, LSA, GVSM, Cluster Validation, Quality Indices, Validation Pipeline

JEL Classification: C44, C87, C88, C89, M15, O32

Suggested Citation

Borke, Lukas and Härdle, Wolfgang K., GitHub API Based QuantNet Mining Infrastructure in R (March 6, 2017). Available at SSRN: https://ssrn.com/abstract=2927901 or http://dx.doi.org/10.2139/ssrn.2927901

Lukas Borke

Humboldt University of Berlin ( email )

Unter den Linden 6
Berlin, Berlin 10099
Germany

Wolfgang K. Härdle (Contact Author)

Humboldt University of Berlin - Center for Applied Statistics and Economics (CASE)

Unter den Linden 6
Berlin, D-10099
Germany

Humboldt University of Berlin - Institute for Statistics and Econometrics ( email )

Unter den Linden 6
Berlin, D-10099
Germany
+49 30 2093 5631 (Phone)
+49 30 2093 5649 (Fax)

Paper statistics

Downloads
14
Abstract Views
140