GitHub API Based QuantNet Mining Infrastructure in R

44 Pages Posted: 7 Mar 2017 Last revised: 9 Mar 2017

See all articles by Lukas Borke

Lukas Borke

Humboldt University of Berlin

Wolfgang Karl Härdle

Blockchain Research Center Humboldt-Universität zu Berlin; Charles University; National Yang Ming Chiao Tung University; Asian Competitiveness Institute

Date Written: March 6, 2017

Abstract

QuantNet being an online GitHub based organization is an integrated environment consisting of different types of statistics-related documents and program codes called Quantlets. The QuantNet Style Guide and the yamldebugger package allow a standardized audit and validation of YAML annotated software repositories within this organization. The behavior statistics of QuantNet users are measured with Web Metrics from Google Analytics. We show how the search queries obtained from Google’s metrics can be used in the test collections in order to calibrate and evaluate the information retrieval (IR) performance of QuantNet’s search engine called QuantNetXploRer. For that purpose, different text mining (TM) models will be examined by means of the new TManalyzer package. Further, we introduce the Validation Pipeline (Vali-PP) and apply it on the YAML data. Vali-PP is a functional multi-staged instrument for clustering analysis, providing multivariate statistical analysis of the co-occurrence distribution of driving factors of the pipeline. The new package rgithubS, which enables a GitHub wide search for code and repositories using the GitHub Search API and which is an essential element of the QuantNet Mining infrastructure, is briefly presented.

The TManalyzer results show that for all considered single term queries the number of true positives is maximal in a latent semantic analysis model configuration (LSA50). The Vali-PP analysis indicates that the optimality of the combination LSA50 and hierarchical clustering (HC) applies to 70−90% of the cluster sizes for most of the considered quality indices. Further, we can infer that more accurate and comprehensive metadata increases the clustering quality. Subsequently, the findings of our experimental design are implemented into the QuantNetXploRer.

Keywords: Code Search, Software Repositories, Text Mining, Information Retrieval, Smart Data, YAML, GitHub Search API, Google Analytics, Web Metrics, LSA, GVSM, Cluster Validation, Quality Indices, Validation Pipeline

JEL Classification: C44, C87, C88, C89, M15, O32

Suggested Citation

Borke, Lukas and Härdle, Wolfgang Karl, GitHub API Based QuantNet Mining Infrastructure in R (March 6, 2017). Available at SSRN: https://ssrn.com/abstract=2927901 or http://dx.doi.org/10.2139/ssrn.2927901

Lukas Borke

Humboldt University of Berlin ( email )

Unter den Linden 6
Berlin, AK Berlin 10099
Germany

Wolfgang Karl Härdle (Contact Author)

Blockchain Research Center Humboldt-Universität zu Berlin ( email )

Unter den Linden 6
Berlin, D-10099
Germany

Charles University ( email )

Celetná 13
Dept Math Physics
Praha 1, 116 36
Czech Republic

National Yang Ming Chiao Tung University ( email )

No. 1001, Daxue Rd. East Dist.
Hsinchu City 300093
Taiwan

Asian Competitiveness Institute ( email )

Singapore

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
134
Abstract Views
1,532
Rank
385,726
PlumX Metrics