GitHub API Based QuantNet Mining Infrastructure in R
44 Pages Posted: 7 Mar 2017 Last revised: 9 Mar 2017
Date Written: March 6, 2017
QuantNet being an online GitHub based organization is an integrated environment consisting of different types of statistics-related documents and program codes called Quantlets. The QuantNet Style Guide and the yamldebugger package allow a standardized audit and validation of YAML annotated software repositories within this organization. The behavior statistics of QuantNet users are measured with Web Metrics from Google Analytics. We show how the search queries obtained from Google’s metrics can be used in the test collections in order to calibrate and evaluate the information retrieval (IR) performance of QuantNet’s search engine called QuantNetXploRer. For that purpose, different text mining (TM) models will be examined by means of the new TManalyzer package. Further, we introduce the Validation Pipeline (Vali-PP) and apply it on the YAML data. Vali-PP is a functional multi-staged instrument for clustering analysis, providing multivariate statistical analysis of the co-occurrence distribution of driving factors of the pipeline. The new package rgithubS, which enables a GitHub wide search for code and repositories using the GitHub Search API and which is an essential element of the QuantNet Mining infrastructure, is briefly presented.
The TManalyzer results show that for all considered single term queries the number of true positives is maximal in a latent semantic analysis model configuration (LSA50). The Vali-PP analysis indicates that the optimality of the combination LSA50 and hierarchical clustering (HC) applies to 70−90% of the cluster sizes for most of the considered quality indices. Further, we can infer that more accurate and comprehensive metadata increases the clustering quality. Subsequently, the findings of our experimental design are implemented into the QuantNetXploRer.
Keywords: Code Search, Software Repositories, Text Mining, Information Retrieval, Smart Data, YAML, GitHub Search API, Google Analytics, Web Metrics, LSA, GVSM, Cluster Validation, Quality Indices, Validation Pipeline
JEL Classification: C44, C87, C88, C89, M15, O32
Suggested Citation: Suggested Citation