Q3-D3-LSA

SFB 649 Discussion Paper 2016-049

48 Pages Posted: 18 Nov 2016  

Lukas Borke

Humboldt University of Berlin

Wolfgang K. Härdle

Humboldt University of Berlin - Institute for Statistics and Econometrics; Humboldt University of Berlin - Center for Applied Statistics and Economics (CASE)

Date Written: November 17, 2016

Abstract

QuantNet is an integrated web-based environment consisting of different types of statistics-related documents and program codes. Its goal is creating reproducibility and offering a platform for sharing validated knowledge native to the social web. To increase the information retrieval (IR) efficiency there is a need for incorporating semantic information. Three text mining models will be examined: vector space model (VSM), generalized VSM (GVSM) and latent semantic analysis (LSA). The LSA has been successfully used for IR purposes as a technique for capturing semantic relations between terms and inserting them into the similarity measure between documents. Our results show that different model configurations allow adapted similarity-based document clustering and knowledge discovery. In particular, different LSA configurations together with hierarchical clustering reveal good results under M3 evaluation. QuantNet and the corresponding Data-Driven Documents (D3) based visualization can be found and applied under quantlet. The driving technology behind it is Q3-D3-LSA, which is the combination of “GitHub API based QuantNet Mining infrastructure in R”, LSA and D3 implementation.

Keywords: QuantNet, D3, GitHub API, text mining, document clustering, similarity, semantic web, generalized vector space model, LSA, visualization

JEL Classification: C87, C88, G17

Suggested Citation

Borke, Lukas and Härdle, Wolfgang K., Q3-D3-LSA (November 17, 2016). SFB 649 Discussion Paper 2016-049. Available at SSRN: https://ssrn.com/abstract=2871111 or http://dx.doi.org/10.2139/ssrn.2871111

Lukas Borke

Humboldt University of Berlin ( email )

Unter den Linden 6
Berlin, Berlin 10099
Germany

Wolfgang K. Härdle (Contact Author)

Humboldt University of Berlin - Center for Applied Statistics and Economics (CASE)

Unter den Linden 6
Berlin, D-10099
Germany

Humboldt University of Berlin - Institute for Statistics and Econometrics ( email )

Unter den Linden 6
Berlin, D-10099
Germany
+49 30 2093 5631 (Phone)
+49 30 2093 5649 (Fax)

Paper statistics

Downloads
29
Abstract Views
322