Why so Similar?: Identifying Semantic Organizing Processes in Large Textual Corpora

37 Pages Posted: 13 Nov 2013  

Drew Margolin

Cornell University - Department of Communication

Yu-Ru Lin

University of Pittsburgh - School of Information Sciences

David Lazer

Northeastern University - Department of Political Science; Harvard University - Harvard Kennedy School (HKS)

Date Written: November 12, 2013

Abstract

This paper introduces the concept of semantic organizing processes as a means of inferring theoretically meaningful behavior from the observation of raw text. Semantic organizing processes are mechanisms by which a set of authors come to produce texts that are similar in some observable, quantifiable way. We introduce three broad semantic organizing processes -- authors sharing subject matter, authors sharing goals, and authors sharing sources -- and argue that each of these processes will lead to texts that tend to share n-grams at different lengths: short n-grams for shared subject matter, moderate length n-grams for shared goals, and long n-grams for shared sources. To test these hypotheses, we develop a novel n-gram extraction technique to capture text similarity based on n-grams of different lengths. We then apply our technique to a corpus where the author attributes are observable: the public statements of the Members of the U.S. Congress. Our results support the hypothesis that these three processes are reflected in distinct kinds of textual similarity. This article presents the first empirical finding that different social processes are detectable through the structure of overlapping textual features. The finding has important implications for modeling text and understanding underlying social processes.

Keywords: semantic networks, framing, vocabulary, textual corpora, semantic organizing processes, isomorphism

Suggested Citation

Margolin, Drew and Lin, Yu-Ru and Lazer, David, Why so Similar?: Identifying Semantic Organizing Processes in Large Textual Corpora (November 12, 2013). Available at SSRN: https://ssrn.com/abstract=2353705 or http://dx.doi.org/10.2139/ssrn.2353705

Drew Margolin (Contact Author)

Cornell University - Department of Communication ( email )

Ithaca, NY 14850
United States

Yu-Ru Lin

University of Pittsburgh - School of Information Sciences ( email )

United States

HOME PAGE: http://yurulin.com

David Lazer

Northeastern University - Department of Political Science ( email )

Boston, MA 02115
United States
617-373-2796 (Phone)
617-373-5311 (Fax)

Harvard University - Harvard Kennedy School (HKS) ( email )

79 John F. Kennedy Street
Taubman Center
Cambridge, MA 02138
United States
617-496-0102 (Phone)
617-496-1722 (Fax)

HOME PAGE: http://www.davidlazer.com

Paper statistics

Downloads
146
Rank
165,112
Abstract Views
1,272