We Don't Know What We Don't Know: When and How the Use of Twitter's Public APIs Biases Scientific Inference
26 Pages Posted: 4 Dec 2017 Last revised: 13 Apr 2019
Date Written: November 29, 2017
Though Twitter research has proliferated, no standards for data collection have crystallized. When using keyword queries, the most common data sources—the Search and Streaming APIs—rarely return the full population of tweets, and scholars do not know whether their data constitute a representative sample. This paper seeks to provide the most comprehensive look to-date at the potential biases that may result. Employing data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive), Streaming, and Search APIs, we use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. We find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend significant caution in future Twitter research.
Keywords: twitter, data collection, APIs, bias
Suggested Citation: Suggested Citation