ffgrep: Scalable Approximate String Matching

16 Pages Posted: 25 Feb 2020

Date Written: September 10, 2019

Abstract

Approximate substring searching is a common but computationally demanding task in bioinformatics and text analysis. We present a new approach that recasts string search as a multiple convolution problem, then exploits highly efficient fast Fourier convolution techniques. This approach, which we call ffgrep, computes and caches the spectra of a target corpora, drastically reducing the cost of subsequent searches. Like other approaches, this algorithm is embarrassingly parallelizable; unlike other approaches, it is capable of operating on not only raw strings, but also word embeddings. ffgrep is applied to an original corpus of imperfect automatic transcriptions of campaign speeches in the 2012 U.S. presidential election. We contrast our approach with agrep, an industry-standard meta-algorithm that selects the optimal member from a number of highly optimized approximate string matching algorithms. Searching for approximate recurrences of a manually curated set of candidate catchphrases, we show that ffgrep speeds computation by up to a factor of 60x in typical settings, with increasing gains as alignments grow longer or more complex. Moreover, these computational gains come at little cost in performance. Taking agrep search results as ground truth, over a wide range of agrep parameters, we show that ffgrep is capable of recovering highly similar results with accuracies exceeding 0.94 and F1 of 0.84–0.9. Finally, we demonstrate how efficient substring matching enables new substantive research by identifying candidate catchphrases without human supervision. By rapidly computing and organizing 90 billion pairwise string comparisons, our proposed method automatically learns that the phrases “kick children off of Head Start or eliminate health insurance for the poor” and “kick students are [sic] financial aid or get rid of funding for Planned Parenthood or eliminate health care for millions on Medicaid” — along with 32 other campaign appeals — all map onto a single recurring theme, President Barack Obama’s critique of a proposed Medicare reform.

Suggested Citation

Knox, Dean, ffgrep: Scalable Approximate String Matching (September 10, 2019). Available at SSRN: https://ssrn.com/abstract=3528533 or http://dx.doi.org/10.2139/ssrn.3528533

Dean Knox (Contact Author)

Princeton University ( email )

001 Fisher Hall
Princeton, NJ 08544
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
29
Abstract Views
218
PlumX Metrics