Reliance on Science by Inventors: Hybrid Extraction of In-Text Patent-to-Article Citations
35 Pages Posted: 27 Oct 2020 Last revised: 22 Apr 2022
Date Written: October 2020
We curate and characterize a complete set of citations from patents to scientific articles, including nearly 16 million from the full text of USPTO and EPO patents. Combining heuristics and machine learning, we achieve 25% higher performance than machine learning alone. At 99.4% accuracy, coverage of 87.6% is achieved, and coverage above 90% with accuracy above 93%. Performance is evaluated with a set of 5,939 randomly-sampled, cross-verified “known good” citations, which the authors have never seen. We compare these “in-text” citations with the “official” citations on the front page of patents. In-text citations are more diverse temporally, geographically, and topically. They are less self-referential and less likely to be recycled from one patent to the next. That said, in-text citations have been overshadowed by front-page in the past few decades, dropping from 80% of all paper-to-patent citations to less than 40%. In replicating two published articles that use only citations on the front page of patents, we show that failing to capture those in the body text leads to understating the relationship between academic science and commercial invention. All patent-to-article citations, as well as the known-good test set, are available at http://relianceonscience.org.
Suggested Citation: Suggested Citation