Preprints with The Lancet is part of SSRN´s First Look, a place where journals identify content of interest prior to publication. Authors have opted in at submission to The Lancet family of journals to post their preprints on Preprints with The Lancet. The usual SSRN checks and a Lancet-specific check for appropriateness and transparency have been applied. Preprints available here are not Lancet publications or necessarily under review with a Lancet journal. These preprints are early stage research papers that have not been peer-reviewed. The findings should not be used for clinical or public health decision making and should not be presented to a lay audience without highlighting that they are preliminary and have not been peer-reviewed. For more information on this collaboration, see the comments published in The Lancet about the trial period, and our decision to make this a permanent offering, or visit The Lancet´s FAQ page, and for any feedback please contact

Trends in COVID-19 Publications: Streamlining Research Using NLP and LDA

24 Pages Posted: 15 Oct 2020

See all articles by Akash Gupta

Akash Gupta

University of Cambridge

Shrey Aeron

University of California, Berkeley

Anjali Agrawal

Harmony School of Innovation

Himanshu Gupta

Valley Health System



Research publications related to the novel coronavirus disease COVID-19 are rapidly growing in number. However, current online literature hubs, even with artificial intelligence, are inadequate for identifying the relative strength of research topics. Hence, we aimed to develop a comprehensive Latent Dirichlet Allocation (LDA) topic model using natural language processing (NLP) techniques, provide visualisations for temporal trends, and apply our methodology to improve existing online literature hubs.Using the search term “COVID”, abstracts were extracted from PubMed®, from January to July 2020 (N=16346). An LDA topic model was trained on 81% of abstracts. Weekly temporal trends were visualised as a heatmap on all abstracts. Then, we tested our methodology on over 23,000 abstracts gathered from January 2020 to September 2020 from LitCovid, a literature hub from the National Center for Biotechnology Information. We use our topic model to subdivide LitCovid’s eight categories into corresponding LDA topics.The optimised LDA topic model, created using PubMed® data, produced 25 comprehensive topics with no significant overlap. There were temporal changes for topics: prominence of “Mental Health” and “Socioeconomic Impact” increased, “Genome Sequence” decreased, and “Epidemiology” remained relatively constant. We identified inadequate representation of “Airborne Transmission Protection”. Importantly, research on masks and PPE is skewed towards clinical applications with a lack of population-based epidemiological research. Our methodology, when applied to LitCovid, identified important topics within each LitCovid category. For example, “Case Report” was split into topics such as “Pulmonary” and “Oncology” as well as the under-represented topics “Haematology” and “Gastroenterology”. Our work allows for comprehensive topic identification and intuitive visualisation of temporal trends in COVID-19 research. Implementation of the methodology complements existing online literature hubs and identifies underrepresented topics such as population-based studies on masks that may be of significant public interest.

Funding Statement: None to declare.

Declaration of Interests: There are no conflicts of interest.

Keywords: Natural Language Processing, NLP, LDA, COVID-19, topic model, trends, LitCovid, PubMed, Machine Learning, research repository

Suggested Citation

Gupta, Akash and Aeron, Shrey and Agrawal, Anjali and Gupta, Himanshu, Trends in COVID-19 Publications: Streamlining Research Using NLP and LDA. Available at SSRN: or

Akash Gupta

University of Cambridge ( email )

Trinity Ln
Cambridge, CB2 1TN
United Kingdom

Shrey Aeron

University of California, Berkeley ( email )

310 Barrows Hall
Berkeley, CA 94720
United States

Anjali Agrawal

Harmony School of Innovation ( email )

Himanshu Gupta (Contact Author)

Valley Health System ( email )

United States

Click here to go to

Paper statistics

Abstract Views
PlumX Metrics