puc-header

Epidemiologic Information Discovery from Open-Access COVID-19 Case Reports Via Pretrained Language Model

44 Pages Posted: 17 Mar 2022 Publication Status: Published

See all articles by Zhizheng Wang

Zhizheng Wang

Dalian University of Technology

Xiao Fan Liu

City University of Hong Kong

Zhanwei Du

The University of Hong Kong - WHO Collaborating Centre for Infectious Disease Epidemiology and Control

Lin Wang

University of Cambridge - Department of Genetics; The University of Hong Kong - WHO Collaborating Centre for Infectious Disease Epidemiology and Control

Ye Wu

Beijing University of Posts and Telecommunications (BUPT) - School of Economics and Management

Petter Holme

affiliation not provided to SSRN

Michael Lachmann

Santa Fe Institute

Hongfei Lin

Dalian University of Technology

Zoie S. Y. Wong

affiliation not provided to SSRN

X-K Xu

Dalian Minzu University - College of Information and Communication Engineering

Yuanyuan Sun

Dalian University of Technology - College of Computer Science and Technology

More...

Abstract

Although open-access data are increasing common and useful to epidemiological research, curation of such datasets is resource-intensive and time-consuming. Despite a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with unstructured format. Here we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applying to the COVID-19 case reports collected from mainland China, our novel framework outstrips all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To implement our algorithm, we provide an open-access online platform that can accurately estimate epidemiological statistics in real-time with substantially reduced burden in data curation.

Keywords: Epidemiologic information; Artificial Intelligence; COVID-19 case reports; Pretrained language model; Epidemiology; Natural language processing

Suggested Citation

Wang, Zhizheng and Liu, Xiao Fan and Du, Zhanwei and Wang, Lin and Wu, Ye and Holme, Petter and Lachmann, Michael and Lin, Hongfei and Wong, Zoie S. Y. and Xu, Xiao-Ke and Sun, Yuanyuan, Epidemiologic Information Discovery from Open-Access COVID-19 Case Reports Via Pretrained Language Model. Available at SSRN: https://ssrn.com/abstract=4060371 or http://dx.doi.org/10.2139/ssrn.4060371
This version of the paper has not been formally peer reviewed.

Zhizheng Wang

Dalian University of Technology ( email )

Huiying Rd
DaLian, LiaoNing, 116024
China

Xiao Fan Liu

City University of Hong Kong ( email )

18 Tat Hong Avenue
Kowloon
Hong Kong

Zhanwei Du

The University of Hong Kong - WHO Collaborating Centre for Infectious Disease Epidemiology and Control ( email )

Hong Kong
China

Lin Wang

University of Cambridge - Department of Genetics ( email )

The University of Hong Kong - WHO Collaborating Centre for Infectious Disease Epidemiology and Control ( email )

Ye Wu

Beijing University of Posts and Telecommunications (BUPT) - School of Economics and Management ( email )

10 Xi Tu Cheng Rd.
Mailbox 164
Beijing, Beijing 100876
China

Petter Holme

affiliation not provided to SSRN ( email )

No Address Available

Michael Lachmann

Santa Fe Institute ( email )

1399 Hyde Park Road
Santa Fe, NM 87501
United States

Hongfei Lin

Dalian University of Technology ( email )

Huiying Rd
DaLian, LiaoNing, 116024
China

Zoie S. Y. Wong

affiliation not provided to SSRN

No Address Available

Xiao-Ke Xu

Dalian Minzu University - College of Information and Communication Engineering ( email )

Dalian, 116600
China

Yuanyuan Sun (Contact Author)

Dalian University of Technology - College of Computer Science and Technology ( email )

China

Click here to go to Cell.com

Paper statistics

Downloads
11
Abstract Views
493
PlumX Metrics