puc-header

The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science

43 Pages Posted: 26 Oct 2021 Publication Status: Published

See all articles by Nicholas Walker

Nicholas Walker

Lawrence Berkeley National Laboratory - Energy Storage & Distributed Resources Division

Amalie Trewartha

Lawrence Berkeley National Laboratory

Haoyan Huo

Lawrence Berkeley National Laboratory

Sanghoon Lee

Lawrence Berkeley National Laboratory

Kevin Cruse

Lawrence Berkeley National Laboratory

John Dagdelen

Lawrence Berkeley National Laboratory

Alexander Dunn

Lawrence Berkeley National Laboratory

Kristin Persson

Lawrence Berkeley National Laboratory

Gerbrand Ceder

Lawrence Berkeley National Laboratory

Anubhav Jain

Lawrence Berkeley National Laboratory

More...

Abstract

The massive increase in materials science publications has resulted in a bottleneck in efficiently connecting new materials discoveries to knowledge from established literature. This problem may be addressed through the use of named entity recognition (NER) to automatically extract structured summary-level data from unstructured text in published materials science articles. We compare the performance of four different NER models on three different annotated materials science datasets. The four models consist of a Bi-directional Long Short-Term Memory (BiLSTM) model enhanced with word embeddings pre-trained on materials science articles, the original BERT model trained on general text,  a BERT variant pre-trained on a broad corpus of scientific articles spanning multiple fields (SciBERT), and a domain-specific BERT variant pre-trained exclusively on materials science articles (MatBERT, this work). Each of the annotated datasets consists of a collection of paragraphs sourced from literature with annotated summary-level information relevant to the topic of the dataset; we explore NER performance on datasets for (1) solid state materials, their phases, sample descriptions and applications among other features of interest, (2) dopant species, host materials, and dopant concentrations, and (3) gold nanoparticle descriptions and morphologies. The MatBERT model achieves the best overall NER performance across the datasets and consistently outperforms the other models on predicting individual entity classes. MatBERT improves over the other two BERTBASE-based NER models by ≈ 1-12% depending on the NER task, which implies that explicit pre-training on materials science text rather than general text provides a measurable advantage. The original BERT model, which was not specifically pre-trained on scientific text, performs worse than both MatBERT and SciBERT by ≈ 3-12% depending on the NER task and reinforces the importance of selecting an appropriate pre-training corpus. Despite its architectural simplicity compared to BERT, the BiLSTM model consistently outperforms the original BERT model, perhaps due to the use of a tokenizer custom-made for materials science text as well as domain-specific pre-trained word embeddings. Learning curves indicate that the MatBERT and SciBERT models outperform the original BERT model by an even greater margin in the small data limit. These results suggest that domain-specific pre-training does provide a measurable advantage for NER in materials science, particularly for datasets of less than 1,000 annotated examples. The higher quality predictions offered by MatBERT models can be expected to accelerate the creation of previously infeasible structured datasets from unstructured text.

Keywords: Natural Language Processing, NLP, Named Entity Recognition, NER, BiLSTM, BERT, Transformers, Language Model, Pre-train, Materials Science, Solid State, Doping, Gold Nanoparticles

Suggested Citation

Walker, Nicholas and Trewartha, Amalie and Huo, Haoyan and Lee, Sanghoon and Cruse, Kevin and Dagdelen, John and Dunn, Alexander and Persson, Kristin and Ceder, Gerbrand and Jain, Anubhav, The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science. Available at SSRN: https://ssrn.com/abstract=3950755 or http://dx.doi.org/10.2139/ssrn.3950755
This version of the paper has not been formally peer reviewed.

Nicholas Walker (Contact Author)

Lawrence Berkeley National Laboratory - Energy Storage & Distributed Resources Division

Berkeley, CA
United States

Amalie Trewartha

Lawrence Berkeley National Laboratory

1 Cyclotron Road
Berkeley, CA 94720
United States

Haoyan Huo

Lawrence Berkeley National Laboratory

1 Cyclotron Road
Berkeley, CA 94720
United States

Sanghoon Lee

Lawrence Berkeley National Laboratory

1 Cyclotron Road
Berkeley, CA 94720
United States

Kevin Cruse

Lawrence Berkeley National Laboratory

1 Cyclotron Road
Berkeley, CA 94720
United States

John Dagdelen

Lawrence Berkeley National Laboratory

1 Cyclotron Road
Berkeley, CA 94720
United States

Alexander Dunn

Lawrence Berkeley National Laboratory

1 Cyclotron Road
Berkeley, CA 94720
United States

Kristin Persson

Lawrence Berkeley National Laboratory ( email )

210 Hearst Mining Building
Berkeley, CA 94720-1760
United States

Gerbrand Ceder

Lawrence Berkeley National Laboratory

1 Cyclotron Road
Berkeley, CA 94720
United States

Anubhav Jain

Lawrence Berkeley National Laboratory

Click here to go to Cell.com

Paper statistics

Downloads
85
Abstract Views
1,966
PlumX Metrics