The Impact of Domain-Specific Pre-Training on Named Entity Recognition Tasks in Materials Science
43 Pages Posted: 26 Oct 2021 Publication Status: Published
More...Abstract
The massive increase in materials science publications has resulted in a bottleneck in efficiently connecting new materials discoveries to knowledge from established literature. This problem may be addressed through the use of named entity recognition (NER) to automatically extract structured summary-level data from unstructured text in published materials science articles. We compare the performance of four different NER models on three different annotated materials science datasets. The four models consist of a Bi-directional Long Short-Term Memory (BiLSTM) model enhanced with word embeddings pre-trained on materials science articles, the original BERT model trained on general text, a BERT variant pre-trained on a broad corpus of scientific articles spanning multiple fields (SciBERT), and a domain-specific BERT variant pre-trained exclusively on materials science articles (MatBERT, this work). Each of the annotated datasets consists of a collection of paragraphs sourced from literature with annotated summary-level information relevant to the topic of the dataset; we explore NER performance on datasets for (1) solid state materials, their phases, sample descriptions and applications among other features of interest, (2) dopant species, host materials, and dopant concentrations, and (3) gold nanoparticle descriptions and morphologies. The MatBERT model achieves the best overall NER performance across the datasets and consistently outperforms the other models on predicting individual entity classes. MatBERT improves over the other two BERTBASE-based NER models by ≈ 1-12% depending on the NER task, which implies that explicit pre-training on materials science text rather than general text provides a measurable advantage. The original BERT model, which was not specifically pre-trained on scientific text, performs worse than both MatBERT and SciBERT by ≈ 3-12% depending on the NER task and reinforces the importance of selecting an appropriate pre-training corpus. Despite its architectural simplicity compared to BERT, the BiLSTM model consistently outperforms the original BERT model, perhaps due to the use of a tokenizer custom-made for materials science text as well as domain-specific pre-trained word embeddings. Learning curves indicate that the MatBERT and SciBERT models outperform the original BERT model by an even greater margin in the small data limit. These results suggest that domain-specific pre-training does provide a measurable advantage for NER in materials science, particularly for datasets of less than 1,000 annotated examples. The higher quality predictions offered by MatBERT models can be expected to accelerate the creation of previously infeasible structured datasets from unstructured text.
Keywords: Natural Language Processing, NLP, Named Entity Recognition, NER, BiLSTM, BERT, Transformers, Language Model, Pre-train, Materials Science, Solid State, Doping, Gold Nanoparticles
Suggested Citation: Suggested Citation