Context Matters: A Strategy to Pre-train Language Model for Science Education

9 Pages Posted: 27 Jan 2023

See all articles by Zhengliang Liu

Zhengliang Liu

University of Georgia

Xinyu He

University of Georgia

Lei Liu

Independent

Tianming Liu

University of Georgia

Xiaoming Zhai

The University of Georgia

Date Written: January 27, 2023

Abstract

This study aims at improving the performance of scoring student responses in science education automatically. BERT-based language models have shown significant superiority over traditional NLP models in various language-related tasks. However, science writing of students, including argumentation and explanation, is domain-specific. In addition, the language used by students is different from the language in journals and Wikipedia, which are training sources of BERT and its existing variants. All these suggest that a domain-specific model pre-trained using science education data may improve model performance. However, the ideal type of data to contextualize pre-trained language model and improve the performance in automatically scoring student written responses remains unclear. Therefore, we employ di↵erent data in this study to contextualize both BERT and SciBERT models and compare their performance on automatic scoring of assessment tasks for scientific argumentation. We use three datasets to pre-train the model: 1) journal articles in science education, 2) a large dataset of students’ written responses (sample size over 50,000), and 3) a small dataset of students’ written responses of scientific argumentation tasks. Our experimental results show that in-domain training corpora constructed from science questions and responses improve language model performance on a wide variety of downstream tasks. Our study confirms the effectiveness of continual pre-training on domain-specific data in the education do- main and demonstrates a generalizable strategy for automating science education tasks with high accuracy. We plan to release our data and SciEdBERT models for public use and community engagement.

Keywords: science education, natural language processing, BERT, context, SciEdBERT

Suggested Citation

Liu, Zhengliang and He, Xinyu and Liu, Lei and Liu, Tianming and Zhai, Xiaoming, Context Matters: A Strategy to Pre-train Language Model for Science Education (January 27, 2023). Available at SSRN: https://ssrn.com/abstract=4339205 or http://dx.doi.org/10.2139/ssrn.4339205

Zhengliang Liu

University of Georgia ( email )

Athens, GA 30602-6254
United States

Xinyu He

University of Georgia

Lei Liu

Independent ( email )

Tianming Liu

University of Georgia ( email )

Athens, GA 30602-6254
United States

Xiaoming Zhai (Contact Author)

The University of Georgia ( email )

110 Carlton Street
Athens, GA GA 30602
United States
7065424548 (Phone)

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
62
Abstract Views
462
Rank
763,609
PlumX Metrics