LLM-Japanese-Dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology

12 Pages Posted: 7 Jun 2023

See all articles by Masanori HIRANO

Masanori HIRANO

Preferred Networks, Inc.

Masahiro Suzuki

University of Tokyo; Sumitomo Mitsui Trust Bank, Limited - Nikko Asset Management Co., Ltd.

Hiroki Sakaji

The University of Tokyo

Date Written: May 22, 2023

Abstract

This study constructed a Japanese chat dataset for tuning large language models (LLMs), which consist of about 8.4 million records. Recently, LLMs have been developed and gaining popularity. However, high-performing LLMs are usually mainly for English. There are two ways to support languages other than English by those LLMs: constructing LLMs from scratch or tuning existing models. However, in both ways, datasets are necessary parts. In this study, we focused on supporting Japanese in those LLMs and making a dataset for training or tuning LLMs in Japanese.

The dataset we constructed consisted of various tasks, such as translation and knowledge tasks. In our experiment, we tuned an existing LLM using our dataset and evaluated the performance qualitatively. The results suggest that our dataset is possibly beneficial for LLMs. However, we also revealed some difficulties in constructing LLMs in languages other than English.

Keywords: Large Language Model, Dataset, Japanese, Chat

Suggested Citation

HIRANO, Masanori and Suzuki, Masahiro and Sakaji, Hiroki, LLM-Japanese-Dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology (May 22, 2023). Available at SSRN: https://ssrn.com/abstract=4454626 or http://dx.doi.org/10.2139/ssrn.4454626

Masanori HIRANO (Contact Author)

Preferred Networks, Inc. ( email )

Otemachi Bldg., 1-6-1 Otemachi
Chiyoda-ku, Tokyo 1000004
Japan

Masahiro Suzuki

University of Tokyo ( email )

Hongo 7-3-1
Bunkyo-ku
Tokyo, Tokyo 113-8657
Japan

Sumitomo Mitsui Trust Bank, Limited - Nikko Asset Management Co., Ltd. ( email )

Midtown Tower
9-7-1 Akasaka
Minato-ku, Tokyo 107-6242
Japan

Hiroki Sakaji

The University of Tokyo ( email )

7-3-1 Hongo
Bunkyo-ku
Tokyo, 113-0033
Japan

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
45
Abstract Views
309
PlumX Metrics