KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

12 Pages Posted: 2 Apr 2025

See all articles by Michael James Bommarito

Michael James Bommarito

273 Ventures; ALEA Institute; Stanford Center for Legal Informatics; Michigan State College of Law; Bommarito Consulting, LLC

Daniel Martin Katz

Illinois Tech - Chicago Kent College of Law; Bucerius Center for Legal Technology & Data Science; Stanford CodeX - The Center for Legal Informatics; 273 Ventures; ALEA Institute

Jillian Bommarito

273 Ventures; ALEA Institute

Date Written: March 21, 2025

Abstract

We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9% - 17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR postprocessing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization. 

Suggested Citation

Bommarito, Michael James and Katz, Daniel Martin and Bommarito, Jillian, KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications (March 21, 2025). Available at SSRN: https://ssrn.com/abstract=5188502 or http://dx.doi.org/10.2139/ssrn.5188502

ALEA Institute ( email )

HOME PAGE: http://https://aleainstitute.ai/

Stanford Center for Legal Informatics ( email )

559 Nathan Abbott Way
Stanford, CA 94305-8610
United States

Michigan State College of Law ( email )

318 Law College Building
East Lansing, MI 48824-1300
United States

Bommarito Consulting, LLC ( email )

MI 48098
United States

Daniel Martin Katz (Contact Author)

Illinois Tech - Chicago Kent College of Law ( email )

565 W. Adams St.
Chicago, IL 60661-3691
United States

HOME PAGE: http://www.danielmartinkatz.com/

Bucerius Center for Legal Technology & Data Science ( email )

Jungiusstr. 6
Hamburg, 20355
Germany

HOME PAGE: http://legaltechcenter.de/

Stanford CodeX - The Center for Legal Informatics ( email )

559 Nathan Abbott Way
Stanford, CA 94305-8610
United States

HOME PAGE: http://law.stanford.edu/directory/daniel-katz/

273 Ventures ( email )

HOME PAGE: http://273ventures.com

ALEA Institute ( email )

HOME PAGE: http://https://aleainstitute.ai

Jillian Bommarito

273 Ventures ( email )

ALEA Institute ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
85
Abstract Views
408
Rank
640,419
PlumX Metrics