Abstract

The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP

🎯Research Motivation

• Scarcity of clean, high-purity morpheme lexicons for Uralic languages; existing dictionary-derived candidates are noisy, manual curation is impractical, and corpus-based methods (e.g., Morfessor) are ill-suited for low-resource settings.
• Lack of principled guidance on optimal BPE vocabulary size (k) for morphologically rich, agglutinative languages; current practice relies on heuristics or downstream metrics that overlook morphology-specific trade-offs.
• Standard BPE often misaligns with true morpheme boundaries, causing over-splitting or under-segmentation; intrinsic, linguistically grounded evaluation metrics and resources are needed to assess and improve tokenization.

🔧Research Method

SampoNLP introduces a corpus-free IMDP pipeline using MDL-inspired self-referential atomicity scoring with dynamic programming (Best Explanation Power) and Otsu thresholding to distill high-purity morpheme lexicons from noisy candidate lists. These lexicons underpin the Integrated Performance Score (IPS), which balances morpheme coverage and over-splitting to evaluate BPE tokenizers across vocabulary sizes and identify elbow points.

💡Research Ideas

• Morphology-Aware Tokenizer Learning with Self-Referential Constraints: Integrate atomicity scores and lexicon constraints into tokenizer training to align merges with morpheme boundaries and surpass standard BPE in agglutinative languages.
• Extending SampoNLP to Low-Resource Agglutinative Families Beyond Uralic: Apply IMDP and IPS to Turkic, Dravidian, and Bantu languages, adapting character sets and whitelists to test generality and robustness.
• Linking IPS to Downstream Model Performance: An Intrinsic–Extrinsic Study: Quantify correlations between IPS (LMC/OSR) and task metrics across NLU/NLG, deriving k-selection policies and training-time trade-offs.