A Framework for Evaluating Word Boundary Detection in Persian Tokenizers
Author(s):
Article Type:
Research/Original Article (بدون رتبه معتبر)
Abstract:
Tokenization is a critical stage in text preprocessing and presents numerous challenges in languages like Persian, where there is no deterministic word boundary. These challenges include the identification of multi-function morphemes, separation of punctuation marks, omission of spaces between tokens, and handling extra spaces inside words.Typically, the evaluation of tokenizers focuses on overall performance, and test data does not necessarily cover all challenging linguistic phenomena. As a result, strengths and weaknesses of tokenizers in addressing specific challenges are not independently assessed. This paper examines the challenges posed by the Persian script in detecting word boundaries and evaluates the performance of seven tokenizers in handling these issues. A test set of 4091 tokens across 483 sentences was prepared, with 1010 considered as challenging tokens. The tokenizers were evaluated using this dataset.The results indicate varying performance among tokenizers when dealing with Persian orthography. Some tokenizers performed better in separating compound words, while others excelled in identifying and preserving zero-length joiners (half-space). A detailed comparison reveals that no tokenizer fully addresses all challenges, highlighting the need for improved algorithms and more sophisticated solutions for Persian word boundary detection.By introducing a comprehensive benchmark and identifying the strengths and weaknesses of available tokenizers, this study paves the way for the development of better Persian language processing tools.
Keywords:
Language:
English
Published:
Journal of Innovations in Computer Science and Engineering, Volume:1 Issue: 1, Winter and Spring 2024
Pages:
62 to 75
https://www.magiran.com/p2829923
سامانه نویسندگان
مقالات دیگری از این نویسنده (گان)
-
Intermediate Fine-Tuning for Robust Persian Emotion Detection in Text
Seyed Morteza Mahdavi Mortazavi *,
Journal of Innovations in Computer Science and Engineering, Winter and Spring 2025 -
EmoRecBiGRU: Emotion Recognition in Persian Tweets with a Transformer-based Model, Enhanced by Bidirectional GRU
Faezeh Sarlakifar, Morteza Mahdavi Mortazavi, *
International Journal Information and Communication Technology Research, Summer 2024