Written language Classification in multilingual documents
Author(s):
Abstract:
Optical character recognition is one of the working areas in pattern recognition. Each year the conference papers related to the topic in artificial intelligence, pattern rec¬ognition, image processing, machine vision, and. .. Is presented. However, Due to the inherent complexity of languages in the world, still very interested in the subject mat¬ter want to identify the texts with better results. Researchers have presented Many algorithms to convert text images and non editable text into editable by the computer. Many articles say that the written language has its own characteristics, can only iden¬tify a document type that has one language. In view documents, there are several things that a document containing two or more different languages. Therefore, Docu¬ment identification systems require identification several languages simultaneously. In this study, we chose common language, then based on Physical Characteristics extracted from them, we present a text language classification algorithm for multi language document.Then we can extracted from this classes same features for char¬acter identification. Farsi and Arabic in class1, Chinese, Japanese and Korean in class2 and in English, Indonesian and Spanish are placed in Class 3. System must befor each line of the document, identify class it belongs to. The classifier used for classification is decision tree classifier structure with the daptive threshold levels.Surveydata are scanned document. The diagnosis is equal to 93.3 percent, which proves the effectiveness of the model presented.
Keywords:
Language:
Persian
Published:
Journal of Publishing, Volume:2 Issue: 5, 2013
Page:
21
https://www.magiran.com/p1363192