Thai Text Segmentation Using Vowel-Centered Rules and Learning

Abstract

A vast majority of text processing algorithms make one common assumption that input text is a sequence of words. In some language in which word boundaries are not always explicit, such as, Thai, text segmentation is an issue of interest. This work presents a two-step algorithm for Thai text segmentation. The first step chops the input text into pieces centered around the vowels. In the second step, the algorithm defines a set of features that might help determine whether or not two consecutive pieces from the previous step belong together as a unit (word, syllable, etc). It then uses learning algorithms to build a model out of these features. Given an input text, applying this model will result in a sequence of units. Each small (few syllables) yet useful enough for further processing by other word-based algorithms.

Keywords: Thai, Text, Segmentation, Learning, Decision Trees, C4.5

Corresponding author: E-mail: patrawadee@as.nida.ac.th

Thai, Text, Segmentation, Learning, Decision Trees, C4.5

How to Cite

Citation Format

Tanawongsuwan*, P. . (2018). Thai Text Segmentation Using Vowel-Centered Rules and Learning. Current Applied Science and Technology, 305-311.

References

Lorchirachoonkul, V. and Khuwinphunt, C. 1981 Thai Soundex Algorithm and Thai-Syllable Seperation Algorithm. Research paper, National Institute of Development Administration, Thailand.
Sornlertlamvanich, V. 1993 Word Segmentation for Thai in Machine Translation System. Machine Translation, National Electronics and Computer Technology Center, Bangkok. Pp. 50-56.
Pooworawan, Y. and Imarom, V. 1986 Thai Syllable Separater by Dictionary. Proceedings 9th National Conference on Electrical Engineering, Khon Kaen, Thailand.
Kawtrakul, A. and Thumkanon, C. 1997 A Statistical Approach to Thai Morphological Analyzer, Proceedings 5th Workshop on Very Large Corpora. Beijing.
Meknavin, S. Charoenpornsawat, P. and Kijsirikul, B. 1997 Feature-based Thai Word Segmentation. Proceedings Natural Language Proceeding Pacific Rim Symposium, Phuket, Thailand, pp.41-46.

Thai Text Segmentation Using Vowel-Centered Rules and Learning

Abstract

How to Cite

References

Author Information

Patrawadee Tanawongsuwan*

About this Article

Journal

Type of Manuscript

Published

Current Journal

Share

Public URL

Search

Latest Articles

Isolation and Characterization of Multifunctional Seed-Borne Endophytic Bacterium Lysinibacillus sphaericus YEBEVIA for Enhancing Maize Growth

Seed Coating with Fungicidal Agents: Enhancing Quality, Storability, and Fusarium sp. Inhibition in Vegetable Soybean Seeds

Unraveling the Molecular Evolution and Structural Landscape of Klebsiella pneumoniae Carbapenemase Variants

Phytoremediation: Stratagem Against Heavy Metal Contamination