A vast majority of text processing algorithms make one common assumption that input text is a sequence of words. In some language in which word boundaries are not always explicit, such as, Thai, text segmentation is an issue of interest. This work presents a two-step algorithm for Thai text segmentation. The first step chops the input text into pieces centered around the vowels. In the second step, the algorithm defines a set of features that might help determine whether or not two consecutive pieces from the previous step belong together as a unit (word, syllable, etc). It then uses learning algorithms to build a model out of these features. Given an input text, applying this model will result in a sequence of units. Each small (few syllables) yet useful enough for further processing by other word-based algorithms.
Keywords: Thai, Text, Segmentation, Learning, Decision Trees, C4.5
Corresponding author: E-mail: patrawadee@as.nida.ac.th
Tanawongsuwan*, P. . (2018). Thai Text Segmentation Using Vowel-Centered Rules and Learning. CURRENT APPLIED SCIENCE AND TECHNOLOGY, 305-311.

https://cast.kmitl.ac.th/articles/147933