Lightweight Morphological Analysis Model for Smart Home Applications Based on Natural Language Interfaces

Abstract

With the rapid evolution of the smart home environment, the demand for natural language processing (NLP) applications on information appliances is increasing. However, it is not easy to embed NLP-based applications in information appliances because most information appliances have hardware constraints such as small memory, limited battery capacity, and restricted processing power. In this paper, we propose a lightweight morphological analysis model, which provides the first step module of NLP for many languages. To overcome hardware constraints, the proposed model modifies a well-known left-longest-match-preference (LLMP) model and simplifies a conventional hidden Markov model (HMM). In the experiments, the proposed model exhibited good performance (a response time of 0.0195 sec per sentence, a memory usage of 1.85 MB, a precision of 92%, and a recall rate of 90%) in terms of the various evaluation measures. On the basis of these experiments, we conclude that the proposed model is suitable for natural language interfaces of information appliances with many hardware limitations because it requires less memory and consumes less battery power.

1. Introduction

A smart home is a home in which all systems work together to make residents’ lives better with more control. In smart homes, household appliances are being rapidly evolved into information appliances (e.g., smartphones and personal digital assistants (PDAs)), which are usable for the purposes of computing, telecommunicating, reproducing, and presenting encoded information in myriad forms and applications. The information appliances will play important roles in the improvement of the quality of life, safety, and security as well as the communication possibilities with the outside world [1]. Therefore, future information appliances will interact with residents via social networking services (SNS) such as Twitter (http://www.twitter.com/), Facebook (http://www.facebook.com/), and Line (http://line.me/en/) [2, 3], as shown in Figure 1.

Figure 1

Scenarios of smart home services via social networking.

To implement such interactions via social networking, information appliances need to be enabled to a Web server or a gateway. Recent approaches have shown methods to embed Web servers directly in resource-constrained devices [2]. As shown in Figure 1, some information appliances in which embedded Web servers will be registered as users’ friends. Then, the registered information appliances will execute various commands that are received from users via social networking services. To realize such smart homes, information appliances should understand users’ natural language commands which are in the form of short text messages (e.g., tweets and recognized speech inputs) [2]. Natural language processing (NLP) techniques can be used to convert a natural language into a formal language that information appliances can understand [4], as shown in Figure 2.

Figure 2

Example of natural language processing in smart home interactions.

As shown in Figure 2, a morphological analyzer segments an input sentence into a sequence of words and annotates the segmented words with part-of-speech (POS) tags. In inflective languages, the major goal of the segmentation process is to find roots of words (e.g., hours = hour+s/plural-noun). In noninflective languages such as Chinese, the major goal of the segmentation process is to correctly split a compound word in a sequence of morphemes (e.g., 美人 = 美 (America) + 人(people)). A named entity recognizer groups some words into meaningful units (e.g., temperature and time). A semantic and speech act analyzer generates a machine-readable semantic form (e.g., set (temperature = 25, time = 14:30)) and identifies user's intention that is implied in an input sentence (e.g., request). As shown in the NLP steps, the initial step in the development of NLP-based applications is to implement a high-performance morphological analyzer (i.e., a morpheme segmentation and part-of-speech (POS) tagging system). However, this implementation is not easy because many information appliances have limited input and output capabilities, limited bandwidth, limited memory, limited battery capacity, and restricted processing power. These hardware limitations make it difficult to use the well-known morphological analysis models that require complex computations on a large amount of training data. Although many high-performance information appliances are developed at present, lightweight morphological analyzers are still needed to efficiently realize high-level NLP applications because high-level linguistic models (e.g., named entity recognition, semantic analysis, speech act analysis, and so on) require large memory and high-performance processor. To resolve this problem, we propose a morpheme segmentation and POS tagging model that combines a rule-based method with a statistical method. The current version of the proposed system operates in Korean, but we believe that changing the language will not be a difficult task because the system simply uses a combination of widely used language-independent NLP techniques such as a longest-matching method and a hidden Markov model (HMM).

This paper is organized as follows. In Section 2, we review the previous work on morpheme segmentation and POS tagging systems. In Section 3, we present a hybrid system for morpheme segmentation and POS tagging in information appliances with restricted resources. In Section 4, we report the result of our experiments. Finally, we draw conclusions in Section 5.

2. Related Works

Morpheme segmentation and POS tagging have been widely studied by many researchers [5–8]. Previous morpheme segmentation methods can be classified into two groups: rule-based models [9–12] and tabular parsing models [13]. Since the rule-based models are based on stemming [9, 10] or longest matching [11, 12], they are widely used for analytic languages (i.e., isolating languages) with low morpheme-per-word ratios (e.g., Chinese and English). Although rule-based models are simple and exhibit decent performance, they are not appropriate for synthetic languages (i.e., agglutinative languages) with high morpheme-per-word ratios (e.g., Korean, Japanese, and Turkish) because various linguistic problems occur in separating a word into a sequence of morphemes. Therefore, tabular parsing models are widely used for the Korean language, although they require complex computations to identify all possible morpheme candidates. However, it is impractical to use these tabular parsing models in information appliances, which typically have restricted processing power. To resolve this problem, we propose an efficient morpheme segmentation method based on modified longest-match-preference rules.

The initial approaches to POS tagging were based on rule-based models. Karlsson [14] applied constraint grammars (the grammar formalism was specified as a list of linguistic constraints) to POS tagging. Some researchers dealt with POS tagging as a part of syntactic analysis using rules that had been handcrafted on the basis of knowledge of morphology and intuition [15, 16]. Although these rule-based models are simple and clear, they have some drawbacks. First, they require handcrafted linguistic knowledge, which is considerably costly to construct and maintain. Second, they cannot effectively handle unknown word patterns because they use lexical levels of predefined patterns. Approaches that are designed to resolve these problems are mainly based on statistical models. The HMM is a representative model of statistical POS tagging for many languages [17]. To improve performance, some researchers have tried to apply effective smoothing methods or language-dependent characteristics to a conventional HMM [17, 18]. Because these statistical models can automatically obtain the necessary information for POS tagging, they do not require the construction and maintenance of linguistic knowledge. In addition, they are generally more robust to unknown word patterns than to the rule-based models. However, in information appliances with a small main memory, it is impractical to use these statistical models because they have large memory requirements. Conditional random fields (CRFs) and maximum entropy Markov models (MEMMs) are good frameworks that use contextual features for building probabilistic models to segment and label sequence data [19]. However, the strength of these discriminative models cannot help being restricted in information appliances with restricted processing power because they generally require more complex computations than an HMM for parameter estimations and probability calculations. Kudo et al. [20] proposed a compact CRF-based model in POS tagging of Japanese. Although Kudo's model showed good performances, it still requires larger memory capacity than an HMM-based model because it uses additional n-gram features in order to increase performances. In the experiments on automatic word spacing which are performed in a commercial mobile phone with a XSCALE PXA270 CPU, 51.26 MB memory, and Windows Mobile 5.0, a CRF-based model was 2.11 times slower in response speed and 77.61 times larger in memory usage than an HMM-based model. To resolve these problems, we proposed a modified hidden Markov model that requires much less memory for loading statistical information.

3. Lightweight Morphological Analysis and POS Tagging

3.1. Modified Left-Longest-Match-Preference Method for Morpheme Segmentation

In English, a word is a spacing unit, but in Korean, an eojeol that consists of one or more morphemes comprises a spacing unit. Therefore, for morphological analysis of Korean sentences, eojeol's should be first segmented into several morphemes. In this paper, we refer to an eojeol as a word for convenience because an eojeol plays the similar role as a word in the English language. To aid the readability of the examples, we use Romanized Korean characters called Hangeul and insert hyphen symbols between Korean characters called eumjeol's. The segmented morphemes can then be recovered into their lemma forms (i.e., lexical roots). To perform these processes in information appliances, we propose a method based on modified left-longest-match-preference (LLMP) rules. The conventional LLMP model scans an input word from left to right and matches the input word against each key in a morpheme dictionary. Then, it returns a lemma form of the longest-matched key and continues to scan the remainder of the input word. If a lemma has various POSs, the LLMP model assigns the most frequent POS to the lemma. Owing to the characteristic of longest matching, the conventional LLMP model cannot find all morpheme candidates in an input word, as shown in Figure 3.

Figure 3

Example of the wrong left longest match.

In Figure 3, the correct morpheme sequence of “jip-gwon-han (doing the seizure of power)” is “jip-gwon (seizure of power)/noun + ha (do)/verb_suffix + n (-ing)/ending” in this context. However, the conventional LLMP model only returns “jip-gwon (seizure of power)/noun + han (hate)/noun” because “han (hate)/noun” is a longer morpheme than “ha (do)/verb_suffix” and “n (-ing)/ending.” We refer to short morphemes that are covered by long morphemes as hidden morphemes. To increase the recall rate of morphological analysis by resolving this hidden morpheme problem, we modify the LLMP model by adding supplementary rules for finding hidden morphemes. To construct the supplementary rules, we first implemented a Korean morpheme segmentation system based on the LLMP model. Second, we annotated alarge Korean corpus using the morpheme segmentation system. By comparing the results of automatic annotation with the correct results of human annotation, we automatically collected the cases where a long morpheme should be divided into a set of shorter morphemes. Finally, we selected the top-n cases that most frequently occurred and represented each case using symbolic rules, as listed in Table 1. We refer to these symbolic rules as decomposition rules.

Table 1

Subset of the decomposition rules.

Target morpheme	Morpheme sequence	Frequency
han	ha (do)/adjective_suffix + n (-ing)/ending	18206 (5.56%)
han	ha (do)/verb_suffix + n (-ing)/ending	9865 (3.01%)
ha-go	ha (reason)/verb_suffix + go (and)/ending	9494 (2.90%)
hal	ha (do)/verb_suffix + l (-ing)/ending	8816 (2.69%)
jeog- $i n$	jeog (having)/noun_suffix + i (be)/copula + n (-ng)/ending	7615 (2.32%)
deul-i	deul (-s)/noun_suffix + i (subject case)/propositional_word	7085 (2.16%)

By using the decomposition rules, the modified LLMP model adds hidden morphemes to the results obtained by the initial analysis, performed using a conventional LLMP model. For example, the modified LLMP model matches the longest-match morpheme “han (hate)” against “han (hate)” → “ha (do)/verb_suffix + n (-ing)/ending” and “han (hate)” → “ha (do)/adjective_suffix + n (-ing)/ending” in the decomposition rules. Then, it adds “ha (do)/verb_suffix + n (-ing)/ending” and “ha (do)/adjective_suffix + n (-ing)/ending” to the original morpheme sequence, as shown in Figure 4.

Figure 4

Processing example of the modified LLMP model.

3.2. Simplified HMM for POS Tagging

Let $W_{1, n}$ denote a sentence that consists of a sequence of n words, $w_{1}, w_{2}, \dots, w_{n}$ , and let $T_{1, n}$ denote the POS tag sequence, $t_{1}, t_{2}, \dots, t_{n}$ , of $C_{1, n}$ . The tagging problem can then be formally defined as finding $T_{1, n}$ , which results in

\begin{array}{l} T (W_{1, n}) \overset{def}{=} \underset{T_{1, n}}{argmax} P (T_{1, n} | W_{1, n}) = \underset{T_{1, n}}{argmax} \frac{P (T_{1, n}, W_{1, n})}{P (W_{1, n})} \\ = \underset{T_{1, n}}{argmax} P (T_{1, n}, W_{1, n}) . \end{array}

(1)

In (1), $P (W_{1, n})$ is dropped as it is constant for all $T_{1, n}$ terms. Next, (1) is broken into smaller pieces to collect statistics about each piece, as shown in

\begin{matrix} P (T_{1, n}, W_{1, n}) = \prod_{i = 1}^{n} P (w_{i} | t_{1, i}, w_{1, i - 1}) P (t_{i} | t_{1, i - 1}, w_{1, i - 1}) . \end{matrix}

(2)

Equation (2) is simplified by making two assumptions: the current POS tag is dependent only upon the previous POS tag and the current word is only affected by its POS tag. Equation (3) is a well-known HMM model for POS tagging:

\begin{matrix} P (T_{1, n}, W_{1, n}) \approx \prod_{i = 1}^{n} P (w_{i} | t_{i}) P (t_{i} | t_{i - 1}) . \end{matrix}

(3)

In (3), $P (w_{i} | t_{i})$ and $P (t_{i} | t_{i - 1})$ are called an observation probability and a transition probability, respectively. In Korean, it is difficult to calculate both the observation probability and the transition probability because a word generally consists of multiple morphemes. Therefore, many previous systems have in-word HMMs for calculating the observation probabilities of words, as shown in Figure 5.

Figure 5

Example of in-word HMMs based on the tabular parsing method.

In Figure 5, the gray rectangles represent the in-word HMMs based on the modified LLMP model. However, these in-word HMMs require more computing power, because they increase the complexity of POS tagging. To resolve this problem, we simplify the observation probability and the transition probability calculations based on the assumption that the first POS tag and the last POS tag provide important clues to syntactically connect words, as shown in

\begin{array}{l} P_{k} (w_{i} | t_{i}) \approx \prod_{j = 1}^{m} (\begin{pmatrix} P_{k} (seg M_{j} | {seg}_{j}) \cdot \\ P_{k} (seg T_{j}^{first} | seg T_{j - 1}^{last}) \end{pmatrix}), \\ where k = 1 \dots c, \\ P (t_{i} | t_{i - 1}) \approx P (t_{i}^{first} | t_{i - 1}^{last}) . \end{array}

(4)

In (4), ${seg}_{j}$ is the jth longest-morpheme segment that the modified LLMP model generates from the ith word, $w_{i}$ , and $seg M_{j}$ is a morpheme sequence of the jth longest-morpheme segment. $seg T_{j}^{first}$ and $seg T_{j - 1}^{last}$ are the first POS tag in the jth longest-morpheme segment and the last POS tag in the $j - 1$ th longest-morpheme segment, respectively. $P_{k}$ is the probability of the kth morpheme among c morpheme candidate sequences in the ith word, $w_{i} \cdot t_{i}^{first}$ and $t_{i - 1}^{last}$ are a POS tag of the first morpheme in the ith word, $w_{i}$ , and a POS tag of the last morpheme in the $i - 1$ th word, $w_{i - 1}$ . Figure 6 shows an example of the simplified HMM based on the modified LLMP model.

Figure 6

Example of the simplified HMM based on the modified LLMP model.

As shown in Figure 6, the transition probability between “chong-20-nyeon-eul (for total of 20 years)” and “jip-gwon-han (doing the seizure of power)” is calculated based on grammatical possibilities between the POS tag “noun” of the first morpheme “jip-gwon” in the current word and the POS tag “postpositional_word” of the last morpheme “eul” in the previous word. The observation probability of the word “jip-gwon-han” is calculated as the maximum score among the following three probabilities:

\begin{array}{l} P_{1} (“ j i p - g w o n / noun “ | {seg}_{1}) \times P_{1} (“ noun “ | \emptyset) \\ \times P_{1} (“ h a n / noun “ | {seg}_{2}) \times P_{1} (“ noun “ | “ noun “), \\ P_{2} (“ j i p - g w o n / noun “ | {seg}_{1}) \times P_{2} (“ noun “ | \emptyset) \\ \times P_{2} (“ h a / adjective_suffix + n / ending “ | {seg}_{2}) \\ \times P_{2} (“ adjective_suffix + ending “ | “ noun “), \\ P_{3} (“ j i p - g w o n / noun “ | {seg}_{1}) \times P_{3} (“ noun “ | \emptyset) \\ \times P_{3} (“ h a / verb_suffix + n / ending “ | {seg}_{2}) \\ \times P_{3} (“ verb_suffix + ending “ | “ noun “) . \end{array}

(5)

In the above example, we can assign $P_{i} (“ j i p - g w o n / noun “ | {seg}_{1}) \times P_{1} (“ noun “ \emptyset)$ to 1.0 without any calculation because the morpheme sequence of the first segment is always “jip-gwon/noun.” This strategy makes the simplified HMM use less memory. As we illustrated above through examples, the modified LLMP model ignores many morpheme candidates. Due to this pruning process, the simplified HMM can dramatically reduce the amount of calculations required to obtain the observation probabilities. Finally, the maximum scores from (1) and (4) are calculated using the well-known Viterbi algorithm [21].

4. Experiments

4.1. Data Sets and Experimental Settings

To evaluate the proposed model experimentally, we used the 21st Century Sejong Project's POS-tagged corpus [22]. Table 2 describes the Sejong POS-tagged corpus in brief.

Table 2

Description of Sejong POS-tagged corpus.

Description	Number
Sentences	139,828
Words	2,015,860
Morphemes	4,641,546
POS tags	46

We divided the POS-tagged corpus into training and test data, at a ratio of nine to one. We then performed a 10-fold cross-validation using the following evaluation measures: precision, recall rate, and F1-measure. In order to evaluate the usefulness of the proposed model in a real information appliance environment, we implemented it in a commercial mobile phone with a XSCALE PXA270 CPU, 51.26 MB memory, and Windows Mobile 5.0.

4.2. Experimental Results

The first experiment performed was intended to evaluate the changes in performance with the proposed model, based on the number of decomposition rules. We computed the average performance of the proposed model at various cutoff points in Figure 7.

Figure 7

F1-measure scores at various cutoff points.

In Figure 7, the more rules the proposed model had, the higher performance it obtained. However, we believe that the model incorporating top-40% rules is the most suitable for information appliances because the models having more rules require more processing time and larger working memories, while delivering limited performance improvement over models with smaller rule sets.

In the second experiment, we compared the performance of the proposed model with those that are representative of previous models, using the same training and testing data, as listed in Table 3.

Table 3

Comparison of precision and recall rates.

Measure	LLMP	Tabular parsing + HMM	Modified LLMP + Simplified HMM
Avg. precision	0.83	0.94	0.92
Avg. recall	0.82	0.94	0.90
Avg. $F_{1}$ -measure	0.82	0.94	0.91

In Table 3, “LLMP” is a morphological analyzer based on conventional LLMP rules. This morphological analyzer does not need additional POS tagging processes because it returns one morpheme sequence per word. “Tabular parsing + HMM” is a POS tagger based on an HMM that selects the most reasonable sequence among all possible morpheme candidates generated by the tabular parsing method. The system is one of state-of-the-art Korean morphological analyzers which show F₁-measures of 94~95% [18]. “Modified LLMP + Simplified HMM” is the proposed POS tagger that selects the most probable sequence among a number of morpheme candidates generated by the modified LLMP model. As listed in Table 3, “Tabular parsing + HMM” exhibited the best performance in terms of all measures. However, the performance differences between the proposed model and the “Tabular parsing + HMM” model were much smaller than those between the proposed model and the “LLMP.” This fact reveals that the decomposition rules are very effective. On the other hand, the proposed model significantly outperformed the “LLMP.”

In the last experiment, we compared the memory usage and response time of the above models, as listed in Table 4.

Table 4

Comparison of memory usage and response time.

Measure	LLMP	Tabular parsing + HMM	Modified LLMP + Simplified HMM
Memory usage (MB)	1.6	3.0	1.85
Response time (sec/sentence)	0.0154	0.1495	0.0195

As listed in Table 4, the proposed model used much less memory and required much less processing time than the “Tabular parsing + HMM” model. Let N denote the number of eomjeol's in an eojeol. In the scan procedure from left to right for matching an eojeol against each key in a morpheme dictionary, the tabular parsing model has the time complexity $O (N^{3})$ because it should check all grammatical connections between two adjacent eomjeols in a similar manner with CYK algorithm (http://en.wikipedia.org/wiki/CYK_algorithm). However, the modified LLMP has the time complexity $O (N)$ as already known. Let S denote the number of observations in an HMM, and let T denote the number of transitions in an HMM. In the POS tagging procedure, the simplified HMM has the same time complexity $O (T x {| S |}^{2})$ with the ordinary HMM. However, the sizes of T and S in the simplified HMM are about 3 times smaller and about 5 times smaller than those in the ordinary HMM, respectively. As shown in the time complexity analysis, “LLMP” is the most lightweight model exhibiting high speed. However, the proposed model is the more suitable for information appliances because the precision and recall rate of the “LLMP” model are too low for NLP applications. Although the computing power of information appliances is rapidly increasing, the memory usage and the processing time of the “Tabular parsing + HMM” model are still limiting factors for information appliances. For example, we can easily find many wireless sensor network (WSN) gateways and security appliances with 64~256 MBs. In many systems, the size of internal user memory is restricted within a few MBs. The experimental platform (i.e., the commercial mobile phone with a XSCALE PXA270 CPU, 51.26 MB memory, and Windows Mobile 5.0) had only 8 MBs of internal user memory which is too little to implement NLP applications. In addition, information appliances may use information retrieval (IR) techniques for extracting keywords from instruction manuals or online contents. In this case, “Modified LLMP + Simplified HMM” spent 0.3 seconds per MB for indexing keywords, but “Tabular parsing + HMM” spent 85 seconds per MB for indexing keywords. In particular, in mobile devices with restricted battery capacity, long processing times lead to rapid battery consumption. Based on these experimental results, if computational cost and memory limitations are important factors, the combination of the modified LLMP model and the simplified HMM may be one of the best solutions.

4.3. Contribution to Distributed Sensor Networks

Smart home technology can be used in the following key areas in which various sensors should interact with each other in order to detect residents’ behaviors and protect against dangerous situations [3]:

(i)

safety area: intruder detection, burglar deception, fire detection, video surveillance, and so on;

(ii)

comport area: temperature control, light control, windows control, and so on.

To realize this smart home environment, sensor network systems should gather information detected by sensors and should transmit the information to tablet terminals, gateways, or information appliances. If the sensor network systems adopt NLP techniques (i.e., NLP techniques are embedded in sensors or gateways), they will be able to more promptly detect various events and more accurately determine their actions against events. For example, if a keyword detector based on NLP techniques is embedded in a motion sensor (or CCTV), the sensor network system can generate necessary actions when the keywords like “money” and “give me” are included in the conversation between an intruder and a resident or when a resident shouts “fire” while fast moving. As a result, the proposed model can contribute to making sensor network systems better at understanding residents’ contexts.

5. Conclusions

We proposed a morpheme segmentation and POS tagging model for an information appliance. To reduce the number of morpheme candidates, the proposed model uses a method that expands the set of morpheme candidates generated by longest-match-preference rules, instead of using the well-known tabular parsing method. To reduce the computational cost and memory usage, the proposed model uses a method that simplifies an inner HMM, which is necessary in order to find the correct sequence of morphemes in a word. In the experiments, the proposed model exhibited good performance in terms of the various evaluation measures such as precision, recall rate, memory usage, and response time. On the basis of these experiments, we conclude that the proposed model is suitable for information appliances with many hardware limitations because it requires less memory and consumes less battery power.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the IT R&D program of MOTIE/MSIP/KEIT [10041678, The Original Technology Development of Interactive Intelligent Personal Assistant Software for the Information Service on multiple domains]. This research was also supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2013R1A1A4A01005074).

References

Caillet

Pessiot

J. F.

Amini

M. R.

Gallinari

Unsupervised learning with term clustering for thematic segmentation of texts

Proceedings of the RIAO Conference (RIAO ′04)

2004

Kamilaris

Pitsillides

Social networking of the smart home

Proceedings of the 21st International Symposium on Personal Indoor and Mobile Radio Communications (PIMRC ′10)

September 2010

2632 2637

2-s2.0-78751491615

10.1109/PIMRC.2010.5671783

Chandak

M. B.

Dharaskar

Natural language processing based context sensitive, content specific architecture and its speech based implementation for smart home applications

International Journal of Smart Home 2010 4 2 1 10

2-s2.0-80055009848

Hurson

Connected Computing Environment 2012

Academic Press

Merialdo

Tagging English text with a probabilistic model

Computational Linguistics 1994 20 2 155 171

Brill

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics 1995 21 4 543 564

Brants

TnT: a statistical part-of-speech tagger

Proceedings of the Conference on Applied Natural Language Processing (ANLC ′00)

2000

Stroudsburg, Pa, USA

224 231

Kruengkrai

Uchimoto

Kazama

Wang

Torisawa

Isahara

Joint ChineseWord segmentation and POS tagging using an error-driven Word-character hybrid model

IEICE Transactions on Information and Systems 2009 E92-D 12 2298 2305

2-s2.0-77950274996

10.1587/transinf.E92.D.2298

Porter

An algorithm for suffix stripping

Program 1980 14 3 130 137

10.

Lovins

Development of a stemming algorithm

Mechanical Translation and Computational Linguistics 1968 11 1-2 22 31

11.

Htay

H. H.

Murthy

K. N.

Myanmar word segmentation using syllable level longest matching

Proceedings of the 6th Workshop on Asian Language Resources

2008

41 48

12.

Tseng

Chen

Design of Chinese morphological analyzer

Proceedings the 1st SIGHAN workshop on Chinese Language Processing

2002

1 7

13.

Hong

Koo

M.-W.

Yang

Korean morphological analyzer for speech translational system

Proceedings of the International Conference on Spoken Language Processing (ICSLP' 96)

October 1996

673 676

2-s2.0-0030361245

14.

Karlsson

Constraint grammar as a framework for parsing running text

Proceedings of the 13th Conference on Computational linguistics

1990

168 173

15.

Voutilainen

A syntax-based part of speech analyzer

Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics

1995

157 164

16.

Church

K. W.

Stochastic parts program and noun phrase parser for unrestricted text

Proceedings of the Conference on Applied Natural Language Processing

1988

136 143

2-s2.0-0024925138

17.

Charniak

Hendrickson

Jacobson

Perkowitz

Equations for part-of-speech tagging

Proceedings of the 11th National Conference on Artificial Intelligence

July 1993

784 789

2-s2.0-0027707504

18.

Lee

Rim

Probabilistic models for Korean morphological analysis

Proceedings of the International Joint Conference on Natural Language Processing

2005

197 202

19.

Lafferty

McCallum

Pereira

Conditional random fields: probabilistic models for segmenting and labeling sequence data

Proceedings of the International Conference on Machine Learning

2001

282 289

20.

Kudo

Yamamotoz

Matsumoto

Applying conditional random fields to Japanese morphological analysis

Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP ′04)

2004

230 237

21.

Forney

G. D.

Jr.

The Viterbi algorithm

Proceedings of the IEEE 1973 61 3 268 278

2-s2.0-0015600423

10.1109/PROC.1973.9030

22.

The National Institute of the Korean Language: final report on achievements of 21st Sejong project: electronic dictionary

2007