Abstract
Named entity recognition (NER), which provides useful information for many high level NLP applications and semantic web technologies, is a well-studied topic for most of the languages and especially for English. However, the modelling of morphologically rich languages (MRLs) for the NER task is still an open research area. The studies for Turkish which is a strong representative of MRLs have fallen behind the well-studied languages for a long while. In recent years, Turkish NER intrigued researchers due to its scarce data resources and the unavailability of high-performing systems. Especially, the need to semantically enrich the textual data coming with user generated content initiated many studies in this field. This article presents a CRF-based NER system which successfully models the morphologically very rich nature of this language, its extensions to expand the covered named entity types, and also to process extra challenging user generated content coming with Web 2.0. The article introduces the re-annotation of the available datasets and a brand new dataset from Web 2.0. The introduced approach reveals an exact match F1 score of 92% on a dataset collected from Turkish news articles and ∼65% on different datasets collected from Web 2.0. The proposed model is believed to be easily applied to similar MRLs with relevant resources.
Introduction
The second generation of the world wide web (Web 2.0), which is also referred as the Social Web, focuses on people and their social interactions by the use of attractive and easy-to-use applications. The volume of user generated content (UGC)2
Moens et al. [28] defines UGC as “any form of content such as blogs, wikis, discussion forums, posts, chats, tweets, podcasts, digital images, video, audio files, advertisements and other forms of media that was created by users of an online system or service, often made available via social media websites”.
Named Entity Recognition (NER) can be basically defined as identifying and categorizing certain type of data (e.g. person, location, organization names, date-time expressions). Beside its value for semantic web technologies, NER is also an important stage for several natural language processing (NLP) tasks including machine translation, sentiment analysis and syntactic parsing. MUC (Message Understanding Conference [4,45]) and CoNLL (Conference on Computational Natural Language Learning [48,49]) conferences define three basic categories of named entities; these are 1) ENAMEX (person, location and organization names), 2) TIMEX (date and time entities) and 3) NUMEX (numerical expressions like money and percentages). However, NER research is not limited to only these types; different application areas concentrate on determining alternative entity types such as protein names, medicine names, book titles.
The NER research was firstly started in early 1990s for English. In 1995, with the high interest of the research community, the success rates for English achieved nearly the human annotation performance on news texts [45]. Nadeau and Sekine [31] give a survey of the research for English NER between 1991 to 2006. The satisfaction on English NER task directed the field to new research areas such as multilingual NER systems [48,49], transliteration [54], coreference [30] of named entities and especially to NER on UGC [23–25,29,35,38,39].
The use of Conditional Random Fields (CRFs) [22], which are reported to offer several advantages over hidden Markov models (HMMs), stochastic grammars and maximum entropy Markov models (MEMMs), became very dominant in the literature for the named entity recognition task. CRF-based NER models have been experimented with for various domains and languages: [27] for English and German, [9] for Hindi and Bengali, [3] for Chinese, [42] for biomedical data, [24,35] for Tweets are some studies among many others.
Morphologically rich languages (MRLs) (such as Finnish, Czech, Korean, Hungarian and many others) pose interesting challenges for NLP tasks (e.g., data scarcity, the representation of rich morphological features in different tasks) as is the case for NER. Although there exist some studies reporting their approach for some MRLs, the usage of morphological information for the NER task is still an open research issue. Georgiev et al. [12] put word prefix and word suffix information as new features to their systems for Bulgarian. Hasan et al. [15] use the first and last 3 characters of the words as extra features in order to use them as prefix and suffix information for Bengali. Konkol and Miloslav [16] report that their effort to add morphological features did not yield in any success improvement for Czech as well as Yeniterzi [53] which reports similar findings for Turkish.
Turkish is a free-constituent order language with complex agglutinative, inflectional and derivational morphology. With its morphologically very rich nature, Turkish is one of the strongest representatives of MRLs and attracts the attention of the NLP community. Especially, the need to semantically enrich the textual data coming with UGC initiated many studies for Turkish NER in recent years. Nevertheless, the results for Turkish NER remain still very behind the reported accuracies for English. This article introduces a CRF-based Turkish NER model (firstly introduced in [40] on Turkish well formed texts for only ENAMEX types), the enhancements made in order to extend its coverage to also include TIMEX and NUMEX entity types and to process UGC3
This manuscript focuses only to the textual content coming with UGC [28].
The article is organized as follows: Section 2 gives brief information about Turkish language characteristics related to NER, Section 3 gives a brief overview of the previous studies for Turkish NER, Section 4 gives information about existing and newly introduced language resources, Section 5 gives the details of the proposed framework, its extensions to TIMEX and NUMEX entities and to Web 2.0 domain, Section 6 provides our experiments and evaluates the results by comparing with related work and Section 7 gives the conclusion.
This section briefly states the characteristics of the Turkish language which are considered to have influence for the NER task. Turkish is a morphologically rich and highly agglutinative language. In most of the Turkish NLP studies, lemmas are used instead of word surface forms in order to decrease lexical sparsity. For example a Turkish verb “gitmek” (to go) may appear in hundreds of different surface forms4
Some surface forms of “gitmek” (only in simple present tense for different person arguments): gidiyorum, gidiyorsun, gidiyor, gidiyoruz, gidiyorsunuz, gidiyorlar.
Although in well formed text, only the proper nouns, abbreviations and the initial words of the sentences start with an initial capital letter, this is most of the time not the case in social media domain. Turkish person (first) names are usually selected from common nouns such as İpek (silk), Kaya (rock), Pembe (pink) and Çiçek (flower). This property of the language makes the recognition of such named entities very hard in UGC domain where the appropriate capitalization rules are frequently ignored.
Turkish is a free word order language. As a consequence of this property, the position of the word in a sentence does not provide information about being a named entity or not. All of the three sentences: “Ahmet yarın Mehmet ile konuşmaya gidecek.”, “Yarın Mehmet ile konuşmaya Ahmet gidecek” and “Yarın Ahmet, Mehmet ile konuşmaya gidecek.” are valid Turkish sentences all with the English translation of “Tomorrow, Ahmet will go to talk to Mehmet”.
Çelikkaya et al. [2] make a preliminary investigation of the problems caused by UGC for Turkish NER. The following example from [2] shows the complexity caused by the omission of the above mentioned rules for proper nouns. In this tweet, which should actually be written as in the second line in formal writing, “Aydın” is a person name. The word when written with lowercase letters has also the meaning of a common noun; “enlightened”. This makes very difficult to differentiate this named entity (person name) from the word “enlightened”. “ “Aydın’lara gidiyoruz.” (We are going to
Another problem for real data is the spelling errors produced either by mistake or on purpose for exaggeration, interjection or ASCIIfication (removal of accent, cedilla, etc.) of special Turkish letters (öüçşığİ). In the first line of the below example, the letter “ı” is written with its ascii counterpart and repeated multiple times for specifying exclamation. The second line of the same example shows the case where all the letters are capitalized and it is again very difficult to detect the named entity and alleviate the ambiguity caused by the common sense of the proper noun. “ “ (
And finally, the following example again from [2] exemplifies the foreign words inflected with Turkish suffixes by omitting the required apostrophe sign as shown in the second line. In the following Tweet: “Bieber” is used in accusative case without the required apostrophe sign. “ “ (I don’t like
Overview of the previously reported Turkish NER results (compiled in [40])
Overview of the previously reported Turkish NER results (compiled in [40])
Şeker and Eryiğit [40] compile the performances of all the previous Turkish NER studies and tries to make comparisons with them whenever possible although none of the previous systems nor most of the used data sets were publicly available. The authors interacted with the owners of the previous systems which made it possible to obtain a performance score either by sending their test data to be tested with the prior system or obtaining the test data set used in the prior work.
Table 1 (from [40]) gives an overview of the previously reported Turkish NER performances. The performances listed in Table 1 are organized in decreasing order of credit given to partial matches during evaluation. The outputs of the NER systems are generally evaluated in comparison with human annotations. Nadeau and Sekine [31] exemplify the different types of errors which may occur during the automatic recognition of the named entities; e.g. a totally missed NE, an identified entity with wrong entity type or wrong boundaries or both. Nadeau and Sekine [31] give the details of the three main scoring techniques (ACE,5
In the ACE evaluation method, each entity type has a parameterized weight and contributes up to a maximal proportion of the final score (e.g., if each person is worth 1 point and each organization is worth 0.5 point then it takes two organizations to counterbalance one person in the final score).
Similar to most NER studies in the literature, MUC and CoNLL evaluations were widely used in Turkish NER studies. In the MUC evaluation approach, a system is scored on two axes: its ability to find the correct type (TYPE) and its ability to find exact text (TEXT). MUC TEXT makes evaluation only on NE boundaries without looking if the correct NE type is assigned or not. MUC TYPE evaluates the performance of assigning the correct named entity (NE) type to each word without taking into account if the NE boundaries are detected correctly. The final MUC score is the micro-averaged F-measure, which is the harmonic mean of precision and recall calculated over all entity slots on both axes. On the other hand, the CoNLL scoring technique uses an exact-match evaluation; it evaluates an assignment to be correct if both the type and the boundary of an NE are determined correctly. The calculated score is again the micro-averaged F-measure, but this time calculated on the exactly matched named entities. The results (excluding the ones provided for [21] and [32]) in Table 1 are given as MUC and CoNLL F1 scores. Note that the test sets, evaluation methods (3rd column), working domain (4th column) and entity types (5th column) in focus of each work are different from each other.
The first published work on Turkish NER is [5] which is a language independent system tested on Romanian, English, Greek, Turkish and Hindi. This system is trained with a small training data and learns from unannotated text using a bootstrapping algorithm. The first NER work specific to Turkish is [52]. The study focuses on three Information Extraction (IE) tasks, namely, sentence segmentation, topic segmentation and name tagging. For name tagging task they use lexical, morphological and contextual features of the words to generate an HMM based model. They use a training and test set collected from news articles which will be introduced in the following sections. The authors use the same training data with [40], but a different test data which is not available. Their performance is reported as 91.56%. In order to be able to have an idea (although not strictly comparable), Şeker and Eryiğit [40] also provide their MUC F1 score (94.59%) as well as the CoNLL F1 score of 91.94%.
Bayraktar and Temizel [1] work on financial texts to find only person names. They apply the local grammar based approach of [51] to Turkish. Bayraktar and Temizel [1] initially identified common reporting verbs in Turkish and then used these reporting verbs to generate patterns for locating person names. The study reports a CoNLL F1 score of 81.97% which is not directly comparable with any of the related work given in this section due to the difference in the used datasets.
Yeniterzi [53] uses CRFs and exploits the impact of morphology for Turkish NER. This work is the one which is most similar to ours except for the usage of morphological features and gazetteers. Yeniterzi [53] and Şeker and Eryiğit [40] use the same training and test data. Table 1 gives the reported performance by [53]. In order to be able to make a strict comparison, the results of its replication and evaluation under our settings are provided under Section 6.2.
Özkaya and Diri [32] also use CRFs for NER on email messages, but since they are using features specific to email domain only (such as from, subject fields) their work may not be extended to general texts. They do not provide their evaluation metrics and their overall results, but overall precision, recall and F-measure values are calculated as 92.89%, 77.07% and 84.24% respectively using the token counts provided in their paper.
The automatic rule learning system of [47] starts with a set of seeds selected from the training set, and then extracts rules over these examples. The named-entities are generalized by using contextual, lexical, morphological and orthographic features. Although the authors do not explicitly mention that they use the CoNLL evaluation method, the evaluation strategy of looking for the exact match seems compatible with it. Their reported accuracy is 91.08% on ENAMEX and TIMEX types. The relevant F-measure for only ENAMEX types is calculated as 90.63%.
Küçük and Yazıcı [21] use rote-learning [11] in order to extend their rule-based recognizer [19] into a hybrid recognizer so that it can learn from the available annotated data and extend its knowledge resources. They evaluate their system on general news texts, financial news texts, historical texts and children’s stories. In Table 1 we took the results on general news texts domain which sounds similar to our domain. Their evaluation strategy gives more credit to partial matches and is not similar to neither CoNLL nor MUC scoring techniques. They work on ENAMEX, TIMEX and NUMEX entity types but they do not provide the scores for each of these. After measuring this system performance on their own dataset, Şeker and Eryiğit [40] report a CoNLL F1 score of 69.78% on ENAMEX types for [21].
Demir and Özgür [6] address NER task for morphologically rich languages by employing a semi-supervised learning approach based on neural networks. They adopt a fast unsupervised method for learning continuous vector representations of words, and use these representations along with language independent features. They test their work on the data set of [40] and report a CoNLL F1 score of 91.85%.
The Turkish NER studies on UGC domain are very recent and limited in number compared to well formed text domain. The work of Çelikkaya et al. [2] is the first study which investigates the NER success on Turkish UGC; they test on 3 different domains, namely on datasets collected from Twitter, a Speech-to-Text Interface and a Hardware Forum. Küçük et al. [17], Küçük and Steinberger [18] and Eken and Tantuğ [8] follow this trend and report their approaches on Twitter datasets. The outputs of our extended CRF-Model are compared with the mentioned studies in Section 6.2.
Some sample annotations from the formal news text dataset
This section firstly gives the features of the existing and freely available Turkish datasets tagged with named entities. Then, it introduces the newly annotated ones within this work.
Available datasets
The most widely used dataset for Turkish NER research is introduced by [52]. This data consists of nearly 500K words collected from newspaper articles and is annotated only for ENAMEX types. Another available dataset from well-written text genre comes from [47]. This dataset is rather small (∼55K) compared to the previous one and as a result is less preferable for supervised machine learning systems which mostly need high volume of human-annotated data. The dataset consists of news articles on terrorism from both online and print news sources in Turkish. The annotated types on this corpus are ENAMEX and TIMEX categories.
The datasets from the UGC domain are brand new and the available ones are as follows:
Çelikkaya et al. [2] introduce three datasets annotated with ENAMEX, TIMEX and NUMEX types; 1) a 55K dataset which is from a very popular online forum dedicated to hardware products’ reviews. An important feature of this dataset is that it contains mostly trademarks (generally company names), their products together with a related model. Although, this type of named entities are categorized under more specific named entity classes in extended NE classifications [41], the most relevant category in MUC-6 for these is the “Organization”. This forum data is full of spelling errors and capitalization is not properly used or not used at all in most of the cases. 2) a very small corpus (∼1.5K) collected from Speech-to-Text Interface of a mobile assistant application. The most important characteristic of this dataset is that there is no capitalization or punctuation at all in the produced text message. 3) a 55K Twitter corpus which is used for testing purposes in many of the follow up studies [17,18] and [8]. Unfortunately the annotations on this new domain were arguable and this resulted in the emergence of re-annotated versions of the same dataset simultaneously by different groups6
Table 2 from [8], Table 1 from [2] and Table 1 from [18] provides the number of annotated named entities in different versions of this dataset. The main arguable point on the annotation of this dataset was the tagging of named entities consisting apostrophe signs.
Human annotation of language resources is a costly process. The creation of benchmark datasets is very valuable to speed-up progress in a specific research area. As may be noticed from the previous subsections, early Turkish NER studies mostly evaluated their success on their own datasets which makes it hard to make a comparison between the proposed approaches. In this study, we selected two mostly used datasets from the Turkish NER literature; one from well-written text domain [52] (which is also the biggest dataset) and one from UGC domain [2] (limited to Twitter content only) and re-annotated them with the following two main purposes:
To extend the covered named entity types (to also cover TIMEX and NUMEX types) which were priorly limited to ENAMEX types only (in [52]).
To improve the consistency and hence the quality of the annotations by strictly following a specific guideline (namely the MUC-6 the Sixth Message Understanding Conference guidelines [13]).
Previous annotations were also carefully investigated during this second round of annotation. In addition to these two datasets, we also annotated a brand new Turkish treebank from the social media domain: ITU Web Treebank (IWT) [33]. IWT is specifically selected for the NER annotation due to its representativeness on UGC. Its composition (free from duplicates and re-tweets) includes UGC from different Web 2.0 domains (namely news story comments, personal blog comments, customer product reviews, social network posts and discussion forum posts) which we believe eliminates the limitation to Twitter content found in recent work. Two human annotators served during the annotation process. The strength of agreement is considered to be ‘very good’ using Kappa statistics.7
Confidence intervals were calculated using the GraphPad QuickCalcs Web site:
Table 3 gives the distribution of the named entities for each annotated datasets. One should note that the reported number of named entities may differ significantly from some of the previous studies (e.g. [2,53]) which report the number of tokens (conforming a named entity) instead of the actual number of named entities (consisting of one or more tokens) provided in here.
Entity distributions in newly introduced datasets
This section introduces a CRF-based NER system which successfully models the morphologically very rich nature of Turkish, its used features for ENAMEX types and newly added TIMEX and NUMEX types, and its adaptation for UGC.
Proposed framework
Figure 1 shows the architecture of the used framework. The following subsections provides the details of each module.

Proposed Framework.
We tokenized our data so that each word is represented as a token except for proper nouns which go under inflection. All punctuation characters are considered as a token. Sentences are separated from each other by an empty line. Tokenization of a sample sentence can be seen in Table 4.
Since in an MRL, the inflections that may be added to a proper noun are more diverse (e.g., case markers, copulas rendered as suffixes) and common than in English, the tokenization of the proper nouns has an important role on the success of the NER system. The apostrophe which is generally used in English for the possessive clitic (’s) in common nouns or the plural suffix (’s) in proper nouns, may get a long token appended to it in MRLs (consisting of multiple inflectional features); e.g. Ankara’dakiler (‘those which/who are in Ankara’). Since the suffixes separated by an apostrophe are not part of the named entities (NEs) according to MUC-6 guidelines, we partitioned such proper nouns into two tokens (the tokens before and after the apostrophe), which is shown to perform better than the basic word-level tokenization. In the literature related to NER on MRLs, there is not enough study reporting their experiments related to tokenization. To our knowledge, only Yeniterzi [53] tries with a morpheme-level tokenizations and reports no improvement with respect to the basic word-level tokenization.
IOB2 tagging vs RAW tagging
IOB2 tagging vs RAW tagging
We used a two-level morphological analyzer [10] for producing the possible analyses for each word. We then give the output to a morphological disambiguator [10] in order to get the most probable analysis in the given context. For example, the analyzer produces three different possible analyses for the word “Teknik” (Technical) which corresponds to an adjective, a noun and a proper noun accordingly; the disambiguator selects the most probable analysis within the given context:
Teknik teknik+Adj
Teknik teknik+Noun+A3sg+Pnon+Nom
Teknik teknik+Noun+Prop+A3sg+Pnon+Nom
The output of the analyzer both includes the stem of the word and the morphological features8
The abbreviations after the plus sign stand for: +Adj: Adjective, +Noun: Noun, +A3sg: 3sg number-person agreement, +Pnon: Pronoun (no overt possessive agreement), +Nom: Nominative case, +Prop: Proper noun.
Our preliminary work [40] has introduced two kind of gazetteers called base and generator gazetteers which have been compiled from different sources without taking the test corpora into consideration. Base gazetteers are composed of large lists of person and location names (∼261K tokens). The collected person names have been split into first name and surname gazetteers in order to both anonymize our gazetteers and to be able to detect different combinations of these. The location gazetteer has been collected so that it includes all location names in Turkish postal code system,9
Mostly collected from wikipedia.com.
# of distinct tokens in gazetteers
At this stage, we use the information coming from the raw data, the gazetteers and the morphological processing in order to prepare the feature vectors for our training/test instances. For the related class labels at the training stage, we use “Raw Tags”. In this format, we use the labels such as “PERSON”, “ORGANIZATION”, “LOCATION” and “O” (other – for the words which do not belong to a NE) without any position information (that is without any prefix). In our preliminary experiments [40], we have experimented with different training data formats. These are IOB, IOB2, raw labels and fictitious boundary model of [52] and reported that the highest performance is obtained by using the RAW labels whereas using the IOB formats reduced the performance by 0.4% and the fictitious boundary format by 2%. Thus, in this article we follow the same approach and use the raw tags during the training stage. Table 4 gives tagging examples with both IOB2 and raw tags.
Conditional random fields
Conditional random fields (CRFs) [22] is a framework for building probabilistic models to segment and label sequence data. CRF is a discriminative model better suited to including rich, overlapping features focusing solely on the conditional distribution
For the named entity task, each state
The features (
In some studies, it is shown that the useful feature conjunctions may be determined incrementally and provided to the system automatically [26]. But, in this study, we used the approach proposed in [43] and selected useful features manually for our initial explorations. Although this approach generally results with a huge number of features, we did not have any memory problem by using the combinations.
We provided our atomic features within a window of
U15 is the template for using the 2nd feature (part-of-speech tag) of the second previous word. U50 is the template for using the conjunction of the existence of the current word in the location name gazetteer (LG) (col = 10) and its case feature (col = 6) such as exists in LG written in lowercase; exists in LG and the first letter is capitalized.
We use the bigram option of the CRF++ in order to automatically generate the edge features using the previous label
Features used for ENAMEX types
In our
Morphological features
The morphological features are extracted from the analysis produced after the automatic morphological processing of each word.
The stem information. For the inflected proper nouns where the inflections after the apostrophe are treated as a separate token, the same surface form after the apostrophe is assigned as the stem of the token representing inflections.
The final part of speech category for each word. In Turkish, with the use of derivations, words may change their part of speech categories within a single surface form. The final form of the word determines its syntactic role within a sentence. Therefore, we use the final POS form of each word. We assigned a special POS tag (“APOST”) to the tokens separated by an apostrophe from the proper nouns.
The case argument. This feature is 0 for non nominal tokens and one of the following values for nominals: Nominative (NOM), Accusative/Objective (ACC), Dative (DAT), Ablative (ABL), Locative (LOC), Genitive (GEN), Instrumental (INS), Equative (EQU). Ex: the value will be NOM for the word “Teknik” with the morphological analysis “teknik+Noun+Prop+A3sg+Pnon+Nom”.
A binary feature indication that the “+Prop” tag exists (1) in the selected morphological analysis or not (0). Ex: The value will be 1 for the word “Teknik” given above. It is useful to mention that the morphological pipeline tags all unknown words as proper nouns.
All inflectional tags after the POS category. If a derivation exists then the inflectional tags after the last derived POS category is used. Ex: the value will be “Prop+A3sg+Pnon+Nom” for the word “Teknik” with the above morphological analysis.
As stated in the introduction, the usage of morphological information in NER modeling for MRLs is still an open research issue. In the literature, some studies use the first and last n characters of a word as extra features to CRF in order to include prefix and suffix information. However, Turkish is an MRL which is very rich and the possible suffix combinations could not be limited with a predefined length. Since the affixes in Turkish appear almost always as suffixes (except for some very rare foreign words), no extra feature is needed to represent prefixes in this language. A uniform representation (+INF feature) for suffixes free from variances due to vowel harmony is considered to be more appropriate for Turkish. As a result, we model the morphological information with the above features where +NCS and +PROP features are the atomic units extracted from the +INF feature.
Lexical features
The information about lowercase and uppercase letters used in the current token. This feature takes 4 different values: lowercase(0), UPPERCASE(1), Proper Name Case(2) and miXEd CaSe(3). A binary feature indicating whether the current token is the beginning of a sentence (1) or not (0).
Gazetteer lookup features
Eight different features are used for each of the eight gazetteers introduced in Section 5.1.3.
Extra features for TIMEX and NUMEX types
The numeric class argument. This feature is 0 for non-numeric tokens, 1 for integer tokens between [1–12], 2 for integer tokens between [13–31], 3 for integer tokens between [31–2020] and 4 for other integer tokens and 5 for all other numeric values. A binary feature indicating that the token is a percentage sign (%) or the word “yüzde” (percent) or not. A binary feature indicating that the token is the word “saat” (o’clock) or not. A binary feature indicating that the token includes the character “:” or not. A binary feature indicating that the token is included in the months gazetteer or not. A binary feature indicating that the token is included in the currency units gazetteer or not.
Adaptation for UGC
A widely used approach while adapting the NER systems to UGC domain is to use text normalization prior to the NE identification. In the literature, there exist very few studies related to the text normalization of MRLs which is a much more complicated problem when compared to the normalization of English texts. Similarly, in this work, the first approach that has been tried, but could not produce good results, was to use a Turkish text normalizer [50] specifically developed for Web 2.0 domain. As a result, instead of using such a comprehensive normalizer as a pre-processor, different error-tolerant gazetteer lookup scenarios are investigated. Similar to our investigations, Çelikkaya et al. [2] and Eken and Tantuğ [8] reported unsuccessful trials with their minimum-edit-distance based approaches. In our work, the highest performing method ( As exemplified in Section 2, it is very hard to detect proper names with a common noun meaning when written in lowercase letters. Although this still remains as a challenging issue for Turkish NER studies, in this work, we manually selected the names from our gazetteers with a very little chance of being used as a common noun in Turkish texts. We then add a new binary CRF feature (CAP) indicating that the current token exists in this auto capitalization gazetteer or not. A binary feature indicating if the given token conforms to a specific pattern (the Twitter mention tags).
Experimental results
In recent years, the CoNLL evaluation method became a de facto standard for the evaluation of NER systems. In this article, we follow this trend and use this method on all of our evaluations. During the presentation of our experimental results, the performances are provided as CoNLL F1 scores (discussed in detail in Section 3 – micro-averaged F-measure calculated on exactly matched named entities). The produced output of the testing stage with RAW labels is converted automatically to IOB-2 style and then evaluated by the evaluation script from CoNLL 2000 shared task.13
Following the previous work [40,53], in all of the provided experiments for the well formed text domain, we used 445K tokens of the news articles [52] (Table 3) as the training set (to be referred as
In the following sections, the training and test set couples are provided for each experiment separated with a slash sign and between parentheses such as (
Following the work of [40], our first experiment is to investigate the impact of each selected feature for the identification of ENAMEXs. Table 6 shows the impact of each selected feature to the best model by leaving out one feature at a time. Each row of the table states the obtained performances by excluding the feature given in the first column: e.g., -SS row give the performances of the best model (given in the first row) when the SS feature is excluded both during training and testing stages. The results show that even the SS feature (which was treated to have a slight impact with an incremental addition approach of each feature in [40]), has an important impact on the overall system by causing a 2.11% decrease with its absence.
Contribution of each feature for ENAMEX types (train3/wfs3 )
Contribution of each feature for ENAMEX types (
The impact of the inflectional features (INF) is also not surprising in such an agglutinative language since most of the time these features carry some information that would be carried with individual words in a morphologically poor language. All added morphological features have important impact on the performance. One should keep in mind that the +NCS and +PROP features are the atomic units extracted from the +INF feature. Since Turkish is an agglutinative language, the possible number of different values for the +INF feature is very high. For this reason, in many recent studies the usage of the inflectional features as a block (many atomic features concatenated to each other) is not a preferred approach. We observe from Table 6, the INF feature has an important impact despite this fact.
Table 7 gives the evaluation results of our second set of experiments conducted on the extended news dataset (
Extension to 7 NE types
The most attractive result in Table 7 is that the base model’s average success (92.33%) on ENAMEX types is better than the system trained only on ENAMEX types (91.94%). The investigations show that the reason for this is the alleviation of miss-classification of some named entities with the annotation of these in TIMEX categories: e.g. “Eylül” (September) and “Ekim” (October) are at the same time very common female names in Turkish but also the name of some months. The new annotations prevent the tendency of the classifier to annotate these as person names as was the case when trained on
Table 8 evaluates the impact of newly added TIMEX and NUMEX features similarly to our initial experiments. -OT (O’clock term) and -PS (percentage) lines in Table 8 give exactly the same performances due their impact to the same instances in the test data. When these 2 features are excluded at the same time the performance drop on NUMEX categories is almost 11 percentage points.
Contribution of each feature for NUMEX & TIMEX types (
Contribution of each feature for UGC adaptation (
The next experiment set is to evaluate the UGC adaptation introduced in Section 5.4. When we evaluate the system extended to 7 entity types (without any UGC adaptation) on
Performance on
We also evaluate the final system on
As stated in the previous sections and in many studies in the literature, CRFs are proven to perform well on the NER task. However its modeling for MRLs, in other words for the rich morphology and the sparse data problem due to the high number of possible word surface forms appearing with such languages, is still an active research area. In addition, the need of normalization for the textual content appearing in Social Web is also complicated for MRLs [50] and it is not clear how the orchestration should be designed with normalization and higher level tasks aiming to extract structured data from such content. For the NER task, there exists studies (given in previous sections) reporting negative results by applying a sophisticated text normalization prior to the NER task. The reason for this may be explained as the mutual need of both tasks: one needing the outputs of the other one to produce better results.
This article which introduces a NER model for Turkish, which is a morphologically very rich language, reports some improvements over the previous trials to its modeling for both well-formed texts and UGC. Although the results are very promising, we believe this area still needs more in-depth investigations in order to improve the performances especially on UGC.
This section gives comparisons with some prior works related to the morphological modeling of Turkish for the NER task and its adaptation to the UGC domain. With this purpose, The work of Yeniterzi [53] which exploits the impact of morphology for Turkish NER is selected as the baseline comparative study for our morphological modeling. The work is replicated for a reliable comparison and discussed in more detail within the remainder of this section. For the UGC adaptation, the proposed model is compared with two CRF-based models ([2] and [8]) and two rule-based models ([17] and [18]). The work of Çelikkaya et al. [2], which is the pioneering study for Turkish, is selected as the baseline comparative study for UGC adaptation and the performances of its reimplementation with our experimental settings are also provided below.
Yeniterzi [53] tries to include the morphological information into its CRF model by a new tokenization approach instead of the word-based tokenization. In this approach, each atomic morphological feature is provided as a seperate token to the system and tried to be labeled. Yeniterzi [53] states no significant improvement with this tokenization over the word-based one. As explained in the previous sections, our approach to include morphology consists of adding two atomic features (parts-of-speech tag POS and the noun case information NCS) extracted from a word’s morphological analysis and the full analysis (INF) which are all shown to have positive impact on the overall performance (Table 6). In both of the works, the stem information extracted from the morphological analysis is also added to the feature model in order to reduce data sparsity.
Comparison with related work on UGC
Comparison with related work on UGC
Another difference of [53] is the usage of letter case information. While our case feature (CS) takes 4 possible values (lower-case(0), UPPERCASE(1), Proper Name Case(2) and miXEd CaSe(3)), it takes only 2 values (lowercase and uppercase) in [53]. Yeniterzi [53] reports the impact of this feature as 1.53 percentage points whereas in our experiments we obtain a 3.37 percentage points impact (Table 6). In this section, we replicate the model of [53] with our settings ((
Çelikkaya et al. [2] follow the work of Şeker and Eryiğit[40] and try to adapt a similar CRF based NER model to UGC domains. The authors test with different feature models which were the reduced versions14
In these reduced feature models, some of the lexical features provided in Section 5.2.2 were treated to be useless for the UGC domain and removed from the feature set.
As given previously, our baseline score on the reannotated version (
Table 11 also presents the comparison with the other related works on UGC domain. The first set of the table provides the results on
Eken and Tantuğ [8] also use CRFs but with a different feature model which basically consists of the surface form, first and last 4 characters of the words instead of morphological features, lexical features, gazetteer lookups. Their reported accuracy on
The dataset
Küçük et al. [17] apply a rule-based multilingual NER system [34] to Turkish tweets. The system mostly employs language-independent rules that make reference to language-specific dictionary lists to recognize ENAMEX types and considers only those candidate tokens which have their initial letters capitalized. The system can be adapted to a new language by providing for that language separate word lists. Küçük et al. [17] tailor it for Turkish by equipping it with the required lists for Turkish information extraction, including lists of common person, location and organization names as well as organization endings in Turkish. The work focuses only on ENAMEX types. In order to be able to make a comparison with this work, Table 11 provides an extra set of scores on
Küçük and Steinberger [18] adapt the rule based system of [20] to better fit Twitter language by relaxing its capitalization constraint and by diacritics-based expansion of its lexical resources. They employ a simplistic normalization scheme on tweets to observe the effects of these on the overall named entity recognition performance on Turkish tweets. Table 11 provides the comparisons on
In order to reach the ideal of multilingual semantic web, semantic enrichment of UGC in different languages plays a key role. However, the modelling of morphologically rich languages for the named entity recognition task remains still as an open research question, and despite the very high results reported for morphologically less complex languages, the NER success rates did still not reach the human performances in case of MRLs.
This article presents a CRF-based NER system which successfully models the morphologically very rich nature of Turkish, which we believe may serve as a model for similar languages. The article provides the used lexical and morphological feature representations, and preprocessing stages in order to improve the performances both on well formed texts and user generated content. The re-annotation of the available datasets (from well formed text domain) to extend the covered named entity types (ENAMEX, TIMEX and NUMEX) as well as two newly annotated datasets from Web 2.0 are introduced. The compiled gazetteers, datasets and the used feature templates are made available for future research from
The introduced approach reveals an exact match F1 score of 92% on a dataset collected from Turkish news articles and ∼65% on different datasets collected from Web 2.0. Although the results obtained on well formed texts are in acceptable levels now, the field still needs new research in order to increase the results for non-canonical social media content. Especially the detection of proper nouns, with also a common noun meaning, written in lowercase letters needs special focus as the future work. The impact of normalization also needs to be investigated more. In this new UGC domain, named entity recognition and normalization becomes two NLP layers which are hard to orchestrate; one needing the outputs of the other one to produce better results. As a result, joint systems of these two layers deserve investigation in future research.
Footnotes
Acknowledgements
We would like to acknowledge that this work is part of a research project entitled “Parsing Web 2.0 Sentences” subsidized by the TUBITAK (Turkish Scientific and Technological Research Council) 1001 program (grant number 112E276) and part of the ICT COST Action IC1207. We want to thank the following people without whom it would be impossible to produce this work: Reyyan Yeniterzi and Ilyas Çiçekli for providing their datasets, Gökhan Tür for the helpful discussions, Dilek Küçük and Adnan Yazıcı for processing the test data with their NER tool and Memduh Gokirmak for helping during the annotation process. Finally, we want to thank our three reviewers for insightful comments and suggestions that helped us improve the final version of the article.
