Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1

Abstract

Named entity recognition (NER), which provides useful information for many high level NLP applications and semantic web technologies, is a well-studied topic for most of the languages and especially for English. However, the modelling of morphologically rich languages (MRLs) for the NER task is still an open research area. The studies for Turkish which is a strong representative of MRLs have fallen behind the well-studied languages for a long while. In recent years, Turkish NER intrigued researchers due to its scarce data resources and the unavailability of high-performing systems. Especially, the need to semantically enrich the textual data coming with user generated content initiated many studies in this field. This article presents a CRF-based NER system which successfully models the morphologically very rich nature of this language, its extensions to expand the covered named entity types, and also to process extra challenging user generated content coming with Web 2.0. The article introduces the re-annotation of the available datasets and a brand new dataset from Web 2.0. The introduced approach reveals an exact match F1 score of 92% on a dataset collected from Turkish news articles and ∼65% on different datasets collected from Web 2.0. The proposed model is believed to be easily applied to similar MRLs with relevant resources.

Keywords

Named entity recognition Turkish user generated content CRF web data

1. Introduction

The second generation of the world wide web (Web 2.0), which is also referred as the Social Web, focuses on people and their social interactions by the use of attractive and easy-to-use applications. The volume of user generated content (UGC)2

²
Moens et al. [28] defines UGC as “any form of content such as blogs, wikis, discussion forums, posts, chats, tweets, podcasts, digital images, video, audio files, advertisements and other forms of media that was created by users of an online system or service, often made available via social media websites”.

grows enormously every day as well as the need to semantically interpret this high volume of data. Semantic Web technologies focus on automatically interpreting this ever-growing dynamic UGC (as well as static web pages) and converting it into a machine readable structured data [14]. Semantic enrichment (e.g., sentiment detection, polarity, named entity recognition) of UGC plays a key role for Semantic Web Technologies. Named entity recognition and linking (to external resources such as those in DBpedia) are of primary importance for extracting information and for populating knowledge bases [7,36,37,44].

Named Entity Recognition (NER) can be basically defined as identifying and categorizing certain type of data (e.g. person, location, organization names, date-time expressions). Beside its value for semantic web technologies, NER is also an important stage for several natural language processing (NLP) tasks including machine translation, sentiment analysis and syntactic parsing. MUC (Message Understanding Conference [4,45]) and CoNLL (Conference on Computational Natural Language Learning [48,49]) conferences define three basic categories of named entities; these are 1) ENAMEX (person, location and organization names), 2) TIMEX (date and time entities) and 3) NUMEX (numerical expressions like money and percentages). However, NER research is not limited to only these types; different application areas concentrate on determining alternative entity types such as protein names, medicine names, book titles.

The NER research was firstly started in early 1990s for English. In 1995, with the high interest of the research community, the success rates for English achieved nearly the human annotation performance on news texts [45]. Nadeau and Sekine [31] give a survey of the research for English NER between 1991 to 2006. The satisfaction on English NER task directed the field to new research areas such as multilingual NER systems [48,49], transliteration [54], coreference [30] of named entities and especially to NER on UGC [23–25,29,35,38,39].

The use of Conditional Random Fields (CRFs) [22], which are reported to offer several advantages over hidden Markov models (HMMs), stochastic grammars and maximum entropy Markov models (MEMMs), became very dominant in the literature for the named entity recognition task. CRF-based NER models have been experimented with for various domains and languages: [27] for English and German, [9] for Hindi and Bengali, [3] for Chinese, [42] for biomedical data, [24,35] for Tweets are some studies among many others.

Morphologically rich languages (MRLs) (such as Finnish, Czech, Korean, Hungarian and many others) pose interesting challenges for NLP tasks (e.g., data scarcity, the representation of rich morphological features in different tasks) as is the case for NER. Although there exist some studies reporting their approach for some MRLs, the usage of morphological information for the NER task is still an open research issue. Georgiev et al. [12] put word prefix and word suffix information as new features to their systems for Bulgarian. Hasan et al. [15] use the first and last 3 characters of the words as extra features in order to use them as prefix and suffix information for Bengali. Konkol and Miloslav [16] report that their effort to add morphological features did not yield in any success improvement for Czech as well as Yeniterzi [53] which reports similar findings for Turkish.

Turkish is a free-constituent order language with complex agglutinative, inflectional and derivational morphology. With its morphologically very rich nature, Turkish is one of the strongest representatives of MRLs and attracts the attention of the NLP community. Especially, the need to semantically enrich the textual data coming with UGC initiated many studies for Turkish NER in recent years. Nevertheless, the results for Turkish NER remain still very behind the reported accuracies for English. This article introduces a CRF-based Turkish NER model (firstly introduced in [40] on Turkish well formed texts for only ENAMEX types), the enhancements made in order to extend its coverage to also include TIMEX and NUMEX entity types and to process UGC3

This manuscript focuses only to the textual content coming with UGC [28].

which poses extra challenges for NLP applications. The introduced system makes extensive use of morphological information and reports significant improvement by the use of this, differing from some of the previous NER studies (e.g. [16] and [53]) on MRLs. The article introduces the re-annotation of the most commonly used datasets to extend the covered named entity types, and a brand new dataset from Web 2.0. The introduced approach reveals an exact match F1 score of 92% on a dataset collected from Turkish news articles and ∼65% on different datasets collected from Web 2.0 (i.e., 64.96% on IWT (ITU Web Treebank from [33]) and 67.96% on TDS (a Turkish Twitter Dataset from [2])).

The article is organized as follows: Section 2 gives brief information about Turkish language characteristics related to NER, Section 3 gives a brief overview of the previous studies for Turkish NER, Section 4 gives information about existing and newly introduced language resources, Section 5 gives the details of the proposed framework, its extensions to TIMEX and NUMEX entities and to Web 2.0 domain, Section 6 provides our experiments and evaluates the results by comparing with related work and Section 7 gives the conclusion.

2. Turkish language characteristics related to NER

This section briefly states the characteristics of the Turkish language which are considered to have influence for the NER task. Turkish is a morphologically rich and highly agglutinative language. In most of the Turkish NLP studies, lemmas are used instead of word surface forms in order to decrease lexical sparsity. For example a Turkish verb “gitmek” (to go) may appear in hundreds of different surface forms4

⁴
Some surface forms of “gitmek” (only in simple present tense for different person arguments): gidiyorum, gidiyorsun, gidiyor, gidiyoruz, gidiyorsunuz, gidiyorlar.

depending on the tense, mood and the person arguments whereas the same verb in English has only five different forms (going, go, goes, went, gone). In case of the proper nouns, the inflectional suffixes are separated from the lemma by an apostrophe in well formatted texts. As a result, although it seems that it is unnecessary to make an automatic morphological processing for the stemming of the proper nouns, the stemming of the surrounding words of the proper nouns has influence on the success of NER. Section 6 investigates the impact of using morphological information for the named entity recognition task.

Although in well formed text, only the proper nouns, abbreviations and the initial words of the sentences start with an initial capital letter, this is most of the time not the case in social media domain. Turkish person (first) names are usually selected from common nouns such as İpek (silk), Kaya (rock), Pembe (pink) and Çiçek (flower). This property of the language makes the recognition of such named entities very hard in UGC domain where the appropriate capitalization rules are frequently ignored.

Turkish is a free word order language. As a consequence of this property, the position of the word in a sentence does not provide information about being a named entity or not. All of the three sentences: “Ahmet yarın Mehmet ile konuşmaya gidecek.”, “Yarın Mehmet ile konuşmaya Ahmet gidecek” and “Yarın Ahmet, Mehmet ile konuşmaya gidecek.” are valid Turkish sentences all with the English translation of “Tomorrow, Ahmet will go to talk to Mehmet”.

Çelikkaya et al. [2] make a preliminary investigation of the problems caused by UGC for Turkish NER. The following example from [2] shows the complexity caused by the omission of the above mentioned rules for proper nouns. In this tweet, which should actually be written as in the second line in formal writing, “Aydın” is a person name. The word when written with lowercase letters has also the meaning of a common noun; “enlightened”. This makes very difficult to differentiate this named entity (person name) from the word “enlightened”.

“aydınlara gidiyoruz.”

“Aydın’lara gidiyoruz.”

(We are going to Aydın ’s house)

Another problem for real data is the spelling errors produced either by mistake or on purpose for exaggeration, interjection or ASCIIfication (removal of accent, cedilla, etc.) of special Turkish letters (öüçşığİ). In the first line of the below example, the letter “ı” is written with its ascii counterpart and repeated multiple times for specifying exclamation. The second line of the same example shows the case where all the letters are capitalized and it is again very difficult to detect the named entity and alleviate the ambiguity caused by the common sense of the proper noun.

“aydiiiiiiiiin nerdesin?”

“AYDIIIIIIIIIN NERDESİN?”

(Aydın, where are you?)

And finally, the following example again from [2] exemplifies the foreign words inflected with Turkish suffixes by omitting the required apostrophe sign as shown in the second line. In the following Tweet: “Bieber” is used in accusative case without the required apostrophe sign.

“Justin Bieberi sevmem.”

“Justin Bieber’i sevmem.”

(I don’t like Justin Bieber )

3. Previous Turkish NER studies

Table 1
Overview of the previously reported Turkish NER results (compiled in [40])

Related work Best Result Eval. Method Domain NE Types

[32] 84.24 n/a E-mail texts ENAMEX

[21] 90.13 OTHER General news ENAMEX, TIMEX, NUMEX

[52] 91.56 MUC General news ENAMEX

[1] 81.97 MUC Financial Texts PERSON NAMES

[40] 94.59 MUC General news ENAMEX

[47] 91.08 CoNLL Terrorism news ENAMEX,TIMEX

[53] 88.94 CoNLL General news ENAMEX

[6] 91.85 CoNLL General news ENAMEX

[40] 91.94 CoNLL General news ENAMEX

Related work	Best Result	Eval. Method	Domain	NE Types
[32]	84.24	n/a	E-mail texts	ENAMEX
[21]	90.13	OTHER	General news	ENAMEX, TIMEX, NUMEX
[52]	91.56	MUC	General news	ENAMEX
[1]	81.97	MUC	Financial Texts	PERSON NAMES
[40]	94.59	MUC	General news	ENAMEX
[47]	91.08	CoNLL	Terrorism news	ENAMEX,TIMEX
[53]	88.94	CoNLL	General news	ENAMEX
[6]	91.85	CoNLL	General news	ENAMEX
[40]	91.94	CoNLL	General news	ENAMEX

Şeker and Eryiğit [40] compile the performances of all the previous Turkish NER studies and tries to make comparisons with them whenever possible although none of the previous systems nor most of the used data sets were publicly available. The authors interacted with the owners of the previous systems which made it possible to obtain a performance score either by sending their test data to be tested with the prior system or obtaining the test data set used in the prior work.

Table 1 (from [40]) gives an overview of the previously reported Turkish NER performances. The performances listed in Table 1 are organized in decreasing order of credit given to partial matches during evaluation. The outputs of the NER systems are generally evaluated in comparison with human annotations. Nadeau and Sekine [31] exemplify the different types of errors which may occur during the automatic recognition of the named entities; e.g. a totally missed NE, an identified entity with wrong entity type or wrong boundaries or both. Nadeau and Sekine [31] give the details of the three main scoring techniques (ACE,5

⁵

In the ACE evaluation method, each entity type has a parameterized weight and contributes up to a maximal proportion of the final score (e.g., if each person is worth 1 point and each organization is worth 0.5 point then it takes two organizations to counterbalance one person in the final score).

MUC and CONLL) which have been used in previous entity recognition studies.

Similar to most NER studies in the literature, MUC and CoNLL evaluations were widely used in Turkish NER studies. In the MUC evaluation approach, a system is scored on two axes: its ability to find the correct type (TYPE) and its ability to find exact text (TEXT). MUC TEXT makes evaluation only on NE boundaries without looking if the correct NE type is assigned or not. MUC TYPE evaluates the performance of assigning the correct named entity (NE) type to each word without taking into account if the NE boundaries are detected correctly. The final MUC score is the micro-averaged F-measure, which is the harmonic mean of precision and recall calculated over all entity slots on both axes. On the other hand, the CoNLL scoring technique uses an exact-match evaluation; it evaluates an assignment to be correct if both the type and the boundary of an NE are determined correctly. The calculated score is again the micro-averaged F-measure, but this time calculated on the exactly matched named entities. The results (excluding the ones provided for [21] and [32]) in Table 1 are given as MUC and CoNLL F1 scores. Note that the test sets, evaluation methods (3rd column), working domain (4th column) and entity types (5th column) in focus of each work are different from each other.

The first published work on Turkish NER is [5] which is a language independent system tested on Romanian, English, Greek, Turkish and Hindi. This system is trained with a small training data and learns from unannotated text using a bootstrapping algorithm. The first NER work specific to Turkish is [52]. The study focuses on three Information Extraction (IE) tasks, namely, sentence segmentation, topic segmentation and name tagging. For name tagging task they use lexical, morphological and contextual features of the words to generate an HMM based model. They use a training and test set collected from news articles which will be introduced in the following sections. The authors use the same training data with [40], but a different test data which is not available. Their performance is reported as 91.56%. In order to be able to have an idea (although not strictly comparable), Şeker and Eryiğit [40] also provide their MUC F1 score (94.59%) as well as the CoNLL F1 score of 91.94%.

Bayraktar and Temizel [1] work on financial texts to find only person names. They apply the local grammar based approach of [51] to Turkish. Bayraktar and Temizel [1] initially identified common reporting verbs in Turkish and then used these reporting verbs to generate patterns for locating person names. The study reports a CoNLL F1 score of 81.97% which is not directly comparable with any of the related work given in this section due to the difference in the used datasets.

Yeniterzi [53] uses CRFs and exploits the impact of morphology for Turkish NER. This work is the one which is most similar to ours except for the usage of morphological features and gazetteers. Yeniterzi [53] and Şeker and Eryiğit [40] use the same training and test data. Table 1 gives the reported performance by [53]. In order to be able to make a strict comparison, the results of its replication and evaluation under our settings are provided under Section 6.2.

Özkaya and Diri [32] also use CRFs for NER on email messages, but since they are using features specific to email domain only (such as from, subject fields) their work may not be extended to general texts. They do not provide their evaluation metrics and their overall results, but overall precision, recall and F-measure values are calculated as 92.89%, 77.07% and 84.24% respectively using the token counts provided in their paper.

The automatic rule learning system of [47] starts with a set of seeds selected from the training set, and then extracts rules over these examples. The named-entities are generalized by using contextual, lexical, morphological and orthographic features. Although the authors do not explicitly mention that they use the CoNLL evaluation method, the evaluation strategy of looking for the exact match seems compatible with it. Their reported accuracy is 91.08% on ENAMEX and TIMEX types. The relevant F-measure for only ENAMEX types is calculated as 90.63%.

Küçük and Yazıcı [21] use rote-learning [11] in order to extend their rule-based recognizer [19] into a hybrid recognizer so that it can learn from the available annotated data and extend its knowledge resources. They evaluate their system on general news texts, financial news texts, historical texts and children’s stories. In Table 1 we took the results on general news texts domain which sounds similar to our domain. Their evaluation strategy gives more credit to partial matches and is not similar to neither CoNLL nor MUC scoring techniques. They work on ENAMEX, TIMEX and NUMEX entity types but they do not provide the scores for each of these. After measuring this system performance on their own dataset, Şeker and Eryiğit [40] report a CoNLL F1 score of 69.78% on ENAMEX types for [21].

Demir and Özgür [6] address NER task for morphologically rich languages by employing a semi-supervised learning approach based on neural networks. They adopt a fast unsupervised method for learning continuous vector representations of words, and use these representations along with language independent features. They test their work on the data set of [40] and report a CoNLL F1 score of 91.85%.

The Turkish NER studies on UGC domain are very recent and limited in number compared to well formed text domain. The work of Çelikkaya et al. [2] is the first study which investigates the NER success on Turkish UGC; they test on 3 different domains, namely on datasets collected from Twitter, a Speech-to-Text Interface and a Hardware Forum. Küçük et al. [17], Küçük and Steinberger [18] and Eken and Tantuğ [8] follow this trend and report their approaches on Twitter datasets. The outputs of our extended CRF-Model are compared with the mentioned studies in Section 6.2.

Table 2

Some sample annotations from the formal news text dataset

<ENAMEX TYPE="ORGANIZATION">Ankara 26. Asliye Hukuk Mahkemesi</ENAMEX> ,

<TIMEX TYPE="DATE">2 Temmuz 1997</TIMEX> ’de okuduğu şiirde dönemin

<ENAMEX TYPE="ORGANIZATION">Deniz Kuvvetleri</ENAMEX> Komutanı

<ENAMEX TYPE="PERSON">Erkaya</ENAMEX> ’nın kişilik haklarına

hakaret ettiği gerekçesiyle <ENAMEX TYPE="PERSON">Hatipoğlu</ENAMEX> ’nu

<NUMEX TYPE="MONEY">3 milyar lira</NUMEX> manevi tazminat cezasına çarptırdı .

<ENAMEX TYPE="LOCATION">Türkiye</ENAMEX> ’nin kirlenmesinin

<NUMEX TYPE="PERCENT">yüzde 30</NUMEX> ’u sanayiden geliyor.

4. Datasets

This section firstly gives the features of the existing and freely available Turkish datasets tagged with named entities. Then, it introduces the newly annotated ones within this work.

4.1. Available datasets

The most widely used dataset for Turkish NER research is introduced by [52]. This data consists of nearly 500K words collected from newspaper articles and is annotated only for ENAMEX types. Another available dataset from well-written text genre comes from [47]. This dataset is rather small (∼55K) compared to the previous one and as a result is less preferable for supervised machine learning systems which mostly need high volume of human-annotated data. The dataset consists of news articles on terrorism from both online and print news sources in Turkish. The annotated types on this corpus are ENAMEX and TIMEX categories.

The datasets from the UGC domain are brand new and the available ones are as follows:

Çelikkaya et al. [2] introduce three datasets annotated with ENAMEX, TIMEX and NUMEX types; 1) a 55K dataset which is from a very popular online forum dedicated to hardware products’ reviews. An important feature of this dataset is that it contains mostly trademarks (generally company names), their products together with a related model. Although, this type of named entities are categorized under more specific named entity classes in extended NE classifications [41], the most relevant category in MUC-6 for these is the “Organization”. This forum data is full of spelling errors and capitalization is not properly used or not used at all in most of the cases. 2) a very small corpus (∼1.5K) collected from Speech-to-Text Interface of a mobile assistant application. The most important characteristic of this dataset is that there is no capitalization or punctuation at all in the produced text message. 3) a 55K Twitter corpus which is used for testing purposes in many of the follow up studies [17,18] and [8]. Unfortunately the annotations on this new domain were arguable and this resulted in the emergence of re-annotated versions of the same dataset simultaneously by different groups6

⁶
Table 2 from [8], Table 1 from [2] and Table 1 from [18] provides the number of annotated named entities in different versions of this dataset. The main arguable point on the annotation of this dataset was the tagging of named entities consisting apostrophe signs.

([17,18] and [8] as well as this study). Additionally, Küçük et al. [17] and Küçük and Steinberger [18] introduce a Twitter dataset of 20K tokens whereas Eken and Tantuğ [8] introduce another one with 108K tokens.

4.2. Newly introduced datasets

Human annotation of language resources is a costly process. The creation of benchmark datasets is very valuable to speed-up progress in a specific research area. As may be noticed from the previous subsections, early Turkish NER studies mostly evaluated their success on their own datasets which makes it hard to make a comparison between the proposed approaches. In this study, we selected two mostly used datasets from the Turkish NER literature; one from well-written text domain [52] (which is also the biggest dataset) and one from UGC domain [2] (limited to Twitter content only) and re-annotated them with the following two main purposes:

To extend the covered named entity types (to also cover TIMEX and NUMEX types) which were priorly limited to ENAMEX types only (in [52]).

To improve the consistency and hence the quality of the annotations by strictly following a specific guideline (namely the MUC-6 the Sixth Message Understanding Conference guidelines [13]).

Previous annotations were also carefully investigated during this second round of annotation. In addition to these two datasets, we also annotated a brand new Turkish treebank from the social media domain: ITU Web Treebank (IWT) [33]. IWT is specifically selected for the NER annotation due to its representativeness on UGC. Its composition (free from duplicates and re-tweets) includes UGC from different Web 2.0 domains (namely news story comments, personal blog comments, customer product reviews, social network posts and discussion forum posts) which we believe eliminates the limitation to Twitter content found in recent work. Two human annotators served during the annotation process. The strength of agreement is considered to be ‘very good’ using Kappa statistics.7

⁷
Confidence intervals were calculated using the GraphPad QuickCalcs Web site: http://graphpad.com/quickcalcs/kappa1.cfm (accessed December 2015).

In all of the three datasets, we used MUC-6 style SGML tag elements: ENAMEX, TIMEX, and NUMEX; and the subcategorization is captured by a SGML tag attribute called TYPE, which is defined to have a different set of possible values for each tag element. Table 2 shows some sample annotations.

Table 3 gives the distribution of the named entities for each annotated datasets. One should note that the reported number of named entities may differ significantly from some of the previous studies (e.g. [2,53]) which report the number of tokens (conforming a named entity) instead of the actual number of named entities (consisting of one or more tokens) provided in here.

Table 3

Entity distributions in newly introduced datasets

Group	Type	News articles [52]	Tweets [2]	IWT [33]

		492K	55K	43K
ENAMEX	Person	15,352	681	380
ENAMEX	Location	10,404	240	260
ENAMEX	Organization	9,571	428	401
TIMEX	Date	1,486	57	59
TIMEX	Time	169	20	9
NUMEX	Money	638	24	45
NUMEX	Percentage	710	5	8
TOTAL		38,330	1,455	1,162

5. A CRF-based Turkish named entity recognizer

This section introduces a CRF-based NER system which successfully models the morphologically very rich nature of Turkish, its used features for ENAMEX types and newly added TIMEX and NUMEX types, and its adaptation for UGC.

5.1. Proposed framework

Figure 1 shows the architecture of the used framework. The following subsections provides the details of each module.

Fig. 1.

Proposed Framework.

5.1.1. Tokenization

We tokenized our data so that each word is represented as a token except for proper nouns which go under inflection. All punctuation characters are considered as a token. Sentences are separated from each other by an empty line. Tokenization of a sample sentence can be seen in Table 4.

Since in an MRL, the inflections that may be added to a proper noun are more diverse (e.g., case markers, copulas rendered as suffixes) and common than in English, the tokenization of the proper nouns has an important role on the success of the NER system. The apostrophe which is generally used in English for the possessive clitic (’s) in common nouns or the plural suffix (’s) in proper nouns, may get a long token appended to it in MRLs (consisting of multiple inflectional features); e.g. Ankara’dakiler (‘those which/who are in Ankara’). Since the suffixes separated by an apostrophe are not part of the named entities (NEs) according to MUC-6 guidelines, we partitioned such proper nouns into two tokens (the tokens before and after the apostrophe), which is shown to perform better than the basic word-level tokenization. In the literature related to NER on MRLs, there is not enough study reporting their experiments related to tokenization. To our knowledge, only Yeniterzi [53] tries with a morpheme-level tokenizations and reports no improvement with respect to the basic word-level tokenization.

Table 4
IOB2 tagging vs RAW tagging

Token IOB2 Tags RAW Tags

Mustafa B-PERSON PERSON

Kemal I-PERSON PERSON

Atatürk I-PERSON PERSON

1919 O O

yılında O O

Samsun B-LOCATION LOCATION

’a O O

çıktı O O

. O O

Token	IOB2 Tags	RAW Tags
Mustafa	B-PERSON	PERSON
Kemal	I-PERSON	PERSON
Atatürk	I-PERSON	PERSON
1919	O	O
yılında	O	O
Samsun	B-LOCATION	LOCATION
’a	O	O
çıktı	O	O
.	O	O

5.1.2. Morphological processing

We used a two-level morphological analyzer [10] for producing the possible analyses for each word. We then give the output to a morphological disambiguator [10] in order to get the most probable analysis in the given context. For example, the analyzer produces three different possible analyses for the word “Teknik” (Technical) which corresponds to an adjective, a noun and a proper noun accordingly; the disambiguator selects the most probable analysis within the given context:

Teknik teknik+Adj

Teknik teknik+Noun+A3sg+Pnon+Nom

Teknik teknik+Noun+Prop+A3sg+Pnon+Nom

The output of the analyzer both includes the stem of the word and the morphological features8

⁸
The abbreviations after the plus sign stand for: +Adj: Adjective, +Noun: Noun, +A3sg: 3sg number-person agreement, +Pnon: Pronoun (no overt possessive agreement), +Nom: Nominative case, +Prop: Proper noun.

which we use as features for our CRF model. One should keep in mind that this is an automatic process and it possesses its own error margin.

5.1.3. Gazetteers

Our preliminary work [40] has introduced two kind of gazetteers called base and generator gazetteers which have been compiled from different sources without taking the test corpora into consideration. Base gazetteers are composed of large lists of person and location names (∼261K tokens). The collected person names have been split into first name and surname gazetteers in order to both anonymize our gazetteers and to be able to detect different combinations of these. The location gazetteer has been collected so that it includes all location names in Turkish postal code system,9

⁹
https://interaktifkargo.ptt.gov.tr/posta_kodu/.

all country names from international telephone code system,10

¹⁰

http://www.ttrehber.turktelekom.com.tr/trk-web/ulkekodlari.html.

city and states of those countries11

¹¹

Mostly collected from wikipedia.com.

and geographical names from different sources. The derivative gazetteers, which are rather small compared to the base gazetteers, consists of some frequently observed generator words (e.g. “Mr”, “Professor”, “Ministry of”, “Street”) having impact on the probability of next or previous words being part of a NE. Table 5 provides the number of tokens in each of these gazetteers. In this work, we basically add two small gazetteers (62 tokens in total) to the ones introduced in [40] in order to be able to identify TIMEX and NUMEX types. These are one base gazetteer (given as “Months” in Table 5) and one generator gazetteer (given as “Currency Units” in Table 5). Currency units generator gazetteer includes currency unit names of different countries generating currency expressions with the previous numerals.

Table 5

# of distinct tokens in gazetteers

	Gazetteer	# of tokens
Base	First names	44.048
	Surnames	138.844
	Location names	33.551
	Months	12
Generator	Location	44
	Organization	60
	Person	22
	Currency Units	50

5.1.4. Data preparation

At this stage, we use the information coming from the raw data, the gazetteers and the morphological processing in order to prepare the feature vectors for our training/test instances. For the related class labels at the training stage, we use “Raw Tags”. In this format, we use the labels such as “PERSON”, “ORGANIZATION”, “LOCATION” and “O” (other – for the words which do not belong to a NE) without any position information (that is without any prefix). In our preliminary experiments [40], we have experimented with different training data formats. These are IOB, IOB2, raw labels and fictitious boundary model of [52] and reported that the highest performance is obtained by using the RAW labels whereas using the IOB formats reduced the performance by 0.4% and the fictitious boundary format by 2%. Thus, in this article we follow the same approach and use the raw tags during the training stage. Table 4 gives tagging examples with both IOB2 and raw tags.

5.1.5. Conditional random fields

Conditional random fields (CRFs) [22] is a framework for building probabilistic models to segment and label sequence data. CRF is a discriminative model better suited to including rich, overlapping features focusing solely on the conditional distribution $p (y | x)$ . We use linear chain CRFs where $p (y | x)$ is defined as: $\begin{array}{l} (1) & p_{θ} (y | x) = \frac{1}{Z_{θ} (x)} exp {\sum_{t = 1}^{T} \sum_{k = 1}^{K} θ_{k} f_{k} (y_{t - 1}, y_{t}, x_{t})} \end{array}$ where $f_{k} (y_{t - 1}, y_{t}, x_{t})$ is the function for the properties of transition from the state $y_{t - 1}$ to $y_{t}$ with the input $x_{t}$ and $θ_{k}$ is the parameter optimized by the training. $Z_{θ} (x)$ is a normalization factor calculated by: $\begin{array}{l} (2) & Z_{θ} (x) = \sum_{y \in Y^{T}} exp {\sum_{t = 1}^{T} \sum_{k = 1}^{K} θ_{k} f_{k} (y_{t - 1}, y_{t}, x_{t})} \end{array}$

For the named entity task, each state $y_{t}$ is the named entity label and each feature vector $x_{t}$ contains all the components of the global observations x that are needed for computing features at time t. Sutton and McCallum [46] give detailed information on mathematical foundations and many examples about the usage of CRFs. In this study we used CRF++12

¹²
http://crfpp.googlecode.com/svn/trunk/doc/index.html.

which is an open source implementation of CRFs.

5.1.6. Feature templates

The features ( $f_{k}$ ) are based on some number of hand-crafted atomic observational tests (such as the token is capitalized or appears in a gazetteer) and a large collection of features is formed by making conjunctions of the atomic tests in certain user-defined patterns. “Conjunctions are important because the model is log-linear, and the only way to represent certain complex decision boundaries is to project the problem into a higher-dimensional space comprised of other functions of multiple variables” [26].

In some studies, it is shown that the useful feature conjunctions may be determined incrementally and provided to the system automatically [26]. But, in this study, we used the approach proposed in [43] and selected useful features manually for our initial explorations. Although this approach generally results with a huge number of features, we did not have any memory problem by using the combinations.

We provided our atomic features within a window of ${- 3, + 3}$ and some selected combinations of these as feature templates to CRF++. Two sample feature templates are given in the below example. The templates are given in [pos, col] format, where pos stands for the relative position of the token in focus and col stands for the feature column number in the input file. $\begin{array}{l} U 15 : % x [- 2, 2] \\ U 50 : % x [0, 10] / % x [0, 6] \end{array}$

U15 is the template for using the 2nd feature (part-of-speech tag) of the second previous word. U50 is the template for using the conjunction of the existence of the current word in the location name gazetteer (LG) (col = 10) and its case feature (col = 6) such as exists in LG written in lowercase; exists in LG and the first letter is capitalized.

We use the bigram option of the CRF++ in order to automatically generate the edge features using the previous label $y_{- 1}$ and the current label $y_{0}$ .

5.2. Features used for ENAMEX types

In our base model we used word tokens converted to lower case in their surface form. The idea behind converting tokens to lowercase is avoiding one of the major problems of the Turkish language studies; the sparse data problem. Other features added to this model can be grouped into three main categories: morphological, lexical and gazetteer lookup features.

5.2.1. Morphological features

The morphological features are extracted from the analysis produced after the automatic morphological processing of each word.

Stem:

The stem information. For the inflected proper nouns where the inflections after the apostrophe are treated as a separate token, the same surface form after the apostrophe is assigned as the stem of the token representing inflections.

Part of Speech Tag (POS):

The final part of speech category for each word. In Turkish, with the use of derivations, words may change their part of speech categories within a single surface form. The final form of the word determines its syntactic role within a sentence. Therefore, we use the final POS form of each word. We assigned a special POS tag (“APOST”) to the tokens separated by an apostrophe from the proper nouns.

Noun Case (NCS):

The case argument. This feature is 0 for non nominal tokens and one of the following values for nominals: Nominative (NOM), Accusative/Objective (ACC), Dative (DAT), Ablative (ABL), Locative (LOC), Genitive (GEN), Instrumental (INS), Equative (EQU). Ex: the value will be NOM for the word “Teknik” with the morphological analysis “teknik+Noun+Prop+A3sg+Pnon+Nom”.

Proper Noun (PROP):

A binary feature indication that the “+Prop” tag exists (1) in the selected morphological analysis or not (0). Ex: The value will be 1 for the word “Teknik” given above. It is useful to mention that the morphological pipeline tags all unknown words as proper nouns.

All Inflectional Features (INF):

All inflectional tags after the POS category. If a derivation exists then the inflectional tags after the last derived POS category is used. Ex: the value will be “Prop+A3sg+Pnon+Nom” for the word “Teknik” with the above morphological analysis.

As stated in the introduction, the usage of morphological information in NER modeling for MRLs is still an open research issue. In the literature, some studies use the first and last n characters of a word as extra features to CRF in order to include prefix and suffix information. However, Turkish is an MRL which is very rich and the possible suffix combinations could not be limited with a predefined length. Since the affixes in Turkish appear almost always as suffixes (except for some very rare foreign words), no extra feature is needed to represent prefixes in this language. A uniform representation (+INF feature) for suffixes free from variances due to vowel harmony is considered to be more appropriate for Turkish. As a result, we model the morphological information with the above features where +NCS and +PROP features are the atomic units extracted from the +INF feature.

5.2.2. Lexical features

Case Feature (CS):

The information about lowercase and uppercase letters used in the current token. This feature takes 4 different values: lowercase(0), UPPERCASE(1), Proper Name Case(2) and miXEd CaSe(3).

Start of the Sentence (SS):

A binary feature indicating whether the current token is the beginning of a sentence (1) or not (0).

5.2.3. Gazetteer lookup features

Eight different features are used for each of the eight gazetteers introduced in Section 5.1.3. Lookup features for base gazetteers (BG) have a 1 value if the token exists in the corresponding gazetteer and 0 otherwise. Generator gazetteer lookup features (GG) are binary features as well but this time the stem of the word is checked instead of the full surface form.

5.3. Extra features for TIMEX and NUMEX types

Numeric Value (NV):

The numeric class argument. This feature is 0 for non-numeric tokens, 1 for integer tokens between [1–12], 2 for integer tokens between [13–31], 3 for integer tokens between [31–2020] and 4 for other integer tokens and 5 for all other numeric values.

Percentage Sign (PS):

A binary feature indicating that the token is a percentage sign (%) or the word “yüzde” (percent) or not.

O’clock Term (OT):

A binary feature indicating that the token is the word “saat” (o’clock) or not.

Column Indicator (CI):

A binary feature indicating that the token includes the character “:” or not.

Month Gazetteer (MG):

A binary feature indicating that the token is included in the months gazetteer or not.

Currency Gazetteer (CG):

A binary feature indicating that the token is included in the currency units gazetteer or not.

5.4. Adaptation for UGC

A widely used approach while adapting the NER systems to UGC domain is to use text normalization prior to the NE identification. In the literature, there exist very few studies related to the text normalization of MRLs which is a much more complicated problem when compared to the normalization of English texts. Similarly, in this work, the first approach that has been tried, but could not produce good results, was to use a Turkish text normalizer [50] specifically developed for Web 2.0 domain. As a result, instead of using such a comprehensive normalizer as a pre-processor, different error-tolerant gazetteer lookup scenarios are investigated. Similar to our investigations, Çelikkaya et al. [2] and Eken and Tantuğ [8] reported unsuccessful trials with their minimum-edit-distance based approaches. In our work, the highest performing method (ASC) is found to be the toleration of the replacement of a single Turkish special character (‘ı’, ‘ü’, ‘ş’, ‘ö’, ‘ç’, ‘ğ’) with its ascii counterpart (‘i’, ‘u’, ‘s’, ‘o’, ‘c’, ‘g’) at a time. Our observations show that allowing a more flexible error tolerance yields at a very high number of false matches (of input tokens) with gazetteer items. Auto Capitalization Gazetteer (CAP):

As exemplified in Section 2, it is very hard to detect proper names with a common noun meaning when written in lowercase letters. Although this still remains as a challenging issue for Turkish NER studies, in this work, we manually selected the names from our gazetteers with a very little chance of being used as a common noun in Turkish texts. We then add a new binary CRF feature (CAP) indicating that the current token exists in this auto capitalization gazetteer or not.

Mention (MEN):

A binary feature indicating if the given token conforms to a specific pattern (the Twitter mention tags).

6. Experimental results

In recent years, the CoNLL evaluation method became a de facto standard for the evaluation of NER systems. In this article, we follow this trend and use this method on all of our evaluations. During the presentation of our experimental results, the performances are provided as CoNLL F1 scores (discussed in detail in Section 3 – micro-averaged F-measure calculated on exactly matched named entities). The produced output of the testing stage with RAW labels is converted automatically to IOB-2 style and then evaluated by the evaluation script from CoNLL 2000 shared task.13

¹³
http://www.cnts.ua.ac.be/CoNLL2000/chunking/output.html.

Following the previous work [40,53], in all of the provided experiments for the well formed text domain, we used 445K tokens of the news articles [52] (Table 3) as the training set (to be referred as train3 – the training data version annotated with 3 ENAMEX types only [40,53] and train7 – the re-annotated training data version with 7 entity types) and the remaining 47K tokens as the test set. The overall test datasets used in the following experiments are named as follows:

wfs3 : the original version of the news article test data [40,52,53] with 3 ENAMEX (PERSON, LOCATION, ORGANIZATION) types only,

wfs7 : the re-annotated version of the news article test data [52] with 7 entity types (ENAMEX, NUMEX and TIMEX) (Table 3),

tds1_v1: Tweet dataset introduced in [2],

tds1_v2: Tweet dataset introduced in [2] re-annotated version from [17],

tds1_v3: Tweet dataset introduced in [2] re-annotated version from [8],

tds1_v4: Tweet dataset introduced in [2] re-annotated version from this article,

tds2 : Tweet dataset introduced in [17,18],

tds3 : Tweet dataset introduced in [8],

iwt : ITU Web Treebank [33].

In the following sections, the training and test set couples are provided for each experiment separated with a slash sign and between parentheses such as (train3/wfs3).

6.1. Evaluation of the selected features

Following the work of [40], our first experiment is to investigate the impact of each selected feature for the identification of ENAMEXs. Table 6 shows the impact of each selected feature to the best model by leaving out one feature at a time. Each row of the table states the obtained performances by excluding the feature given in the first column: e.g., -SS row give the performances of the best model (given in the first row) when the SS feature is excluded both during training and testing stages. The results show that even the SS feature (which was treated to have a slight impact with an incremental addition approach of each feature in [40]), has an important impact on the overall system by causing a 2.11% decrease with its absence.

Table 6
Contribution of each feature for ENAMEX types (train3/wfs3)

Excluded Feature PER ORG LOC Overall

best model 92.94 88.77 92.93 91.94

base model 80.77 77.86 87.66 82.28

-STEM 90.03 86.30 90.61 89.31

-POS 90.00 87.31 91.00 89.66

-NCS 90.31 87.11 90.97 89.74

-PROP 90.39 87.18 91.00 89.81

-INF 90.63 86.55 91.35 89.88

-CS 89.73 83.16 90.97 88.57

-SS 90.36 87.16 91.11 89.83

-BG 90.11 86.53 91.24 89.60

-GG 92.23 87.28 92.14 91.02

Excluded Feature	PER	ORG	LOC	Overall
best model	92.94	88.77	92.93	91.94
base model	80.77	77.86	87.66	82.28
-STEM	90.03	86.30	90.61	89.31
-POS	90.00	87.31	91.00	89.66
-NCS	90.31	87.11	90.97	89.74
-PROP	90.39	87.18	91.00	89.81
-INF	90.63	86.55	91.35	89.88
-CS	89.73	83.16	90.97	88.57
-SS	90.36	87.16	91.11	89.83
-BG	90.11	86.53	91.24	89.60
-GG	92.23	87.28	92.14	91.02

The impact of the inflectional features (INF) is also not surprising in such an agglutinative language since most of the time these features carry some information that would be carried with individual words in a morphologically poor language. All added morphological features have important impact on the performance. One should keep in mind that the +NCS and +PROP features are the atomic units extracted from the +INF feature. Since Turkish is an agglutinative language, the possible number of different values for the +INF feature is very high. For this reason, in many recent studies the usage of the inflectional features as a block (many atomic features concatenated to each other) is not a preferred approach. We observe from Table 6, the INF feature has an important impact despite this fact.

Table 7 gives the evaluation results of our second set of experiments conducted on the extended news dataset (wfs7). The first column are the results provided in Table 6 (best model on wfs3). The second column provides the results when exactly the same ENAMEX features (Section 5.2) are applied on wfs7. The last column of the table provides the results of our best model which includes the extra features included for TIMEX and NUMEX categories (Section 5.3). The last two rows give the average performances on the ENAMEX category and overall categories (ENAMEX, TIMEX, NUMEX).

Table 7

Extension to 7 NE types

Type	[40] (train3/wfs3)	Base Model (train7/wfs7)	Best Model (train7/wfs7)
Person	92.94	92.19	91.47
Location	92.93	94.28	94.34
Organization	88.67	89.56	89.88
Date	–	54.79	89.25
Time	–	51.85	91.89
Money	–	86.36	100.00
Percentage	–	65.67	98.41
on ENAMEX	91.94	92.33	92.15
Overall	–	89.27	92.34

The most attractive result in Table 7 is that the base model’s average success (92.33%) on ENAMEX types is better than the system trained only on ENAMEX types (91.94%). The investigations show that the reason for this is the alleviation of miss-classification of some named entities with the annotation of these in TIMEX categories: e.g. “Eylül” (September) and “Ekim” (October) are at the same time very common female names in Turkish but also the name of some months. The new annotations prevent the tendency of the classifier to annotate these as person names as was the case when trained on wfs3. The results of the third column show that the new features improve the results on almost all NE types except person names. We also executed the same experiments with 10 fold cross validation and obtained an average F-measure (CoNLL evaluation) of 91.53 with a standard error of ±0.50.

Table 8 evaluates the impact of newly added TIMEX and NUMEX features similarly to our initial experiments. -OT (O’clock term) and -PS (percentage) lines in Table 8 give exactly the same performances due their impact to the same instances in the test data. When these 2 features are excluded at the same time the performance drop on NUMEX categories is almost 11 percentage points.

Table 8

Contribution of each feature for NUMEX & TIMEX types (train7/wfs7)

Excluded Feature	PER	ORG	LOC	DATE	TIME	MONEY	PERC	Overall
best model	91.47	94.34	89.88	89.25	91.89	100.00	98.41	92.34
base model	92.19	94.28	89.56	54.79	51.85	86.36	65.67	89.27
-NV	91.24	94.12	89.62	59.18	91.89	95.00	98.41	90.54
-PS	91.17	94.34	89.62	77.25	91.89	95.00	98.41	91.38
-OT	91.17	94.34	89.62	77.25	91.89	95.00	98.41	91.38
-CI	91.17	94.34	89.62	77.25	64.29	95.00	98.41	91.18
-MG	91.10	94.12	89.62	62.69	91.89	100.00	98.41	90.74
-CG	91.17	94.34	89.62	77.25	90.29	80.00	98.41	91.42

Table 9

Contribution of each feature for UGC adaptation (train7/tds1_v4)

Excluded Feature	PER	ORG	LOC	DATE	TIME	MONEY	PERC	Overall
best model	75.98	69.54	59.86	39.03	41.23	54.55	94.12	67.96
base model	47.88	56.48	22.86	11.32	33.33	54.55	94.12	38.36
-Asc	75.98	58.39	52.44	39.03	41.23	54.55	94.12	63.94
-Cap	58.63	63.27	23.37	15.37	34.67	54.55	94.12	47.15
-Men	66.74	69.54	59.86	39.03	41.23	54.55	94.12	63.63

The next experiment set is to evaluate the UGC adaptation introduced in Section 5.4. When we evaluate the system extended to 7 entity types (without any UGC adaptation) on tds1_v1, we obtain 22.57%. The re-annotation of this dataset (tds1_v4) alone results with an increase of 15.79 percentage points (from 22.57% to 38.36%). The baseline success (38.36%) on this dataset is provided in the second line of Table 9. After the introduced adaptation, our best model obtains 67.96% on this dataset.

Table 10

Performance on iwt (train7/iwt)

Excluded Feature	PER	ORG	LOC	DATE	TIME	MONEY	PERC	Overall
best model	67.22	77.17	53.87	70.31	80.00	27.45	50.00	64.96

We also evaluate the final system on iwt (Table10). It is noticeable that the performance on monetary and percentage expressions are lower than the one obtained on tds1_v4. When we investigate the produced outputs for error analysis, we notice that the recall for these two types are very low due to the unusual usage of these expressions in social media domain (e.g. monetary expressions without providing any currency unit).

6.2. Discussions & comparison with related work

As stated in the previous sections and in many studies in the literature, CRFs are proven to perform well on the NER task. However its modeling for MRLs, in other words for the rich morphology and the sparse data problem due to the high number of possible word surface forms appearing with such languages, is still an active research area. In addition, the need of normalization for the textual content appearing in Social Web is also complicated for MRLs [50] and it is not clear how the orchestration should be designed with normalization and higher level tasks aiming to extract structured data from such content. For the NER task, there exists studies (given in previous sections) reporting negative results by applying a sophisticated text normalization prior to the NER task. The reason for this may be explained as the mutual need of both tasks: one needing the outputs of the other one to produce better results.

This article which introduces a NER model for Turkish, which is a morphologically very rich language, reports some improvements over the previous trials to its modeling for both well-formed texts and UGC. Although the results are very promising, we believe this area still needs more in-depth investigations in order to improve the performances especially on UGC.

This section gives comparisons with some prior works related to the morphological modeling of Turkish for the NER task and its adaptation to the UGC domain. With this purpose, The work of Yeniterzi [53] which exploits the impact of morphology for Turkish NER is selected as the baseline comparative study for our morphological modeling. The work is replicated for a reliable comparison and discussed in more detail within the remainder of this section. For the UGC adaptation, the proposed model is compared with two CRF-based models ([2] and [8]) and two rule-based models ([17] and [18]). The work of Çelikkaya et al. [2], which is the pioneering study for Turkish, is selected as the baseline comparative study for UGC adaptation and the performances of its reimplementation with our experimental settings are also provided below.

Yeniterzi [53] tries to include the morphological information into its CRF model by a new tokenization approach instead of the word-based tokenization. In this approach, each atomic morphological feature is provided as a seperate token to the system and tried to be labeled. Yeniterzi [53] states no significant improvement with this tokenization over the word-based one. As explained in the previous sections, our approach to include morphology consists of adding two atomic features (parts-of-speech tag POS and the noun case information NCS) extracted from a word’s morphological analysis and the full analysis (INF) which are all shown to have positive impact on the overall performance (Table 6). In both of the works, the stem information extracted from the morphological analysis is also added to the feature model in order to reduce data sparsity.

Table 11
Comparison with related work on UGC

Related work Best Result train/test NE Types

[2] 19.28 train3/tds1_v1 ENAMEX, TIMEX, NUMEX

[17] 36.11 ———/tds1_v2 ENAMEX

[18] 46.93 ———/tds1_v2 ENAMEX, TIMEX, NUMEX

[8] 28.53 train7/tds1_v3 ENAMEX, TIMEX, NUMEX

this article 67.96 train7/tds1_v4 ENAMEX, TIMEX, NUMEX

Below are comparable results on ENAMEX types

[17] 42.68 ———/tds2 ENAMEX

[18] 48.13 ———/tds2 ENAMEX

this article 49.02 train7/tds2 ENAMEX

Below are comparable results on all 7 NE types

[18] 54.81 ———/tds2 ENAMEX, TIMEX, NUMEX

this article 56.02 train7/tds2 ENAMEX, TIMEX, NUMEX

Below are comparable results on all 7 NE types

[8] 46.97 train7/tds3 ENAMEX, TIMEX, NUMEX

this article 51.61 train7/tds3 ENAMEX, TIMEX, NUMEX

Related work	Best Result	train/test	NE Types
[2]	19.28	train3/tds1_v1	ENAMEX, TIMEX, NUMEX
[17]	36.11	———/tds1_v2	ENAMEX
[18]	46.93	———/tds1_v2	ENAMEX, TIMEX, NUMEX
[8]	28.53	train7/tds1_v3	ENAMEX, TIMEX, NUMEX
this article	67.96	train7/tds1_v4	ENAMEX, TIMEX, NUMEX
Below are comparable results on ENAMEX types
[17]	42.68	———/tds2	ENAMEX
[18]	48.13	———/tds2	ENAMEX
this article	49.02	train7/tds2	ENAMEX
Below are comparable results on all 7 NE types
[18]	54.81	———/tds2	ENAMEX, TIMEX, NUMEX
this article	56.02	train7/tds2	ENAMEX, TIMEX, NUMEX
Below are comparable results on all 7 NE types
[8]	46.97	train7/tds3	ENAMEX, TIMEX, NUMEX
this article	51.61	train7/tds3	ENAMEX, TIMEX, NUMEX

Another difference of [53] is the usage of letter case information. While our case feature (CS) takes 4 possible values (lower-case(0), UPPERCASE(1), Proper Name Case(2) and miXEd CaSe(3)), it takes only 2 values (lowercase and uppercase) in [53]. Yeniterzi [53] reports the impact of this feature as 1.53 percentage points whereas in our experiments we obtain a 3.37 percentage points impact (Table 6). In this section, we replicate the model of [53] with our settings ((train3/WFS3), datasets represented with RAW labels, the same feature templates are used for the selected CRF features) and obtained 89.13% CoNLL F1 score when we used a binary CS feature as opposed to its reported 88.71% in [53]. We also tested the model with our 4-valued CS feature and obtained 89.21%. When we test our model by omitting all the gazetteer features (BG and GG) and the SS feature to see the impact of our morphological modeling, we obtain 89.70% which results in a statistically significant improvement (according to the McNemar chi-squared test with p < 0.05). From these experiments, we may conclude that our proposed model of using morphological features as CRF features seems better suited to Turkish than using them as separate tokens. However, there could still be room for improvement with further feature engineering on the selection of atomic morphological features.

Çelikkaya et al. [2] follow the work of Şeker and Eryiğit[40] and try to adapt a similar CRF based NER model to UGC domains. The authors test with different feature models which were the reduced versions14

¹⁴

In these reduced feature models, some of the lexical features provided in Section 5.2.2 were treated to be useless for the UGC domain and removed from the feature set.

of the feature model used in [40], but could not obtain any performance increase. Their best model which only adds a normalization stage prior to the decoding stage performed very low with a 19.28% CoNLL F1 score on tds1_v1. One should notice that their training data was train3 although tds1_v1 also consists TIMEX and NUMEX expressions. As stated in Section 6.1, our system (extended to 7 entity types) prior to the UGC adaptation obtains a performance score (22.57% on tds1_v1) which is slightly better than their reported success. In order to make a fair comparison, we also replicate their work (with their best model) but this time trained on train7 instead of train3 and obtained an improvement of 5.63 percentage points (from 19.28% to 24.91%).

As given previously, our baseline score on the reannotated version (tds1_v4) of the same data set was 38.36%. Table 11 provides our score (67.96%) obtained with our newly introduced model.

Table 11 also presents the comparison with the other related works on UGC domain. The first set of the table provides the results on tds1 (which has been reannotated by several groups as explained in Section 4.1). Since each group worked on a different version of this dataset these results are only provided to give an idea but essentially they are not comparable with each other. The remaining of the table compare the performances of our model on the test sets of the previous studies.

Eken and Tantuğ [8] also use CRFs but with a different feature model which basically consists of the surface form, first and last 4 characters of the words instead of morphological features, lexical features, gazetteer lookups. Their reported accuracy on tds1_v3 is 28.53% and 46.97% on tds3. Our systems performance on tds3 is calculated as 51.61%. Eken and Tantuğ [8] report that the results are increased from 47% to 64% (on tds3) by changing the training set from the one used in here (train715

¹⁵

The dataset train7 has been shared with the authors before the submission of this article.

) to another Tweeter dataset. We observed the same behavior and obtained an improvement from 52% to 68% by using their training dataset although it is relatively small in size when compared to train7. However, we believe that these experiments are not trustworthy to draw any conclusion due to the high number of retweets occurring in both training and test datasets.

Küçük et al. [17] apply a rule-based multilingual NER system [34] to Turkish tweets. The system mostly employs language-independent rules that make reference to language-specific dictionary lists to recognize ENAMEX types and considers only those candidate tokens which have their initial letters capitalized. The system can be adapted to a new language by providing for that language separate word lists. Küçük et al. [17] tailor it for Turkish by equipping it with the required lists for Turkish information extraction, including lists of common person, location and organization names as well as organization endings in Turkish. The work focuses only on ENAMEX types. In order to be able to make a comparison with this work, Table 11 provides an extra set of scores on tds2 measuring the performances obtained only for ENAMEX types.

Küçük and Steinberger [18] adapt the rule based system of [20] to better fit Twitter language by relaxing its capitalization constraint and by diacritics-based expansion of its lexical resources. They employ a simplistic normalization scheme on tweets to observe the effects of these on the overall named entity recognition performance on Turkish tweets. Table 11 provides the comparisons on tds2 and tds3.

7. Conclusion & future work

In order to reach the ideal of multilingual semantic web, semantic enrichment of UGC in different languages plays a key role. However, the modelling of morphologically rich languages for the named entity recognition task remains still as an open research question, and despite the very high results reported for morphologically less complex languages, the NER success rates did still not reach the human performances in case of MRLs.

This article presents a CRF-based NER system which successfully models the morphologically very rich nature of Turkish, which we believe may serve as a model for similar languages. The article provides the used lexical and morphological feature representations, and preprocessing stages in order to improve the performances both on well formed texts and user generated content. The re-annotation of the available datasets (from well formed text domain) to extend the covered named entity types (ENAMEX, TIMEX and NUMEX) as well as two newly annotated datasets from Web 2.0 are introduced. The compiled gazetteers, datasets and the used feature templates are made available for future research from http://tools.nlp.itu.edu.tr/Datasets. The proposed system is available as a SaaS from http://tools.nlp.itu.edu.tr/Ner.

The introduced approach reveals an exact match F1 score of 92% on a dataset collected from Turkish news articles and ∼65% on different datasets collected from Web 2.0. Although the results obtained on well formed texts are in acceptable levels now, the field still needs new research in order to increase the results for non-canonical social media content. Especially the detection of proper nouns, with also a common noun meaning, written in lowercase letters needs special focus as the future work. The impact of normalization also needs to be investigated more. In this new UGC domain, named entity recognition and normalization becomes two NLP layers which are hard to orchestrate; one needing the outputs of the other one to produce better results. As a result, joint systems of these two layers deserve investigation in future research.

Footnotes

Acknowledgements

We would like to acknowledge that this work is part of a research project entitled “Parsing Web 2.0 Sentences” subsidized by the TUBITAK (Turkish Scientific and Technological Research Council) 1001 program (grant number 112E276) and part of the ICT COST Action IC1207. We want to thank the following people without whom it would be impossible to produce this work: Reyyan Yeniterzi and Ilyas Çiçekli for providing their datasets, Gökhan Tür for the helpful discussions, Dilek Küçük and Adnan Yazıcı for processing the test data with their NER tool and Memduh Gokirmak for helping during the annotation process. Finally, we want to thank our three reviewers for insightful comments and suggestions that helped us improve the final version of the article.

References

Ö.

Bayraktar and

T.T.

Temizel, Person name extraction from Turkish financial news text using local grammar based approach, in: Proceedings of the 23rd International Symposium on Computer and Information Sciences (ISCIS’08), Istanbul, Turkey, IEEE, 2008. doi:10.1109/ISCIS.2008.4717897.

Çelikkaya,

Torunoğlu and

Eryiğit, Named entity recognition on real data: A preliminary investigation for Turkish, in: Proceedings of the 7th International Conference on Application of Information and Communication Technologies, AICT2013, Baku, Azerbaijan, IEEE, 2013. doi:10.1109/ICAICT.2013.6722801.

Chen,

Zhang and

Isahara, Chinese named entity recognition with conditional random fields, in: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, Australia,

H.T.

Ng and

O.O.Y.

Kwong, eds, Association for Computational Linguistics, 2006, pp. 118–121, http://www.aclweb.org/anthology/W/W06/W06-0116.

N.A.

Chinchor and

Marsh, MUC-7 information extraction task definition, in: Proceedings of the Seventh Message Understanding Conference (MUC-7), Appendices, Virginia, USA, Morgan Kaufman Publishers, 1998, http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ie_task.html.

Cucerzan and

Yarowsky, Language independent named entity recognition combining morphological and contextual evidence, in: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora,

Fung and

Zhou, eds, Association for Computational Linguistics, 1999, pp. 90–99, http://aclweb.org/anthology/W/W99/W99-0612.pdf.

Demir and

Özgür, Improving named entity recognition for morphologically rich languages using word embeddings, in: 13th International Conference on Machine Learning and Applications, ICMLA 2014, Detroit, MI, USA, December 3–6, 2014, IEEE, 2014, pp. 117–122. doi:10.1109/ICMLA.2014.24.

Dredze,

McNamee,

Rao,

Gerber and

Finin, Entity disambiguation for knowledge base population, in: COLING 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, Beijing, China, 23–27 August, 2010,

C.-R.

Huang and

Jurafsky, eds, Tsinghua University Press, 2010, pp. 277–285, http://aclweb.org/anthology/C10-1032.

Eken and

A.C.

Tantuğ, Recognizing named entities in Turkish tweets, in: Proceedings of the Fourth International Conference on Software Engineering and Applications, Dubai, UAE,

Nagamalaiet al., eds, 2015, pp. 155–162. doi:10.5121/csit.2015.50213.

Ekbal and

Bandyopadhyay, A conditional random field approach for named entity recognition in Bengali and Hindi, Linguistic Issues in Language Technology2(1) (2009), 1–44, CSLI Publications, http://journals.linguisticsociety.org/elanguage/lilt/article/view/213.html.

10.

Eryiğit, ITU Turkish NLP web service, in: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, Gothenburg, Sweden, April 26–30, 2014,

Bouma and

Parmentier, eds, Association for Computational Linguistics, 2014, pp. 1–4, http://aclweb.org/anthology/E/E14/E14-2001.pdf.

11.

Freitag, Machine learning for information extraction in informal domains, Machine Learning39(2/3) (2000), 169–202. doi:10.1023/A:1007601113994.

12.

Georgiev,

Nakov,

Ganchev,

Osenova and

K.I.

Simov, Feature-rich named entity recognition for Bulgarian using conditional random fields, in: Recent Advances in Natural Language Processing, RANLP 2009, Borovets, Bulgaria, 14–16 September, 2009,

Angelova,

Bontcheva,

Mitkov,

Nicolov and

Nikolov, eds, RANLP 2009 Organising Committee/ACL, 2009, pp. 113–117, http://aclweb.org/anthology/R/R09/R09-1022.pdf.

13.

Grishman, Sixth message understanding conference MUC-6 task description, 1996, http://cs.nyu.edu/faculty/grishman/muc6.html, Last accessed: Aug 14th 2014.

14.

Gruber, Collective knowledge systems: Where the social web meets the semantic web, Web Semantics: Science, Services and Agents on the World Wide Web6(1) (2008), pp. 4–13, Elsevier. doi:10.1016/j.websem.2007.11.01.

15.

K.S.

Hasan,

Md.

Altaf ur Rahman and

Ng, Learning-based named entity recognition for morphologically-rich, resource-scarce languages, in: EACL 2009, 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, Athens, Greece, March 30–April 3, 2009,

Lascarides,

Gardent and

Nivre, eds, Association for Computational Linguistics, 2009, pp. 354–362, http://www.aclweb.org/anthology/E09-1041.

16.

Konkol and

Konopík, CRF-based Czech named entity recognizer and consolidation of Czech NER research, in: Text, Speech, and Dialogue – 16th International Conference, TSD 2013, Proceedings, Pilsen, Czech Republic, September 1–5, 2013,

Habernal and

Matousek, eds, Lecture Notes in Computer Science, Vol. 8082, Springer, 2013, pp. 153–160. doi:10.1007/978-3-642-40585-3_20.

17.

Küçük,

Jacquet and

Steinberger, Named entity recognition on Turkish Tweets, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), 2014, http://www.lrec-conf.org/proceedings/lrec2014/summaries/380.html.

18.

Küçük and

Steinberger, Experiments to improve named entity recognition on Turkish Tweets, in: Proceedings of the EACL’2014 Workshop on Language Analysis in Social Media (LASM), Gothenburg, Sweden,

Farzindar,

Inkpen,

Gamon, and

Nagarajan, eds, Association for Computational Linguistics, 2014, pp. 71–78, https://arxiv.org/abs/1410.8668.

19.

Küçük and

Yazıcı, Named entity recognition experiments on Turkish texts, in: Flexible Query Answering Systems, 8th International Conference, FQAS 2009, Proceedings, Roskilde, Denmark, October 26–28, 2009

Andreasen,

R.R.

Yager,

Bulskov,

Christiansen and

H.L.

Larsen, eds, Lecture Notes in Computer Science, Vol. 5822, Springer, 2009, pp. 524–535, ISBN 978-3-642-04956-9. doi:10.1007/978-3-642-04957-6_45.

20.

Küçük and

Yazici, Rule-based named entity recognition from Turkish texts, in: Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications, Trabzon, Turkey, Karadeniz Teknik Üniversitesi Basımevi, 2009, pp. 456–460.

21.

Küçük and

Yazıcı, A hybrid named entity recognizer for Turkish, Expert Systems with Applications39(3) (2012), 2733–2742. doi:10.1016/j.eswa.2011.08.131.

22.

J.D.

Lafferty,

McCallum and

F.C.N.

Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in: Proceedings of the Eighteenth International Conference on Machine Learning, (ICML 2001), Williams College, Williamstown, MA, USA, June 28–July 1, 2001,

C.E.

Brodley and

A.P.

Danyluk, eds, Morgan Kaufmann, 2001, pp. 282–289, ISBN 1-55860-778-1.

23.

Li,

Weng,

He,

Yao,

Datta,

Sun and

B.-S.

Lee, TwiNER: Named entity recognition in targeted Twitter stream, in: The 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, Portland, OR, USA, August 12–16, 2012,

W.R.

Hersh,

Callan,

Maarek and

Sanderson, eds, ACM, 2012, pp. 721–730. doi:10.1145/2348283.2348380.

24.

Liu,

Zhang,

Wei and

Zhou, Recognizing named entities in tweets, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Portland, Oregon, USA, 19–24 June, 2011,

Lin,

Matsumoto and

Mihalcea, eds, Association for Computational Linguistics, 2011, pp. 359–367, http://www.aclweb.org/anthology/P11-1037.

25.

Liu,

Zhou,

Fu and

Wei, Joint inference of named entity recognition and normalization for tweets, in: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference – Volume 1: Long Papers, Jeju Island, Korea, July 8–14, 2012, Association for Computational Linguistics, 2012, pp. 526–535, http://www.aclweb.org/anthology/P12-1055.

26.

McCallum, Efficiently inducing features of conditional random fields, in: UAI ’03, Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence, Acapulco, Mexico, August 7–10, 2003,

Meek and

Kjærulff, eds, Morgan Kaufmann, 2003, pp. 403–410, https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=955&proceeding_id=19.

27.

McCallum and

Li, Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, in: Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in Cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31–June 1, 2003,

Daelemans and

Osborne, eds, Association for Computational Linguistics, 2003, pp. 188–191. doi:10.3115/1119176.1119206.

28.

M.-F.

Moens,

Li and

T.-S.

Chua, Mining User Generated Content, Chapman & Hall/CRC, 2014, ISBN 1466557400, 9781466557406.

29.

Mohit,

Schneider,

Bhowmick,

Oflazer and

N.A.

Smith, Recall-oriented learning of named entities in Arabic Wikipedia, in: EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23–27, 2012,

Daelemans,

Lapata and

Màrquez, eds, Association for Computational Linguistics, 2012, pp. 162–173, http://aclweb.org/anthology-new/E/E12/E12-1017.pdf.

30.

S.-H.

Na and

H.T.

Ng, A 2-poisson model for probabilistic coreference of named entities for improved text retrieval, in: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19–23, 2009,

Allan,

J.A.

Aslam,

Sanderson,

Zhai and

Zobel, eds, ACM, 2009, pp. 275–282. doi:10.1145/1571941.1571990.

31.

Nadeau and

Sekine, A survey of named entity recognition and classification, Linguisticae Investigationes30(1) (January 2007), 3–26, Publisher: John Benjamins Publishing Company. doi:10.1075/li.30.1.03nad.

32.

Özkaya and

Diri, Named entity recognition by conditional random fields from Turkish informal texts, in: Proceedings of the IEEE 19th Signal Processing and Communications Applications Conference (SIU 2011), Antalya, Turkey, 2011, pp. 662–665. doi:10.1109/SIU.2011.5929737.

33.

Pamay,

Sulubacak,

Torunoğlu-Selamet and

Eryiğit, The annotation process of the ITU Web Treebank, in: Proceedings of the 9th Linguistic Annotation Workshop, LAW@NAACL-HLT 2015, Denver, Colorado, USA, June 5, 2015,

Meyers,

Rehbein and

Zinsmeister, eds, Association for Computational Linguistics, 2015, pp. 95–101. doi:10.3115/v1/W15-1610.

34.

Pouliquen and

Steinberger, Automatic construction of multilingual name dictionaries, in: Learning Machine Translation, Advances in Neural Information Processing Systems Series, 2009, pp. 59–78. doi:10.7551/mitpress/9780262072977.003.0003.

35.

Ritter,

Clark, Mausam and

Etzioni, Named entity recognition in tweets: An experimental study, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, John McIntyre Conference Centre, Edinburgh, UK, 27–31 July, 2011, a Meeting of SIGDAT, a Special Interest Group of the ACL, Association for Computational Linguistics, 2011, pp. 1524–1534, http://www.aclweb.org/anthology/D11-1141.

36.

Rizzo and

Troncy, NERD: A framework for unifying named entity recognition and disambiguation extraction tools, in: EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23–27, 2012,

Daelemans,

Lapata and

Màrquez, eds, Association for Computational Linguistics, 2012, pp. 73–76, http://aclweb.org/anthology-new/E/E12/E12-2015.pdf.

37.

Rizzo,

van Erp and

Troncy, Benchmarking the extraction and disambiguation of named entities on the semantic web, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk, and

Piperidis, eds, European Language Resources Association (ELRA), 2014, pp. 4593–4600, http://www.lrec-conf.org/proceedings/lrec2014/summaries/176.html.

38.

Rizzo,

van Erp and

Troncy, Evaluating named entity recognition and disambiguation in news and tweets, in: 24th Meeting of Computational Linguistics in the Netherlands (CLIN), Enschede, Holland, 2014.

39.

Rüd,

Ciaramita,

Müller and

Schütze, Piggyback: Using search engines for robust cross-domain named entity recognition, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Portland, Oregon, USA, 19–24 June, 2011,

Lin,

Matsumoto and

Mihalcea, eds, Association for Computational Linguistics, 2011, pp. 965–975, http://www.aclweb.org/anthology/P11-1097.

40.

G.A.

Şeker and

Eryiğit, Initial explorations on using CRFs for Turkish named entity recognition, in: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, Mumbai, India, 8–15 December, 2012,

Kay and

Boitet, eds, Indian Institute of Technology Bombay, 2012, pp. 2459–2474, http://aclweb.org/anthology/C/C12/C12-1150.pdf.

41.

Sekine,

Sudo and

Nobata, Extended named entity hierarchy, in: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, Canary Islands, Spain, May 29–31, 2002, European Language Resources Association, 2002, http://www.lrec-conf.org/proceedings/lrec2002/pdf/120.pdf.

42.

Settles, Biomedical named entity recognition using conditional random fields and rich feature sets, in: JNLPBA ’04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, Switzerland, August 28–29, 2004,

Collier,

Ruch and

Nazarenko, eds, Association for Computational Linguistics, 2004, pp. 104–107. doi:10.3115/1567594.1567618.

43.

Sha and

F.C.N.

Pereira, Shallow parsing with conditional random fields, in: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, May 27–June 1, 2003,

M.A.

Hearst and

Ostendorf, eds, Association for Computational Linguistics, 2003. doi:10.3115/1073445.1073473.

44.

Shen,

Wang,

Luo and

Wang, LINDEN: Linking named entities with knowledge base via semantic knowledge, in: Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16–20, 2012,

Mille,

F.L.

Gandon,

Misselis,

Rabinovich and

Staab, eds, ACM, 2012, pp. 449–458. doi:10.1145/2187836.2187898.

45.

Sundheim, Overview of results of the MUC-6 evaluation, in: Proceedings of the 6th Conference on Message Understanding, MUC 1995, Columbia, Maryland, USA, November 6–8, 1995, 1995, pp. 13–31. doi:10.1145/1072399.1072402.

46.

C.A.

Sutton and

McCallum, An introduction to conditional random fields, Foundations and Trends in Machine Learning4(4) (2012), 267–373. doi:10.1561/2200000013.

47.

Tatar and

Cicekli, Automatic rule learning exploiting morphological features for named entity recognition in Turkish, Journal of Information Science37(2) (2011), 137–151. doi:10.1177/0165551511398573.

48.

E.F.

Tjong Kim Sang, Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition, in: Proceedings of the 6th Conference on Natural Language Learning, CoNLL 2002, Held in Cooperation with COLING 2002, Taipei, Taiwan, 2002,

Dan and

van den Bosch, eds, Association for Computational Linguistics, 2002. doi:10.3115/1118853.1118877.

49.

E.F.

Tjong Kim Sang and

De Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in Cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31–June 1, 2003,

Daelemans and

Osborne, eds, Association for Computational Linguistics, 2003, pp. 142–147. doi:10.3115/1119176.1119195.

50.

Torunoǧlu and

Eryiğit, A cascaded approach for social media text normalization of Turkish, in: Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), Gothenburg, Sweden, Association for Computational Linguistics, April 2014, pp. 62–70.

51.

H.N.

Traboulsi, Named entity recognition: A local grammar-based approach, PhD thesis, Department of Computing School of Electronics and Physical Sciences University of Surrey, 2006.

52.

Tür,

Hakkani-Tür and

Oflazer, A statistical information extraction system for Turkish, Natural Language Engineering9(2) (2003), 181–210. doi:10.1017/S135132490200284X.

53.

Yeniterzi, Exploiting morphology in Turkish named entity recognition system, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference – Student Session, Portland, Oregon, USA, 19–24 June, 2011, Association for Computational Linguistics, 2011, pp. 105–110, http://www.aclweb.org/anthology/P11-3019.

54.

Zhang,

Li,

Kumaran and

Liu, Report of NEWS 2012 machine transliteration shared task, in: Proceedings of the 4th Named Entities Workshop 2012 (NEWS’12) at ACL 2012, Association for Computational Linguistics, 2012, pp. 10–20.

Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1

Abstract

Keywords

1. Introduction

4 Some surface forms of “gitmek” (only in simple present tense for different person arguments): gidiyorum, gidiyorsun, gidiyor, gidiyoruz, gidiyorsunuz, gidiyorlar.

4.1. Available datasets

6 Table 2 from [8], Table 1 from [2] and Table 1 from [18] provides the number of annotated named entities in different versions of this dataset. The main arguable point on the annotation of this dataset was the tagging of named entities consisting apostrophe signs.

7 Confidence intervals were calculated using the GraphPad QuickCalcs Web site: http://graphpad.com/quickcalcs/kappa1.cfm (accessed December 2015).

5.1. Proposed framework

Table 4 IOB2 tagging vs RAW tagging Token IOB2 Tags RAW Tags Mustafa B-PERSON PERSON Kemal I-PERSON PERSON Atatürk I-PERSON PERSON 1919 O O yılında O O Samsun B-LOCATION LOCATION ’a O O çıktı O O . O O

8 The abbreviations after the plus sign stand for: +Adj: Adjective, +Noun: Noun, +A3sg: 3sg number-person agreement, +Pnon: Pronoun (no overt possessive agreement), +Nom: Nominative case, +Prop: Proper noun.

9 https://interaktifkargo.ptt.gov.tr/posta_kodu/.

5.1.5. Conditional random fields

12 http://crfpp.googlecode.com/svn/trunk/doc/index.html.

5.2. Features used for ENAMEX types

5.2.1. Morphological features

5.2.2. Lexical features

5.2.3. Gazetteer lookup features

5.3. Extra features for TIMEX and NUMEX types

5.4. Adaptation for UGC

6. Experimental results

13 http://www.cnts.ua.ac.be/CoNLL2000/chunking/output.html.

Footnotes

Acknowledgements

References

⁴
Some surface forms of “gitmek” (only in simple present tense for different person arguments): gidiyorum, gidiyorsun, gidiyor, gidiyoruz, gidiyorsunuz, gidiyorlar.

⁶
Table 2 from [8], Table 1 from [2] and Table 1 from [18] provides the number of annotated named entities in different versions of this dataset. The main arguable point on the annotation of this dataset was the tagging of named entities consisting apostrophe signs.

⁷
Confidence intervals were calculated using the GraphPad QuickCalcs Web site: http://graphpad.com/quickcalcs/kappa1.cfm (accessed December 2015).

Table 4
IOB2 tagging vs RAW tagging

Token IOB2 Tags RAW Tags

Mustafa B-PERSON PERSON

Kemal I-PERSON PERSON

Atatürk I-PERSON PERSON

1919 O O

yılında O O

Samsun B-LOCATION LOCATION

’a O O

çıktı O O

. O O

⁸
The abbreviations after the plus sign stand for: +Adj: Adjective, +Noun: Noun, +A3sg: 3sg number-person agreement, +Pnon: Pronoun (no overt possessive agreement), +Nom: Nominative case, +Prop: Proper noun.

⁹
https://interaktifkargo.ptt.gov.tr/posta_kodu/.

¹²
http://crfpp.googlecode.com/svn/trunk/doc/index.html.

¹³
http://www.cnts.ua.ac.be/CoNLL2000/chunking/output.html.