Towards building a Urdu Language Corpus using Common Crawl

Abstract

Urdu is the most popular language in Pakistan which is spoken by millions of people across the globe. While English is considered the dominant web content language, characteristics of Urdu language web content are still unknown. In this paper, we study the World-Wide-Web (WWW) by focusing on the content present in the Perso-Arabic script. Leveraging from the Common Crawl Corpus, which is the largest publicly available web content of 2.87 billion documents for the period of December 2016, we examine different aspects of Urdu web content. We use the Compact Language Detector (CLD2) for language detection. We find that the global WWW population has a share of 0.04% for Urdu web content with respect to document frequency. 70.9% of the top-level Urdu domains consist of . com, . org, and . info. Besides, urdulughat is the most dominating second-level domain. 40% of the domains are hosted in the United States while only 0.33% are hosted within Pakistan. Moreover, 25.68% web-pages have Urdu as primary language and only 11.78% of web-pages are exclusively in Urdu. Our Urdu corpus consists of 1.25 billion total and 18.14 million unique tokens. Furthermore, the corpus follows the Zipf’s law distribution. This Urdu Corpus can be used for text summarization, text classification, and cross-lingual information retrieval.

Keywords

Urdu web corpus Perso-Arabic script web content analysis common crawl corpus

Get full access to this article

View all access options for this article.

References

Veisi

, Amini

M.M.

and Hosseini

, Toward kurdish language processing: Experiments in collecting and processing the asosoft text corpus. Digital Scholarship in the Humanities, 2019.

Suwaileh

, Kutlu

, Fathima

, Elsayed

and Lease

, Arabicweb16: A new crawl for today’s arabic web. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, (2016), pp. 673–676. ACM.

Habernal

, Zayed

and Gurevych

, C4corpus: Multilingual web-size corpus with free license. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), (2016), pp. 914–922.

Callan

, Hoy

, Yoo

and Zhao

, Clueweb09 data set. https://lemurproject.org/clueweb09.php/, 2019.

SIL International. Ethnologue: Languages of the World. https://www.ethnologue.com/language/urd, June 2019.

Mehmood

M.A.

, Shafiq

H.M.

and Waheed

, Understanding regional context ofworld wide web using common crawl corpus. In Proceedings of 13th Malaysia International Conference on Communications (MICC), (2017), pp. 164–169. IEEE.

Common Crawl Foundation. Common Crawl Corpus. http://commoncrawl.org/, June 2019.

Amazon. Elastic compute cloud (EC2). https://aws.amazon.com/ec2/, June 2019.

Amazon. Elastic map reduce (EMR). https://aws.amazon.com/emr/, June 2019.

10.

Amazon Web Services (AWS)- cloud computing services. https://aws.amazon.com, June 2019.

11.

Compact Language Detector 2. CLD2owners/cld2. https://github.com/CLD2Owners/cld2, 2019.

12.

Zipf

G.K.

, Human behavior and the principle of least effort. 1949.

13.

Kolias

, Anagnostopoulos

and Kayafas

, Exploratory analysis of a terabyte scale web corpus. arXiv preprint arXiv:1409.5443, 2014.

14.

Sebastain Spiegler. Statistics of common crawl corpus 2012. https://commoncrawl.org/2013/08/alook-inside-common-crawls-210tb-2012-web-corpus/, June 2013.

15.

Amazon. Apache Hadoop. https://hadoop.apache.org/, June 2019.

16.

Danilak

, langdetect: language-detection library to Python. https://github.com/Mimino666/langdetect, June 2019.

17.

saffsd. Python’s language identification tool. https://github.com/saffsd/langid.py, 2019.

18.

MaxMind Organization. MaxMind GeoIP Databases and Services. http://www.maxmind.com/, June 2019.

19.

Rivest

, The md5 message-digest algorithm. 1992.

20.

Shahi

, Apache Solr: a practical approach to enterprise search. Apress, 2015.

21.

Choudhary

and Nain

, An annotated urdu corpus of handwritten text image and benchmarking of corpus. In Proceedings of 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), (2014), pp. 1159–1164. IEEE.

22.

Moreno-Sanchez

, Font-Clos

and Corral

, Large-scale analysis of zipf's law in english texts, Journal of PloS one11(1) (2016), e0147073.

23.

Selab

and Guessoum

, Building talaa, a free general and categorized arabic corpus. In Proceedings of International Conference on Agents and Artificial Intelligence (ICAART), (2015), pp. 284–291.

24.

Grefenstette

and Nioche

, Estimation of english and non-english language use on the www. In Proceedings of the RIAO (Recherche dâĂŹInformations Assist Ât’ee par Ordinateur), (2000), pp. 237–246.

25.

Shuyo.

, Language-detection at wiki. https://github.com/shuyo/languagedetection/blob/wiki/ProjectHome.md.