Arabic vs. English: Comparative Statistical Study

Alotaiby, Fahad; Foda, Salah; Alkharashi, Ibrahim

doi:10.1007/s13369-013-0665-3

Arabic vs. English: Comparative Statistical Study

Research Article - Computer Engineering and Computer Science
Published: 06 September 2013

Volume 39, pages 809–820, (2014)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Fahad Alotaiby¹,
Salah Foda¹ &
Ibrahim Alkharashi²

374 Accesses
4 Citations
Explore all metrics

Abstract

Important research areas, such as automatic speech recognition, optical character recognition, and information retrieval, heavily depend on the presence of a good statistical representation of the used language. A more precise representation leads to more accurate systems. However, Arabic is a richer and more complex language than English. Moreover, clitics have a heavy presence in the Arabic language. They can be attached to a stem or to each other without orthographic marks such as an apostrophe. This raises the need to study key statistics of the Arabic language and the statistical differences between Arabic and English on a large scale. Therefore, two large Arabic and English corpora collected from newswire text data, consisting of 600 million words each, are utilized. Hence, the distribution of word length, paragraph length, punctuation marks, unigrams, bigrams and trigrams is presented. In addition, the distribution of clitics in Arabic and their statistical effect are shown. As a result, it has been shown that the number of Arabic word-types is 76 % more than in English. However, lexicon size in Arabic could be reduced by 24.54 % when applying clitics tokenization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Yang, S.; Zhu, H.; Apostoli, A.; Cao, P.: N-gram statistics in English and Chinese: similarities and differences. In: Proceedings of IEEE International Conference on Semantic Computing, Irvine, pp. 454–460 (2007)
Al-Kadi I.: Study of information-theoretic properties of Arabic based on word entropy and Zipf’s law. J. King Saud Univ. 10, 1–14 (1996)
Google Scholar
Attia, M.: Arabic tokenization system. In: Proceedings of the 2007 Workshop on Computational Approaches To Semitic Languages: Common Issues and Resources. Association for Computational Linguistics, Prague, pp. 65–72 (2007)
Heintz, I.: Arabic language modeling with finite state transducers. In: Proceedings of the ACL-08: HLT Student Research Workshop, Companion Volume, Columbus, pp. 37–42 (2008)
Buckwalter, T.: Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic Data Consortium (LDC) catalogue number LDC2004L02, Philadelphia, USA, ISBN 1-58563-324-0(2004)
Rashwan M., Badrashiny M., Attia M., Abdou S., Rafea A.: A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. (TASLP) 19(1), 166–175 (2011)
Article Google Scholar
Shaalan, K.; Abo Bakr, H.; Ziedan, I.: A hybrid approach for building Arabic diacritizer. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, Athens, pp. 27–35 (2009)
Kadri, Y.; Nie, J.Y.: Effective stemming for Arabic information retrieval. In: Proceedings of the challenge of Arabic for NLP/MT Conference. The British Computer Society. London (2006)
Majdi, S.; Eric, A.: Comparative evaluation of Arabic language morphological analysers and stemmers. In: Proceedings of COLING 2008 22nd International Conference on Computational Linguistics, Manchester (2008)
Rogati, M.; McCarley, S.; Yang, Y.: Unsupervised learning of Arabic stemming using a parallel corpus. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, Singapore, pp. 113–118 (2003)
Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Geneva (2004)
Graff, D.: Arabic Gigaword Third Edition. Linguistic Data Consortium, Philadelphia (2007)
Graff, D.; Kong, J.; Chen, K.; Maeda, K.: English Gigaword Third Edition. Linguistic Data Consortium, Philadelphia (2007)
Diab, M.; Hacioglu, K.; Jurafsky, D.: Automatic tagging of Arabic text: from raw text to base phrase chunks. 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL04), Boston (2004)
Habash, N.; Rambow, O.; Roth, R.: MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, pp. 102–109 (2009)
Diab, M.: Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and phrase chunking. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, pp. 285–288 (2009)
Alghoneim K., Alotaiby F.: Syllable based labeling for continuous Arabic speech recognition. J. Appl. Sci. Comput. 10(2), 77–86 (2003)
Google Scholar
Manning, C.; Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Maamouri, M., Bies, A.; Kulick, S.; Gaddeche, F.; Mekki, W.: Arabic Treebank: Part 3(a) v. 2.6. Linguistic Data Consortium, Philadelphia, Catalog ID: LDC2007E65 (2007)

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, College of Engineering, King Saud University, Riyadh, Saudi Arabia
Fahad Alotaiby & Salah Foda
Computer Research Institute, King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia
Ibrahim Alkharashi

Authors

Fahad Alotaiby
View author publications
You can also search for this author in PubMed Google Scholar
Salah Foda
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Alkharashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fahad Alotaiby.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alotaiby, F., Foda, S. & Alkharashi, I. Arabic vs. English: Comparative Statistical Study. Arab J Sci Eng 39, 809–820 (2014). https://doi.org/10.1007/s13369-013-0665-3

Download citation

Received: 17 December 2011
Accepted: 04 July 2013
Published: 06 September 2013
Issue Date: February 2014
DOI: https://doi.org/10.1007/s13369-013-0665-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic vs. English: Comparative Statistical Study

Abstract

Access this article

Similar content being viewed by others

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

A Hybrid Approach for Arabic Diacritization

A 700M+ Arabic corpus: KACST Arabic corpus design and construction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Arabic vs. English: Comparative Statistical Study

Abstract

Access this article

Similar content being viewed by others

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

A Hybrid Approach for Arabic Diacritization

A 700M+ Arabic corpus: KACST Arabic corpus design and construction

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation