Abstract
Important research areas, such as automatic speech recognition, optical character recognition, and information retrieval, heavily depend on the presence of a good statistical representation of the used language. A more precise representation leads to more accurate systems. However, Arabic is a richer and more complex language than English. Moreover, clitics have a heavy presence in the Arabic language. They can be attached to a stem or to each other without orthographic marks such as an apostrophe. This raises the need to study key statistics of the Arabic language and the statistical differences between Arabic and English on a large scale. Therefore, two large Arabic and English corpora collected from newswire text data, consisting of 600 million words each, are utilized. Hence, the distribution of word length, paragraph length, punctuation marks, unigrams, bigrams and trigrams is presented. In addition, the distribution of clitics in Arabic and their statistical effect are shown. As a result, it has been shown that the number of Arabic word-types is 76 % more than in English. However, lexicon size in Arabic could be reduced by 24.54 % when applying clitics tokenization.
Similar content being viewed by others
References
Yang, S.; Zhu, H.; Apostoli, A.; Cao, P.: N-gram statistics in English and Chinese: similarities and differences. In: Proceedings of IEEE International Conference on Semantic Computing, Irvine, pp. 454–460 (2007)
Al-Kadi I.: Study of information-theoretic properties of Arabic based on word entropy and Zipf’s law. J. King Saud Univ. 10, 1–14 (1996)
Attia, M.: Arabic tokenization system. In: Proceedings of the 2007 Workshop on Computational Approaches To Semitic Languages: Common Issues and Resources. Association for Computational Linguistics, Prague, pp. 65–72 (2007)
Heintz, I.: Arabic language modeling with finite state transducers. In: Proceedings of the ACL-08: HLT Student Research Workshop, Companion Volume, Columbus, pp. 37–42 (2008)
Buckwalter, T.: Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic Data Consortium (LDC) catalogue number LDC2004L02, Philadelphia, USA, ISBN 1-58563-324-0(2004)
Rashwan M., Badrashiny M., Attia M., Abdou S., Rafea A.: A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. (TASLP) 19(1), 166–175 (2011)
Shaalan, K.; Abo Bakr, H.; Ziedan, I.: A hybrid approach for building Arabic diacritizer. In: Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, Athens, pp. 27–35 (2009)
Kadri, Y.; Nie, J.Y.: Effective stemming for Arabic information retrieval. In: Proceedings of the challenge of Arabic for NLP/MT Conference. The British Computer Society. London (2006)
Majdi, S.; Eric, A.: Comparative evaluation of Arabic language morphological analysers and stemmers. In: Proceedings of COLING 2008 22nd International Conference on Computational Linguistics, Manchester (2008)
Rogati, M.; McCarley, S.; Yang, Y.: Unsupervised learning of Arabic stemming using a parallel corpus. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, Singapore, pp. 113–118 (2003)
Buckwalter, T.: Issues in Arabic orthography and morphology analysis. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Geneva (2004)
Graff, D.: Arabic Gigaword Third Edition. Linguistic Data Consortium, Philadelphia (2007)
Graff, D.; Kong, J.; Chen, K.; Maeda, K.: English Gigaword Third Edition. Linguistic Data Consortium, Philadelphia (2007)
Diab, M.; Hacioglu, K.; Jurafsky, D.: Automatic tagging of Arabic text: from raw text to base phrase chunks. 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL04), Boston (2004)
Habash, N.; Rambow, O.; Roth, R.: MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, pp. 102–109 (2009)
Diab, M.: Second generation AMIRA tools for Arabic processing: fast and robust tokenization, POS tagging, and phrase chunking. In: Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, pp. 285–288 (2009)
Alghoneim K., Alotaiby F.: Syllable based labeling for continuous Arabic speech recognition. J. Appl. Sci. Comput. 10(2), 77–86 (2003)
Manning, C.; Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Maamouri, M., Bies, A.; Kulick, S.; Gaddeche, F.; Mekki, W.: Arabic Treebank: Part 3(a) v. 2.6. Linguistic Data Consortium, Philadelphia, Catalog ID: LDC2007E65 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Alotaiby, F., Foda, S. & Alkharashi, I. Arabic vs. English: Comparative Statistical Study. Arab J Sci Eng 39, 809–820 (2014). https://doi.org/10.1007/s13369-013-0665-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-013-0665-3