Language is an essential manifestation of human cognition, a complex evolutionary construct in communication. The challenges in making a computer “understand and respond” to such uniquely human constructs are innumerable. Also, given the extreme diversity in languages, the specific phonetical, grammatical and cultural features, “portability” of techniques used for English to another language is limited. Moreover, natural language processing by computers is the need of the hour for different languages across the globe.
The Arabic language is one of the six official languages of the United Nations, spoken by more than 420 million – has three variants: Classical Arabic, Modern Standard Arabic (MSA) and Dialect Arabic.
Many people around the world learn Classical Arabic, for personal or professional reasons, mainly to peruse religious books and Arabic literature. Over time, Classical Arabic has undergone many changes – for example, the added dots to distinguish between similar letters and diacritics (Tashkeel Or Harakat) to dispel ambiguity for the readers. Modern Standard Arabic (MSA) is used across newspapers, books, and official documents. Apart from these two, different dialects – sometimes within the same country – like Gulf dialect, Syro Palestinian dialect, Egyptian and Maghrebi dialects – are spoken and used in social media.
Diacritics are ubiquitous in classical Arabic and are very prevalent in poetry, legal domain, and educational books. They define the proper pronunciation of a word. The main challenge in modern Arabic is that it omits diacritics for brevity. Thus, a word in modern Arabic can have several meanings depending on the different diacritics that the word could assume. For example, the word “كتب qtb” could mean “books” (كُتُب qutub – a noun) or “he wrote” (“qataba كَتَبَ” – past tense ) or “It has been written” (“qutiba كُتِبَ”- past participle).
Fancy each letter with eight possible diacritics representing eight different pronunciations and the complexity of the situation unfolds! Building diacritization systems for Arabic is thus imperative for many Arabic NLP applications like Arabic text-to-speech (TTS). Another compelling use case is to help non-native speakers and beginners to read and pronounce Arabic texts correctly. However, the key to all of these is context-based processing.
The NLP domain with a new generation of deep learning approaches has witnessed some advanced techniques to represent the semantic meaning of texts including word2vec, InferSent, USE, and ELMO. Some attempts have been made to train word-vectors to achieve better performance for the Arabic NLP tasks, but “sentence-level” representation is still a hugely underexplored domain. Other core Arabic NLP modules like named-entity-recognition, part-of-speech-tagging, syntactic-parsing, and coreference-resolution can significantly benefit from these techniques.
Arabic NLP applications have huge market potential in the GCC. Arabic chatbots could enable interaction between users and organizations in healthcare, public services, and educational domains. Arabic Machine Translation could greatly facilitate the communication between Arabic and non-Arabic speakers. At saal.ai we employ cutting-edge NLP technologies and develop context-aware algorithms. Why not engage with us and find out more?