The ABCS of natural language processing

The ABCS of natural language processing

The ABCS of natural language processing

Natural Language Processing (NLP) is a focus area of Artificial Intelligence (AI) enabling machines to analyze text or speech and communicate with people in their language. Like many topics whose names end with “ing,” Natural Language Processing is simultaneously a problem, a set of solutions that work well on the problem, and the field that studies this problem and its solution methods.

Any AI-powered product that interacts with human end-users or interfaces with data containing natural language text and speech such as web search, conversational virtual assistants, automatic delivery of curated daily news articles, autocorrects and auto-suggests in instant messaging and emails, automatic translation of foreign language most likely incorporates NLP in some manner. This post will give a broad overview of NLP, along with a few quirks of the field.

Artificial Intelligence means different things to different people but it is primarily concerned with getting computers to do tasks that require human intelligence, and natural language is an ability that has proven to be difficult to automate, requiring complex and sophisticated reasoning processes and knowledge. Natural Language Processing is an interdisciplinary topic intersecting computer science and linguistics, also referred to as computational linguistics. NLP encompasses the automatic processing of natural language (as opposed to programming languages such as Python and Java) for communication.

While the early years consisted of the study of small prototypes and computational models of various kinds of linguistic phenomena in academia, the focus has shifted over the past four decades to robust and practical learning and processing systems in industry applied to large corpora such as the web. As a matter of trivia, the abbreviation “NLP” is also associated with a distinct field of study in psychotherapy and personal development called Neuro-Linguistic Programming. Hence the twitter-verse refers to Natural Language Processing with the hashtag #NLProc.

NLP can be divided into Natural Language Understanding (NLU) and Natural Language Generation (NLG). NLU refers to taking as input some spoken word or typed text and understanding what it means and how it is organized by representing it in a machine-readable format such as a logical form or structured knowledge bases. NLG approaches the problem from the opposite end taking as input the machine-readable representation and working out a way to express it in natural language to communicate with human end-users.

Traditionally, work in language processing has decomposed NLU into a number of stages, mirroring theoretical linguistics distinctions between phonetics, syntax, semantics, and pragmatics:

– Speech Recognition: The raw speech signal in the form of a frequency spectrogram is analyzed using knowledge of the frequencies of sounds in a language, and the sequence of words spoken is obtained. If the input is in text format, then this step is skipped.

– Syntactic Analysis: The sequence of words spoken is analyzed using knowledge of the grammar of the language and the structure of the sentence is obtained. For example, in English, the subject is followed by a verb which is followed by the object of the sentence. In NLP, this is achieved via part-of-speech tagging (classifying words into nouns, verbs, adjectives, etc.) and parsing (organizing the words of a sentence into an ordered tree-like structure). With the need to deal with real language data, syntactic analysis is preceded by sentence segmentation, tokenization, and lexical analysis:Sentence Segmentation is the process of splitting a paragraph into distinct sentences, i.e. knowing where a sentence ends and the next begins, using cues such as periods at the end of sentence markers.
Tokenization is the process of breaking a sentence into distinct terms such as words. This is particularly useful in cases of English contractions such as “isn’t” and languages like Chinese and Japanese where spaces do not separate individual words.
Lexical Analysis is a deeper study of the structure of words (morphology) so that the machine knows that words like “swims,” “swam,” “swimming” belong to the same root word or lemma “swim.” Some NLP processes save processing time and space by removing the most frequently occurring words; known as stop words.

– Semantic Analysis: Using information about the sentence structure, and the meaning of individual words (sourced from dictionaries and grouping words with similar meanings together [word clusters]), the literal meaning of the sentence is obtained. Word Sense Disambiguation is a popular NLP component which distinguishes whether the word “bank” is the bank of a river or a financial bank.

– Pragmatic Analysis: The meaning of the sentence obtained via semantic analysis is partial and is completed after taking into account the wider context of the sentence or discourse relations. i.e. when and where the sentence was uttered and who was saying to whom. Discourse relations are also used to find relationships between sentences, as well as classifying the speech acts (e.g. yes-no question, content question, statement, etc.). Practically speaking, NLP systems sometimes combine semantic and pragmatic analysis into a unified process.

Similarly, NLG is split into stages in order to map a machine’s internal representation of a sentence to a surface text or speech:

Text Planning: Identifying the goals of the utterance, i.e. what to say and planning how to say it.
Sentence Realization: Producing the actual text, i.e. realizing the surface form which is grammatically correct and satisfies the goals of text planning.
Speech Synthesis: If speech output is required, the words in a sentence are transformed into a speech signal.

Today there exist numerous NLP applications and language processing tasks. These are implemented utilizing one or more of the afore-mentioned NLU and NLG stages as AI software modules. To name a few,Named Entity Recognition: Identifying keywords in a sentence such as names of people, places, organizations, product names, by looking at patterns of part-of-speech and sentence syntax.
Sentiment Analysis: Identifying the mood and attitude of an utterance as positive, neutral, or negative via semantic analysis of the sentence.
Topic Identification: Identifying themes and topics in documents or collection of documents so as to group them on the basis of the distribution of lexical analysis.
Automatic Text Summarization: Generates a synopsis of a huge collection of text or just a snippet of a document when the time is short by understanding and identifying important information and generating a summary.
Machine Translation: Translates text or speech from one language to another by understanding the source language and generating the same meaning in the target language.
Automated Question Answering: Answers a question by understanding the meaning of the question, searching for the answer in a database using linguistic cues, and generating the appropriate answer.

With the availability of a number of NLP systems as free web services or off-the-shelf tools, there is a tendency to believe that an NLP solution can be easily integrated into an AI product. This is not always the case. In order to do any NLP, you first need to understand language. Language is different for different genres or domains because newspaper articles, technical papers, novels, blogs, Twitter have different writing styles. Hence the first step involves some degree of data exploration and analytics to get a feel of what the text is trying to say and how a human reasoning system would process it. Then an NLP model can be developed by encoding this knowledge in the form of rules. These rules can either be hand-written by linguists observing how people perform the NLP task (rule-based) or more popularly can be generated automatically by statistically inducing knowledge from large volumes of solved examples of the NLP task (data-driven, statistical, machine learning, deep learning). Each approach has its shortcomings and oftentimes a hybrid system of machine learning and linguistic knowledge rules is employed. For example, a machine translation system can learn to automatically translate from Arabic to English from billions of examples of text written in both Arabic and English, along with a few linguistic rules for handling sentence segmentation and word morphology.

The same utterance can be interpreted in different ways by different people. This implies that language processing is not deterministic, i.e. language (input) can have more than one interpretation (output). Something which might be humorous for one person might not be for another. If human beings cannot agree 100% on a language processing task, it is not possible for an NLP model-based machine to perform the task without some degree of error. It is this non-deterministic nature of language understanding which makes it an AI-complete problem. In AI, the most difficult computational problems which are equivalent to solving the central artificial intelligence problem – making computers as intelligent as people are known as AI-complete or AI-hard (similar to NP-complete and NP-hard problems in computer science complexity theory).

In summary, Natural Language Processing is already deeply ingrained in everyday applications we use. With petabytes of data and the high rate of discoveries made in both NLP techniques as well as computational power, the future is bright for Natural Language Processing and Artificial Intelligence.