Linguistic feature engineering for machine learning

Linguistic feature engineering for machine learning

Linguistic feature engineering for machine learning

Natural Language Processing (NLP), Machine Learning (ML) and Deep Learning (DL) fall under the broad umbrella of Artificial Intelligence (AI).Artificial Intelligence A big challenge in NLP is processing and transforming noisy unstructured textual data into some structured formats that can be understood by an ML algorithm.

Feature engineering in ML is the process of generating or deriving features from raw data or corpus and is considered the most crucial parameter for Machine Learning algorithms. It is based on amalgamated principles of statistics, math, and optimization but here, we discuss feature engineering strategies from a linguistic point of view. A large amount of data in the form of natural language cannot be interpreted by the computer, since algorithms don’t have the ability to accept the raw natural language data and generate the output for an NLP application. Hence, features are derived using the linguistic aspects of natural language and play an important role when developing NLP applications using ML.

Features are the representatives of the corpus that can be understood by the ML algorithms. It is an attribute or property shared by units on which analysis or prediction has to be done. The quality and the number of features greatly influence the model quality.

Feature selection:
It is essential to create many features and select a set of relevant features for the use in model construction. This simplifies the models, shortens the training time, avoids dimensionality and reduces overfitting.

NLP features include:

(i) Part of Speech (POS) tagging: The process of tagging words in the corpus based on their part of speech. The POS of the word also called grammatical tagging or word-category disambiguation is also dependent on its neighboring words. POS tag sequences help the machine to understand various sentence structures and are very useful in building a chatbot with ML algorithms or performing sentiment analysis.

(ii) Parsing: Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. It helps in generating features like noun phrases, POS tags within the noun phrase, head-word and dependency relations.

(iii) Named-entity recognition (NER): This is the process of identifying named-entity mentions in text into predefined categories such as person names, organizations, locations, time expressions, monetary values, etc. NER tags help NLP systems understand the role of a noun phrase in a sentence particularly while building a question-answering system where it is very crucial to extract entities in the sentence based on its syntax and discourse.

(iv) Wordnet: Wordnet is a semantic lexicon for the English language and is conceived as a data-processing resource which covers lexical-semantic categories like; Synonymy and Antonymy. It is being used in many knowledge-based applications as a method to capture the relation between the words.

Thus, the linguistic aspect as part of feature engineering help interpret the data better than the black box approach adopted by some ML algorithms.