April 24, 2019
In the field of Natural Language Processing (NLP), we map words into numeric vectors so that the neural networks or machine learning algorithms can learn over it. Word Embeddings help in transforming words with similar meaning to similar numeric representations. Word Embeddings, being the actual input to the neural networks, influence the performance of NLP models greatly.
Initially, the words were represented as one-hot vectors. A one-hot vector of a word is a vector of length equal to the total number of unique vocabulary in the corpora, where the element corresponding to the word assumes the value 1 and others 0. The significant issues of this approach are:
In order to address these issues, word embeddings techniques have become predictive which result in a dense, low dimensional vector for each word which would also carry the meaning of the word. Some popular examples are Word2Vec, Glove, Elmo, Bert, etc.
Word2Vec: Word2Vec model trains by trying to predict a target word given a context (Continuous Bag of Words – CBOW) or the context words from the target (Skip-gram). CBOW is faster since it treats the entire context as one entity whereas skip-gram creates different training pairs for each context word. However, skip-gram does a better job for uncommon words because of how it treats the context.
Glove: The Glove model uses a co-occurrence matrix for the embeddings, where each row represents a word, each column represents the context that word can appear in, and the value represents the frequency of the word in the context. Finally, Dimensionality reduction is applied to the matrix to create a resulting embedding matrix where each row represents the embedding vector for the corresponding word.
Elmo: Elmo uses deep BiLSTM architecture, which allows it to learn more context-dependent aspects of word meanings in the higher layers along with syntax aspects in lower layers.
Bert: It is built on the bidirectional idea of Elmo but uses the relatively new and state-of-the-art Transformer (an attention-based model with positional encodings to represent word positions) architecture to compute embedding.
While attractive, every new embedding technique that is published necessitates understanding the architecture and quick implementation may not be feasible. The labeled data for specific use cases could be too less to learn the word representations from their vocabularies which can be addressed using pre-trained models and transfer learning. For example, TensorFlow Hub is an online repository for reusable machine learning models. Pre-trained models of Elmo and Bert can be obtained from TensorFlow Hub which can be fine-tuned effectively for a specific use case in hand.
These word embeddings have made the NLP solutions more adaptable and enabled widespread usage in real life. For example, some significant use-cases of Word2Vec are:
Word-representations have come a long way and are expected to be able to capture more diversified contexts or a single embedding model that could generate word vectors for multiple languages, etc.