In the field of Natural Language Processing (NLP), we map words into numeric vectors so that the neural networks or machine learning algorithms can learn over it. Word Embeddings help in transforming words with similar meaning to similar numeric representations. Word Embeddings, being the actual input to the neural networks, influence the performance of NLP models greatly.
Initially, the words were represented as one-hot vectors. A one-hot vector of a word is a vector of length equal to the total number of unique vocabulary in the corpora, where the element corresponding to the word assumes the value 1 and others 0. The significant issues of this approach are:
- Too sparse given a large corpus ⇒ computationally very expensive
- No contextual/semantic information embedded in one-hot vectors ⇒ not readily suitable for tasks like POS tagging, named-entity recognition, etc.
In order to address these issues, word embeddings techniques have become predictive which result in a dense, low dimensional vector for each word which would also carry the meaning of the word. Some popular examples are Word2Vec, Glove, Elmo, Bert, etc.
Word2Vec: Word2Vec model trains by trying to predict a target word given a context (Continuous Bag of Words – CBOW) or the context words from the target (Skip-gram). CBOW is faster since it treats the entire context as one entity whereas skip-gram creates different training pairs for each context word. However, skip-gram does a better job for uncommon words because of how it treats the context.
Glove: The Glove model uses a co-occurrence matrix for the embeddings, where each row represents a word, each column represents the context that word can appear in, and the value represents the frequency of the word in the context. Finally, Dimensionality reduction is applied to the matrix to create a resulting embedding matrix where each row represents the embedding vector for the corresponding word.
Elmo: Elmo uses deep BiLSTM architecture, which allows it to learn more context-dependent aspects of word meanings in the higher layers along with syntax aspects in lower layers.
Bert: It is built on the bidirectional idea of Elmo but uses the relatively new and state-of-the-art Transformer (an attention-based model with positional encodings to represent word positions) architecture to compute embedding.
Major Comparisons
- Word2vec and Glove word embeddings are context independent while Elmo and Bert are context dependent. For example, the word ‘cell’ in two different sentences, i.e. “He is in prison cell” and “I got a new cell phone” yields one vector in word2vec and Glove. However, in Elmo and Bert, two different vectors would be generated as there are two different contexts.
- Word2vec, Glove, and Elmo are trained as word-based models, i.e. the models take words as input and output word embeddings that are either context sensitive (Elmo) or context independent (Glove, Word2vec). Bert represents input as subwords and learns embeddings for subwords, having a balance between character-based and word-based representations. Thus, Bert is very effective in handling OOV(out of vocabulary) words in comparison to the other three.
- Though Elmo and Bert have given the state-of-the-art results for NLP tasks, they could be computationally prohibitive for production since they demand far more resources than Word2vec or Glove.
- For embeddings like Word2Vec, Glove, etc. only the vector outputs from training are required and not the model itself. However, Elmo-like and Bert-like embeddings require the trained model as well. In real time, the trained model is used to generate the vector of the word specific to the context.
While attractive, every new embedding technique that is published necessitates understanding the architecture and quick implementation mnew embedding techniqueay not be feasible. The labeled data for specific use cases could be too less to learn the word representations from their vocabularies which can be addressed using pre-trained models and transfer learning. For example, TensorFlow Hub is an online repository for reusable machine learning models. Pre-trained models of Elmo and Bert can be obtained from TensorFlow Hub which can be fine-tuned effectively for a specific use case in hand.
These word embeddings have made the NLP solutions more adaptable and enabled widespread usage in real life. For example, some significant use-cases of Word2Vec are:
- Music recommendations at Spotify and Anghami
- Listing recommendations at Airbnb
- Product recommendations in Yahoo Mail
- Matching advertisements to search queries on Yahoo Search.
Word-representations have come a long way and are expected to be able to capture more diversified contexts or a single embedding model that could generate word vectors for multiple languages, etc.