In this post I will discuss about A Structured Self-Attentive Sentence Embedding (published at ICLR in 2017) which is an interesting paper that introduces a new sentence representation different from the conventional sentence embedding vector. To understand this post and the paper, the readers are required to have a basic understanding of latent space representation of words (word embeddings), recurrent architectures like LSTMs and BiLSTMs and non linear functions like softmax and tanh. I highly recommend the readers to read the original paper after reading this post to learn about the proposed model in more detail. Introduction This paper proposes a new model for extracting a sentence embedding by using self-attention. Instead of using a traditional 1-D vector, the authors propose to use a 2-D matrix to represent the sentence, with each row of the matrix attending on a different part of the sentence. They also propose a self-attention mechanism and a special regularization term...
In this post we will look at a simple python library called Morfessor which uses unsupervised training to split a given word into its constituent morphemes. Morfessor is a family of probabilistic machine learning methods that find morphological segmentations for words of a natural language, based solely on raw text data. So what are morphemes and why are they useful? Morphemes are the smallest meaningful elements of a word . For example, chair, dog, bird, table, compute are all morphemes. They express a direct meaning and cannot be further separated into smaller parts. Sometimes a single word can carry a number of morphemes. Consider the word 'unsegmented'; it consists of 3 morphemes - 'un', 'segment' and 'ed'. Morphemes are used in a variety of linguistic tasks. They help in understanding word structure and word formation. In Natural Language Processing, morphology is used in text preprocessing tasks (word stemming and lemmatization) and genera...