Skip to main content

Morphological segmentation of words

In this post we will look at a simple python library called Morfessor which uses unsupervised training to split a given word into its constituent morphemes. Morfessor is a family of probabilistic machine learning methods that find morphological segmentations for words of a natural language, based solely on raw text data.

So what are morphemes and why are they useful?

Morphemes are the smallest meaningful elements of a word. For example, chair, dog, bird, table, compute are all morphemes. They express a direct meaning and cannot be further separated into smaller parts. Sometimes a single word can carry a number of morphemes. Consider the word 'unsegmented'; it consists of 3 morphemes - 'un', 'segment' and 'ed'.

Morphemes are used in a variety of linguistic tasks. They help in understanding word structure and word formation. In Natural Language Processing, morphology is used in text preprocessing tasks (word stemming and lemmatization) and generating vector-space representations of words.

Why use Morfessor?

  1. Morfessor is cool - It uses unsupervised training but still gives better results in most cases than other rule based natural language models and supervised machine learning models.
  2. Trains very fast - The entire code takes about 15 minutes to run and once the model is built, it returns results for the given input word in less than a second.
  3. Less code - Just about 30 lines of easy code to build and run the model.

Installing python dependencies

pip install nltk
pip install morfessor

Generating training data

from nltk.corpus import words

# using nltk word corpus as training data
words = words.words()
outfile = open("words", "w")
for word in words:
    outfile.write(word+"\n")

outfile.close()

Building the model

import math
import morfessor

# function for adjusting the counts of each compound
def log_func(x):
    return int(round(math.log(x + 1, 2)))

infile = "words"
io = morfessor.MorfessorIO()
train_data = list(io.read_corpus_file(infile))
model = morfessor.BaselineModel()
model.load_data(train_data, count_modifier=log_func)
model.train_batch()
io.write_binary_model_file("model.bin", model)

The intuition behind the count modifier function being a log function is to dampen the frequency of occurrence of a given compound. The effect of the frequency is essentially a sub linear function, so the idea is to use a log function to best approximate it.

Testing the model

import morfessor

model_file = "model.bin"
io = morfessor.MorfessorIO()
model = io.read_binary_model_file(model_file)

word = raw_input("Input word > ")
# for segmenting new words we use the viterbi_segment(compound) method
print model.viterbi_segment(word)[0]

Conclusion

Thats it! Thats all we need to split words into their constituent morphemes.

Results -

  • unsegmented -> ['un', 'segment', 'ed']
  • imbalance -> ['im', 'balance']
  • handkerchief -> ['hand', 'kerchief']
  • questionwise -> ['question', 'wise']
  • myocardiogram -> ['myo', 'cardio', 'gram']
As you can see, the model works pretty well even for complex and long words.

I hope this post was helpful. Thank you for reading it!

Comments

Post a Comment

Popular posts from this blog

Paper Explained: A Structured Self-Attentive Sentence Embedding

In this post I will discuss about A Structured Self-Attentive Sentence Embedding (published at ICLR in 2017) which is an interesting paper that introduces a new sentence representation different from the conventional sentence embedding vector. To understand this post and the paper, the readers are required to have a basic understanding of latent space representation of words (word embeddings), recurrent architectures like LSTMs and BiLSTMs and non linear functions like softmax and tanh. I highly recommend the readers to read the original paper after reading this post to learn about the proposed model in more detail. Introduction This paper proposes a new model for extracting a sentence embedding by using self-attention. Instead of using a traditional 1-D vector, the authors propose to use a 2-D matrix to represent the sentence, with each row of the matrix attending on a different part of the sentence. They also propose a self-attention mechanism and a special regularization term...