N-gram language models in Python

03 July 2012

In this article, I will go through all the steps necessary to create a language model that you can use in a Python program.

Preliminaries

You need to have Python 2.7 and Boost installed (for example, follow these instructions).

Setup a virtual environment:

mkdir klm
virtualenv klm
source klm/bin/activate

Install the dependencies:

pip install nltk # required for tokenization
git clone https://github.com/vchahun/kenlm.git
cd kenlm
./bjam # compile LM estimation code
python setup.py install # install Python module
cd -

Getting some data

Let's download the Bible:

wget https://github.com/vchahun/notes/raw/data/bible/bible.en.txt.bz2

We will use the NLTK to tokenize it.

Create a script named process.py containing:

import sys
import nltk

for line in sys.stdin:
    for sentence in nltk.sent_tokenize(line):
        print(' '.join(nltk.word_tokenize(sentence)).lower())

and run bzcat bible.en.txt.bz2 | python process.py | wc to confirm that it works.

Training a model

We can use KenLM to train a trigram language model with Kneser-Ney smoothing with the following commands:

bzcat bible.en.txt.bz2 |\
python process.py |\
./kenlm/bin/lmplz -o 3 > bible.arpa

Then you can compile the model into a binary format with build_binary to optimize loading time:

./kenlm/bin/build_binary bible.arpa bible.klm

Finally, you can load your language model and use it to score sentences:

import kenlm
model = kenlm.LanguageModel('bible.klm')
model.score('in the beginning was the word')

Notes

To get the NLTK sentence tokenizer, you need to execute: nltk.download() in a Python interpreter and select the punkt package.

For a detailed introduction to n-gram language models, read Querying and Serving N -gram Language Models with Python.