N-gram language models in Python

In this article, I will go through all the steps necessary to create a language model that you can use in a Python program.


You need to have Python 2.7 and Boost installed (for example, follow these instructions).

Setup a virtual environment:

mkdir klm
virtualenv klm
source klm/bin/activate

Install the dependencies:

pip install nltk # required for tokenization
git clone
cd kenlm
./bjam # compile LM estimation code
python install # install Python module
cd -

Getting some data

Let's download the Bible:


We will use the NLTK to tokenize it.

Create a script named containing:

import sys
import nltk

for line in sys.stdin:
    for sentence in nltk.sent_tokenize(line):
        print(' '.join(nltk.word_tokenize(sentence)).lower())

and run bzcat bible.en.txt.bz2 | python | wc to confirm that it works.

Training a model

We can use KenLM to train a trigram language model with Kneser-Ney smoothing with the following commands:

bzcat bible.en.txt.bz2 |\
python |\
./kenlm/bin/lmplz -o 3 >

Then you can compile the model into a binary format with build_binary to optimize loading time:

./kenlm/bin/build_binary bible.klm

Finally, you can load your language model and use it to score sentences:

import kenlm
model = kenlm.LanguageModel('bible.klm')
model.score('in the beginning was the word')


To get the NLTK sentence tokenizer, you need to execute: in a Python interpreter and select the punkt package.

For a detailed introduction to n-gram language models, read Querying and Serving N -gram Language Models with Python.