N-gram language models in Python
In this article, I will go through all the steps necessary to create a language model that you can use in a Python program.
Preliminaries
You need to have Python 2.7 and Boost installed (for example, follow these instructions).
Setup a virtual environment:
mkdir klm
virtualenv klm
source klm/bin/activate
Install the dependencies:
pip install nltk # required for tokenization
git clone https://github.com/vchahun/kenlm.git
cd kenlm
./bjam # compile LM estimation code
python setup.py install # install Python module
cd -
Getting some data
Let's download the Bible:
wget https://github.com/vchahun/notes/raw/data/bible/bible.en.txt.bz2
We will use the NLTK to tokenize it.
Create a script named process.py
containing:
import sys
import nltk
for line in sys.stdin:
for sentence in nltk.sent_tokenize(line):
print(' '.join(nltk.word_tokenize(sentence)).lower())
and run bzcat bible.en.txt.bz2 | python process.py | wc
to confirm that it works.
Training a model
We can use KenLM to train a trigram language model with Kneser-Ney smoothing with the following commands:
bzcat bible.en.txt.bz2 |\
python process.py |\
./kenlm/bin/lmplz -o 3 > bible.arpa
Then you can compile the model into a binary format with build_binary
to optimize loading time:
./kenlm/bin/build_binary bible.arpa bible.klm
Finally, you can load your language model and use it to score sentences:
import kenlm
model = kenlm.LanguageModel('bible.klm')
model.score('in the beginning was the word')
Notes
To get the NLTK sentence tokenizer, you need to execute: nltk.download()
in a Python interpreter and select the punkt
package.
For a detailed introduction to n-gram language models, read Querying and Serving N -gram Language Models with Python.