In this article, I will go through all the steps necessary to create a language model that you can use in a Python program.
Setup a virtual environment:
mkdir klm virtualenv klm source klm/bin/activate
Install the dependencies:
pip install nltk # required for tokenization git clone https://github.com/vchahun/kenlm.git cd kenlm ./bjam # compile LM estimation code python setup.py install # install Python module cd -
Getting some data
Let's download the Bible:
We will use the NLTK to tokenize it.
Create a script named
import sys import nltk for line in sys.stdin: for sentence in nltk.sent_tokenize(line): print(' '.join(nltk.word_tokenize(sentence)).lower())
bzcat bible.en.txt.bz2 | python process.py | wc to confirm that it works.
Training a model
We can use KenLM to train a trigram language model with Kneser-Ney smoothing with the following commands:
bzcat bible.en.txt.bz2 |\ python process.py |\ ./kenlm/bin/lmplz -o 3 > bible.arpa
Then you can compile the model into a binary format with
build_binary to optimize loading time:
./kenlm/bin/build_binary bible.arpa bible.klm
Finally, you can load your language model and use it to score sentences:
import kenlm model = kenlm.LanguageModel('bible.klm') model.score('in the beginning was the word')
To get the NLTK sentence tokenizer, you need to execute:
nltk.download() in a Python interpreter and select the
For a detailed introduction to n-gram language models, read Querying and Serving N -gram Language Models with Python.