VariKN
A toolkit for producing n-gram language models. The highlights are the implementation of Kneser-Ney growing and revised Kneser pruning methods.
Introduction
VariKN language modeling toolkit provides tools for training n-gram
language models. Amongst the supported methods are:
- Absolute discounting
- Kneser-Ney smoothing
- Revised Kneser pruning
- Kneser-Ney growing
The package provides accurate pruning for Kneser-Ney smoothed
models. Also, it is possible
to train a very high-order n-gram models with the growing
algorithm. The models can be output to arpa lm format, which is
compatible with most common other tools in the field.
Installation
See the file install.
Commands and Interfaces
The provided commands and interfaces are described in commands.html.
Scientific publications
- Description of algorithms, especially Revised Kneser pruning and Kneser-Ney growing:
Vesa Siivola, Teemu Hirsimäki and Sami
Virpioja,
"On Growing and Pruning Kneser-Ney Smoothed N-Gram Models",
IEEE Transactions on Speech, Audio and Language Processing,
15(5):1617-1624, 2007.
- Guidelines on typical training parameters:
Vesa Siivola, Mathias Creutz and Mikko Kurimo: "Morfessor and VariKN machine
learning tools for speech and language technology", Proceedings of the
8th International Conference on Speech Communication and Technology
(INTERSPEECH'07), 2007.
Links to other interesting github projects
- Aalto ASR speech recognition system handles long n-gram contexts gracefully
- Morfessor provides an unsupervised method for producing morpheme-like sub-word units