VariKN

A toolkit for producing n-gram language models. The highlights are the implementation of Kneser-Ney growing and revised Kneser pruning methods.

Download .zip Download .tar.gz View on GitHub

Introduction

VariKN language modeling toolkit provides tools for training n-gram language models. Amongst the supported methods are:

Absolute discounting
Kneser-Ney smoothing
Revised Kneser pruning
Kneser-Ney growing

The package provides accurate pruning for Kneser-Ney smoothed models. Also, it is possible to train a very high-order n-gram models with the growing algorithm. The models can be output to arpa lm format, which is compatible with most common other tools in the field.

Installation

See the file install.

Commands and Interfaces

The provided commands and interfaces are described in commands.html.

Scientific publications

Description of algorithms, especially Revised Kneser pruning and Kneser-Ney growing: Vesa Siivola, Teemu Hirsimäki and Sami Virpioja, "On Growing and Pruning Kneser-Ney Smoothed N-Gram Models", IEEE Transactions on Speech, Audio and Language Processing, 15(5):1617-1624, 2007.
Guidelines on typical training parameters: Vesa Siivola, Mathias Creutz and Mikko Kurimo: "Morfessor and VariKN machine learning tools for speech and language technology", Proceedings of the 8th International Conference on Speech Communication and Technology (INTERSPEECH'07), 2007.

Links to other interesting github projects

Aalto ASR speech recognition system handles long n-gram contexts gracefully
Morfessor provides an unsupervised method for producing morpheme-like sub-word units