Format of Perplexity Results

The perplexity task log file contains the following information:

Perplexity score The average branching factor of the language in the text.
Total word count The number of words in the sample text file.
OOV (out-of-vocabulary) rate The number of words in the sample text that are not included in the language model vocabulary, as a percentage of the total word count.
Unique OOV rate The number of unique OOV words as a percentage of the total number of unique words in the sample text.

The log file also lists all the OOV words, sorted by number of occurrences in the text and then alphabetically. Each word is listed alongside its number of occurrences.

For example:

Perplexity is 142.567 over 492 words, ignoring 24 OOV words.
Total word count is: 516  Count without <s> is: 492
OOV rate: 24 OOV / 492 words = 4.878%
Unique OOV: 20 OOV / 246 words = 8.130%
OOV WORDS: (20 unique words, 24 instances in text)
3 But
2 That
2 Well
1 Airbase
1 All
1 And
1 A
1 Beginning
1 Dramatic
1 He's
1 Interestingly
1 Of
1 She's
1 So
1 Tell
1 There
1 They
1 This
1 What

Perplexity values around or below 100 are typical and acceptable for call center-like conversations. Aim for this value when you process telephone data (8 kHz sampling rates).

Perplexity values around or below 250 are typical and acceptable for broad coverage content, such as news. Aim for this value when you process such audio data (16 kHz sampling rates).


_FT_HTML5_bannerTitle.htm