.. Copyright (C) 2001-2020 NLTK Project
.. For license information, see LICENSE.TXT

.. -*- coding: utf-8 -*-


Regression Tests
================


Issue 167
---------
https://github.com/nltk/nltk/issues/167

    >>> from nltk.corpus import brown
    >>> from nltk.lm.preprocessing import padded_everygram_pipeline
    >>> ngram_order = 3
    >>> train_data, vocab_data = padded_everygram_pipeline(
    ...     ngram_order,
    ...     brown.sents(categories="news")
    ... )

    >>> from nltk.lm import WittenBellInterpolated
    >>> lm = WittenBellInterpolated(ngram_order)
    >>> lm.fit(train_data, vocab_data)

Sentence containing an unseen word should result in infinite entropy because
Witten-Bell is based ultimately on MLE, which cannot handle unseen ngrams.
Crucially, it shouldn't raise any exceptions for unseen words.

    >>> from nltk.util import ngrams
    >>> sent = ngrams("This is a sentence with the word aaddvark".split(), 3)
    >>> lm.entropy(sent)
    inf

If we remove all unseen ngrams from the sentence, we'll get a non-infinite value
for the entropy.

    >>> sent = ngrams("This is a sentence".split(), 3)
    >>> lm.entropy(sent)
    17.41365588455936


Issue 367
---------
https://github.com/nltk/nltk/issues/367

Reproducing Dan Blanchard's example:
https://github.com/nltk/nltk/issues/367#issuecomment-14646110

    >>> from nltk.lm import Lidstone, Vocabulary
    >>> word_seq = list('aaaababaaccbacb')
    >>> ngram_order = 2
    >>> from nltk.util import everygrams
    >>> train_data = [everygrams(word_seq, max_len=ngram_order)]
    >>> V = Vocabulary(['a', 'b', 'c', ''])
    >>> lm = Lidstone(0.2, ngram_order, vocabulary=V)
    >>> lm.fit(train_data)

For doctest to work we have to sort the vocabulary keys.

    >>> V_keys = sorted(V)
    >>> round(sum(lm.score(w, ("b",)) for w in V_keys), 6)
    1.0
    >>> round(sum(lm.score(w, ("a",)) for w in V_keys), 6)
    1.0

    >>> [lm.score(w, ("b",)) for w in V_keys]
    [0.05, 0.05, 0.8, 0.05, 0.05]
    >>> [round(lm.score(w, ("a",)), 4) for w in V_keys]
    [0.0222, 0.0222, 0.4667, 0.2444, 0.2444]


Here's reproducing @afourney's comment:
https://github.com/nltk/nltk/issues/367#issuecomment-15686289

    >>> sent = ['foo', 'foo', 'foo', 'foo', 'bar', 'baz']
    >>> ngram_order = 3
    >>> from nltk.lm.preprocessing import padded_everygram_pipeline
    >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, [sent])
    >>> from nltk.lm import Lidstone
    >>> lm = Lidstone(0.2, ngram_order)
    >>> lm.fit(train_data, vocab_data)

The vocabulary includes the "UNK" symbol as well as two padding symbols.

    >>> len(lm.vocab)
    6
    >>> word = "foo"
    >>> context = ("bar", "baz")

The raw counts.

    >>> lm.context_counts(context)[word]
    0
    >>> lm.context_counts(context).N()
    1

Counts with Lidstone smoothing.

    >>> lm.context_counts(context)[word] + lm.gamma
    0.2
    >>> lm.context_counts(context).N() + len(lm.vocab) * lm.gamma
    2.2

Without any backoff, just using Lidstone smoothing, P("foo" | "bar", "baz") should be:
0.2 / 2.2 ~= 0.090909

    >>> round(lm.score(word, context), 6)
    0.090909


Issue 380
---------
https://github.com/nltk/nltk/issues/380

Reproducing setup akin to this comment:
https://github.com/nltk/nltk/issues/380#issue-12879030

For speed take only the first 100 sentences of reuters. Shouldn't affect the test.
    >>> from nltk.corpus import reuters
    >>> sents = reuters.sents()[:100]
    >>> ngram_order = 3
    >>> from nltk.lm.preprocessing import padded_everygram_pipeline
    >>> train_data, vocab_data = padded_everygram_pipeline(ngram_order, sents)

    >>> from nltk.lm import Lidstone
    >>> lm = Lidstone(0.2, ngram_order)
    >>> lm.fit(train_data, vocab_data)
    >>> lm.score("said", ("",)) < 1
    True