bo-graduation/venv/lib/python3.7/site-packages/nltk/test/crubadan.doctest

.. Copyright (C) 2001-2020 NLTK Project
.. For license information, see LICENSE.TXT

Crubadan Corpus Reader
======================

Crubadan is an NLTK corpus reader for ngram files provided
by the Crubadan project. It supports several languages.

    >>> from nltk.corpus import crubadan
    >>> crubadan.langs() # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
    ['abk', 'abn',..., 'zpa', 'zul']

----------------------------------------
Language code mapping and helper methods
----------------------------------------

The web crawler that generates the 3-gram frequencies works at the
level of "writing systems" rather than languages. Writing systems
are assigned internal 2-3 letter codes that require mapping to the
standard ISO 639-3 codes. For more information, please refer to 
the README in nltk_data/crubadan folder after installing it.

To translate ISO 639-3 codes to "Crubadan Code":

    >>> crubadan.iso_to_crubadan('eng')
    'en'
    >>> crubadan.iso_to_crubadan('fra')
    'fr'
    >>> crubadan.iso_to_crubadan('aaa')

In reverse, print ISO 639-3 code if we have the Crubadan Code:

    >>> crubadan.crubadan_to_iso('en')
    'eng'
    >>> crubadan.crubadan_to_iso('fr')
    'fra'
    >>> crubadan.crubadan_to_iso('aa')

---------------------------
Accessing ngram frequencies
---------------------------

On initialization the reader will create a dictionary of every
language supported by the Crubadan project, mapping the ISO 639-3
language code to its corresponding ngram frequency.

You can access individual language FreqDist and the ngrams within them as follows:

    >>> english_fd = crubadan.lang_freq('eng')
    >>> english_fd['the']
    728135

Above accesses the FreqDist of English and returns the frequency of the ngram 'the'.
A ngram that isn't found within the language will return 0:

    >>> english_fd['sometest']
    0

A language that isn't supported will raise an exception:

    >>> crubadan.lang_freq('elvish')
    Traceback (most recent call last):
    ...
    RuntimeError: Unsupported language.
add tag_comparison_v3.py 5 years ago			`.. Copyright (C) 2001-2020 NLTK Project`
readme check 5 years ago			`.. For license information, see LICENSE.TXT`

			`Crubadan Corpus Reader`
			`======================`

			`Crubadan is an NLTK corpus reader for ngram files provided`
			`by the Crubadan project. It supports several languages.`

			`>>> from nltk.corpus import crubadan`
			`>>> crubadan.langs() # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE`
			`['abk', 'abn',..., 'zpa', 'zul']`

			`----------------------------------------`
			`Language code mapping and helper methods`
			`----------------------------------------`

			`The web crawler that generates the 3-gram frequencies works at the`
			`level of "writing systems" rather than languages. Writing systems`
			`are assigned internal 2-3 letter codes that require mapping to the`
			`standard ISO 639-3 codes. For more information, please refer to`
			`the README in nltk_data/crubadan folder after installing it.`

			`To translate ISO 639-3 codes to "Crubadan Code":`

			`>>> crubadan.iso_to_crubadan('eng')`
			`'en'`
			`>>> crubadan.iso_to_crubadan('fra')`
			`'fr'`
			`>>> crubadan.iso_to_crubadan('aaa')`

			`In reverse, print ISO 639-3 code if we have the Crubadan Code:`

			`>>> crubadan.crubadan_to_iso('en')`
			`'eng'`
			`>>> crubadan.crubadan_to_iso('fr')`
			`'fra'`
			`>>> crubadan.crubadan_to_iso('aa')`

			`---------------------------`
			`Accessing ngram frequencies`
			`---------------------------`

			`On initialization the reader will create a dictionary of every`
			`language supported by the Crubadan project, mapping the ISO 639-3`
			`language code to its corresponding ngram frequency.`

			`You can access individual language FreqDist and the ngrams within them as follows:`

			`>>> english_fd = crubadan.lang_freq('eng')`
			`>>> english_fd['the']`
			`728135`

			`Above accesses the FreqDist of English and returns the frequency of the ngram 'the'.`
			`A ngram that isn't found within the language will return 0:`

			`>>> english_fd['sometest']`
			`0`

			`A language that isn't supported will raise an exception:`

			`>>> crubadan.lang_freq('elvish')`
			`Traceback (most recent call last):`
			`...`
			`RuntimeError: Unsupported language.`