|
|
|
.. Copyright (C) 2001-2020 NLTK Project
|
|
|
|
.. For license information, see LICENSE.TXT
|
|
|
|
|
|
|
|
Crubadan Corpus Reader
|
|
|
|
======================
|
|
|
|
|
|
|
|
Crubadan is an NLTK corpus reader for ngram files provided
|
|
|
|
by the Crubadan project. It supports several languages.
|
|
|
|
|
|
|
|
>>> from nltk.corpus import crubadan
|
|
|
|
>>> crubadan.langs() # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
|
|
|
|
['abk', 'abn',..., 'zpa', 'zul']
|
|
|
|
|
|
|
|
----------------------------------------
|
|
|
|
Language code mapping and helper methods
|
|
|
|
----------------------------------------
|
|
|
|
|
|
|
|
The web crawler that generates the 3-gram frequencies works at the
|
|
|
|
level of "writing systems" rather than languages. Writing systems
|
|
|
|
are assigned internal 2-3 letter codes that require mapping to the
|
|
|
|
standard ISO 639-3 codes. For more information, please refer to
|
|
|
|
the README in nltk_data/crubadan folder after installing it.
|
|
|
|
|
|
|
|
To translate ISO 639-3 codes to "Crubadan Code":
|
|
|
|
|
|
|
|
>>> crubadan.iso_to_crubadan('eng')
|
|
|
|
'en'
|
|
|
|
>>> crubadan.iso_to_crubadan('fra')
|
|
|
|
'fr'
|
|
|
|
>>> crubadan.iso_to_crubadan('aaa')
|
|
|
|
|
|
|
|
In reverse, print ISO 639-3 code if we have the Crubadan Code:
|
|
|
|
|
|
|
|
>>> crubadan.crubadan_to_iso('en')
|
|
|
|
'eng'
|
|
|
|
>>> crubadan.crubadan_to_iso('fr')
|
|
|
|
'fra'
|
|
|
|
>>> crubadan.crubadan_to_iso('aa')
|
|
|
|
|
|
|
|
---------------------------
|
|
|
|
Accessing ngram frequencies
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
On initialization the reader will create a dictionary of every
|
|
|
|
language supported by the Crubadan project, mapping the ISO 639-3
|
|
|
|
language code to its corresponding ngram frequency.
|
|
|
|
|
|
|
|
You can access individual language FreqDist and the ngrams within them as follows:
|
|
|
|
|
|
|
|
>>> english_fd = crubadan.lang_freq('eng')
|
|
|
|
>>> english_fd['the']
|
|
|
|
728135
|
|
|
|
|
|
|
|
Above accesses the FreqDist of English and returns the frequency of the ngram 'the'.
|
|
|
|
A ngram that isn't found within the language will return 0:
|
|
|
|
|
|
|
|
>>> english_fd['sometest']
|
|
|
|
0
|
|
|
|
|
|
|
|
A language that isn't supported will raise an exception:
|
|
|
|
|
|
|
|
>>> crubadan.lang_freq('elvish')
|
|
|
|
Traceback (most recent call last):
|
|
|
|
...
|
|
|
|
RuntimeError: Unsupported language.
|