.. Copyright (C) 2001-2020 NLTK Project .. For license information, see LICENSE.TXT Crubadan Corpus Reader ====================== Crubadan is an NLTK corpus reader for ngram files provided by the Crubadan project. It supports several languages. >>> from nltk.corpus import crubadan >>> crubadan.langs() # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE ['abk', 'abn',..., 'zpa', 'zul'] ---------------------------------------- Language code mapping and helper methods ---------------------------------------- The web crawler that generates the 3-gram frequencies works at the level of "writing systems" rather than languages. Writing systems are assigned internal 2-3 letter codes that require mapping to the standard ISO 639-3 codes. For more information, please refer to the README in nltk_data/crubadan folder after installing it. To translate ISO 639-3 codes to "Crubadan Code": >>> crubadan.iso_to_crubadan('eng') 'en' >>> crubadan.iso_to_crubadan('fra') 'fr' >>> crubadan.iso_to_crubadan('aaa') In reverse, print ISO 639-3 code if we have the Crubadan Code: >>> crubadan.crubadan_to_iso('en') 'eng' >>> crubadan.crubadan_to_iso('fr') 'fra' >>> crubadan.crubadan_to_iso('aa') --------------------------- Accessing ngram frequencies --------------------------- On initialization the reader will create a dictionary of every language supported by the Crubadan project, mapping the ISO 639-3 language code to its corresponding ngram frequency. You can access individual language FreqDist and the ngrams within them as follows: >>> english_fd = crubadan.lang_freq('eng') >>> english_fd['the'] 728135 Above accesses the FreqDist of English and returns the frequency of the ngram 'the'. A ngram that isn't found within the language will return 0: >>> english_fd['sometest'] 0 A language that isn't supported will raise an exception: >>> crubadan.lang_freq('elvish') Traceback (most recent call last): ... RuntimeError: Unsupported language.