You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
380 lines
14 KiB
Plaintext
380 lines
14 KiB
Plaintext
5 years ago
|
.. Copyright (C) 2001-2019 NLTK Project
|
||
|
.. For license information, see LICENSE.TXT
|
||
|
|
||
|
=========================================
|
||
|
Loading Resources From the Data Package
|
||
|
=========================================
|
||
|
|
||
|
>>> import nltk.data
|
||
|
|
||
|
Overview
|
||
|
~~~~~~~~
|
||
|
The `nltk.data` module contains functions that can be used to load
|
||
|
NLTK resource files, such as corpora, grammars, and saved processing
|
||
|
objects.
|
||
|
|
||
|
Loading Data Files
|
||
|
~~~~~~~~~~~~~~~~~~
|
||
|
Resources are loaded using the function `nltk.data.load()`, which
|
||
|
takes as its first argument a URL specifying what file should be
|
||
|
loaded. The ``nltk:`` protocol loads files from the NLTK data
|
||
|
distribution:
|
||
|
|
||
|
>>> from __future__ import print_function
|
||
|
>>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
|
||
|
>>> tokenizer.tokenize('Hello. This is a test. It works!')
|
||
|
['Hello.', 'This is a test.', 'It works!']
|
||
|
|
||
|
It is important to note that there should be no space following the
|
||
|
colon (':') in the URL; 'nltk: tokenizers/punkt/english.pickle' will
|
||
|
not work!
|
||
|
|
||
|
The ``nltk:`` protocol is used by default if no protocol is specified:
|
||
|
|
||
|
>>> nltk.data.load('tokenizers/punkt/english.pickle') # doctest: +ELLIPSIS
|
||
|
<nltk.tokenize.punkt.PunktSentenceTokenizer object at ...>
|
||
|
|
||
|
But it is also possible to load resources from ``http:``, ``ftp:``,
|
||
|
and ``file:`` URLs, e.g. ``cfg = nltk.data.load('http://example.com/path/to/toy.cfg')``
|
||
|
|
||
|
>>> # Load a grammar using an absolute path.
|
||
|
>>> url = 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg')
|
||
|
>>> url.replace('\\', '/') # doctest: +ELLIPSIS
|
||
|
'file:...toy.cfg'
|
||
|
>>> print(nltk.data.load(url)) # doctest: +ELLIPSIS
|
||
|
Grammar with 14 productions (start state = S)
|
||
|
S -> NP VP
|
||
|
PP -> P NP
|
||
|
...
|
||
|
P -> 'on'
|
||
|
P -> 'in'
|
||
|
|
||
|
The second argument to the `nltk.data.load()` function specifies the
|
||
|
file format, which determines how the file's contents are processed
|
||
|
before they are returned by ``load()``. The formats that are
|
||
|
currently supported by the data module are described by the dictionary
|
||
|
`nltk.data.FORMATS`:
|
||
|
|
||
|
>>> for format, descr in sorted(nltk.data.FORMATS.items()):
|
||
|
... print('{0:<7} {1:}'.format(format, descr)) # doctest: +NORMALIZE_WHITESPACE
|
||
|
cfg A context free grammar.
|
||
|
fcfg A feature CFG.
|
||
|
fol A list of first order logic expressions, parsed with
|
||
|
nltk.sem.logic.Expression.fromstring.
|
||
|
json A serialized python object, stored using the json module.
|
||
|
logic A list of first order logic expressions, parsed with
|
||
|
nltk.sem.logic.LogicParser. Requires an additional logic_parser
|
||
|
parameter
|
||
|
pcfg A probabilistic CFG.
|
||
|
pickle A serialized python object, stored using the pickle
|
||
|
module.
|
||
|
raw The raw (byte string) contents of a file.
|
||
|
text The raw (unicode string) contents of a file.
|
||
|
val A semantic valuation, parsed by
|
||
|
nltk.sem.Valuation.fromstring.
|
||
|
yaml A serialized python object, stored using the yaml module.
|
||
|
|
||
|
`nltk.data.load()` will raise a ValueError if a bad format name is
|
||
|
specified:
|
||
|
|
||
|
>>> nltk.data.load('grammars/sample_grammars/toy.cfg', 'bar')
|
||
|
Traceback (most recent call last):
|
||
|
. . .
|
||
|
ValueError: Unknown format type!
|
||
|
|
||
|
By default, the ``"auto"`` format is used, which chooses a format
|
||
|
based on the filename's extension. The mapping from file extensions
|
||
|
to format names is specified by `nltk.data.AUTO_FORMATS`:
|
||
|
|
||
|
>>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()):
|
||
|
... print('.%-7s -> %s' % (ext, format))
|
||
|
.cfg -> cfg
|
||
|
.fcfg -> fcfg
|
||
|
.fol -> fol
|
||
|
.json -> json
|
||
|
.logic -> logic
|
||
|
.pcfg -> pcfg
|
||
|
.pickle -> pickle
|
||
|
.text -> text
|
||
|
.txt -> text
|
||
|
.val -> val
|
||
|
.yaml -> yaml
|
||
|
|
||
|
If `nltk.data.load()` is unable to determine the format based on the
|
||
|
filename's extension, it will raise a ValueError:
|
||
|
|
||
|
>>> nltk.data.load('foo.bar')
|
||
|
Traceback (most recent call last):
|
||
|
. . .
|
||
|
ValueError: Could not determine format for foo.bar based on its file
|
||
|
extension; use the "format" argument to specify the format explicitly.
|
||
|
|
||
|
Note that by explicitly specifying the ``format`` argument, you can
|
||
|
override the load method's default processing behavior. For example,
|
||
|
to get the raw contents of any file, simply use ``format="raw"``:
|
||
|
|
||
|
>>> s = nltk.data.load('grammars/sample_grammars/toy.cfg', 'text')
|
||
|
>>> print(s) # doctest: +ELLIPSIS
|
||
|
S -> NP VP
|
||
|
PP -> P NP
|
||
|
NP -> Det N | NP PP
|
||
|
VP -> V NP | VP PP
|
||
|
...
|
||
|
|
||
|
Making Local Copies
|
||
|
~~~~~~~~~~~~~~~~~~~
|
||
|
.. This will not be visible in the html output: create a tempdir to
|
||
|
play in.
|
||
|
>>> import tempfile, os
|
||
|
>>> tempdir = tempfile.mkdtemp()
|
||
|
>>> old_dir = os.path.abspath('.')
|
||
|
>>> os.chdir(tempdir)
|
||
|
|
||
|
The function `nltk.data.retrieve()` copies a given resource to a local
|
||
|
file. This can be useful, for example, if you want to edit one of the
|
||
|
sample grammars.
|
||
|
|
||
|
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg')
|
||
|
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy.cfg'
|
||
|
|
||
|
>>> # Simulate editing the grammar.
|
||
|
>>> with open('toy.cfg') as inp:
|
||
|
... s = inp.read().replace('NP', 'DP')
|
||
|
>>> with open('toy.cfg', 'w') as out:
|
||
|
... _bytes_written = out.write(s)
|
||
|
|
||
|
>>> # Load the edited grammar, & display it.
|
||
|
>>> cfg = nltk.data.load('file:///' + os.path.abspath('toy.cfg'))
|
||
|
>>> print(cfg) # doctest: +ELLIPSIS
|
||
|
Grammar with 14 productions (start state = S)
|
||
|
S -> DP VP
|
||
|
PP -> P DP
|
||
|
...
|
||
|
P -> 'on'
|
||
|
P -> 'in'
|
||
|
|
||
|
The second argument to `nltk.data.retrieve()` specifies the filename
|
||
|
for the new copy of the file. By default, the source file's filename
|
||
|
is used.
|
||
|
|
||
|
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg', 'mytoy.cfg')
|
||
|
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'mytoy.cfg'
|
||
|
>>> os.path.isfile('./mytoy.cfg')
|
||
|
True
|
||
|
>>> nltk.data.retrieve('grammars/sample_grammars/np.fcfg')
|
||
|
Retrieving 'nltk:grammars/sample_grammars/np.fcfg', saving to 'np.fcfg'
|
||
|
>>> os.path.isfile('./np.fcfg')
|
||
|
True
|
||
|
|
||
|
If a file with the specified (or default) filename already exists in
|
||
|
the current directory, then `nltk.data.retrieve()` will raise a
|
||
|
ValueError exception. It will *not* overwrite the file:
|
||
|
|
||
|
>>> os.path.isfile('./toy.cfg')
|
||
|
True
|
||
|
>>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') # doctest: +ELLIPSIS
|
||
|
Traceback (most recent call last):
|
||
|
. . .
|
||
|
ValueError: File '...toy.cfg' already exists!
|
||
|
|
||
|
.. This will not be visible in the html output: clean up the tempdir.
|
||
|
>>> os.chdir(old_dir)
|
||
|
>>> for f in os.listdir(tempdir):
|
||
|
... os.remove(os.path.join(tempdir, f))
|
||
|
>>> os.rmdir(tempdir)
|
||
|
|
||
|
Finding Files in the NLTK Data Package
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
The `nltk.data.find()` function searches the NLTK data package for a
|
||
|
given file, and returns a pointer to that file. This pointer can
|
||
|
either be a `FileSystemPathPointer` (whose `path` attribute gives the
|
||
|
absolute path of the file); or a `ZipFilePathPointer`, specifying a
|
||
|
zipfile and the name of an entry within that zipfile. Both pointer
|
||
|
types define the `open()` method, which can be used to read the string
|
||
|
contents of the file.
|
||
|
|
||
|
>>> path = nltk.data.find('corpora/abc/rural.txt')
|
||
|
>>> str(path) # doctest: +ELLIPSIS
|
||
|
'...rural.txt'
|
||
|
>>> print(path.open().read(60).decode())
|
||
|
PM denies knowledge of AWB kickbacks
|
||
|
The Prime Minister has
|
||
|
|
||
|
Alternatively, the `nltk.data.load()` function can be used with the
|
||
|
keyword argument ``format="raw"``:
|
||
|
|
||
|
>>> s = nltk.data.load('corpora/abc/rural.txt', format='raw')[:60]
|
||
|
>>> print(s.decode())
|
||
|
PM denies knowledge of AWB kickbacks
|
||
|
The Prime Minister has
|
||
|
|
||
|
Alternatively, you can use the keyword argument ``format="text"``:
|
||
|
|
||
|
>>> s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60]
|
||
|
>>> print(s)
|
||
|
PM denies knowledge of AWB kickbacks
|
||
|
The Prime Minister has
|
||
|
|
||
|
Resource Caching
|
||
|
~~~~~~~~~~~~~~~~
|
||
|
|
||
|
NLTK uses a weakref dictionary to maintain a cache of resources that
|
||
|
have been loaded. If you load a resource that is already stored in
|
||
|
the cache, then the cached copy will be returned. This behavior can
|
||
|
be seen by the trace output generated when verbose=True:
|
||
|
|
||
|
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
|
||
|
<<Loading nltk:grammars/book_grammars/feat0.fcfg>>
|
||
|
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True)
|
||
|
<<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
|
||
|
|
||
|
If you wish to load a resource from its source, bypassing the cache,
|
||
|
use the ``cache=False`` argument to `nltk.data.load()`. This can be
|
||
|
useful, for example, if the resource is loaded from a local file, and
|
||
|
you are actively editing that file:
|
||
|
|
||
|
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',cache=False,verbose=True)
|
||
|
<<Loading nltk:grammars/book_grammars/feat0.fcfg>>
|
||
|
|
||
|
The cache *no longer* uses weak references. A resource will not be
|
||
|
automatically expunged from the cache when no more objects are using
|
||
|
it. In the following example, when we clear the variable ``feat0``,
|
||
|
the reference count for the feature grammar object drops to zero.
|
||
|
However, the object remains cached:
|
||
|
|
||
|
>>> del feat0
|
||
|
>>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',
|
||
|
... verbose=True)
|
||
|
<<Using cached copy of nltk:grammars/book_grammars/feat0.fcfg>>
|
||
|
|
||
|
You can clear the entire contents of the cache, using
|
||
|
`nltk.data.clear_cache()`:
|
||
|
|
||
|
>>> nltk.data.clear_cache()
|
||
|
|
||
|
Retrieving other Data Sources
|
||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
>>> formulas = nltk.data.load('grammars/book_grammars/background.fol')
|
||
|
>>> for f in formulas: print(str(f))
|
||
|
all x.(boxerdog(x) -> dog(x))
|
||
|
all x.(boxer(x) -> person(x))
|
||
|
all x.-(dog(x) & person(x))
|
||
|
all x.(married(x) <-> exists y.marry(x,y))
|
||
|
all x.(bark(x) -> dog(x))
|
||
|
all x y.(marry(x,y) -> (person(x) & person(y)))
|
||
|
-(Vincent = Mia)
|
||
|
-(Vincent = Fido)
|
||
|
-(Mia = Fido)
|
||
|
|
||
|
Regression Tests
|
||
|
~~~~~~~~~~~~~~~~
|
||
|
Create a temp dir for tests that write files:
|
||
|
|
||
|
>>> import tempfile, os
|
||
|
>>> tempdir = tempfile.mkdtemp()
|
||
|
>>> old_dir = os.path.abspath('.')
|
||
|
>>> os.chdir(tempdir)
|
||
|
|
||
|
The `retrieve()` function accepts all url types:
|
||
|
|
||
|
>>> urls = ['https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg',
|
||
|
... 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg'),
|
||
|
... 'nltk:grammars/sample_grammars/toy.cfg',
|
||
|
... 'grammars/sample_grammars/toy.cfg']
|
||
|
>>> for i, url in enumerate(urls):
|
||
|
... nltk.data.retrieve(url, 'toy-%d.cfg' % i) # doctest: +ELLIPSIS
|
||
|
Retrieving 'https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', saving to 'toy-0.cfg'
|
||
|
Retrieving 'file:...toy.cfg', saving to 'toy-1.cfg'
|
||
|
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-2.cfg'
|
||
|
Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-3.cfg'
|
||
|
|
||
|
Clean up the temp dir:
|
||
|
|
||
|
>>> os.chdir(old_dir)
|
||
|
>>> for f in os.listdir(tempdir):
|
||
|
... os.remove(os.path.join(tempdir, f))
|
||
|
>>> os.rmdir(tempdir)
|
||
|
|
||
|
Lazy Loader
|
||
|
-----------
|
||
|
A lazy loader is a wrapper object that defers loading a resource until
|
||
|
it is accessed or used in any way. This is mainly intended for
|
||
|
internal use by NLTK's corpus readers.
|
||
|
|
||
|
>>> # Create a lazy loader for toy.cfg.
|
||
|
>>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
|
||
|
|
||
|
>>> # Show that it's not loaded yet:
|
||
|
>>> object.__repr__(ll) # doctest: +ELLIPSIS
|
||
|
'<nltk.data.LazyLoader object at ...>'
|
||
|
|
||
|
>>> # printing it is enough to cause it to be loaded:
|
||
|
>>> print(ll)
|
||
|
<Grammar with 14 productions>
|
||
|
|
||
|
>>> # Show that it's now been loaded:
|
||
|
>>> object.__repr__(ll) # doctest: +ELLIPSIS
|
||
|
'<nltk.grammar.CFG object at ...>'
|
||
|
|
||
|
|
||
|
>>> # Test that accessing an attribute also loads it:
|
||
|
>>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg')
|
||
|
>>> ll.start()
|
||
|
S
|
||
|
>>> object.__repr__(ll) # doctest: +ELLIPSIS
|
||
|
'<nltk.grammar.CFG object at ...>'
|
||
|
|
||
|
Buffered Gzip Reading and Writing
|
||
|
---------------------------------
|
||
|
Write performance to gzip-compressed is extremely poor when the files become large.
|
||
|
File creation can become a bottleneck in those cases.
|
||
|
|
||
|
Read performance from large gzipped pickle files was improved in data.py by
|
||
|
buffering the reads. A similar fix can be applied to writes by buffering
|
||
|
the writes to a StringIO object first.
|
||
|
|
||
|
This is mainly intended for internal use. The test simply tests that reading
|
||
|
and writing work as intended and does not test how much improvement buffering
|
||
|
provides.
|
||
|
|
||
|
>>> from nltk.compat import StringIO
|
||
|
>>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'wb', size=2**10)
|
||
|
>>> ans = []
|
||
|
>>> for i in range(10000):
|
||
|
... ans.append(str(i).encode('ascii'))
|
||
|
... test.write(str(i).encode('ascii'))
|
||
|
>>> test.close()
|
||
|
>>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'rb')
|
||
|
>>> test.read() == b''.join(ans)
|
||
|
True
|
||
|
>>> test.close()
|
||
|
>>> import os
|
||
|
>>> os.unlink('testbuf.gz')
|
||
|
|
||
|
JSON Encoding and Decoding
|
||
|
--------------------------
|
||
|
JSON serialization is used instead of pickle for some classes.
|
||
|
|
||
|
>>> from nltk import jsontags
|
||
|
>>> from nltk.jsontags import JSONTaggedEncoder, JSONTaggedDecoder, register_tag
|
||
|
>>> @jsontags.register_tag
|
||
|
... class JSONSerializable:
|
||
|
... json_tag = 'JSONSerializable'
|
||
|
...
|
||
|
... def __init__(self, n):
|
||
|
... self.n = n
|
||
|
...
|
||
|
... def encode_json_obj(self):
|
||
|
... return self.n
|
||
|
...
|
||
|
... @classmethod
|
||
|
... def decode_json_obj(cls, obj):
|
||
|
... n = obj
|
||
|
... return cls(n)
|
||
|
...
|
||
|
>>> JSONTaggedEncoder().encode(JSONSerializable(1))
|
||
|
'{"!JSONSerializable": 1}'
|
||
|
>>> JSONTaggedDecoder().decode('{"!JSONSerializable": 1}').n
|
||
|
1
|
||
|
|