You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

200 lines
7.2 KiB
Python

5 years ago
# Natural Language Toolkit: Chunkers
#
# Copyright (C) 2001-2019 NLTK Project
# Author: Steven Bird <stevenbird1@gmail.com>
# Edward Loper <edloper@gmail.com>
# URL: <http://nltk.org/>
# For license information, see LICENSE.TXT
#
"""
Classes and interfaces for identifying non-overlapping linguistic
groups (such as base noun phrases) in unrestricted text. This task is
called "chunk parsing" or "chunking", and the identified groups are
called "chunks". The chunked text is represented using a shallow
tree called a "chunk structure." A chunk structure is a tree
containing tokens and chunks, where each chunk is a subtree containing
only tokens. For example, the chunk structure for base noun phrase
chunks in the sentence "I saw the big dog on the hill" is::
(SENTENCE:
(NP: <I>)
<saw>
(NP: <the> <big> <dog>)
<on>
(NP: <the> <hill>))
To convert a chunk structure back to a list of tokens, simply use the
chunk structure's ``leaves()`` method.
This module defines ``ChunkParserI``, a standard interface for
chunking texts; and ``RegexpChunkParser``, a regular-expression based
implementation of that interface. It also defines ``ChunkScore``, a
utility class for scoring chunk parsers.
RegexpChunkParser
=================
``RegexpChunkParser`` is an implementation of the chunk parser interface
that uses regular-expressions over tags to chunk a text. Its
``parse()`` method first constructs a ``ChunkString``, which encodes a
particular chunking of the input text. Initially, nothing is
chunked. ``parse.RegexpChunkParser`` then applies a sequence of
``RegexpChunkRule`` rules to the ``ChunkString``, each of which modifies
the chunking that it encodes. Finally, the ``ChunkString`` is
transformed back into a chunk structure, which is returned.
``RegexpChunkParser`` can only be used to chunk a single kind of phrase.
For example, you can use an ``RegexpChunkParser`` to chunk the noun
phrases in a text, or the verb phrases in a text; but you can not
use it to simultaneously chunk both noun phrases and verb phrases in
the same text. (This is a limitation of ``RegexpChunkParser``, not of
chunk parsers in general.)
RegexpChunkRules
----------------
A ``RegexpChunkRule`` is a transformational rule that updates the
chunking of a text by modifying its ``ChunkString``. Each
``RegexpChunkRule`` defines the ``apply()`` method, which modifies
the chunking encoded by a ``ChunkString``. The
``RegexpChunkRule`` class itself can be used to implement any
transformational rule based on regular expressions. There are
also a number of subclasses, which can be used to implement
simpler types of rules:
- ``ChunkRule`` chunks anything that matches a given regular
expression.
- ``ChinkRule`` chinks anything that matches a given regular
expression.
- ``UnChunkRule`` will un-chunk any chunk that matches a given
regular expression.
- ``MergeRule`` can be used to merge two contiguous chunks.
- ``SplitRule`` can be used to split a single chunk into two
smaller chunks.
- ``ExpandLeftRule`` will expand a chunk to incorporate new
unchunked material on the left.
- ``ExpandRightRule`` will expand a chunk to incorporate new
unchunked material on the right.
Tag Patterns
~~~~~~~~~~~~
A ``RegexpChunkRule`` uses a modified version of regular
expression patterns, called "tag patterns". Tag patterns are
used to match sequences of tags. Examples of tag patterns are::
r'(<DT>|<JJ>|<NN>)+'
r'<NN>+'
r'<NN.*>'
The differences between regular expression patterns and tag
patterns are:
- In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so
``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not
``'<NN'`` followed by one or more repetitions of ``'>'``.
- Whitespace in tag patterns is ignored. So
``'<DT> | <NN>'`` is equivalant to ``'<DT>|<NN>'``
- In tag patterns, ``'.'`` is equivalant to ``'[^{}<>]'``; so
``'<NN.*>'`` matches any single tag starting with ``'NN'``.
The function ``tag_pattern2re_pattern`` can be used to transform
a tag pattern to an equivalent regular expression pattern.
Efficiency
----------
Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a
rate of about 300 tokens/second, with a moderately complex rule set.
There may be problems if ``RegexpChunkParser`` is used with more than
5,000 tokens at a time. In particular, evaluation of some regular
expressions may cause the Python regular expression engine to
exceed its maximum recursion depth. We have attempted to minimize
these problems, but it is impossible to avoid them completely. We
therefore recommend that you apply the chunk parser to a single
sentence at a time.
Emacs Tip
---------
If you evaluate the following elisp expression in emacs, it will
colorize a ``ChunkString`` when you use an interactive python shell
with emacs or xemacs ("C-c !")::
(let ()
(defconst comint-mode-font-lock-keywords
'(("<[^>]+>" 0 'font-lock-reference-face)
("[{}]" 0 'font-lock-function-name-face)))
(add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))
You can evaluate this code by copying it to a temporary buffer,
placing the cursor after the last close parenthesis, and typing
"``C-x C-e``". You should evaluate it before running the interactive
session. The change will last until you close emacs.
Unresolved Issues
-----------------
If we use the ``re`` module for regular expressions, Python's
regular expression engine generates "maximum recursion depth
exceeded" errors when processing very large texts, even for
regular expressions that should not require any recursion. We
therefore use the ``pre`` module instead. But note that ``pre``
does not include Unicode support, so this module will not work
with unicode strings. Note also that ``pre`` regular expressions
are not quite as advanced as ``re`` ones (e.g., no leftward
zero-length assertions).
:type CHUNK_TAG_PATTERN: regexp
:var CHUNK_TAG_PATTERN: A regular expression to test whether a tag
pattern is valid.
"""
from nltk.data import load
from nltk.chunk.api import ChunkParserI
from nltk.chunk.util import (
ChunkScore,
accuracy,
tagstr2tree,
conllstr2tree,
conlltags2tree,
tree2conlltags,
tree2conllstr,
tree2conlltags,
ieerstr2tree,
)
from nltk.chunk.regexp import RegexpChunkParser, RegexpParser
# Standard treebank POS tagger
_BINARY_NE_CHUNKER = 'chunkers/maxent_ne_chunker/english_ace_binary.pickle'
_MULTICLASS_NE_CHUNKER = 'chunkers/maxent_ne_chunker/english_ace_multiclass.pickle'
def ne_chunk(tagged_tokens, binary=False):
"""
Use NLTK's currently recommended named entity chunker to
chunk the given list of tagged tokens.
"""
if binary:
chunker_pickle = _BINARY_NE_CHUNKER
else:
chunker_pickle = _MULTICLASS_NE_CHUNKER
chunker = load(chunker_pickle)
return chunker.parse(tagged_tokens)
def ne_chunk_sents(tagged_sentences, binary=False):
"""
Use NLTK's currently recommended named entity chunker to chunk the
given list of tagged sentences, each consisting of a list of tagged tokens.
"""
if binary:
chunker_pickle = _BINARY_NE_CHUNKER
else:
chunker_pickle = _MULTICLASS_NE_CHUNKER
chunker = load(chunker_pickle)
return chunker.parse_sents(tagged_sentences)