You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

177 lines
6.5 KiB
Plaintext

5 years ago
.. Copyright (C) 2001-2019 NLTK Project
.. For license information, see LICENSE.TXT
========
PropBank
========
The PropBank Corpus provides predicate-argument annotation for the
entire Penn Treebank. Each verb in the treebank is annotated by a single
instance in PropBank, containing information about the location of
the verb, and the location and identity of its arguments:
>>> from nltk.corpus import propbank
>>> pb_instances = propbank.instances()
>>> print(pb_instances) # doctest: +NORMALIZE_WHITESPACE
[<PropbankInstance: wsj_0001.mrg, sent 0, word 8>,
<PropbankInstance: wsj_0001.mrg, sent 1, word 10>, ...]
Each propbank instance defines the following member variables:
- Location information: `fileid`, `sentnum`, `wordnum`
- Annotator information: `tagger`
- Inflection information: `inflection`
- Roleset identifier: `roleset`
- Verb (aka predicate) location: `predicate`
- Argument locations and types: `arguments`
The following examples show the types of these arguments:
>>> inst = pb_instances[103]
>>> (inst.fileid, inst.sentnum, inst.wordnum)
('wsj_0004.mrg', 8, 16)
>>> inst.tagger
'gold'
>>> inst.inflection
<PropbankInflection: vp--a>
>>> infl = inst.inflection
>>> infl.form, infl.tense, infl.aspect, infl.person, infl.voice
('v', 'p', '-', '-', 'a')
>>> inst.roleset
'rise.01'
>>> inst.predicate
PropbankTreePointer(16, 0)
>>> inst.arguments # doctest: +NORMALIZE_WHITESPACE
((PropbankTreePointer(0, 2), 'ARG1'),
(PropbankTreePointer(13, 1), 'ARGM-DIS'),
(PropbankTreePointer(17, 1), 'ARG4-to'),
(PropbankTreePointer(20, 1), 'ARG3-from'))
The location of the predicate and of the arguments are encoded using
`PropbankTreePointer` objects, as well as `PropbankChainTreePointer`
objects and `PropbankSplitTreePointer` objects. A
`PropbankTreePointer` consists of a `wordnum` and a `height`:
>>> print(inst.predicate.wordnum, inst.predicate.height)
16 0
This identifies the tree constituent that is headed by the word that
is the `wordnum`\ 'th token in the sentence, and whose span is found
by going `height` nodes up in the tree. This type of pointer is only
useful if we also have the corresponding tree structure, since it
includes empty elements such as traces in the word number count. The
trees for 10% of the standard PropBank Corpus are contained in the
`treebank` corpus:
>>> tree = inst.tree
>>> from nltk.corpus import treebank
>>> assert tree == treebank.parsed_sents(inst.fileid)[inst.sentnum]
>>> inst.predicate.select(tree)
Tree('VBD', ['rose'])
>>> for (argloc, argid) in inst.arguments:
... print('%-10s %s' % (argid, argloc.select(tree).pformat(500)[:50]))
ARG1 (NP-SBJ (NP (DT The) (NN yield)) (PP (IN on) (NP (
ARGM-DIS (PP (IN for) (NP (NN example)))
ARG4-to (PP-DIR (TO to) (NP (CD 8.04) (NN %)))
ARG3-from (PP-DIR (IN from) (NP (CD 7.90) (NN %)))
Propbank tree pointers can be converted to standard tree locations,
which are usually easier to work with, using the `treepos()` method:
>>> treepos = inst.predicate.treepos(tree)
>>> print (treepos, tree[treepos])
(4, 0) (VBD rose)
In some cases, argument locations will be encoded using
`PropbankChainTreePointer`\ s (for trace chains) or
`PropbankSplitTreePointer`\ s (for discontinuous constituents). Both
of these objects contain a single member variable, `pieces`,
containing a list of the constituent pieces. They also define the
method `select()`, which will return a tree containing all the
elements of the argument. (A new head node is created, labeled
"*CHAIN*" or "*SPLIT*", since the argument is not a single constituent
in the original tree). Sentence #6 contains an example of an argument
that is both discontinuous and contains a chain:
>>> inst = pb_instances[6]
>>> inst.roleset
'expose.01'
>>> argloc, argid = inst.arguments[2]
>>> argloc
<PropbankChainTreePointer: 22:1,24:0,25:1*27:0>
>>> argloc.pieces
[<PropbankSplitTreePointer: 22:1,24:0,25:1>, PropbankTreePointer(27, 0)]
>>> argloc.pieces[0].pieces
... # doctest: +NORMALIZE_WHITESPACE
[PropbankTreePointer(22, 1), PropbankTreePointer(24, 0),
PropbankTreePointer(25, 1)]
>>> print(argloc.select(inst.tree))
(*CHAIN*
(*SPLIT* (NP (DT a) (NN group)) (IN of) (NP (NNS workers)))
(-NONE- *))
The PropBank Corpus also provides access to the frameset files, which
define the argument labels used by the annotations, on a per-verb
basis. Each frameset file contains one or more predicates, such as
'turn' or 'turn_on', each of which is divided into coarse-grained word
senses called rolesets. For each roleset, the frameset file provides
descriptions of the argument roles, along with examples.
>>> expose_01 = propbank.roleset('expose.01')
>>> turn_01 = propbank.roleset('turn.01')
>>> print(turn_01) # doctest: +ELLIPSIS
<Element 'roleset' at ...>
>>> for role in turn_01.findall("roles/role"):
... print(role.attrib['n'], role.attrib['descr'])
0 turner
1 thing turning
m direction, location
>>> from xml.etree import ElementTree
>>> print(ElementTree.tostring(turn_01.find('example')).decode('utf8').strip())
<example name="transitive agentive">
<text>
John turned the key in the lock.
</text>
<arg n="0">John</arg>
<rel>turned</rel>
<arg n="1">the key</arg>
<arg f="LOC" n="m">in the lock</arg>
</example>
Note that the standard corpus distribution only contains 10% of the
treebank, so the parse trees are not available for instances starting
at 9353:
>>> inst = pb_instances[9352]
>>> inst.fileid
'wsj_0199.mrg'
>>> print(inst.tree) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
(S (NP-SBJ (NNP Trinity)) (VP (VBD said) (SBAR (-NONE- 0) ...))
>>> print(inst.predicate.select(inst.tree))
(VB begin)
>>> inst = pb_instances[9353]
>>> inst.fileid
'wsj_0200.mrg'
>>> print(inst.tree)
None
>>> print(inst.predicate.select(inst.tree))
Traceback (most recent call last):
. . .
ValueError: Parse tree not avaialable
However, if you supply your own version of the treebank corpus (by
putting it before the nltk-provided version on `nltk.data.path`, or
by creating a `ptb` directory as described above and using the
`propbank_ptb` module), then you can access the trees for all
instances.
A list of the verb lemmas contained in PropBank is returned by the
`propbank.verbs()` method:
>>> propbank.verbs()
['abandon', 'abate', 'abdicate', 'abet', 'abide', ...]