You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
424 lines
31 KiB
HTML
424 lines
31 KiB
HTML
5 years ago
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||
|
<html>
|
||
|
<head>
|
||
|
<title>pattern-search</title>
|
||
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||
|
<link type="text/css" rel="stylesheet" href="../clips.css" />
|
||
|
<style>
|
||
|
/* Small fixes because we omit the online layout.css. */
|
||
|
h3 { line-height: 1.3em; }
|
||
|
#page { margin-left: auto; margin-right: auto; }
|
||
|
#header, #header-inner { height: 175px; }
|
||
|
#header { border-bottom: 1px solid #C6D4DD; }
|
||
|
table { border-collapse: collapse; }
|
||
|
#checksum { display: none; }
|
||
|
</style>
|
||
|
<link href="../js/shCore.css" rel="stylesheet" type="text/css" />
|
||
|
<link href="../js/shThemeDefault.css" rel="stylesheet" type="text/css" />
|
||
|
<script language="javascript" src="../js/shCore.js"></script>
|
||
|
<script language="javascript" src="../js/shBrushXml.js"></script>
|
||
|
<script language="javascript" src="../js/shBrushJScript.js"></script>
|
||
|
<script language="javascript" src="../js/shBrushPython.js"></script>
|
||
|
</head>
|
||
|
<body class="node-type-page one-sidebar sidebar-right section-pages">
|
||
|
<div id="page">
|
||
|
<div id="page-inner">
|
||
|
<div id="header"><div id="header-inner"></div></div>
|
||
|
<div id="content">
|
||
|
<div id="content-inner">
|
||
|
<div class="node node-type-page"
|
||
|
<div class="node-inner">
|
||
|
<div class="breadcrumb">View online at: <a href="http://www.clips.ua.ac.be/pages/pattern-search" class="noexternal" target="_blank">http://www.clips.ua.ac.be/pages/pattern-search</a></div>
|
||
|
<h1>pattern.search</h1>
|
||
|
<!-- Parsed from the online documentation. -->
|
||
|
<div id="node-1357" class="node node-type-page"><div class="node-inner">
|
||
|
<div class="content">
|
||
|
<p class="big">The pattern.search module has a pattern matching system similar to regular expressions, that can be used to search a string by syntax (word function) or by semantics (word meaning).<span class="blue"> </span></p>
|
||
|
<p>It can be used by itself or with other <a href="pattern.html">pattern</a> modules: <a href="pattern-web.html">web</a> | <a href="pattern-db.html">db</a> | <a href="pattern-en.html">en</a> | search <span class="blue"> </span>| <a href="pattern-vector.html">vector</a> | <a href="pattern-graph.html">graph</a>.</p>
|
||
|
<p><img src="../g/pattern_schema.gif" alt="" width="620" height="180" /></p>
|
||
|
<hr />
|
||
|
<h2>Documentation</h2>
|
||
|
<ul>
|
||
|
<li><a href="#introduction">Searching + matching in a nutshell</a></li>
|
||
|
<li><a href="#pattern">Pattern</a></li>
|
||
|
<li><a href="#constraint">Constraint</a></li>
|
||
|
<li><a href="#match">Match</a></li>
|
||
|
<li><a href="#taxonomy">Taxonomy</a></li>
|
||
|
<li><a href="#utility">Utility functions</a></li>
|
||
|
</ul>
|
||
|
<p> </p>
|
||
|
<hr />
|
||
|
<h2><a name="introduction"></a>Searching + matching in a nutshell</h2>
|
||
|
<p>The <span class="inline_code">search()</span> function takes a string (e.g., a word or a sequence of words) and returns a list of non-overlapping matches in the given sentence. The <span class="inline_code">match()</span> function returns the first match, or <span class="inline_code">None</span>. Both functions call <span class="inline_code">compile()</span>, which takes a string and returns a <span class="inline_code">Pattern</span> object.</p>
|
||
|
<pre class="brush:python; gutter:false; light:true;">search(pattern, sentence)</pre><pre class="brush:python; gutter:false; light:true;">match(pattern, sentence)</pre><pre class="brush:python; gutter:false; light:true;">compile(pattern)</pre><div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import search
|
||
|
>>> print search('rabbit', 'big white rabbit')
|
||
|
|
||
|
[Match(words=[Word('rabbit')])]</pre></div>
|
||
|
<p>Search strings can contain a wildcard character at the <span class="inline_code">*start</span>, at the <span class="inline_code">end*</span>, at <span class="inline_code">*both*</span> ends or <span class="inline_code">in*between</span>:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> print search('rabbit*', 'big white rabbit')
|
||
|
>>> print search('rabbit*', 'big white rabbits')
|
||
|
|
||
|
[Match(words=[Word('rabbit')])]
|
||
|
[Match(words=[Word('rabbits')])]
|
||
|
</pre></div>
|
||
|
<p>Search strings can contain multiple options separated by a vertical dash:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> print search('rabbit|cony|bunny', 'big black bunny')
|
||
|
|
||
|
[Match(words=[Word('bunny')])]</pre></div>
|
||
|
<h3>Syntactical pattern matching</h3>
|
||
|
<p>The examples above can also be resolved with (faster) regular expressions. The pattern.search module is more useful with <em>parsed</em> sentences. The pattern.en module has a <a class="link-maintenance" href="pattern-en.html#parser">parser</a> that takes a string and assigns a part-of-speech tag to each word (e.g., <span class="postag">NN</span> = noun, <span class="postag">VB</span> = verb, <span class="postag">JJ</span> = adjective). The parser also groups words into chunks (e.g., <span class="postag">JJ</span> + <span class="postag">NN</span> = <span class="postag">NP</span> = noun phrase) and finds word lemmata (was → be).</p>
|
||
|
<p>A parsed <span class="inline_code">Sentence</span> or <span class="inline_code">Text</span> can be searched by part-of-speech tags:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import search
|
||
|
>>> from pattern.en import parsetree
|
||
|
>>>
|
||
|
>>> t = parsetree('big white rabbit')
|
||
|
>>> print t
|
||
|
>>> print
|
||
|
>>> print search('JJ', t) # all adjectives
|
||
|
>>> print search('NN', t) # all nouns
|
||
|
>>> print search('NP', t) # all noun phrases
|
||
|
|
||
|
[Sentence('big/JJ/B-NP/O white/JJ/I-NP/O rabbit/NN/I-NP/O')]
|
||
|
|
||
|
[Match(words=[Word(u'big/JJ')]), Match(words=[Word(u'white/JJ')])]
|
||
|
[Match(words=[Word(u'rabbit/NN')])]
|
||
|
[Match(words=[Word(u'big/JJ'), Word(u'white/JJ'), Word(u'rabbit/NN')])]</pre></div>
|
||
|
<h3>Semantical pattern matching</h3>
|
||
|
<p>A <span class="inline_code">Taxonomy</span> can be used to define semantical categories of words. Say we want to extract flower names from a text. The search pattern is rather clumsy: <span class="inline_code">"rose|lily|daisy|daffodil|begonia"</span>. A more robust approach is to work with a taxonomy:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import search, taxonomy
|
||
|
>>> from pattern.en import parsetree
|
||
|
>>>
|
||
|
>>> for f in ('rose', 'lily', 'daisy', 'daffodil', 'begonia'):
|
||
|
>>> taxonomy.append(f, type='flower')
|
||
|
>>>
|
||
|
>>> t = parsetree('A field of white daffodils.', lemmata=True)
|
||
|
>>> print t
|
||
|
>>> print
|
||
|
>>> print search('FLOWER', t)
|
||
|
|
||
|
[Sentence('A/DT/B-NP/O/a field/NN/I-NP/O/field of/IN/B-PP/B-PNP/of'
|
||
|
'white/JJ/B-NP/I-PNP/white daffodils/NNS/I-NP/I-PNP/daffodil ././O/O/.')]
|
||
|
|
||
|
[Match(words=[Word(u'white/JJ'), Word(u'daffodils/NNS')])]
|
||
|
</pre></div>
|
||
|
<p>Note how the search pattern has <span class="inline_code">"FLOWER"</span> in uppercase. Since <span class="inline_code">search()</span> is case-insensitive, uppercase words are recognized as taxonomy terms (i.e., <span class="postag">FLOWER</span> = rose + lily + daisy + daffodil + begonia). Furthermore, since lemmata were parsed, <em>daffodils</em> is recognized as the plural form of <em>daffodil</em> (the lemma), and as such also part of <span class="postag">FLOWER</span>.</p>
|
||
|
<p>Note that the returned match is <em>white daffodils</em>. Since <span class="inline_code">search()</span> is (by default) greedy, the whole <span class="postag">NP</span> chunk is matched. In other words, <em>white daffodils</em> is regarded as a more specific instance of <em>daffodil</em>.</p>
|
||
|
<p> </p>
|
||
|
<hr />
|
||
|
<h2><a name="pattern"></a>Pattern</h2>
|
||
|
<p>A <span class="inline_code">Pattern</span> is a sequence of constraints that matches certain phrases in a (parsed) sentence. Each constraint can match a word in the sentence. If a number of successive words corresponds to the entire sequence of constraints, the phrase is a match. The search is case-insensitive.</p>
|
||
|
<p>Constraints can be constructed for syntax (e.g., find all adjectives) and semantics (e.g., find all product names). For example, <span class="inline_code">Pattern.fromstring("NP</span> <span class="inline_code">be</span> <span class="inline_code">*</span> <span class="inline_code">than</span> <span class="inline_code">NP")</span> matches phrases such as <em><span style="text-decoration: underline;">the cat</span> was faster than <span style="text-decoration: underline;">the mouse</span></em>, and <em><span style="text-decoration: underline;">Chuck Norris</span> is cooler than <span style="text-decoration: underline;">Dolph Lundgren</span></em>, since <span class="postag">NP</span> matches any noun phrase.<em> </em>With <span class="inline_code">TAXONOMY</span>, the global <span class="inline_code">taxonomy</span> is used to categorize words.</p>
|
||
|
<pre class="brush:python; gutter:false; light:true;">pattern = Pattern(sequence=[])</pre><pre class="brush:python; gutter:false; light:true;">pattern = Pattern.fromstring(string, taxonomy=TAXONOMY)</pre><pre class="brush:python; gutter:false; light:true;">pattern.sequence # List of Constraint objects.
|
||
|
pattern.groups # List of groups, each a list of Constraint objects.
|
||
|
pattern.strict # Disable greedy matching?
|
||
|
</pre><pre class="brush:python; gutter:false; light:true;">pattern.scan(string)
|
||
|
pattern.search(sentence)
|
||
|
pattern.match(sentence, start=0)</pre><ul>
|
||
|
<li><span class="inline_code">Pattern.scan()</span> returns <span class="inline_code">True</span> if <span class="inline_code">Sentence(string)</span> <em>may</em> yield matches.<br />It can be faster to scan a tagged string, before casting it to a <span class="inline_code">Sentence</span> or <span class="inline_code">Text</span> and searching it. </li>
|
||
|
<li><span class="inline_code">Pattern.search()</span> returns a list of <span class="inline_code">Match</span> objects from the given sentence.</li>
|
||
|
<li><span class="inline_code">Pattern.match()</span> returns the first <span class="inline_code">Match</span> found in the given sentence, or <span class="inline_code">None</span>.</li>
|
||
|
</ul>
|
||
|
<div>For example:</div>
|
||
|
<div class="example">
|
||
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.search import Pattern
|
||
|
>>> from pattern.en import parsetree
|
||
|
>>>
|
||
|
>>> t = parsetree('Chuck Norris is cooler than Dolph Lundgren.', lemmata=True)
|
||
|
>>> p = Pattern.fromstring('{NP} be * than {NP}')
|
||
|
>>> m = p.match(t)
|
||
|
>>> print m.group(1)
|
||
|
>>> print m.group(2)
|
||
|
|
||
|
[Word(u'Chuck/NNP'), Word(u'Norris/NNP')]
|
||
|
[Word(u'Dolph/NNP'), Word(u'Lundgren/NNP')]</pre></div>
|
||
|
<p> </p>
|
||
|
<hr />
|
||
|
<h2><a name="constraint"></a>Constraint</h2>
|
||
|
<p>A <span class="inline_code">Constraint</span> matches a set of (tagged) words and taxonomy terms. For example:</p>
|
||
|
<ul>
|
||
|
<li><span class="inline_code">Constraint.fromstring('with|of')</span> matches either <em>with</em> or <em>of</em>.</li>
|
||
|
<li><span class="inline_code">Constraint.fromstring('JJ?')</span> matches any adjective tagged <span class="postag">JJ</span>, but it is optional.</li>
|
||
|
<li><span class="inline_code">Constraint.fromstring('NP|SBJ')</span> matches subject noun phrases.</li>
|
||
|
<li><span class="inline_code">Constraint.fromstring('QUANTITY')</span> matches siblings of <span class="postag">QUANTITY</span> in the taxonomy.</li>
|
||
|
</ul>
|
||
|
<pre class="brush:python; gutter:false; light:true;">constraint = Constraint(
|
||
|
words = [],
|
||
|
tags = [],
|
||
|
chunks = [],
|
||
|
roles = [],
|
||
|
taxa = [],
|
||
|
optional = False,
|
||
|
multiple = False,
|
||
|
first = False,
|
||
|
taxonomy = TAXONOMY,
|
||
|
exclude = None,
|
||
|
custom = None )</pre><pre class="brush:python; gutter:false; light:true;">constraint = Constraint.fromstring(string, **kwargs)</pre><pre class="brush:python; gutter:false; light:true;">constraint.index
|
||
|
constraint.string
|
||
|
constraint.words # List of allowed words/lemmata (of, with, ...)
|
||
|
constraint.tags # List of allowed parts-of-speech (NN, JJ, ...)
|
||
|
constraint.chunks # List of allowed chunk types (NP, VP, ...)
|
||
|
constraint.roles # List of allowed chunk roles (SBJ, OBJ, ...)
|
||
|
constraint.taxa # List of allowed taxonomy terms.
|
||
|
constraint.taxonomy # Taxonomy used for lookup.
|
||
|
constraint.optional # True => matches zero or one word.
|
||
|
constraint.multiple # True => matches one or more words.
|
||
|
constraint.first # True => can only match first word.
|
||
|
constraint.exclude # None, or Constraint of disallowed options.
|
||
|
constraint.custom # function(word) returns True if match. </pre><pre class="brush:python; gutter:false; light:true;">constraint.match(word)</pre><h3>Constraint string syntax</h3>
|
||
|
<p><span class="inline_code">Constraint.fromstring()</span> returns a new <span class="inline_code">Constraint</span> from the given string. It takes the same optional parameters as the constructor. Uppercase words in the given string indicate a <a class="link-maintenance" href="MBSP-tags.html">part-of-speech tag</a> (e.g., <span class="postag">NN</span>, <span class="postag">JJ</span>, <span class="postag">VP</span>) or a taxonomy term (e.g. <span class="postag">PRODUCT</span>, <span class="postag">PERSON</span>).</p>
|
||
|
<p>Some characters like <span class="inline_code">|</span> or <span class="inline_code">?</span> are special. They affect how the constraint is interpreted:</p>
|
||
|
<table class="border">
|
||
|
<tbody>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="smallcaps">Character</span></td>
|
||
|
<td><span class="smallcaps">Example</span></td>
|
||
|
<td><span class="smallcaps">Description</span></td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">(</span></td>
|
||
|
<td><span class="inline_code">(JJ)</span></td>
|
||
|
<td>Wrapper for an optional constraint (deprecated).</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">[</span></td>
|
||
|
<td><span class="inline_code">[Mac OS X | Windows Vista]</span></td>
|
||
|
<td>Wrapper for a constraint that has spaces.<span class="inline_code"> </span></td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">{</span></td>
|
||
|
<td><span class="inline_code">DT {JJ?} NN</span></td>
|
||
|
<td>Wrapper for match groups, e.g., <span class="inline_code">Match.group(1)</span>.</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">_</span></td>
|
||
|
<td><span class="inline_code">Windows_Vista</span></td>
|
||
|
<td>Converted to a space.<span class="inline_code"> </span></td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">|</span></td>
|
||
|
<td><span class="inline_code">ADJP|ADVP</span></td>
|
||
|
<td>Separator for different options.</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">*</span></td>
|
||
|
<td><span class="inline_code">JJ*</span></td>
|
||
|
<td>Used as a wildcard character. <span class="inline_code"> </span></td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">!</span></td>
|
||
|
<td><span class="inline_code">!be|VB*</span></td>
|
||
|
<td>Used in front of words/tags that are <span style="text-decoration: underline;">not</span> allowed.</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">?</span></td>
|
||
|
<td><span class="inline_code">JJ?</span></td>
|
||
|
<td>Used as a suffix, constraint is optional.</td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">+</span></td>
|
||
|
<td><span class="inline_code">RB|JJ+</span> or <span class="inline_code">JJ?+</span> or <span class="inline_code">*+</span></td>
|
||
|
<td>Used as a suffix, constraint can span multiple words.<span class="inline_code"> </span></td>
|
||
|
</tr>
|
||
|
<tr>
|
||
|
<td style="text-align: center;"><span class="inline_code">^</span></td>
|
||
|
<td><span class="inline_code">^hello</span></td>
|
||
|
<td>Used as a prefix, constraint can only match first word.</td>
|
||
|
</tr>
|
||
|
</tbody>
|
||
|
</table>
|
||
|
<p>The characters listed in the table must be escaped if used as content (e.g., <span class="inline_code">"\?"</span>). You can use the module's <span class="inline_code">escape()</span> function. For example, <span class="inline_code">escape("hello?")</span> returns <span class="inline_code">"hello\?"</span>.</p>
|
||
|
<h3>Constraint matching</h3>
|
||
|
<p><span class="inline_code">Constraint.match()</span> returns <span class="inline_code">True</span> if the given string or <span class="inline_code">Word</span> is part of the constraint:</p>
|
||
|
<ul>
|
||
|
<li>the word (or its lemma) occurs in <span class="inline_code">Constraint.words</span>, OR,</li>
|
||
|
<li>the word (or its lemma) occurs in <span class="inline_code">Constraint.taxa</span> taxonomy tree, AND</li>
|
||
|
<li>the word tags and/or chunk tags match those defined in the constraint.</li>
|
||
|
</ul>
|
||
|
<p>It is case-insensitive. Individual terms in <span class="inline_code">Constraint.words</span> can contain wildcards (<span class="inline_code">*</span>). Some part-of-speech-tags can also contain wildcards: <span class="postag">NN*</span>, <span class="postag">VB*</span>, <span class="postag">JJ*</span>, <span class="postag">RB*</span>, <span class="postag">PR*</span>, <span class="postag">WP*</span>.</p>
|
||
|
<p>The following example demonstrates the use of optional and multiple constraints:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import search
|
||
|
>>> from pattern.en import parsetree
|
||
|
>>>
|
||
|
>>> t = parsetree('tasty cat food')
|
||
|
>>> print t
|
||
|
>>> print
|
||
|
>>> print search('DT? RB? JJ? NN+', t)
|
||
|
|
||
|
[Sentence('tasty/JJ/B-NP/O cat/NN/I-NP/O food/NN/I-NP/O')]
|
||
|
|
||
|
[Match(words=[Word(u'tasty/JJ'), Word(u'cat/NN')]), Word(u'food/NN')])]</pre></div>
|
||
|
<p>The pattern matches successive nouns (<span class="postag">NN</span>), optionally preceded by a determiner (<span class="postag">DT</span>), adverb (<span class="postag">RB</span>) and/or adjective (<span class="postag">JJ</span>). It matches anything from <em>food</em> to <em>cat food</em>, <em>tasty cat food</em>, <em>the tasty cat food</em>, etc.</p>
|
||
|
<h3>Constraint = greedy</h3>
|
||
|
<p>The pattern.en parser groups words that belong together into chunks. For example, <em>the black cat</em> is one chunk, tagged <span class="postag">NP</span> (i.e., a noun phrase). The head of the chunk is <em>cat</em>. By default, when a constraint matches the chunk head, it will greedily match the entire chunk. This means that if we search for <em>cat</em> and the sentence has <em>a big black cat</em>, the entire chunk will be returned.</p>
|
||
|
<p>This behavior can be disabled by passing a <span class="inline_code">STRICT</span> flag to <span class="inline_code">Pattern</span>, <span class="inline_code">compile()</span>, <span class="inline_code">search()</span> or <span class="inline_code">match()</span>:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import search, STRICT
|
||
|
>>> from pattern.en import parsetree
|
||
|
>>>
|
||
|
>>> t = parsetree('The black cat is lurking in the tree.')
|
||
|
>>> print search('cat', t)
|
||
|
|
||
|
[Match(words=[Word(u'The/DT'), Word(u'black/JJ'), Word(u'cat/NN')])]
|
||
|
</pre></div>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> print search('cat', t, STRICT)
|
||
|
|
||
|
[Match(words=[Word(u'cat/NN')])]
|
||
|
</pre></div>
|
||
|
<p> </p>
|
||
|
<hr />
|
||
|
<h2><a name="match"></a>Match</h2>
|
||
|
<p><span class="inline_code">Pattern.search()</span> returns a list of <span class="inline_code">Match</span> objects, where each match is a list of successive <span class="inline_code">Word</span> objects.</p>
|
||
|
<pre class="brush:python; gutter:false; light:true;">match = Match(pattern, words=[])</pre><pre class="brush:python; gutter:false; light:true;">match.pattern # Pattern source.
|
||
|
match.words # List of Word objects.
|
||
|
match.string # String of words separated with a space.
|
||
|
match.start # Index of first word in sentence.
|
||
|
match.stop # Index of last word in sentence + 1.</pre><pre class="brush:python; gutter:false; light:true;">match.group(index, chunked=False)
|
||
|
match.constraint(word)
|
||
|
match.constraints(chunk)
|
||
|
match.constituents(constraint=None)</pre><ul>
|
||
|
<li><span class="inline_code">Match.group()</span> returns a list of <span class="inline_code">Word</span> objects matching the constraints in a <span class="inline_code">{</span> <span class="inline_code">}</span> group.</li>
|
||
|
<li><span class="inline_code">Match.constraint()</span> returns the <span class="inline_code">Constraint</span> that matched the given <span class="inline_code">Word</span>, or <span class="inline_code">None</span>.</li>
|
||
|
<li><span class="inline_code">Match.constraints()</span> returns the list of constraints that matched the given <span class="inline_code">Chunk</span>.</li>
|
||
|
<li><span class="inline_code">Match.constituents()</span> returns a list of <span class="inline_code">Word</span> and <span class="inline_code">Chunk</span> objects, with successive words grouped into chunks whenever possible. Optionally, returns only chunks/words that matched the given <span class="inline_code">Constraint</span> (or list of constraints). Chunks are only available if a <span class="inline_code">Sentence</span> or <span class="inline_code">Text</span> was given (i.e., not for plain strings).</li>
|
||
|
</ul>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import match
|
||
|
>>> from pattern.en import parsetree
|
||
|
>>>
|
||
|
>>> t = parsetree('The turtle was faster than the hare.', lemmata=True)
|
||
|
>>> m = match('NP be ADJP|ADVP than NP', t)
|
||
|
>>>
|
||
|
>>> for w in m.words:
|
||
|
>>> print w, '\t =>', m.constraint(w)
|
||
|
|
||
|
Word(u'The/DT') => Constraint(chunks=['NP'])
|
||
|
Word(u'turtle/NN') => Constraint(chunks=['NP'])
|
||
|
Word(u'was/VBD') => Constraint(words=['be'])
|
||
|
Word(u'faster/RBR') => Constraint(chunks=['ADJP', 'ADVP'])
|
||
|
Word(u'than/IN') => Constraint(words=['than'])
|
||
|
Word(u'the/DT') => Constraint(chunks=['NP'])
|
||
|
Word(u'hare/NN') => Constraint(chunks=['NP'])
|
||
|
</pre></div>
|
||
|
<h3>Match groups</h3>
|
||
|
<p>Match groups in the search pattern can be used to quickly retrieve what you need from a <span class="inline_code">Match</span>:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> t = parsetree('the big black dog')
|
||
|
>>> m = match('DT {JJ?+ NN}', t)
|
||
|
>>> print m.group(0) # full pattern
|
||
|
>>> print m.group(1) # {JJ?+ NN}
|
||
|
>>> print m.group(1).string
|
||
|
|
||
|
[Word(u'the/DT'), Word(u'big/JJ'), Word(u'black/JJ'), Word(u'dog/NN')]
|
||
|
[Word(u'big/JJ'), Word(u'black/JJ'), Word(u'dog/NN')]
|
||
|
'big black dog'</pre></div>
|
||
|
<h3>Match words</h3>
|
||
|
<p>Each <span class="inline_code">Word</span> in a <span class="inline_code">Match</span> or <span class="inline_code">Match.group()</span> has the following attributes:</p>
|
||
|
<pre class="brush:python; gutter:false; light:true;">word = Word(sentence, string, tag=None, index=0)</pre><pre class="brush:python; gutter:false; light:true;">word.string
|
||
|
word.tag # Part-of-speech tag (e.g. NN, JJ).
|
||
|
word.sentence # Sentence (a list of successive Words).
|
||
|
word.index # Sentence index.
|
||
|
</pre><p>When <span class="inline_code">search()</span> or <span class="inline_code">match()</span> is given a string, <span class="inline_code">Word</span> objects are created when the <span class="inline_code">Match</span> is returned. When given a parsed <span class="inline_code">Sentence</span>, <span class="inline_code">Word</span> objects are linked from the sentence. These have extra attributes. For an overview of <span class="inline_code">Sentence</span>, <span class="inline_code">Chunk</span> and <span class="inline_code">Word</span>, see the <a class="link-maintenance" href="pattern-en.html#tree">parse tree</a> documentation.</p>
|
||
|
<p> </p>
|
||
|
<hr />
|
||
|
<h2><a name="taxonomy"></a>Taxonomy</h2>
|
||
|
<p>A taxonomy is a hierarchical tree of words classified by semantic type. For example: a <em>begonia</em> is a <em>flower</em>, and a <em>flower</em> is a <em>plant</em>. Taxonomy terms can be used as constraints. For example, <span class="inline_code">"FLOWER"</span> will match <em>flower</em> as well as <em>begonia</em>, or any other flower that has been defined in the taxonomy. By default, constraints will retrieve terms from the global <span class="inline_code">taxonomy</span>.</p>
|
||
|
<pre class="brush:python; gutter:false; light:true;">taxonomy = Taxonomy()</pre><pre class="brush:python; gutter:false; light:true;">taxonomy.case_sensitive # False by default.
|
||
|
taxonomy.classifiers # List of Classifier objects.</pre><pre class="brush:python; gutter:false; light:true;">taxonomy.append(term, type=None)
|
||
|
taxonomy.remove(term)</pre><pre class="brush:python; gutter:false; light:true;">taxonomy.classify(term)
|
||
|
taxonomy.parents(term, recursive=False)
|
||
|
taxonomy.children(term, recursive=False)
|
||
|
</pre><ul>
|
||
|
<li><span class="inline_code">Taxonomy.classify()</span> returns the (most recent) semantic type for a given term.<br />If the term is not in the taxonomy, it will try <span class="inline_code">Taxonomy.classifiers</span> (see further).</li>
|
||
|
<li><span class="inline_code">Taxonomy.parents()</span> returns a list of all semantic types for the given term.</li>
|
||
|
<li><span class="inline_code">Taxonomy.children()</span> returns a list of all terms for the given semantic type.<br />With <span class="inline_code">recursive=True</span>, traverses the entire branch.</li>
|
||
|
</ul>
|
||
|
<p>For example:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import taxonomy, search
|
||
|
>>>
|
||
|
>>> taxonomy.append('chicken', type='food')
|
||
|
>>> taxonomy.append('chicken', type='bird')
|
||
|
>>> taxonomy.append('penguin', type='bird')
|
||
|
>>> taxonomy.append('bird', type='animal')
|
||
|
>>>
|
||
|
>>> print taxonomy.parents('chicken')
|
||
|
>>> print taxonomy.children('animal', recursive=True)
|
||
|
>>> print
|
||
|
>>> print search('FOOD', "I'm eating chicken.")
|
||
|
|
||
|
['bird', 'food']
|
||
|
['bird', 'penguin', 'chicken']
|
||
|
|
||
|
[Match(words=[Word('chicken')])]</pre></div>
|
||
|
<h3>Taxonomy classifier</h3>
|
||
|
<p>A <span class="inline_code">Classifier</span> offers a rule-based approach to enrich the taxonomy. If a term is not in the taxonomy, it will iterate its list of classifiers. Each classifier is a set of functions that can be customized to yield the semantic type of a term.</p>
|
||
|
<pre class="brush:python; gutter:false; light:true;">classifier = Classifier(
|
||
|
parents = lambda term: [],
|
||
|
children = lambda term: [])</pre><pre class="brush:python; gutter:false; light:true;">classifier.parents(term) # Returns a list of parents for a term.
|
||
|
classifier.children(term) # Returns a list of children for a term.
|
||
|
</pre><p>This is useful because taxonomy terms can't include wildcards (i.e., the <span class="inline_code">*</span> character is taken literally).</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import taxonomy, search
|
||
|
>>> from pattern.search import Classifier
|
||
|
>>>
|
||
|
>>> def parents(term):
|
||
|
>>> return ['quality'] if term.endswith('ness') else []
|
||
|
>>>
|
||
|
>>> taxonomy.classifiers.append(Classifier(parents))
|
||
|
>>> taxonomy.append('cat', type='animal')
|
||
|
>>>
|
||
|
>>> print search('QUALITY of a|an|the ANIMAL', 'the litheness of a cat')
|
||
|
|
||
|
[Match(words=[Word('litheness'), Word('of'), Word('a'), Word('cat')])]</pre></div>
|
||
|
<p>This example creates a classifier that tags words ending in <em>-ness</em> as <span class="postag">quality</span> (e.g., sharpness, greediness). This is more concise than manually adding all words ending in <em>-ness</em> to the taxonomy. The <span class="postag">quality</span> term is then used as a constraint. Remember to always define <span class="inline_code">Classifier.parents()</span>. For performance, <span class="inline_code">Classifier.children()</span> is not called in <span class="inline_code">Constraint.match()</span>.</p>
|
||
|
<h3 class="example">Taxonomy classifier from WordNet</h3>
|
||
|
<p class="example">The following example creates a rule-based taxonomy from the <span class="inline_code">pattern.en.wordnet</span> module:</p>
|
||
|
<div class="example">
|
||
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.search import taxonomy, WordNetClassifier
|
||
|
>>>
|
||
|
>>> taxonomy.classifiers.append(WordNetClassifier())
|
||
|
>>>
|
||
|
>>> print taxonomy.parents('cat', pos='NN')
|
||
|
>>> print taxonomy.parents('cat', pos='VB')
|
||
|
|
||
|
['feline']
|
||
|
['flog']</pre></div>
|
||
|
<table class="border">
|
||
|
<tbody>
|
||
|
<tr>
|
||
|
<td style="text-align: center;">
|
||
|
<p><br /><img src="../g/pattern-search-taxonomy.jpg" alt="" width="300" height="163" /></p>
|
||
|
<p><span style="display: inline !important;"><br /><span class="smallcaps">wordnet taxonomy example</span></span></p>
|
||
|
</td>
|
||
|
</tr>
|
||
|
</tbody>
|
||
|
</table>
|
||
|
<p> </p>
|
||
|
<hr />
|
||
|
<h2><a name="utility"></a>Utility functions</h2>
|
||
|
<p>The pattern.search module has a number of useful list functions:</p>
|
||
|
<pre class="brush:python; gutter:false; light:true;">unique(iterable) # Returns a new list with unique items.</pre><pre class="brush:python; gutter:false; light:true;">find(function, iterable) # Returns first item for which function(item) is True.</pre><pre class="brush:python; gutter:false; light:true;">product(iterable, repeat=1) # Returns a generator of all combinations of length n.</pre><pre class="brush:python; gutter:false; light:true;">variations(iterable, optional=lambda item: False)</pre><pre class="brush:python; gutter:false; light:true;">odict(items=[])</pre><ul>
|
||
|
<li><span class="inline_code">product()</span> returns a generator of all permutations, with replacement. <br />For example: <span class="inline_code">product([1,2,3),</span> <span class="inline_code">repeat=2)</span> yields:<br /><span class="inline_code">[1,1],</span> <span class="inline_code">[1,2],</span> <span class="inline_code">[1,3],</span> <span class="inline_code">[2,1],</span> <span class="inline_code">[2,2],</span> <span class="inline_code">[2,3],</span> <span class="inline_code">[3,1],</span> <span class="inline_code">[3,2],</span> <span class="inline_code">[3,3]</span></li>
|
||
|
<li><span class="inline_code">variations()</span> returns all variations of a sequence with optional items (in-order).</li>
|
||
|
<li><span class="inline_code">odict()</span> is a dictionary with ordered keys (e.g., like a stack).<br />The most recent keys will be returned first when traversing the dictionary.<br /><span class="inline_code">odict.push()</span> takes a <span class="inline_code">(key,</span> <span class="inline_code">value)</span>-tuple and sets the given key to the given value. If the key exists, it pushes the updated item to the top of the stack.</li>
|
||
|
</ul>
|
||
|
</div>
|
||
|
</div></div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
<script>
|
||
|
SyntaxHighlighter.all();
|
||
|
</script>
|
||
|
</body>
|
||
|
</html>
|