You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

733 lines
52 KiB
HTML

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>pattern-en</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link type="text/css" rel="stylesheet" href="../clips.css" />
<style>
/* Small fixes because we omit the online layout.css. */
h3 { line-height: 1.3em; }
#page { margin-left: auto; margin-right: auto; }
#header, #header-inner { height: 175px; }
#header { border-bottom: 1px solid #C6D4DD; }
table { border-collapse: collapse; }
#checksum { display: none; }
</style>
<link href="../js/shCore.css" rel="stylesheet" type="text/css" />
<link href="../js/shThemeDefault.css" rel="stylesheet" type="text/css" />
<script language="javascript" src="../js/shCore.js"></script>
<script language="javascript" src="../js/shBrushXml.js"></script>
<script language="javascript" src="../js/shBrushJScript.js"></script>
<script language="javascript" src="../js/shBrushPython.js"></script>
</head>
<body class="node-type-page one-sidebar sidebar-right section-pages">
<div id="page">
<div id="page-inner">
<div id="header"><div id="header-inner"></div></div>
<div id="content">
<div id="content-inner">
<div class="node node-type-page"
<div class="node-inner">
<div class="breadcrumb">View online at: <a href="http://www.clips.ua.ac.be/pages/pattern-en" class="noexternal" target="_blank">http://www.clips.ua.ac.be/pages/pattern-en</a></div>
<h1>pattern.en</h1>
<!-- Parsed from the online documentation. -->
<div id="node-1383" class="node node-type-page"><div class="node-inner">
<div class="content">
<p class="big">The pattern.en module contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment analysis, tools for English verb conjugation and noun singularization &amp; pluralization, and a WordNet interface.</p>
<p>It can be used by itself or with other <a href="pattern.html">pattern</a> modules: <a href="pattern-web.html">web</a> | <a href="pattern-db.html">db</a>&nbsp;| en | <a href="pattern-search.html">search</a> | <a href="pattern-vector.html">vector</a> | <a href="pattern-graph.html">graph</a>.</p>
<p><img src="../g/pattern_schema.gif" alt="" width="620" height="180" /></p>
<hr />
<h2>Documentation</h2>
<ul>
<li><a href="#article">Indefinite article</a></li>
<li><a href="#pluralization">Pluralization + singularization</a></li>
<li><a href="#comparative">Comparative + superlative</a></li>
<li><a href="#conjugation">Verb conjugation</a></li>
<li><a href="#quantify">Quantification</a></li>
<li><a href="#spelling">Spelling</a></li>
<li><a href="#ngram">n-grams</a></li>
<li><a href="#parser">Parser</a>&nbsp;<span class="smallcaps link-maintenance">(tokenizer, tagger, chunker)</span></li>
<li><a href="#tree">Parse trees</a></li>
<li><a href="#sentiment">Sentiment</a></li>
<li><a href="#modality">Mood &amp; modality</a></li>
<li><a href="#wordnet">WordNet</a></li>
<li><a href="#wordlist">Wordlists</a></li>
</ul>
<p>&nbsp;</p>
<hr />
<h2><a name="article"></a>Indefinite article</h2>
<p>The article is the most common determiner (<span class="postag">DT</span>) in English. It defines whether the successive noun is definite (<em><span style="text-decoration: underline;">the</span> cat</em>) or indefinite (<em><span style="text-decoration: underline;">a</span> cat</em>). The definite article is always <em>the</em>. The indefinite article can be&nbsp;<em>a</em> or <em>an</em>&nbsp;depending on how the successive noun is pronounced.</p>
<pre class="brush:python; gutter:false; light:true;">article(word, function=INDEFINITE) # DEFINITE | INDEFINITE</pre><pre class="brush:python; gutter:false; light:true;">referenced(word, article=INDEFINITE) # Returns article + word.
</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import referenced
&gt;&gt;&gt;
&gt;&gt;&gt; print referenced('university')
&gt;&gt;&gt; print referenced('hour')
a university
an hour</pre></div>
<p><span class="small"><span style="text-decoration: underline;">Reference</span>: Granger, M. (2006). <em>Ruby Linguistics Framework</em>, </span><span class="small">http://deveiate.org/projects/Linguistics</span></p>
<p>&nbsp;</p>
<hr />
<h2><a name="pluralization"></a>Pluralization + singularization</h2>
<p>The <span class="inline_code">pluralize()</span> function returns the plural form of a singular noun. The <span class="inline_code">singularize()</span> function returns the singular form of a plural noun. The <span class="inline_code">pos</span> parameter (part-of-speech) can be set to <span class="inline_code">NOUN</span> or <span class="inline_code">ADJECTIVE</span>, but only a small number of possessive adjectives inflect (e.g. <em>my</em><em>our</em>). The <span class="inline_code">custom</span> dictionary is for user-defined replacements. Accuracy of the algorithms is 96%.</p>
<pre class="brush:python; gutter:false; light:true;">pluralize(word, pos=NOUN, custom={}, classical=True)</pre><pre class="brush:python; gutter:false; light:true;">singularize(word, pos=NOUN, custom={})</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import pluralize, singularize
&gt;&gt;&gt;
&gt;&gt;&gt; print pluralize('child')
&gt;&gt;&gt; print singularize('wolves')
children
wolf
</pre></div>
<p><span class="small"><span style="text-decoration: underline;">Reference</span>: <br />Conway, D. (1998). An Algorithmic Approach to English Pluralization. <em>Proceedings of the 2nd Perl conference</em>.<br />Ferrer, B. (2005). <em>Inflector for Python</em>, http://www.bermi.org/projects/inflector</span></p>
<p>&nbsp;</p>
<hr />
<h2><a name="comparative"></a>Comparative + superlative</h2>
<p>The <span class="inline_code">comparative()</span> and <span class="inline_code">superlative()</span> functions give the comparative or superlative form of an adjective. Words with three or more syllables (e.g., <em>fantastic</em>) are simply preceded by <em>more</em> or <em>most</em>.</p>
<pre class="brush:python; gutter:false; light:true;">comparative(adjective) # big =&gt; bigger</pre><pre class="brush:python; gutter:false; light:true;">superlative(adjective) # big =&gt; biggest</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import comparative, superlative
&gt;&gt;&gt;
&gt;&gt;&gt; print comparative('bad')
&gt;&gt;&gt; print superlative('bad')
worse
worst
</pre></div>
<p>&nbsp;</p>
<hr />
<h2><a name="conjugation"></a>Verb conjugation</h2>
<p>The pattern.en module has a lexicon of 8,500 common English verbs and their conjugated forms (infinitive, 3rd singular present, present participle, past and past participle verbs such as <em>be</em>&nbsp;may have more forms). Some verbs can also be negated, including&nbsp;<em>be</em>, <em>can</em>, <em>do</em>, <em>will</em>, <em>must</em>, <em>have</em>, <em>may</em>, <em>need</em>, <em>dare</em>, <em>ought</em>.</p>
<pre class="brush:python; gutter:false; light:true;">conjugate(verb,
tense = PRESENT, # INFINITIVE, PRESENT, PAST, FUTURE
person = 3, # 1, 2, 3 or None
number = SINGULAR, # SG, PL
mood = INDICATIVE, # INDICATIVE, IMPERATIVE, CONDITIONAL, SUBJUNCTIVE
aspect = IMPERFECTIVE, # IMPERFECTIVE, PERFECTIVE, PROGRESSIVE
negated = False, # True or False
parse = True) </pre><pre class="brush:python; gutter:false; light:true;">lemma(verb) # Base form, e.g., are =&gt; be.</pre><pre class="brush:python; gutter:false; light:true;">lexeme(verb) # List of possible forms: be =&gt; is, was, ...</pre><pre class="brush:python; gutter:false; light:true;">tenses(verb) # List of possible tenses of the given form.
</pre><p>The&nbsp;<span class="inline_code">conjugate()</span> function takes the following optional parameters:</p>
<table class="border">
<tbody>
<tr>
<td style="text-align: left;"><span class="smallcaps">Tense</span></td>
<td style="text-align: left;"><span class="smallcaps">Person</span></td>
<td style="text-align: left;"><span class="smallcaps">Number</span></td>
<td style="text-align: left;"><span class="smallcaps">Mood</span></td>
<td style="text-align: left;"><span class="smallcaps">Aspect</span></td>
<td style="text-align: left;"><span class="smallcaps">Alias</span></td>
<td style="text-align: center;"><span class="smallcaps">Tag</span></td>
<td style="text-align: left;"><span class="smallcaps">Example</span></td>
</tr>
<tr>
<td><span class="inline_code">INFINITIVE</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">"inf"</span></td>
<td style="text-align: center;"><span class="postag">VB</span></td>
<td><em>be</em></td>
</tr>
<tr>
<td><span class="inline_code">PRESENT</span></td>
<td><span class="inline_code">1</span></td>
<td><span class="inline_code">SG</span></td>
<td><span class="inline_code">INDICATIVE</span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"1sg"</span></td>
<td style="text-align: center;"><span class="postag">VBP</span></td>
<td><em>I <span style="text-decoration: underline;">am</span></em></td>
</tr>
<tr>
<td><span class="inline_code">PRESENT</span></td>
<td><span class="inline_code">2</span></td>
<td><span class="inline_code">SG</span></td>
<td><span class="inline_code">INDICATIVE</span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"2sg"</span></td>
<td style="text-align: center;">&nbsp;·</td>
<td><em>you <span style="text-decoration: underline;">are</span></em></td>
</tr>
<tr>
<td><span class="inline_code">PRESENT</span></td>
<td><span class="inline_code">3</span></td>
<td><span class="inline_code">SG</span></td>
<td><span class="inline_code">INDICATIVE</span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"3sg"</span></td>
<td style="text-align: center;"><span class="postag">VBZ</span></td>
<td><em>he <span style="text-decoration: underline;">is</span></em></td>
</tr>
<tr>
<td><span class="inline_code">PRESENT</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">PL</span></td>
<td><span class="inline_code">INDICATIVE</span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"pl"</span></td>
<td style="text-align: center;">&nbsp;·</td>
<td><em>are</em></td>
</tr>
<tr>
<td><span class="inline_code">PRESENT</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">INDICATIVE</span></td>
<td><span class="inline_code">PROGRESSIVE</span></td>
<td><span class="inline_code">"part"</span></td>
<td style="text-align: center;"><span class="postag">VBG</span></td>
<td><em>being</em></td>
</tr>
<tr>
<td style="border-left: 0; border-right: 0; padding: 0;">&nbsp;</td>
</tr>
<tr>
<td><span class="inline_code">PAST</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">None</span></td>
<td><span class="inline_code">"p"</span></td>
<td style="text-align: center;"><span class="postag">VBD</span></td>
<td><em>were</em></td>
</tr>
<tr>
<td><span class="inline_code">PAST</span></td>
<td><span class="inline_code"><span>1</span></span></td>
<td><span class="inline_code"><span>PL</span></span></td>
<td><span class="inline_code">INDICATIVE</span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"1sgp"</span></td>
<td style="text-align: center;">&nbsp;·</td>
<td><em>I <span style="text-decoration: underline;">was</span></em></td>
</tr>
<tr>
<td><span class="inline_code">PAST</span></td>
<td><span class="inline_code"><span>2</span></span></td>
<td><span class="inline_code"><span>PL</span></span></td>
<td><span class="inline_code"><span>INDICATIVE</span></span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"2sgp"</span></td>
<td style="text-align: center;">&nbsp;·</td>
<td><em>you <span style="text-decoration: underline;">were</span></em></td>
</tr>
<tr>
<td><span class="inline_code">PAST</span></td>
<td><span class="inline_code"><span>3</span></span></td>
<td><span class="inline_code"><span>PL</span></span></td>
<td><span class="inline_code"><span>INDICATIVE</span></span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"3gp"</span></td>
<td style="text-align: center;">&nbsp;·</td>
<td><em>he <span style="text-decoration: underline;">was</span></em></td>
</tr>
<tr>
<td><span class="inline_code">PAST</span></td>
<td><span class="inline_code"><span>None</span></span></td>
<td><span class="inline_code"><span>PL</span></span></td>
<td><span class="inline_code"><span>INDICATIVE</span></span></td>
<td><span class="inline_code">IMPERFECTIVE</span></td>
<td><span class="inline_code">"ppl"</span></td>
<td style="text-align: center;">&nbsp;·</td>
<td><em>were</em></td>
</tr>
<tr>
<td style="text-align: left;"><span class="inline_code">PAST</span></td>
<td style="text-align: left;"><span><span>None</span></span></td>
<td style="text-align: left;"><span class="inline_code">None</span></td>
<td style="text-align: left;"><span class="inline_code">INDICATIVE</span></td>
<td style="text-align: left;"><span class="inline_code"><span>PROGRESSIVE</span></span></td>
<td style="text-align: left;"><span class="inline_code">"ppart"</span></td>
<td style="text-align: center;"><span class="postag">VBN</span></td>
<td style="text-align: left;"><em>been</em></td>
</tr>
</tbody>
</table>
<p>Instead of optional parameters, a single short alias, the part-of-speech tag, or&nbsp;<span class="inline_code">PARTICIPLE</span>&nbsp;or <span class="inline_code">PAST+PARTICIPLE</span> can also be given. With no parameters, the infinitive form of the verb is returned.</p>
<p>For example:</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import conjugate, lemma, lexeme
&gt;&gt;&gt;
&gt;&gt;&gt; print lexeme('purr')
&gt;&gt;&gt; print lemma('purring')
&gt;&gt;&gt; print conjugate('purred', '3sg') # he / she / it
['purr', 'purrs', 'purring', 'purred']
purr
purrs
</pre></div>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import tenses, PAST, PL
&gt;&gt;&gt;
&gt;&gt;&gt; print 'p' in tenses('purred') # By alias.
&gt;&gt;&gt; print PAST in tenses('purred')
&gt;&gt;&gt; print (PAST, 1, PL) in tenses('purred')
True
True
True </pre></div>
<p><span class="small"><span style="text-decoration: underline;">Reference</span>: <em>XTAG English morphology</em> (1999), University of Pennsylvania, http://www.cis.upenn.edu/~xtag</span></p>
<p>&nbsp;<br /><span class="smallcaps">Rule-based conjugation</span></p>
<p>All verb functions have an optional <span class="inline_code">parse</span>&nbsp;parameter (<span class="inline_code">True</span> by default) that enables a rule-based parser for unknown verbs. This will not work for irregular verbs, and it is fragile for verbs ending in -e in the past tense, or the present participle. The overall accuracy of the algorithm is 91%.</p>
<p>With <span class="inline_code">parse=False</span>,&nbsp;<span class="inline_code">conjugate()</span>&nbsp;and&nbsp;<span class="inline_code">lemma()</span>&nbsp;yield&nbsp;<span class="inline_code">None</span>:</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import verbs, conjugate, PARTICIPLE
&gt;&gt;&gt;
&gt;&gt;&gt; print 'google' in verbs.infinitives
&gt;&gt;&gt; print 'googled' in verbs.inflections
&gt;&gt;&gt;
&gt;&gt;&gt; print conjugate('googled', tense=PARTICIPLE, parse=False)
&gt;&gt;&gt; print conjugate('googled', tense=PARTICIPLE, parse=True)
False
False
None
googling
</pre></div>
<p>&nbsp;</p>
<hr />
<h2><a name="quantify"></a>Quantification</h2>
<p>The <span class="inline_code">number()</span> function returns a <span class="inline_code">float</span> or <span class="inline_code">int</span> parsed from the given (numeric) string. If no number can be parsed from the string, it returns <span class="inline_code">0</span>.</p>
<p>The <span class="inline_code">numerals()</span> function returns the given <span class="inline_code">int</span> or <span class="inline_code">float</span> as a string of numerals. By default, the fraction is rounded to two decimals.</p>
<p>The <span class="inline_code">quantify()</span> function returns a word count approximation. Two similar words are a <em>pair</em>, three to eight <em>several</em>, and so on. Words can be given as a list, a word → count dictionary, or as a single word + amount.</p>
<p>The <span class="inline_code">reflect()</span> function quantifies Python objects see the examples bundled with the module.</p>
<pre class="brush:python; gutter:false; light:true;">number(string) # "seventy-five point two" =&gt; 75.2</pre><pre class="brush:python; gutter:false; light:true;">numerals(n, round=2) # 2.245 =&gt; "two point twenty-five"</pre><pre class="brush:python; gutter:false; light:true;">quantify([word1, word2, ...], plural={})</pre><pre class="brush:python; gutter:false; light:true;">reflect(object, quantify=True, replace=[])
</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import quantify
&gt;&gt;&gt;
&gt;&gt;&gt; print quantify(['goose', 'goose', 'duck', 'chicken', 'chicken', 'chicken'])
&gt;&gt;&gt; print quantify({'carrot': 100, 'parrot': 20})
&gt;&gt;&gt; print quantify('carrot', amount=1000)
several chickens, a pair of geese and a duck
dozens of carrots and a score of parrots
hundreds of carrots
</pre></div>
<p>&nbsp;</p>
<hr />
<h2><a name="spelling"></a>Spelling</h2>
<p>The <span class="inline_code">suggest()</span> function returns a list of spelling suggestions for a given word. Each suggestion is a <span class="inline_code">(word,</span> <span class="inline_code">confidence)</span>-tuple. It is about 70% accurate.</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">suggest(string)</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.en import suggest
&gt;&gt;&gt; print suggest("parot")
[("part", 0.99), ("parrot", 0.01)]</pre></div>
<p><span class="small"><span style="text-decoration: underline;">Reference</span>: Norvig, P. (2007). <em>How to Write a Spelling Corrector</em>. http://norvig.com/spell-correct.html</span>&nbsp;</p>
<p>&nbsp;</p>
<hr />
<h2><em><a name="ngram"></a>n</em>-grams</h2>
<p>The <span class="inline_code">ngrams()</span> function returns&nbsp;a list of <em>n</em>-grams (i.e., tuples of <em>n</em> successive words) from the given string.&nbsp;Alternatively, you can supply a <span class="inline_code">Text</span> or <span class="inline_code">Sentence</span> object (see further). Punctuation marks are stripped from words, and&nbsp;<em>n</em>-grams will not run over sentence delimiters (i.e., .!?), unless <span class="inline_code">continuous</span> is <span class="inline_code">True</span>.</p>
<pre class="brush:python; gutter:false; light:true;">ngrams(string, n=3, punctuation=".,;:!?()[]{}`''\"@#$^&amp;*+-|=~_", continuous=False)</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import ngrams
&gt;&gt;&gt; print ngrams("I am eating pizza.", n=2) # bigrams
[('I', 'am'), ('am', 'eating'), ('eating', 'pizza')] </pre></div>
<p>&nbsp;</p>
<hr />
<h2><a name="parser"></a>Parser</h2>
<p>A parser identifies sentences, words and word types in a string of text. This involves tokenization (distinguishing between abbreviations and sentence breaks), part-of-speech tagging (annotating words with their type, e.g., is <em>can</em> a <span class="postag">noun</span> or a <span class="postag">verb</span>?) and chunking (grouping consecutive words that belong together). Parsing can be used to answer questions such as <em>who did what and why</em> and is useful in a wide range of text mining applications.&nbsp;The pattern.en parser uses a lexicon of a 100,000 known words and their part-of-speech <a class="link-maintenance" href="MBSP-tags.html" target="_blank">tag</a>, along with rules for unknown words based on word suffix (e.g., <em>-ly</em> = <span class="postag">ADVERB</span>) and context (surrounding words). This approach is fast but not always accurate, since many words are ambiguous and hard to capture with simple rules. The overall accuracy is about 95% (95.8% on WSJ portions 22-24). It is lower for informal language use (e.g., chat language).</p>
<p>The <span class="inline_code">parse()</span> function takes a string of text and returns a part-of-speech tagged Unicode string. Sentences in the output are separated by newline characters.</p>
<pre class="brush:python; gutter:false; light:true;">parse(string,
tokenize = True, # Split punctuation marks from words?
tags = True, # Parse part-of-speech tags? (NN, JJ, ...)
chunks = True, # Parse chunks? (NP, VP, PNP, ...)
relations = False, # Parse chunk relations? (-SBJ, -OBJ, ...)
lemmata = False, # Parse lemmata? (ate =&gt; eat)
encoding = 'utf-8' # Input string encoding.
tagset = None) # Penn Treebank II (default) or UNIVERSAL.
</pre><p>For example:</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import parse
&gt;&gt;&gt; print parse('I eat pizza with a fork.')
I/PRP/B-NP/O eat/VBD/B-VP/O pizza/NN/B-NP/O with/IN/B-PP/B-PNP a/DT/B-NP/I-PNP
fork/NN/I-NP/I-PNP ././O/O
</pre></div>
<ul>
<li>With&nbsp;<span class="inline_code">tags</span><span class="inline_code">=True</span> each word is annotated with a part-of-speech tag.&nbsp;</li>
<li>With <span class="inline_code">chunks=True</span>&nbsp;each word is annotated with a chunk tag and a&nbsp;<span class="postag">PNP</span> tag (prepositional noun phrase, <span class="postag">PP</span> + <span class="postag">NP</span>). The <span class="inline_code postag">O</span> tag (= outside) means that the word is not part of a chunk.</li>
<li>With <span class="inline_code">relations=True</span>&nbsp;each word is annotated with a role tag (e.g., <span class="postag">-SBJ</span>&nbsp;for subject or -<span class="postag">OBJ</span>&nbsp;for).</li>
<li>With <span class="inline_code">lemmata=True</span> each word is annotated with its base form.&nbsp;</li>
<li>With <span class="inline_code">tokenize=False</span>, punctuation marks will not be separated from words. <br />The input string is expected to be tokenized beforehand, or sentence delimiters are not discovered.</li>
</ul>
<p><span class="small"><span style="text-decoration: underline;">Reference</span>: Brill, E. (1992). <em>A simple rule-based part of speech tagger.</em> ANLC '92 Proceedings.</span></p>
<h3>Parser tags</h3>
<p>Let's examine the word <em>fork</em> and the tags assigned by the parser in the example above:</p>
<table class="border">
<tbody>
<tr>
<td class="smallcaps" style="text-align: center;" align="center">word</td>
<td class="smallcaps" style="text-align: center;" align="center">part-of-speech</td>
<td class="smallcaps" style="text-align: center;" align="center">chunk</td>
<td class="smallcaps" style="text-align: center;" align="center">pnp</td>
</tr>
<tr>
<td align="center">fork</td>
<td align="center"><span class="postag">NN </span></td>
<td align="center"><span class="postag">I-NP</span></td>
<td align="center"><span class="postag">I-PNP</span></td>
</tr>
</tbody>
</table>
<p>The word's part-of-speech tag is <span class="postag">NN</span>, which means that it is a noun. The word occurs in a <span class="postag">NP</span> chunk, a noun phrase (i.e., <em>a fork</em>). It is also part of a prepositional noun phrase (i.e., <em><span style="text-decoration: underline;">with</span> a fork</em>).</p>
<p>Common part-of-speech tags are&nbsp;<span class="postag">NN</span> (noun), <span class="postag">VB</span> (verb),&nbsp;<span class="postag">JJ</span> (adjective), <span class="postag">RB</span> (adverb)&nbsp;and&nbsp;<span class="postag">IN</span> (preposition).<br />Common chunk tags are&nbsp;<span class="postag">NP</span> (noun phrase) and <span class="postag">VP</span> (verb phrase).<br />Common chunk relations are <span class="postag">NP-SBJ</span> (subject) and <span class="postag">NP-OBJ</span> (object).</p>
<p>The <a class="link-maintenance" href="MBSP-tags.html" target="_blank">Penn Treebank II tagset</a>&nbsp;gives an overview of all the possible tags generated by the parser.</p>
<h3>Parser tagger &amp; tokenizer</h3>
<p>The <span class="inline_code">tokenize()</span> function returns a list of sentences, with punctuation marks split from words. It takes an optional&nbsp;<span class="inline_code">replace</span>&nbsp;dictionary, by default used to split contractions, i.e.,&nbsp;<span class="inline_code">{"'ve":</span>&nbsp;<span class="inline_code">"&nbsp;</span><span class="inline_code">'ve"</span><span class="inline_code">,</span> <span class="inline_code">...}</span>.</p>
<p>The <span class="inline_code">tag()</span> function simply annotates words with their part-of-speech tag and returns a list of <span class="inline_code">(word,</span> <span class="inline_code">tag)</span>-tuples:</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">tokenize(string, punctuation=".,;:!?()[]{}`''\"@#$^&amp;*+-|=~_", replace={})</pre><pre class="brush:python; gutter:false; light:true;">tag(string, tokenize=True, encoding='utf-8')</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import tag
&gt;&gt;&gt;
&gt;&gt;&gt; for word, pos in tag('I feel *happy*!')
&gt;&gt;&gt; if pos == "JJ": # Retrieve all adjectives.
&gt;&gt;&gt; print word
happy</pre></div>
<h3>Parser output</h3>
<p>The output of&nbsp;<span class="inline_code">parse()</span>&nbsp;is a string of sentences in which each word has been annotated with the requested tags. The <span class="inline_code">pprint()</span> function gives a human-readable breakdown of the tags (the extra <em>p-</em> is for <em>pretty</em>).</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import parse
&gt;&gt;&gt; from pattern.en import pprint
&gt;&gt;&gt;
&gt;&gt;&gt; pprint(parse('I ate pizza.', relations=True, lemmata=True))
WORD TAG CHUNK ROLE ID PNP LEMMA
I PRP NP SBJ 1 - i
ate VBP VP - 1 - eat
pizza NN NP OBJ 1 - pizza
. . - - - - . </pre></div>
<p>The output of <span class="inline_code">parse()</span> is a subclass of <span class="inline_code">unicode</span> called&nbsp;<span class="inline_code">TaggedString</span>&nbsp;whose&nbsp;<span class="inline_code">TaggedString.split()</span> method by default yields a list of sentences, where each sentence is a list of tokens, where each token is a list of the word + its tags.</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import parse
&gt;&gt;&gt; print parse('I ate pizza.').split()
[[[u'I', u'PRP', u'B-NP', u'O'],
[u'ate', u'VBD', u'B-VP', u'O'],
[u'pizza', u'NN', u'B-NP', u'O'],
[u'.', u'.', u'O', u'O']]] </pre></div>
<p>The most convenient way to analyze and mine the output is to construct&nbsp;a <a href="#tree" target="_self">parse tree</a>.</p>
<p>&nbsp;</p>
<hr />
<h2><a name="tree"></a>Parse trees</h2>
<p>A parse tree stores a tagged string as a tree of nested objects that can be traversed to analyze the constituents in the text. The <span class="inline_code">parsetree()</span> function takes the same parameters as <span class="inline_code">parse()</span> and returns a <span class="inline_code">Text</span> object.&nbsp;A&nbsp;<span class="inline_code">Text</span> is a list of <span class="inline_code">Sentence</span> objects. Each <span class="inline_code">Sentence</span> is a list of <span class="inline_code">Word</span> objects. <span class="inline_code">Word</span> objects can be grouped in <span class="inline_code">Chunk</span> objects, which are related to other <span class="inline_code">Chunk</span> objects.</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">parsetree(string,
tokenize = True, # Split punctuation marks from words?
tags = True, # Parse part-of-speech tags? (NN, JJ, ...)
chunks = True, # Parse chunks? (NP, VP, PNP, ...)
relations = False, # Parse chunk relations? (-SBJ, -OBJ, ...)
lemmata = False, # Parse lemmata? (ate =&gt; eat)
encoding = 'utf-8' # Input string encoding.
tagset = None) # Penn Treebank II (default) or UNIVERSAL.
</pre><p>The following example shows the parse tree for the sentence "<em>The cat sat on the mat.</em>":</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import parsetree
&gt;&gt;&gt;
&gt;&gt;&gt; s = parsetree('The cat sat on the mat.', relations=True, lemmata=True)
&gt;&gt;&gt; print repr(s)
[Sentence(
u'The/DT/B-NP/O/NP-SBJ-1/the
cat/NN/I-NP/O/NP-SBJ-1/cat
sat/VBD/B-VP/O/VP-1/sit
on/IN/B-PP/B-PNP/O/on
the/DT/B-NP/I-PNP/O/the
mat/NN/I-NP/I-PNP/O/mat
././O/O/O/O/.')]</pre><pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; for sentence in s:
&gt;&gt;&gt; for chunk in sentence.chunks:
&gt;&gt;&gt; print chunk.type, [(w.string, w.type) for w in chunk.words]
NP [(u'the', u'DT'), (u'cat', u'NN')]
VP [(u'sat', u'VBD')]
PP [(u'on', u'IN')]
NP [(u'the', 'DT), (u'mat', u'NN')]
</pre></div>
<p>A common approach is to store output from <span class="inline_code">parse()</span>&nbsp;in a .txt file, with a tagged sentence on each line.&nbsp;The <span class="inline_code">tree()</span> function can be used to load it as a <span class="inline_code">Text</span> object. It has an optional <span class="inline_code">token</span> parameter that defines the format of the tokens (tagged words).&nbsp;So&nbsp;<span class="inline_code">parsetree(s)</span>&nbsp;is the same as&nbsp;<span class="inline_code">tree(parse(s)</span><span class="inline_code">)</span>.</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">tree(taggedstring, token=[WORD, POS, CHUNK, PNP, REL, LEMMA])</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.en import tree
&gt;&gt;&gt;
&gt;&gt;&gt; for sentence in tree(open('tagged.txt'), token=[WORD, POS, CHUNK])
&gt;&gt;&gt; print sentence</pre></div>
<h3>Text</h3>
<p>A <span class="inline_code">Text</span> is a list of <span class="inline_code">Sentence</span> objects (i.e., it can be iterated with&nbsp;<span class="inline_code">for</span> <span class="inline_code">sentence</span> <span class="inline_code">in</span> <span class="inline_code">text:</span>).</p>
<pre class="brush:python; gutter:false; light:true;">text = Text(taggedstring, token=[WORD, POS, CHUNK, PNP, REL, LEMMA])</pre><pre class="brush:python; gutter:false; light:true;">text = Text.from_xml(xml) # Reads an XML string generated with Text.xml.
</pre><pre class="brush:python; gutter:false; light:true;">text.string # 'The cat sat on the mat .'
text.sentences # [Sentence('The cat sat on the mat .')]
text.copy()
text.xml</pre><h3>Sentence</h3>
<p>A <span class="inline_code">Sentence</span> is a list of <span class="inline_code">Word</span> objects, with attributes and methods that group words in <span class="inline_code">Chunk</span> objects.</p>
<pre class="brush:python; gutter:false; light:true;">sentence = Sentence(taggedstring, token=[WORD, POS, CHUNK, PNP, REL, LEMMA])</pre><pre class="brush:python; gutter:false; light:true;">sentence = Sentence.from_xml(xml)
</pre><pre class="brush:python; gutter:false; light:true;">sentence.parent # Sentence parent, or None.
sentence.id # Unique id for each sentence.
sentence.start # 0
sentence.stop # len(Sentence).
</pre><pre class="brush:python; gutter:false; light:true;">sentence.string # Tokenized string, without tags.
sentence.words # List of Word objects.
sentence.lemmata # List of word lemmata.
sentence.chunks # List of Chunk objects.
sentence.subjects # List of NP-SBJ chunks.
sentence.objects # List of NP-OBJ chunks.
sentence.verbs # List of VP chunks.
sentence.relations # {'SBJ': {1: Chunk('the cat/NP-SBJ-1')},
# 'VP': {1: Chunk('sat/VP-1')},
# 'OBJ': {}}
sentence.pnp # List of PNPChunks: [Chunk('on the mat/PNP')]
</pre><pre class="brush:python; gutter:false; light:true;">sentence.constituents(pnp=False)</pre><pre class="brush:python; gutter:false; light:true;">sentence.slice(start, stop)
sentence.copy()
sentence.xml
</pre><ul>
<li><span class="inline_code">Sentence.constituents()</span> returns a mixed, in-order list of <span class="inline_code">Word</span> and <span class="inline_code">Chunk</span> objects.<br />With <span class="inline_code">pnp=True</span>, it will yield&nbsp;<span class="inline_code">PNPChunk</span> objects whenever possible.</li>
<li><span class="inline_code">Sentence.slice()</span>&nbsp;returns a <span class="inline_code">Slice</span> (= a subclass of <span class="inline_code">Sentence</span>) starting with the word at index <span class="inline_code">start</span> and containing all words up to (not including) index <span class="inline_code">stop</span>.</li>
</ul>
<h3>Sentence words</h3>
<p>A <span class="inline_code">Sentence</span> is made up of <span class="inline_code">Word</span> objects, which are also grouped in <span class="inline_code">Chunk</span> objects:</p>
<pre class="brush:python; gutter:false; light:true;">word = Word(sentence, string, lemma=None, type=None, index=0)</pre><pre class="brush:python; gutter:false; light:true;">word.sentence # Sentence parent.
word.index # Sentence index of word.
word.string # String (Unicode).
word.lemma # String lemma, e.g. 'sat' =&gt; 'sit',
word.type # Part-of-speech tag (NN, JJ, VBD, ...)
word.chunk # Chunk parent, or None.
word.pnp # PNPChunk parent, or None.</pre><h3>Sentence chunks</h3>
<p>A <span class="inline_code">Chunk</span> is a list of <span class="inline_code">Word</span> objects that belong together. <br />Multiple chunks can be part of a <span class="inline_code">PNPChunk</span>, which start with a <span class="postag">PP</span> chunk followed by <span class="postag">NP</span> chunks.</p>
<pre class="brush:python; gutter:false; light:true;">chunk = Chunk(sentence, words=[], type=None, role=None, relation=None)</pre><pre class="brush:python; gutter:false; light:true;">chunk.sentence # Sentence parent.
chunk.start # Sentence index of first word.
chunk.stop # Sentence index of last word + 1.
chunk.string # String of words (Unicode).
chunk.words # List of Word objects.
chunk.lemmata # List of word lemmata.
chunk.head # Primary Word in the chunk.
chunk.type # Chunk tag (NP, VP, PP, ...)
chunk.role # Role tag (SBJ, OBJ, ...)
chunk.relation # Relation id, e.g. NP-SBJ-1 =&gt; 1.
chunk.relations # List of (id, role)-tuples.
chunk.related # List of Chunks with same relation id.
chunk.subject # NP-SBJ chunk with same id.
chunk.object # NP-OBJ chunk with same id.
chunk.verb # VP chunk with same id.
chunk.modifiers # []
chunk.conjunctions # []
chunk.pnp # PNPChunk parent, or None.
</pre><pre class="brush:python; gutter:false; light:true;">chunk.previous(type=None)
chunk.next(type=None)
chunk.nearest(type='VP')</pre><ul>
<li><span class="inline_code">Chunk.head</span> yields the primary&nbsp;<span class="inline_code">Word</span> in the chunk: <em>the big cat</em><em>cat</em>.</li>
<li><span class="inline_code">Chunk.relations</span>&nbsp;contains all relations the chunk is part of. <br />Some chunks have multiple relations, e.g., <span class="postag">SBJ</span> as well as&nbsp;<span class="postag">OBJ</span>, or&nbsp;<span class="postag">OBJ</span> of multiple <span class="postag">VP</span>'s.</li>
<li>For <span class="postag">VP</span> chunks, <span class="inline_code">Chunk.modifiers</span> is a list of nearby adjectives and adverbs that have no relations. <br />For example, in <em>the cat purred happily</em>, modifier of&nbsp;<em>purred</em>&nbsp;<em>happily</em>.</li>
<li><span class="inline_code">Chunk.conjunctions</span> is a list of chunks linked by <em>and</em>&nbsp;and&nbsp;<em>or</em> to this chunk. <br />For example in <em>up and down</em>: the <em>up</em> chunk has conjunctions: <span class="inline_code">[(Chunk('down'),</span> <span class="inline_code">AND)]</span>.</li>
</ul>
<h3>Prepositional noun phrases</h3>
<p>A <span class="inline_code">PNPChunk</span>&nbsp;or prepositional noun phrase is a subclass of <span class="inline_code">Chunk</span>.&nbsp;It groups <span class="postag">PP</span> + <span class="postag">NP</span> chunks (= <span class="postag">PNP</span>).</p>
<pre class="brush:python; gutter:false; light:true;">pnp = PNPChunk(sentence, words=[], type=None, role=None, relation=None)</pre><pre class="brush:python; gutter:false; light:true;">pnp.string # String of words (Unicode).
pnp.chunks # List of Chunk objects.
pnp.preposition # First PP chunk in the PNP.
</pre><p>Words and chunks that are part of a <span class="postag">PNP</span> will have their <span class="inline_code">Word.pnp</span> and <span class="inline_code">Chunk.pnp</span> attribute set.&nbsp;All prepositional noun phrases in a sentence can be retrieved with <span class="inline_code">Sentence.pnp</span>.</p>
<p>&nbsp;</p>
<hr />
<h2><a name="sentiment"></a>Sentiment</h2>
<p>Written text can be broadly categorized into two types: facts and opinions. Opinions carry people's sentiments, appraisals and feelings toward the world. The pattern.en module bundles a lexicon of adjectives (e.g., <em>good</em>, <em>bad</em>, <em>amazing</em>, <em>irritating</em>, ...) that occur frequently in product reviews, annotated with scores for sentiment polarity (positive ↔&nbsp;negative) and subjectivity (objective ↔ subjective).&nbsp;</p>
<p>The <span class="inline_code">sentiment()</span> function returns a <span class="inline_code">(polarity,</span> <span class="inline_code">subjectivity)</span>-tuple for the given sentence, based on the adjectives it contains,&nbsp;where polarity is a value between <span class="inline_code">-1.0</span> and +<span class="inline_code">1.0</span> and subjectivity between <span class="inline_code">0.0</span> and <span class="inline_code">1.0</span>.&nbsp;The sentence can be a string, <span class="inline_code">Text</span>, <span class="inline_code">Sentence</span>, <span class="inline_code">Chunk</span>,&nbsp;<span class="inline_code">Word</span> or a&nbsp;<span class="inline_code">Synset</span> (see below).&nbsp;</p>
<p>The <span class="inline_code">positive()</span> function returns <span class="inline_code">True</span> if the given sentence's polarity is above the threshold. The threshold can be lowered or raised, but overall <span class="inline_code">+0.1</span> gives the best results for product reviews. Accuracy is about 75% for movie reviews.</p>
<pre class="brush:python; gutter:false; light:true;">sentiment(sentence) # Returns a (polarity, subjectivity)-tuple.</pre><pre class="brush:python; gutter:false; light:true;">positive(s, threshold=0.1) # Returns True if polarity &gt;= threshold.</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import sentiment
&gt;&gt;&gt;
&gt;&gt;&gt; print sentiment(
&gt;&gt;&gt; "The movie attempts to be surreal by incorporating various time paradoxes,"
&gt;&gt;&gt; "but it's presented in such a ridiculous way it's seriously boring.")
(-0.34, 1.0) </pre></div>
<p>In the example above,&nbsp;<span class="inline_code">-0.34</span> is the average of&nbsp;<em>surreal</em>, <em>various</em>, <em>ridiculous</em> and <em>seriously boring</em>.&nbsp;To retrieve the scores for individual words, use the special <span class="inline_code">assessments</span> property, which yields a list of <span class="inline_code">(words,</span> <span class="inline_code">polarity,</span> <span class="inline_code">subjectivity,</span> <span class="inline_code">label)</span>-tuples.</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; print sentiment('Wonderfully awful! :-)').assessments
[(['wonderfully', 'awful', '!'], -1.0, 1.0, None),
([':-)'], 0.5, 1.0, 'mood')]
</pre></div>
<p>&nbsp;&nbsp;</p>
<hr />
<h2><a name="modality"></a>Mood &amp; modality</h2>
<p>Grammatical mood refers to the use of auxiliary verbs (e.g., <em>could</em>, <em>would</em>) and adverbs (e.g., <em>definitely</em>,<em> maybe</em>) to express uncertainty.&nbsp;</p>
<p>The <span class="inline_code">mood()</span> function returns either&nbsp;<span class="inline_code">INDICATIVE</span>, <span class="inline_code">IMPERATIVE</span>, <span class="inline_code">CONDITIONAL</span>&nbsp;or <span class="inline_code">SUBJUNCTIVE</span>&nbsp;for a given parsed&nbsp;<span class="inline_code">Sentence</span>. See the table below for an overview of moods.</p>
<p>The <span class="inline_code">modality()</span> function returns the degree of certainty as a value between <span class="inline_code">-1.0</span> and <span class="inline_code">+1.0</span>, where values <span class="inline_code">&gt;</span> <span class="inline_code">+0.5</span> represent facts. For example, "<em>I wish it would stop raining"</em> scores <span class="inline_code">-0.35</span>, whereas "<em>It will stop raining"</em> scores <span class="inline_code">+0.75</span>. Accuracy is about 68% for Wikipedia texts.</p>
<pre class="brush:python; gutter:false; light:true;">mood(sentence) # Returns INDICATIVE | IMPERATIVE | CONDITIONAL | SUBJUNCTIVE</pre><pre class="brush:python; gutter:false; light:true;">modality(sentence) # Returns -1.0 =&gt; +1.0.</pre><table class="border">
<tbody>
<tr>
<td><span class="smallcaps">Mood</span></td>
<td><span class="smallcaps">Form</span></td>
<td><span class="smallcaps">Use</span></td>
<td><span class="smallcaps">Example</span></td>
</tr>
<tr>
<td><span class="inline_code">INDICATIVE</span></td>
<td>none of the below&nbsp;</td>
<td>fact, belief</td>
<td><em>It rains.</em></td>
</tr>
<tr>
<td><span class="inline_code">IMPERATIVE</span></td>
<td>infinitive without <em>to</em></td>
<td>command, warning</td>
<td><em><span style="text-decoration: underline;">Do</span>n't rain!</em></td>
</tr>
<tr>
<td><span class="inline_code">CONDITIONAL</span></td>
<td><em>would</em>, <em>could</em>, <em>should</em>, <em>may</em>, or <em>will</em>,&nbsp;<em>can</em> + <em>if</em></td>
<td>conjecture</td>
<td><em>It <span style="text-decoration: underline;">might</span> rain.</em></td>
</tr>
<tr>
<td><span class="inline_code">SUBJUNCTIVE</span></td>
<td><em>wish</em>, <em>were</em>, or&nbsp;<em>it is</em> + infinitive</td>
<td>wish, opinion</td>
<td><em>I <span style="text-decoration: underline;">hope</span> it rains.</em></td>
</tr>
</tbody>
</table>
<p>For example:</p>
<div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.en import parse, Sentence, parse
&gt;&gt;&gt; from pattern.en import modality
&gt;&gt;&gt;
&gt;&gt;&gt; s = "Some amino acids tend to be acidic while others may be basic." # weaseling
&gt;&gt;&gt; s = parse(s, lemmata=True)
&gt;&gt;&gt; s = Sentence(s)
&gt;&gt;&gt;
&gt;&gt;&gt; print modality(s)
0.11</pre></div>
<p>&nbsp;</p>
<hr />
<h2><a name="wordnet"></a>WordNet</h2>
<p>The pattern.en.wordnet module includes WordNet 3.0 and Oliver Steele's PyWordNet module. <a href="http://wordnet.princeton.edu/" target="_blank">WordNet</a> is a lexical database that groups related words into <span class="inline_code">Synset</span> objects (= sets of synonyms). Each synset provides a short definition and semantic relations to other synsets.</p>
<p>The <span class="inline_code">synsets()</span> function returns a list of <span class="inline_code">Synset</span> objects for a given word, where each set corresponds to a word sense (e.g., <em>tree</em> in the sense of plant, <em>tree</em> in the sense of diagram, etc.)</p>
<pre class="brush:python; gutter:false; light:true;">synset = wordnet.synsets(word, pos=NOUN)[i]</pre><pre class="brush:python; gutter:false; light:true;">synset.pos # Part-of-speech: NOUN | VERB | ADJECTIVE | ADVERB.
synset.synonyms # List of word forms (i.e., synonyms).
synset.gloss # Definition string.
synset.lexname # Category string, or None.
synset.ic # Information Content (float).
</pre><pre class="brush:python; gutter:false; light:true;">synset.antonym # Synset (semantic opposite).
synset.hypernym # Synset (semantic parent).</pre><pre class="brush:python; gutter:false; light:true;">synset.hypernyms(recursive=False, depth=None)
synset.hyponyms(recursive=False, depth=None)
synset.meronyms() # List of synsets (members/parts).
synset.holonyms() # List of synsets (of which this is a member).
synset.similar() # List of synsets (similar adjectives/verbs).</pre><ul>
<li><span class="inline_code">Synset.hypernyms()</span> returns a list of <em>&nbsp;</em>parent synsets (i.e., more general).</li>
<li><span class="inline_code">Synset.hyponyms()</span> returns a list child synsets (i.e., more specific).<br />With <span class="inline_code">recursive=True</span>, returns parents of parents or children of children.<br />Optionally, returns parents or children recursively up to the given <span class="inline_code">depth</span>.</li>
</ul>
<p>For example:</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import wordnet
&gt;&gt;&gt;
&gt;&gt;&gt; s = wordnet.synsets('bird')[0]
&gt;&gt;&gt;
&gt;&gt;&gt; print 'Definition:', s.gloss
&gt;&gt;&gt; print ' Synonyms:', s.synonyms
&gt;&gt;&gt; print ' Hypernyms:', s.hypernyms()
&gt;&gt;&gt; print ' Hyponyms:', s.hyponyms()
&gt;&gt;&gt; print ' Holonyms:', s.holonyms()
&gt;&gt;&gt; print ' Meronyms:', s.meronyms()
Definition: u'warm-blooded egg-laying vertebrates characterized '
'by feathers and forelimbs modified as wings'
Synonyms: [u'bird']
Hypernyms: [Synset(u'vertebrate')]
Hyponyms: [Synset(u'cock'), Synset(u'hen'), ...]
Holonyms: [Synset(u'Aves'), Synset(u'flock')]
Meronyms: [Synset(u'beak'), Synset(u'feather'), ...]</pre></div>
<div class="example"><span class="small"><span style="text-decoration: underline;">Reference</span>: Fellbaum, C. (1998). </span><em class="small">WordNet: An Electronic Lexical Database</em><span class="small">. Cambridge, MIT Press.</span></div>
<h3>Synset similarity</h3>
<p>The <span class="inline_code">ancestor()</span> function returns the common ancestor&nbsp;of two synsets.&nbsp;The <span class="inline_code">similarity()</span> function returns the semantic similarity of two synsets as a value between <span class="inline_code">0.0</span><span class="inline_code">1.0</span>.</p>
<pre class="brush:python; gutter:false; light:true;">wordnet.ancestor(synset1, synset2)</pre><pre class="brush:python; gutter:false; light:true;">wordnet.similarity(synset1, synset2)
</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import wordnet
&gt;&gt;&gt;
&gt;&gt;&gt; a = wordnet.synsets('cat')[0]
&gt;&gt;&gt; b = wordnet.synsets('dog')[0]
&gt;&gt;&gt; c = wordnet.synsets('box')[0]
&gt;&gt;&gt;
&gt;&gt;&gt; print wordnet.ancestor(a, b)
&gt;&gt;&gt;
&gt;&gt;&gt; print wordnet.similarity(a, a)
&gt;&gt;&gt; print wordnet.similarity(a, b)
&gt;&gt;&gt; print wordnet.similarity(a, c)
Synset('carnivore')
1.0
0.86
0.17 </pre></div>
<p>Similarity is calculated using Lin's formula and Resnik's Information Content (IC). IC values for each synset are derived from the word count in Brown corpus.</p>
<p><span class="inline_code">lin</span> <span class="inline_code">=</span> <span class="inline_code">2.0</span> <span class="inline_code">*</span> <span class="inline_code">log(ancestor(synset1,</span> <span class="inline_code">synset2).ic)</span> <span class="inline_code">/</span> <span class="inline_code">log(synset1.ic</span> <span class="inline_code">*</span> <span class="inline_code">synset2.ic)</span></p>
<h3>Synset sentiment</h3>
<p><a href="http://sentiwordnet.isti.cnr.it/" target="_blank">SentiWordNet</a> is a lexical resource for opinion mining, with polarity and subjectivity scores for all WordNet synsets. SentiWordNet is free for non-commercial research purposes. To use SentiWordNet, request a download from the authors and put&nbsp;<span class="inline_code">SentiWordNet*.txt</span> in&nbsp;<span class="inline_code">pattern/en/wordnet/</span>.&nbsp;You can then use&nbsp;<span class="inline_code">Synset.weight()</span> in your script:</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en import wordnet
&gt;&gt;&gt; from pattern.en import ADJECTIVE
&gt;&gt;&gt;
&gt;&gt;&gt; print wordnet.synsets('happy', ADJECTIVE)[0].weight
&gt;&gt;&gt; print wordnet.synsets('sad', ADJECTIVE)[0].weight
(0.375, 0.875)
(-0.625, 0.875)
</pre></div>
<p>&nbsp;</p>
<hr />
<h2><a name="wordlist"></a>Wordlists</h2>
<p>The patten.en module includes a number of general-purpose word lists:</p>
<table class="border">
<tbody>
<tr>
<td><span class="smallcaps">List</span></td>
<td><span class="smallcaps">Description</span></td>
<td style="text-align: center;"><span class="smallcaps">Size</span></td>
<td><span class="smallcaps">Example</span></td>
</tr>
<tr>
<td><span class="inline_code">ACADEMIC</span></td>
<td>English academic words</td>
<td style="text-align: center;">500</td>
<td><em>criterion</em>, <em>proportionally</em>, <em>research</em></td>
</tr>
<tr>
<td><span class="inline_code">BASIC</span></td>
<td>English basic words</td>
<td style="text-align: center;">1,000</td>
<td><em>chicken</em>, <em>pain</em>, <em>road</em></td>
</tr>
<tr>
<td><span class="inline_code">PROFANITY</span></td>
<td>English swear words</td>
<td style="text-align: center;">350</td>
<td>&nbsp;</td>
</tr>
<tr>
<td><span class="inline_code">TIME</span></td>
<td>English time &amp; date words</td>
<td style="text-align: center;">100</td>
<td><em>Christmas</em>, <em>past</em>, <em>saturday</em></td>
</tr>
</tbody>
</table>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.en.wordlist import ACADEMIC
&gt;&gt;&gt;
&gt;&gt;&gt; words = open('paper.txt').read().split()
&gt;&gt;&gt; words = [w for w in words if w not in ACADEMIC] </pre></div>
<p>&nbsp;</p>
<hr />
<h2>See also</h2>
<ul>
<li><a href="http://www.clips.ua.ac.be/pages/MBSP" target="_blank">MBSP</a> (GPL): r<span>obust parser using a memory-based learning approach, in Python.</span></li>
<li><span><a href="http://www.nltk.org/" target="_blank">NLTK</a> (Apache): f</span><span>ull natural language processing toolkit for Python.</span></li>
</ul>
</div>
</div></div>
</div>
</div>
</div>
</div>
</div>
</div>
<script>
SyntaxHighlighter.all();
</script>
</body>
</html>