You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
105 lines
7.7 KiB
HTML
105 lines
7.7 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
<html>
|
|
<head>
|
|
<title>pattern-nl</title>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
<link type="text/css" rel="stylesheet" href="../clips.css" />
|
|
<style>
|
|
/* Small fixes because we omit the online layout.css. */
|
|
h3 { line-height: 1.3em; }
|
|
#page { margin-left: auto; margin-right: auto; }
|
|
#header, #header-inner { height: 175px; }
|
|
#header { border-bottom: 1px solid #C6D4DD; }
|
|
table { border-collapse: collapse; }
|
|
#checksum { display: none; }
|
|
</style>
|
|
<link href="../js/shCore.css" rel="stylesheet" type="text/css" />
|
|
<link href="../js/shThemeDefault.css" rel="stylesheet" type="text/css" />
|
|
<script language="javascript" src="../js/shCore.js"></script>
|
|
<script language="javascript" src="../js/shBrushXml.js"></script>
|
|
<script language="javascript" src="../js/shBrushJScript.js"></script>
|
|
<script language="javascript" src="../js/shBrushPython.js"></script>
|
|
</head>
|
|
<body class="node-type-page one-sidebar sidebar-right section-pages">
|
|
<div id="page">
|
|
<div id="page-inner">
|
|
<div id="header"><div id="header-inner"></div></div>
|
|
<div id="content">
|
|
<div id="content-inner">
|
|
<div class="node node-type-page"
|
|
<div class="node-inner">
|
|
<div class="breadcrumb">View online at: <a href="http://www.clips.ua.ac.be/pages/pattern-nl" class="noexternal" target="_blank">http://www.clips.ua.ac.be/pages/pattern-nl</a></div>
|
|
<h1>pattern.nl</h1>
|
|
<!-- Parsed from the online documentation. -->
|
|
<div id="node-1418" class="node node-type-page"><div class="node-inner">
|
|
<div class="content">
|
|
<p><span class="big">The pattern.nl module contains a fast part-of-speech tagger for Dutch (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization.</span></p>
|
|
<p>It can be used by itself or with other <a href="pattern.html">pattern</a> modules: <a href="pattern-web.html">web</a> | <a href="pattern-db.html">db</a> | <a href="pattern-en.html">en</a> | <a href="pattern-search.html">search</a> | <a href="pattern-vector.html">vector</a> | <a href="pattern-graph.html">graph</a>.</p>
|
|
<p><img src="../g/pattern_schema_nl.gif" alt="" width="620" height="180" /></p>
|
|
<hr />
|
|
<h2>Documentation</h2>
|
|
<p>The functions in this module take the same parameters and return the same values as their counterparts in <a href="pattern-en.html">pattern.en</a>. Refer to the documentation there for more details. </p>
|
|
<h3>Noun singularization & pluralization</h3>
|
|
<p>For Dutch nouns there is <span class="inline_code">singularize()</span> and <span class="inline_code">pluralize()</span>. The implementation is slightly less robust than the English version (accuracy 91% for singularization and 80% for pluralization).</p>
|
|
<div class="example">
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.nl import singularize, pluralize
|
|
>>>
|
|
>>> print singularize('katten')
|
|
>>> print pluralize('kat')
|
|
|
|
kat
|
|
katten </pre></div>
|
|
<h3>Verb conjugation</h3>
|
|
<p>For Dutch verbs there is <span class="inline_code">conjugate()</span>, <span class="inline_code">lemma()</span>, <span class="inline_code">lexeme()</span> and <span class="inline_code">tenses()</span>. The lexicon for verb conjugation contains about 4,000 common Dutch verbs. For unknown verbs it will fall back to a rule-based approach with an accuracy of about 81%.</p>
|
|
<div class="example">
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.nl import conjugate
|
|
>>> from pattern.nl import INFINITIVE, PRESENT, SG
|
|
>>>
|
|
>>> print conjugate('ben', INFINITIVE)
|
|
>>> print conjugate('ben', PRESENT, 2, SG)
|
|
|
|
zijn
|
|
bent </pre></div>
|
|
<h3>Attributive & predicative adjectives </h3>
|
|
<p>Dutch adjectives followed by a noun inflect with an <span class="inline_code">-e</span> suffix (e.g., <em>braaf</em> → <em>brave kat</em>). You can get the base form with the <span class="inline_code">predicative()</span> function, or vice versa with <span class="inline_code">attributive()</span>. Accuracy is 99%.</p>
|
|
<div class="example">
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.nl import attributive, predicative
|
|
>>>
|
|
>>> print predicative('brave')
|
|
>>> print attributive('braaf')
|
|
|
|
braaf
|
|
brave </pre></div>
|
|
<h3 class="example">Sentiment analysis</h3>
|
|
<p class="example">For opinion mining there is <span class="inline_code">sentiment()</span>, which returns a (<span class="inline_code">polarity</span>, <span class="inline_code">subjectivity</span>)-tuple, based on a lexicon of adjectives. Polarity is a value between <span class="inline_code">-1.0</span> and <span class="inline_code">+1.0</span>, subjectivity between <span class="inline_code">0.0</span> and <span class="inline_code">1.0</span>. The accuracy is around 82% (P 0.79, R 0.86) for book reviews:</p>
|
|
<div class="example">
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.nl import sentiment
|
|
>>> print sentiment('Een onwijs spannend goed boek!')
|
|
|
|
(0.69, 0.90) </pre></div>
|
|
<h3>Parser</h3>
|
|
<p>For parsing there is <span class="inline_code">parse()</span>, <span class="inline_code">parsetree()</span> and <span class="inline_code">split()</span>. The <span class="inline_code">parse()</span> function annotates words in the given string with their part-of-speech <a class="link-maintenance" href="mbsp-tags.html">tags</a> (e.g., <span class="postag">NN</span> for nouns and <span class="postag">VB</span> for verbs). The parsetree() function takes a string and returns a tree of nested objects (<span class="inline_code">Text</span> → <span class="inline_code">Sentence</span> → <span class="inline_code">Chunk</span> → <span class="inline_code">Word</span>). The <span class="inline_code">split()</span> function takes the output of <span class="inline_code">parse()</span> and returns a <span class="inline_code">Text</span>. See the pattern.en documentation (<a class="link-maintenance" href="pattern-en.html#tree">here</a>) how to manipulate <span class="inline_code">Text</span> objects. </p>
|
|
<div class="example">
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.nl import parse, split
|
|
>>>
|
|
>>> s = parse('De kat zit op de mat.')
|
|
>>> for sentence in split(s):
|
|
>>> print sentence
|
|
|
|
Sentence('De/DT/B-NP/O kat/NN/I-NP/O zit/VBZ/B-VP/O op/IN/B-PP/B-PNP'
|
|
'de/DT/B-NP/I-PNP mat/NN/I-NP/I-PNP ././O/O')</pre></div>
|
|
<p>The parser is built on Jeroen Geertzen's <a href="http://cosmion.net/jeroen/software/brill_pos/" target="_blank">Dutch language model</a>. The accuracy is around 91%. The original <a href="http://lands.let.ru.nl/literature/hvh.1999.2.ps" target="_blank">WOTAN</a> tagset is mapped to <a href="mbsp-tags.html">Penn Treebank</a>. If you need to work with the original tags you can also use <span class="inline_code">parse()</span> with an optional parameter <span class="inline_code">tagset="WOTAN"</span>.</p>
|
|
<p class="small"><span style="text-decoration: underline;">Reference</span>: Geertzen, J. (2010). <em>Brill-NL. </em>Retrieved from: <a class="noexternal" style="color: inherit;" href="http: //cosmion.net/jeroen/software/brill_pos/" target="_blank">http: //cosmion.net/jeroen/software/brill_pos/</a>.</p>
|
|
</div>
|
|
</div></div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<script>
|
|
SyntaxHighlighter.all();
|
|
</script>
|
|
</body>
|
|
</html> |