You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

531 lines
40 KiB
HTML

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>pattern-metrics</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link type="text/css" rel="stylesheet" href="../clips.css" />
<style>
/* Small fixes because we omit the online layout.css. */
h3 { line-height: 1.3em; }
#page { margin-left: auto; margin-right: auto; }
#header, #header-inner { height: 175px; }
#header { border-bottom: 1px solid #C6D4DD; }
table { border-collapse: collapse; }
#checksum { display: none; }
</style>
<link href="../js/shCore.css" rel="stylesheet" type="text/css" />
<link href="../js/shThemeDefault.css" rel="stylesheet" type="text/css" />
<script language="javascript" src="../js/shCore.js"></script>
<script language="javascript" src="../js/shBrushXml.js"></script>
<script language="javascript" src="../js/shBrushJScript.js"></script>
<script language="javascript" src="../js/shBrushPython.js"></script>
</head>
<body class="node-type-page one-sidebar sidebar-right section-pages">
<div id="page">
<div id="page-inner">
<div id="header"><div id="header-inner"></div></div>
<div id="content">
<div id="content-inner">
<div class="node node-type-page"
<div class="node-inner">
<div class="breadcrumb">View online at: <a href="http://www.clips.ua.ac.be/pages/pattern-metrics" class="noexternal" target="_blank">http://www.clips.ua.ac.be/pages/pattern-metrics</a></div>
<h1>pattern.metrics</h1>
<!-- Parsed from the online documentation. -->
<div id="node-1405" class="node node-type-page"><div class="node-inner">
<div class="content">
<p style="text-align: left;"><span class="big">The pattern.metrics module is a loose collection of performance, accuracy, similarity and significance tests, including code profiling, precision &amp; recall, inter-rater agreement, text metrics (similarity, readability, intertextuality, cooccurrence) and statistics (variance, chi-squared, goodness of fit).</span></p>
<p>It can be used by itself or with other <a href="pattern.html">pattern</a> modules: <a href="pattern-web.html">web</a> | <a href="pattern-db.html">db</a> | <a href="pattern-en.html">en</a> | <a href="pattern-search.html">search</a> <span class="blue"> </span>| <a href="pattern-vector.html">vector</a> | <a href="pattern-graph.html">graph</a>.</p>
<p><img src="../g/pattern_schema.gif" alt="" width="620" height="180" /></p>
<hr />
<h2>Documentation</h2>
<ul>
<li><a href="#profile">Profiler</a></li>
<li><a href="#accuracy">Accuracy, precision and recall</a></li>
<li><a href="#agreement">Inter-rater agreement</a> <span class="small link-maintenance">(Fleiss)</span></li>
</ul>
<div class="smallcaps">Text metrics</div>
<ul>
<li><a href="#similarity">Similarity</a> <span class="small link-maintenance">(Levenshtein, Dice)</span></li>
<li><a href="#readability">Readability</a> <span class="small link-maintenance">(Flesch)</span></li>
<li><a href="#ttr">Type-token ratio</a></li>
<li><a href="#intertextuality">Intertextuality</a></li>
<li><a href="#cooccurrence">Cooccurrence</a></li>
</ul>
<div class="smallcaps">Statistics</div>
<ul>
<li><a href="#mean">Mean, variance, standard deviation</a></li>
<li><a href="#gauss">Normal distribution</a></li>
<li><a href="#histogram">Histogram</a></li>
<li><a href="#moment">Moment</a></li>
<li><a href="#quantile">Quantile &amp; box plot</a></li>
</ul>
<ul>
<li><a href="#fisher">Fisher's exact test</a></li>
<li><a href="#chi2">Pearson's chi-squared test</a></li>
<li><a href="#ks2">Kolmogorov-Smirnov test</a></li>
</ul>
<p>&nbsp;</p>
<hr />
<h2><a name="profile"></a>Profiler</h2>
<p>Python is optimized with fast C extensions (e.g., <span class="inline_code">dict</span> traversal, regular expressions). Pattern is optimized with caching mechanisms. The <span class="inline_code">profile()</span> function can be used to test the performance (speed) of your own code. It returns a string with a breakdown of function calls + running time. You can then test the <span class="inline_code">duration()</span> of individual functions and refactor their source code to make them faster.</p>
<pre class="brush:python; gutter:false; light:true;">profile(function, *args, **kwargs) # Returns a string (report).</pre><pre class="brush:python; gutter:false; light:true;">duration(function, *args, **kwargs) # Returns a float (seconds).</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.metrics import profile
&gt;&gt;&gt; from pattern.en import parsetree
&gt;&gt;&gt;
&gt;&gt;&gt; def main(n=10):
&gt;&gt;&gt; for i in range(n):
&gt;&gt;&gt; parsetree('The cat sat on the mat.')
&gt;&gt;&gt;
&gt;&gt;&gt; print profile(main, n=100)</pre></div>
<table class="border">
<tbody>
<tr>
<td class="smallcaps" style="text-align: center;">ncalls</td>
<td class="smallcaps" style="text-align: center;">tottime</td>
<td class="smallcaps" style="text-align: center;">percall</td>
<td class="smallcaps" style="text-align: center;">cumtime</td>
<td class="smallcaps" style="text-align: center;">percall</td>
<td class="smallcaps">filename:lineno(function)</td>
</tr>
<tr>
<td style="text-align: center;">1</td>
<td style="text-align: center;">0.082</td>
<td style="text-align: center;">0.082</td>
<td style="text-align: center;">1.171</td>
<td style="text-align: center;">1.171</td>
<td>text/__init__.py:229(load)</td>
</tr>
<tr>
<td style="text-align: center;">94,127</td>
<td style="text-align: center;">0.147</td>
<td style="text-align: center;">0.000</td>
<td style="text-align: center;">1.089</td>
<td style="text-align: center;">0.000</td>
<td>text/__init__.py:231(&lt;genexpr&gt;)</td>
</tr>
<tr>
<td style="text-align: center;">94,774</td>
<td style="text-align: center;">0.233</td>
<td style="text-align: center;">0.000</td>
<td style="text-align: center;">0.861</td>
<td style="text-align: center;">0.000</td>
<td>text/__init__.py:195(_read)</td>
</tr>
<tr>
<td style="text-align: center;">95,391</td>
<td style="text-align: center;">0.321</td>
<td style="text-align: center;">0.000</td>
<td style="text-align: center;">0.541</td>
<td style="text-align: center;">0.000</td>
<td>text/__init__.py:33(decode_string)</td>
</tr>
<tr>
<td style="text-align: center;">95,991</td>
<td style="text-align: center;">0.073</td>
<td style="text-align: center;">0.000</td>
<td style="text-align: center;">0.182</td>
<td style="text-align: center;">0.000</td>
<td>{_codecs.utf_8_decode}</td>
</tr>
</tbody>
</table>
<p>In this example, the pattern.en parser spends most of its time loading data files and decoding Unicode.</p>
<p>&nbsp;</p>
<hr />
<h2><a name="accuracy"></a>Accuracy, precision and recall</h2>
<p>Precision and recall can be used to test the performance (accuracy) of a binary classifier. A well-known classification task is spam detection, for example an <span class="inline_code">is_spam()</span> function that yields <span class="inline_code">True</span> or <span class="inline_code">False</span> (binary). Accuracy is a measure of how many times the function yields <span class="inline_code">True</span> for spam messages (= true positives, "hits"). Occasionally, the function might also return <span class="inline_code">True</span> for messages that are not spam (= false positives, "errors"), or <span class="inline_code">False</span> for messages that <em>are</em> spam (= false negatives, "misses").</p>
<p><strong>Precision</strong> is a measure of hits vs. errors. <strong>Recall</strong> is a measure of hits vs. misses. High precision means that actual e-mail does not end up in the junk folder. High recall means that no spam ends up in the inbox.</p>
<p>The <span class="inline_code">confusion_matrix()</span> function takes a function that returns <span class="inline_code">True</span> or <span class="inline_code">False</span> for a given document (e.g., a string), and a list of <span class="inline_code">(document,</span> <span class="inline_code">bool)</span>-tuples for testing. It returns a <span class="inline_code">(TP,</span> <span class="inline_code">TN,</span> <span class="inline_code">FP,</span> <span class="inline_code">FN)</span>-tuple.</p>
<p>The <span class="inline_code">test()</span> function takes a function and a list of <span class="inline_code">(document,</span> <span class="inline_code">bool)</span>-tuples. It returns a tuple with <span class="inline_code">(accuracy,</span> <span class="inline_code">precision,</span> <span class="inline_code">recall,</span> <span class="inline_code">F1-score)</span>. The optional <span class="inline_code">average</span> can be <span class="inline_code">MACRO</span> or <span class="inline_code">None</span>.</p>
<pre class="brush:python; gutter:false; light:true;">confusion_matrix(match=lambda document: False, documents=[(None, False)])</pre><pre class="brush:python; gutter:false; light:true;">test(match=lambda document:False, documents=[], average=None)</pre><table class="border">
<tbody>
<tr>
<td><span class="smallcaps">Metric</span></td>
<td><span class="smallcaps">Formula</span></td>
<td><span class="smallcaps">Description</span></td>
</tr>
<tr>
<td>Accuracy</td>
<td><span class="inline_code">(TP</span> <span class="inline_code">+</span> <span class="inline_code">TN)</span> <span class="inline_code">/</span> <span class="inline_code">(TP</span> <span class="inline_code">+</span> <span class="inline_code">TN</span> <span class="inline_code">+</span> <span class="inline_code">FP</span> <span class="inline_code">+</span> <span class="inline_code">FN)</span></td>
<td>percentage of correct classifications</td>
</tr>
<tr>
<td>Precision</td>
<td><span class="inline_code">TP</span> <span class="inline_code">/</span> <span class="inline_code">(TP</span> <span class="inline_code">+</span> <span class="inline_code">FP)</span></td>
<td>percentage of correct positive classifications</td>
</tr>
<tr>
<td>Recall</td>
<td><span class="inline_code">TP</span> <span class="inline_code">/</span> <span class="inline_code">(TP</span> <span class="inline_code">+</span> <span class="inline_code">FN)</span></td>
<td>percentage of positive cases correctly classified as positive</td>
</tr>
<tr>
<td>F1-score</td>
<td><span class="inline_code">2</span> <span class="inline_code">x</span> <span class="inline_code">P</span> <span class="inline_code">x</span> <span class="inline_code">R</span> <span class="inline_code">/</span> <span class="inline_code">(P</span> <span class="inline_code">+</span> <span class="inline_code">R)</span></td>
<td>harmonic mean of precision and recall</td>
</tr>
</tbody>
</table>
<p>For example:</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.metrics import confusion_matrix, test
&gt;&gt;&gt;
&gt;&gt;&gt; def is_spam(s):
&gt;&gt;&gt; s = (w.strip(',.?!"') for w in s.lower().split())
&gt;&gt;&gt; return any(w in ('viagra', 'lottery') for w in s)
&gt;&gt;&gt;
&gt;&gt;&gt; data = [
&gt;&gt;&gt; ('In attachment is the final report.', False),
&gt;&gt;&gt; ('Here is that link we talked about.', False),
&gt;&gt;&gt; ("Don't forget to buy more cat food!", False),
&gt;&gt;&gt; ("Shouldn't is_spam() flag 'viagra'?", False),
&gt;&gt;&gt; ('You are the winner in our lottery!', True),
&gt;&gt;&gt; ('VIAGRA PROFESSIONAL as low as 1.4$', True)
&gt;&gt;&gt; ]
&gt;&gt;&gt; print confusion_matrix(is_spam, data)
&gt;&gt;&gt; print test(is_spam, data)
(2, 3, 1, 0)
(0.83, 0.67, 1.00, 0.80) </pre></div>
<p>In this example, <span class="inline_code">is_spam()</span> correctly classifies 5 out of 6 messages (83% accuracy). It identifies all spam messages (100% recall). However, it also flags a<em> </em>message that is not spam (67% precision).</p>
<p>&nbsp;</p>
<hr />
<h2><a name="agreement"></a>Inter-rater agreement</h2>
<p>Inter-rater agreement (Fleiss' kappa) can be used to test the consensus among different raters. For example, say we have an <span class="inline_code">is_spam()</span> function that predicts whether a given e-mail message is spam or not. It uses a list of words, each annotated with a "junk score" between <span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>. To avoid bias, each score is the average of the ratings of three different annotators. The annotators agree on obvious words such as <em>viagra</em> (everyone says <span class="inline_code">1.0</span>), but their ratings diverge on ambiguous words. So how <em>reliable</em> is the list?</p>
<p>The <span class="inline_code">agreement()</span> function returns the reliability as a number between <span class="inline_code">-1.0</span> and <span class="inline_code">+1.0</span> (where <span class="inline_code">+0.7</span> is reliable). The given <span class="inline_code">matrix</span> is a list in which each row represents a task. Each task is a list with the number of votes per rating. Each column represents a possible rating.</p>
<pre class="brush:python; gutter:false; light:true;">agreement(matrix)</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.metrics import agreement
&gt;&gt;&gt;
&gt;&gt;&gt; m = [ # 0.0 0.5 1.0 JUNK?
&gt;&gt;&gt; [ 0, 0, 3 ], # viagra
&gt;&gt;&gt; [ 0 1, 2 ], # lottery
&gt;&gt;&gt; [ 1, 2, 0 ], # buy
&gt;&gt;&gt; [ 3, 0, 0 ], # cat
&gt;&gt;&gt; ]
&gt;&gt;&gt; print agreement(m)
0.49</pre></div>
<p>Although the annotators disagree on ambiguous words such as <em>buy</em> (one says <span class="inline_code">0.0</span>, the others say <span class="inline_code">0.5</span>), the list is quite reliable (<span class="inline_code">+0.49</span> agreement). The averaged score for <em>buy</em> will be <span class="inline_code">0.33</span>.</p>
<p>&nbsp;</p>
<hr />
<h2>Text metrics</h2>
<h3><a name="similarity"></a>Similarity</h3>
<p>The <span class="inline_code">similarity()</span> function can be used to test the similarity between two strings. It returns a value between <span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>. The optional <span class="inline_code">metric</span> can be <span class="inline_code">LEVENSHTEIN</span> or <span class="inline_code">DICE</span>. Levenshtein edit distance measures the similarity between two strings as the number of operations (insert, delete, replace) needed to transform one string into the other (e.g., <em>cat</em><em>hat</em><em>what</em>). Dice coefficient measures the similarity as the number of shared bigrams (e.g., <em>nap</em> and <em>trap</em> share one bigram <em>ap</em>).</p>
<pre class="brush:python; gutter:false; light:true;">similarity(string1, string2, metric=LEVENSHTEIN)</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import similarity, levenshtein
&gt;&gt;&gt;
&gt;&gt;&gt; print similarity('cat', 'what')
&gt;&gt;&gt; print levenshtein('cat', 'what')
0.5
2</pre></div>
<h3><a name="readability"></a>Readability</h3>
<p>The <span class="inline_code">readability()</span> function can be used to test the readability of a text. It returns a value between <span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>, based on Flesch Reading Ease, which measures word count and word length (= number of syllables per word).</p>
<pre class="brush:python; gutter:false; light:true;">readibility(string)</pre><table class="border">
<tbody>
<tr>
<td><span class="smallcaps">Readability</span></td>
<td><span class="smallcaps">Description</span></td>
</tr>
<tr>
<td><span class="inline_code">0.9-1.0</span></td>
<td>easily understandable by 11-year olds</td>
</tr>
<tr>
<td><span class="inline_code">0.6-0.7</span></td>
<td>easily understandable by 13 to 15-year olds</td>
</tr>
<tr>
<td><span class="inline_code">0.3-0.5</span></td>
<td>best understood by university graduates</td>
</tr>
</tbody>
</table>
<div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import readability
&gt;&gt;&gt;
&gt;&gt;&gt; dr_seuss = "\n".join((
&gt;&gt;&gt; "'I know some good games we could play,' said the cat.",
&gt;&gt;&gt; "'I know some new tricks,' said the cat in the hat.",
&gt;&gt;&gt; "'A lot of good tricks. I will show them to you.'",
&gt;&gt;&gt; "'Your mother will not mind at all if I do.'"
&gt;&gt;&gt; ))
&gt;&gt;&gt; print readability(dr_seuss)
0.908 </pre></div>
<h3><a name="ttr"></a>Type-token ratio</h3>
<p>The <span class="inline_code">ttr()</span> function can be used to test the lexical diversity of a text. It returns a value between <span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>, which is the average percentage of unique words (types) for each <span class="inline_code">n</span> successive words (tokens) in the text.</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">ttr(string, n=100, punctuation='.,;:!?()[]{}`''\"@#$^&amp;*+-|=~_')</pre><table class="border">
<tbody>
<tr>
<td class="smallcaps">Author</td>
<td class="smallcaps">Text</td>
<td class="smallcaps" style="text-align: center;">Year</td>
<td class="smallcaps">TTR</td>
</tr>
<tr>
<td>Dr. Seuss</td>
<td>The Cat In The Hat</td>
<td style="text-align: center;">1957</td>
<td class="inline_code">0.588</td>
</tr>
<tr>
<td>Lewis Carroll</td>
<td>Alice In Wonderland</td>
<td style="text-align: center;">1865</td>
<td class="inline_code">0.728</td>
</tr>
<tr>
<td>George Washington</td>
<td>First Inaugural Address</td>
<td style="text-align: center;">1789</td>
<td class="inline_code">0.722</td>
</tr>
<tr>
<td>George W. Bush</td>
<td>First Inaugural Address</td>
<td style="text-align: center;">2001</td>
<td class="inline_code">0.704</td>
</tr>
<tr>
<td>Barack Obama</td>
<td>First Inaugural Address</td>
<td style="text-align: center;">2009</td>
<td class="inline_code">0.717</td>
</tr>
</tbody>
</table>
<h3><a name="intertextuality"></a>Intertextuality</h3>
<p>The <span class="inline_code">intertextuality()</span> function can be used to test the overlap between texts (e.g., plagiarism detection). It takes a list of strings and returns a <span class="inline_code">dict</span> with <span class="inline_code">(i,</span> <span class="inline_code">j)</span>-tuples as keys and <span class="inline_code">float</span> values between <span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>. For indices <span class="inline_code">i</span> and <span class="inline_code">j</span> in the given list, the corresponding <span class="inline_code">float</span> is the percentage of text <span class="inline_code">i</span> that is also in text <span class="inline_code">j</span>. Overlap is measured by <a class="link-maintenance" href="pattern-en.html#ngram"><em>n</em>-grams</a> (by default <span class="inline_code">n=5</span> or five successive words). An optional <span class="inline_code">weight</span> function can be used to supply a weight for each <em>n</em>-gram (e.g., <a class="link-maintenance" href="pattern-vector.html#tf-idf">tf-idf</a>).</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">texts=[], n=5, weight=lambda ngram: 1.0)</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import intertextuality
&gt;&gt;&gt; from glob import glob
&gt;&gt;&gt;
&gt;&gt;&gt; index = {}
&gt;&gt;&gt; texts = []
&gt;&gt;&gt; for i, f in enumerate(glob('data/*.txt')):
&gt;&gt;&gt; index[i] = f
&gt;&gt;&gt; texts.append(open(f).read())
&gt;&gt;&gt;
&gt;&gt;&gt; for (i, j), weight in intertextuality(texts, n=3).items():
&gt;&gt;&gt; if weight &gt; 0.1:
&gt;&gt;&gt; print index[i]
&gt;&gt;&gt; print index[j]
&gt;&gt;&gt; print weight
&gt;&gt;&gt; print weight.assessments # Set of overlapping n-grams.
&gt;&gt;&gt; print </pre></div>
<h3><a name="cooccurrence"></a>Cooccurrence</h3>
<p>The <span class="inline_code">cooccurrence()</span> function can be used to test how often words occur alongside each other. It takes an iterable, string, file or list of files, and returns a <span class="inline_code">{word1:</span> <span class="inline_code">{word2:</span> <span class="inline_code">count,</span> <span class="inline_code">word3:</span> <span class="inline_code">count,</span> ...<span class="inline_code">}}</span> dictionary.</p>
<p>A well-known application is distributional semantics. For example, if <em><span style="text-decoration: underline;">cat</span> meows</em> and <em><span style="text-decoration: underline;">cat</span> purrs</em> occur often, <em>meow</em> and <em>purr</em> are probably related to <em>cat</em>, and to each other. This requires a large text corpus (e.g., 10+ million words). For performance, it should be given as an <span class="inline_code">open(path)</span> iterator instead of an <span class="inline_code">open(path).read()</span> string.</p>
<p>The <span class="inline_code">window</span> parameter defines the size of the cooccurrence window, e.g., <span class="inline_code">(-1,</span> <span class="inline_code">-1)</span> means the word to the left of the anchor. The <span class="inline_code">term1</span> function defines which words are anchors (e.g., <em>cat</em>). By default, all words are anchors but this may raise a <span class="inline_code">MemoryError</span>. The <span class="inline_code">term2</span> function defines which co-occuring words to count. The optional <span class="inline_code">normalize</span> function can be used to transform words (e.g., strip punctuation).</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">cooccurrence(iterable, window=(-1, -1),
term1 = lambda w: True,
term2 = lambda w: True,
normalize = lambda w: w)</pre><p>What adjectives occur frequently in front of which nouns?</p>
<div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import cooccurrence
&gt;&gt;&gt;
&gt;&gt;&gt; f = open('pattern/test/corpora/tagged-en-oanc.txt')
&gt;&gt;&gt; m = cooccurrence(f,
&gt;&gt;&gt; window = (-2, -1),
&gt;&gt;&gt; term1 = lambda w: w[1] == 'NN',
&gt;&gt;&gt; term2 = lambda w: w[1] == 'JJ',
&gt;&gt;&gt; normalize = lambda w: tuple(w.split('/')) # cat/NN =&gt; ('cat', 'NN')
&gt;&gt;&gt; )
&gt;&gt;&gt; for noun in m:
&gt;&gt;&gt; for adjective, count in m[noun].items():
&gt;&gt;&gt; print adjective, noun, count
('last', 'JJ') ('year', 'NN') 31
('next', 'JJ') ('year', 'NN') 10
('past', 'JJ') ('year', 'NN') 7
... </pre></div>
<p>&nbsp;</p>
<hr />
<h2><a name="statistics"></a>Statistics</h2>
<h3><a name="mean"></a>Mean, median, variance, standard deviation</h3>
<p>An <strong>average</strong> is a measure of the "center" of a data set (= a list of values). It can be measured in different ways, for example by mean, median or mode. Usually, a data set is a smaller <em>sample</em> of a <em>population</em>. For example, <span class="inline_code">[1</span><span class="inline_code">,2</span><span class="inline_code">,4</span><span class="inline_code">]</span> is a sample of powers of two. The mean is the sum of values divided by the sample size: <span class="inline_code">1</span> + <span class="inline_code">2</span> + <span class="inline_code">4</span> / <span class="inline_code">3</span> = <span class="inline_code">2.37</span>. The median is the middle value in the sorted list of values: <span class="inline_code">2</span>.</p>
<p>Variance measures how a data set is spread out. The square root of variance is called the standard deviation. A low standard deviation indicates that the values are clustered closely around the mean. A high standard deviation indicates that the values are spread out over a large range. The standard deviation can be used to test the reliability of a data set.</p>
<p>For example, for two equally competent sports teams, in which each player has a score, the team with the lower standard deviation is more reliable, since all players perform equally well on average. The team with the higher standard deviation may have very good players and very bad players (e.g., strong offense, weak defense), making their games more unpredictable.</p>
<p>The <span class="inline_code">avg()</span> or <span class="inline_code">mean()</span> function returns the mean. The <span class="inline_code">stdev()</span> function returns the standard deviation:</p>
<pre class="brush:python; gutter:false; light:true;">mean(iterable) # [1, 2, 4] =&gt; 2.33</pre><pre class="brush:python; gutter:false; light:true;">median(iterable) # [1, 2, 4] =&gt; 2</pre><pre class="brush:python; gutter:false; light:true;">variance(iterable, sample=False) # [1, 2, 4] =&gt; 1.56</pre><pre class="brush:python; gutter:false; light:true;">stdev(iterable, sample=False) # [1, 2, 4] =&gt; 1.53</pre><table class="border">
<tbody>
<tr>
<td class="smallcaps">Metric</td>
<td class="smallcaps">Formula</td>
</tr>
<tr>
<td>Mean</td>
<td><span class="inline_code">sum(list)</span> <span class="inline_code">/</span> <span class="inline_code">len(list)</span></td>
</tr>
<tr>
<td>Variance</td>
<td><span class="inline_code">sum((v</span> <span class="inline_code">-</span> <span class="inline_code">mean(list))</span> <span class="inline_code">**</span> <span class="inline_code">2</span> <span class="inline_code">for</span> <span class="inline_code">v</span> <span class="inline_code">in</span> <span class="inline_code">list)</span> <span class="inline_code">/</span> <span class="inline_code">len(list)</span></td>
</tr>
<tr>
<td>Standard deviation</td>
<td class="inline_code">sqrt(variance(list))</td>
</tr>
</tbody>
</table>
<p>To compute the sample variance with <a href="http://en.wikipedia.org/wiki/Bessel%27s_correction" target="_blank">bias correction</a>, i.e., <span class="inline_code">len(list)</span> <span class="inline_code">-</span> <span class="inline_code">1</span>, use <span class="inline_code">sample=True</span>.</p>
<p>We can use the <span class="inline_code">mean()</span> function to implement a generator for the simple moving average (SMA):</p>
<div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import mean
&gt;&gt;&gt;
&gt;&gt;&gt; def sma(iterable, k=10):
&gt;&gt;&gt; a = list(iterable)
&gt;&gt;&gt; for m in xrange(len(a)):
&gt;&gt;&gt; i = m - k
&gt;&gt;&gt; j = m + k + 1
&gt;&gt;&gt; yield mean(a[max(0,i):j])</pre></div>
<h3><a name="gauss"></a>Normal distribution</h3>
<p>The normal (or Gaussian) distribution is a very common distribution of values. When graphed, it produces a bell-shaped curve. An <em>even</em> or uniform distribution on the other hand produces a straight horizontal line. For example, human intelligence is normally distributed. If we performed an IQ test among 750 individuals, about 2/3 or 250 of the IQ scores would range between IQ 85115, or within one standard deviation (15) of the mean IQ 100. This means that few individuals have an exceptionally low or high IQ.</p>
<table class="border">
<tbody>
<tr>
<td style="text-align: center;">
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="../g/pattern-metrics-bell.jpg" alt="" width="398" height="180" /></p>
<p><span class="smallcaps">distribution of iq scores</span></p></td>
</tr>
</tbody>
</table>
<p>The <span class="inline_code">norm()</span> function returns a list of <span class="inline_code">n</span> random samples from the normal distribution.</p>
<p>The <span class="inline_code">pdf()</span> or probability density function returns the chance (<span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>) that a given value occurs in a normal distribution with the given <span class="inline_code">mean</span> and <span class="inline_code">stdev</span>.</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">norm(n, mean=0.0, stdev=1.0) </pre><pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">pdf(x, mean=0.0, stdev=1.0)</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import pdf
&gt;&gt;&gt; print sum(pdf(iq, mean=100, stdev=15) for iq in range(85, 115))
0.6825 </pre></div>
<h3><a name="histogram"></a>Histogram</h3>
<p>The <span class="inline_code">histogram()</span> function returns a dictionary <span class="inline_code">{(start,</span> <span class="inline_code">stop):</span> <span class="inline_code">[v1,</span> <span class="inline_code">v2,</span> ...<span class="inline_code">]}</span> with the values from the given list grouped into <em>k</em> equal intervals. It is an estimate of the distribution of the data set (e.g., which intervals have the most values).</p>
<pre class="brush:python; gutter:false; light:true;">histogram(iterable, k=10)</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.metrics import histogram
&gt;&gt;&gt;
&gt;&gt;&gt; s = [70, 85, 85, 100, 100, 100, 115, 115, 130]
&gt;&gt;&gt; for (i, j), values in sorted(histogram(s, k=5).items()):
&gt;&gt;&gt; m = i + (j - i) / 2 # midpoint
&gt;&gt;&gt; print i, j, m, values
70.0 82.0 76.0 [70]
82.0 94.0 88.0 [85, 85]
94.0 106.0 100.0 [100, 100, 100]
106.0 118.0 112.0 [115, 115]
118.0 130.0 124.0 [130] </pre></div>
<h3><a name="moment"></a>Moment</h3>
<p>The <span class="inline_code">moment()</span> function returns the <em>n</em>-th central moment about the mean, where <span class="inline_code">n=2</span> is variance, <span class="inline_code">n=3</span> skewness and <span class="inline_code">n=4</span> kurtosis. Variance measures how <em>wide</em> the data is spread out. Skewness measures how <em>evenly</em> the data is spread out: <span class="inline_code">&gt;</span> <span class="inline_code">0</span> indicates fewer high values, <span class="inline_code">&lt;</span> <span class="inline_code">0</span> fewer low values. Kurtosis measures how tight the data is near the mean: <span class="inline_code">&gt;</span> <span class="inline_code">0</span> indicates fewer values near the mean (= more extreme values), <span class="inline_code">&lt;</span> <span class="inline_code">0</span> more values near the mean.</p>
<pre class="brush:python; gutter:false; light:true;">moment(iterable, n=2) # n=2 variance | 3 skewness | 4 kurtosis</pre><pre class="brush:python; gutter:false; light:true;">skewness(iterable) # &gt; 0 =&gt; fewer values over mean</pre><pre class="brush:python; gutter:false; light:true;">kurtosis(iterable) # &gt; 0 =&gt; fewer values near mean</pre><p>Skewness and kurtosis are <span class="inline_code">0.0</span> for the normal distribution:</p>
<div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.metrics import skewness
&gt;&gt;&gt; from random import gauss
&gt;&gt;&gt;
&gt;&gt;&gt; print skewness([gauss(100, 15) for i in xrange(100000)])
0.001 </pre></div>
<h3><a name="quantile"></a>Quantile &amp; box plot</h3>
<p>The <span class="inline_code">quantile()</span> function returns the interpolated value at point p (<span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>) in a sorted list of values. With <span class="inline_code">p=0.5</span> it returns the median.The parameters <span class="inline_code">a</span>, <span class="inline_code">b</span>, <span class="inline_code">c</span>, <span class="inline_code">d</span> refer to the algorithm by Hyndman and Fan <a href="http://stat.ethz.ch/R-manual/R-patched/library/stats/html/quantile.html" target="_blank">[1]</a>.</p>
<p>The <span class="inline_code">boxplot()</span> function returns a <span class="inline_code">(min,</span> <span class="inline_code">q1,</span> <span class="inline_code">q2,</span> <span class="inline_code">q3,</span> <span class="inline_code">max)</span>-tuple for a given list of values, where <span class="inline_code">q2</span> is the median, <span class="inline_code">q1</span>&nbsp;the quantile with <span class="inline_code">p=0.25</span> and <span class="inline_code">q3</span> the quantile with <span class="inline_code">p=0.75</span>, i.e., the 25-75% range around the median.&nbsp;This can be used to identify outliers. For example, if a sample of temperatures in your house comprises you (37°C), the cat (38°C), the refrigerator (5°C) and the oven (220°C), then the average temperature is 75°C. This of course is incorrect since the oven is an outlier. It lies well outside the 25-75% range.</p>
<pre class="brush:python; gutter:false; light:true;">quantile(iterable, p=0.5, sort=True, a=1, b=-1, c=0, d=1)</pre><pre class="brush:python; gutter:false; light:true;">boxplot(iterable)</pre><div class="example">
<pre class="brush:python; gutter:false; light:true;">&gt;&gt;&gt; from pattern.metrics import boxplot
&gt;&gt;&gt; print boxplot([5, 37, 38, 220])
(5.0, 29.0, 37.5, 83.5, 220.0)</pre></div>
<table class="border">
<tbody>
<tr>
<td style="text-align: center;"><img style="display: block; margin-left: auto; margin-right: auto;" src="../g/pattern-metrics-boxplot.jpg" alt="" width="398" height="137" /><span class="smallcaps">you, the cat, the fridge and the oven</span></td>
</tr>
</tbody>
</table>
<p><span class="small"><span style="text-decoration: underline;">Reference</span>: Adorio E. (2008) http://adorio-research.org/wordpress/?p=125</span></p>
<p>&nbsp;</p>
<hr />
<h2>Statistical tests</h2>
<h3><a name="fisher"></a>Fisher's exact test</h3>
<p>The <span class="inline_code">fisher()</span> function or <a href="http://en.wikipedia.org/wiki/Fisher's_exact_test">Fisher's exact test</a>&nbsp;can be used to test the contingency of a 2 x 2 classification. It returns probability <span class="inline_code">p</span> between <span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>, where <span class="inline_code">p</span> <span class="inline_code">&lt;</span> <span class="inline_code">0.05</span> is significant and <span class="inline_code">p</span> <span class="inline_code">&lt;</span> <span class="inline_code">0.01</span> is very significant.</p>
<p>Say that 96 pet owners were asked about their pet, and 29/46 men reported owning a dog and 30/50 women reported owning a cat. We have a 2 x 2 classification (cat or dog&nbsp;↔ man or woman)&nbsp;that we assume to be evenly distributed, i.e., we assume that men and women are equally fond of cats and dogs. This is the <em>null hypothesis</em>. But Fisher's exact test yields <span class="inline_code">p</span> <span class="inline_code">0.027</span> <span class="inline_code">&lt;</span> <span class="inline_code">0.05</span>&nbsp;so we need to <em>reject</em> the null hypothesis. There is a significant correlation between gender and pet ownership (women are more fond of cats).</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">fisher(a, b, c, d)</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import fisher
&gt;&gt;&gt; print fisher(a=17, b=30, c=29, d=20)
0.027</pre></div>
<table class="border">
<tbody>
<tr>
<td>&nbsp;</td>
<td class="smallcaps" style="text-align: center;">men</td>
<td class="smallcaps" style="text-align: center;">women</td>
</tr>
<tr>
<td class="smallcaps" style="text-align: right;">cat owner</td>
<td style="text-align: center;"><span class="inline_code">17</span><span class="small"> (a)</span></td>
<td style="text-align: center;"><span class="inline_code">30</span><span class="small"> (b)</span></td>
</tr>
<tr>
<td class="smallcaps" style="text-align: right;">dog owner</td>
<td style="text-align: center;"><span class="inline_code">29</span><span class="small"> (c)</span></td>
<td style="text-align: center;"><span class="inline_code">20</span><span class="small"> (d)</span></td>
</tr>
</tbody>
</table>
<p class="small"><span style="text-decoration: underline;">Reference</span>: Edelson, J. &amp; Lester D. (1983). Personality and pet ownership: a preliminary study. <em>Psychological Reports</em>.</p>
<h3><a name="chi2"></a>Chi-squared test</h3>
<p>The <span class="inline_code">chi2()</span> function or <a href="http://en.wikipedia.org/wiki/Pearson's_chi-squared_test">Pearson's chi-squared test</a> can be used to test the contingency of an n x m classification. It returns an <span class="inline_code">(x2,</span>&nbsp;<span class="inline_code">p)</span>-tuple, where probability&nbsp;<span class="inline_code">p</span> <span class="inline_code">&lt;</span> <span class="inline_code">0.05</span> is significant and <span class="inline_code">p</span> <span class="inline_code">&lt;</span> <span class="inline_code">0.01</span> is very significant.&nbsp;The <span class="inline_code">observed</span> matrix is a list of lists of <span class="inline_code">int</span> values (i.e., absolute frequencies).&nbsp;By default, the <span class="inline_code">expected</span> matrix is evenly distributed over all classes, and&nbsp;<span class="inline_code">df</span> is&nbsp;<span class="inline_code">(n-1)</span> <span class="inline_code">*</span> <span class="inline_code">(m-1)</span>&nbsp;degrees of freedom.&nbsp;</p>
<p>Say that 255 pet owners aged 30, 40, 50 or 55+ were asked whether they owned a cat or a dog. We have an n x m classification (cat or dog&nbsp;↔ 30, 40, 50, 55+) that we assume to be evenly distributed, i.e., we assume that pet preference&nbsp;is unrelated to age.&nbsp;This is the <em>null hypothesis</em>. The chi-squared test for the data below yields <span class="inline_code">p</span> <span class="inline_code">0.89</span>&nbsp;<span class="inline_code">&gt;</span> <span class="inline_code">0.05</span>, which confirms the null hypothesis.</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">chi2(observed=[], expected=None, df=None)</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import chi2
&gt;&gt;&gt; print chi2(observed=[[15, 22, 27, 21], [37, 40, 52, 41]])
(0.63, 0.89)</pre></div>
<table class="border">
<tbody>
<tr>
<td>&nbsp;</td>
<td style="text-align: center;">2534</td>
<td style="text-align: center;">3544</td>
<td style="text-align: center;">4554</td>
<td style="text-align: center;">55+</td>
</tr>
<tr>
<td class="smallcaps" style="text-align: right;">cat owner</td>
<td class="inline_code" style="text-align: center;">15</td>
<td class="inline_code" style="text-align: center;">22</td>
<td class="inline_code" style="text-align: center;">27</td>
<td class="inline_code" style="text-align: center;">21</td>
</tr>
<tr>
<td class="smallcaps" style="text-align: right;">dog owner</td>
<td class="inline_code" style="text-align: center;">37</td>
<td class="inline_code" style="text-align: center;">40</td>
<td class="inline_code" style="text-align: center;">52</td>
<td class="inline_code" style="text-align: center;">41</td>
</tr>
</tbody>
</table>
<h3><a name="ks2"></a>Kolmogorov-Smirnov test</h3>
<p>The <span class="inline_code">ks2()</span> function or <a href="http://en.wikipedia.org/wiki/KolmogorovSmirnov_test">two-sample Kolmogorov-Smirnov test</a> can be used to test if two samples are drawn from the same distribution. It returns a <span class="inline_code">(d,</span> <span class="inline_code">p)</span>-tuple with maximum distance <span class="inline_code">d</span> and probability <span class="inline_code">p</span>&nbsp;(<span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>). By default, the second sample <span class="inline_code">a2</span> is <span class="inline_code">NORMAL</span>, i.e., a list with <span class="inline_code">n</span> values from&nbsp;<span class="inline_code">gauss(mean(a1),</span> <span class="inline_code">stdev(a1))</span>.</p>
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">ks2(a1, a2=NORMAL, n=1000)</pre><div class="example">
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">&gt;&gt;&gt; from pattern.metrics import ks2
&gt;&gt;&gt; ks2([70, 85, 85, 100, 100, 100, 115, 115, 130], n=10000)
(0.17, 0.94)&nbsp;</pre></div>
<p>The values in the given list appear to be normally distributed (bell-shape).</p>
<p>&nbsp;</p>
<hr />
<h2>See also</h2>
<ul>
<li><a href="http://www.scipy.org/">Scipy</a> (BSD): scientific computing for Python.</li>
</ul>
</div>
</div></div>
</div>
</div>
</div>
</div>
</div>
</div>
<script>
SyntaxHighlighter.all();
</script>
</body>
</html>