|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
|
<html>
|
|
|
<head>
|
|
|
<title>pattern-web</title>
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
|
<link type="text/css" rel="stylesheet" href="../clips.css" />
|
|
|
<style>
|
|
|
/* Small fixes because we omit the online layout.css. */
|
|
|
h3 { line-height: 1.3em; }
|
|
|
#page { margin-left: auto; margin-right: auto; }
|
|
|
#header, #header-inner { height: 175px; }
|
|
|
#header { border-bottom: 1px solid #C6D4DD; }
|
|
|
table { border-collapse: collapse; }
|
|
|
#checksum { display: none; }
|
|
|
</style>
|
|
|
<link href="../js/shCore.css" rel="stylesheet" type="text/css" />
|
|
|
<link href="../js/shThemeDefault.css" rel="stylesheet" type="text/css" />
|
|
|
<script language="javascript" src="../js/shCore.js"></script>
|
|
|
<script language="javascript" src="../js/shBrushXml.js"></script>
|
|
|
<script language="javascript" src="../js/shBrushJScript.js"></script>
|
|
|
<script language="javascript" src="../js/shBrushPython.js"></script>
|
|
|
</head>
|
|
|
<body class="node-type-page one-sidebar sidebar-right section-pages">
|
|
|
<div id="page">
|
|
|
<div id="page-inner">
|
|
|
<div id="header"><div id="header-inner"></div></div>
|
|
|
<div id="content">
|
|
|
<div id="content-inner">
|
|
|
<div class="node node-type-page"
|
|
|
<div class="node-inner">
|
|
|
<div class="breadcrumb">View online at: <a href="http://www.clips.ua.ac.be/pages/pattern-web" class="noexternal" target="_blank">http://www.clips.ua.ac.be/pages/pattern-web</a></div>
|
|
|
<h1>pattern.web</h1>
|
|
|
<!-- Parsed from the online documentation. -->
|
|
|
<div id="node-1355" class="node node-type-page"><div class="node-inner">
|
|
|
<div class="content">
|
|
|
<p class="big">The pattern.web module has tools for online data mining: asynchronous requests, a uniform API for web services (Google, Bing, Twitter, Facebook, Wikipedia, Wiktionary, Flickr, RSS), a HTML DOM parser, HTML tag stripping functions, a web crawler, webmail, caching, Unicode support.</p>
|
|
|
<p>It can be used by itself or with other <a href="pattern.html">pattern</a> modules: web | <a href="pattern-db.html">db</a> | <a href="pattern-en.html">en</a> | <a href="pattern-search.html">search</a> | <a href="pattern-vector.html">vector</a> | <a href="pattern-graph.html">graph</a>.</p>
|
|
|
<p><img src="../g/pattern_schema.gif" alt="" width="620" height="180" /></p>
|
|
|
<hr />
|
|
|
<h2>Documentation</h2>
|
|
|
<ul>
|
|
|
<li><a href="#URL">URLs</a></li>
|
|
|
<li><a href="#asynchronous">Asynchronous requests</a></li>
|
|
|
<li><a href="#services">Search engine + web services</a> <span class="smallcaps link-maintenance">(<a href="#google">google</a>, <a href="#google">bing</a>, <a href="#twitter">twitter</a>, <a href="#facebook">facebook</a>, <a href="#wikipedia">wikipedia</a>, flickr)</span></li>
|
|
|
<li><a href="#sort">Web sort</a></li>
|
|
|
<li><a href="#plaintext">HTML to plaintext</a></li>
|
|
|
<li><a href="#DOM">HTML DOM parser</a></li>
|
|
|
<li><a href="#pdf">PDF parser</a></li>
|
|
|
<li><a href="#crawler">Crawler</a></li>
|
|
|
<li><a href="#mail">E-mail</a></li>
|
|
|
<li><a href="#locale">Locale</a></li>
|
|
|
<li><a href="#cache">Cache</a></li>
|
|
|
</ul>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="URL"></a>URLs</h2>
|
|
|
<p>The <span class="inline_code">URL</span> object is a subclass of Python's <span class="inline_code">urllib2.Request</span> that can be used to connect to a web address. The <span class="inline_code">URL.download()</span> method can be used to retrieve the content (e.g., HTML source code). The constructor's <span class="inline_code">method</span> parameter defines how <span class="inline_code">query</span> data is encoded:</p>
|
|
|
<ul>
|
|
|
<li><span class="inline_code">GET</span>: query data is encoded in the URL string (usually for retrieving data).</li>
|
|
|
<li><span class="inline_code">POST</span>: query data is encoded in the message body (for posting data).</li>
|
|
|
</ul>
|
|
|
<pre class="brush:python; gutter:false; light:true;">url = URL(string='', method=GET, query={})
|
|
|
</pre><pre class="brush:python; gutter:false; light:true;">url.string # u'http://user:pw@domain.com:30/path/page?p=1#anchor'
|
|
|
url.parts # Dictionary of attributes:</pre><pre class="brush:python; gutter:false; light:true;">url.protocol # u'http'
|
|
|
url.username # u'user'
|
|
|
url.password # u'pw'
|
|
|
url.domain
# u'domain.com'
|
|
|
url.port
# 30
|
|
|
url.path
# [u'path']
|
|
|
url.page
# u'page'
|
|
|
url.query # {u'p': 1}
|
|
|
url.querystring # u'p=1'
|
|
|
url.anchor # u'anchor'</pre><pre class="brush:python; gutter:false; light:true;">url.exists # False if URL.open() raises a HTTP404NotFound.
|
|
|
url.redirect # Actual URL after redirection, or None.
|
|
|
url.headers # Dictionary of HTTP response headers.
|
|
|
url.mimetype # Document MIME-type.</pre><pre class="brush:python; gutter:false; light:true;">url.open(timeout=10, proxy=None)
|
|
|
url.download(timeout=10, cached=True, throttle=0, proxy=None, unicode=False)
|
|
|
url.copy() </pre><ul>
|
|
|
<li><span class="inline_code">URL()</span> expects a string that starts with a valid protocol (e.g. <span class="inline_code">http://</span>).<span class="inline_code"> </span></li>
|
|
|
<li><span class="inline_code">URL.open()</span> returns a connection from which data can be retrieved with <span class="inline_code">connection.read()</span>.</li>
|
|
|
<li><span class="inline_code">URL.download()</span> caches and returns the retrieved data. <br />It raises a <span class="inline_code">URLTimeout</span> if the download time exceeds the given <span class="inline_code">timeout</span>.<br />It sleeps for <span class="inline_code">throttle</span> seconds after the download is complete.<br />A proxy server can be given as a <span class="inline_code">(host,</span> <span class="inline_code">protocol)</span>-tuple, e.g., <span class="inline_code">('proxy.com',</span> <span class="inline_code">'https')</span>.<br />With <span class="inline_code">unicode=True</span>, returns the data as a Unicode string. By default it is <span class="inline_code">False</span> because the data can be binary (e.g., JPEG, ZIP) but <span class="inline_code">unicode=True</span> is advised for HTML.</li>
|
|
|
</ul>
|
|
|
<p>The example below downloads an image. <br />The <span class="inline_code">extension()</span> helper function parses the file extension from a file name:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import URL, extension
|
|
|
>>>
|
|
|
>>> url = URL('http://www.clips.ua.ac.be/media/pattern_schema.gif')
|
|
|
>>> f = open('test' + extension(url.page), 'wb') # save as test.gif
|
|
|
>>> f.write(url.download())
|
|
|
>>> f.close()</pre></div>
|
|
|
<h3>URL downloads</h3>
|
|
|
<p>The <span class="inline_code">download()</span> function takes a URL string, calls <span class="inline_code">URL.download()</span> and returns the retrieved data. It takes the same optional parameters as <span class="inline_code">URL.download()</span>. This saves you a line of code.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import download
|
|
|
>>> html = download('http://www.clips.ua.ac.be/', unicode=True)</pre></div>
|
|
|
<h3>URL mime-type</h3>
|
|
|
<p>The <span class="inline_code">URL.mimetype</span> can be used to check the type of document at the given URL. This is more reliable than sniffing the filename extension (which may be omitted).</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern import URL, MIMETYPE_IMAGE
|
|
|
>>>
|
|
|
>>> url = URL('http://www.clips.ua.ac.be/media/pattern_schema.gif')
|
|
|
>>> print url.mimetype in MIMETYPE_IMAGE
|
|
|
|
|
|
True</pre></div>
|
|
|
<table class="border">
|
|
|
<tbody>
|
|
|
<tr>
|
|
|
<td><span class="smallcaps">Global</span></td>
|
|
|
<td><span class="smallcaps">Value</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_WEBPAGE</span></td>
|
|
|
<td><span class="inline_code">['text/html']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_STYLESHEET</span></td>
|
|
|
<td><span class="inline_code">['text/css']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_PLAINTEXT</span></td>
|
|
|
<td><span class="inline_code">['text/plain']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_PDF</span></td>
|
|
|
<td><span class="inline_code">['application/pdf']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_NEWSFEED</span></td>
|
|
|
<td><span class="inline_code">['application/rss+xml', 'application/atom+xml']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_IMAGE</span></td>
|
|
|
<td><span class="inline_code">['image/gif', 'image/jpeg', 'image/png']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_AUDIO</span></td>
|
|
|
<td><span class="inline_code">['audio/mpeg', 'audio/mp4', 'audio/x-wav']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_VIDEO</span></td>
|
|
|
<td><span class="inline_code">['video/mpeg', 'video/mp4', 'video/avi', 'video/quicktime']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_ARCHIVE</span></td>
|
|
|
<td><span class="inline_code">['application/x-tar', 'application/zip']</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">MIMETYPE_SCRIPT</span></td>
|
|
|
<td><span class="inline_code">['application/javascript']</span></td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
<h3>URL exceptions</h3>
|
|
|
<p>The <span class="inline_code">URL.open()</span> and <span class="inline_code">URL.download()</span> methods raise a <span class="inline_code">URLError</span> if an error occurs (e.g., no internet connection, server is down). <span class="inline_code">URLError</span> has a number of subclasses:</p>
|
|
|
<table class="border">
|
|
|
<tbody>
|
|
|
<tr>
|
|
|
<td><span class="smallcaps">Exception</span></td>
|
|
|
<td><span class="smallcaps">Description</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">URLError</span></td>
|
|
|
<td>URL has errors (e.g. a missing <span class="inline_code">t</span> in <span class="inline_code">htp://</span>)</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">URLTimeout</span></td>
|
|
|
<td>URL takes too long to load.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">HTTPError</span></td>
|
|
|
<td>URL causes an error on the contacted server.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">HTTP301Redirect</span></td>
|
|
|
<td>URL causes too many redirects.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">HTTP400BadRequest</span></td>
|
|
|
<td>URL contains an invalid request.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">HTTP401Authentication</span></td>
|
|
|
<td>URL requires a login and a password.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">HTTP403Forbidden</span></td>
|
|
|
<td>URL is not accessible (check user-agent).</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">HTTP404NotFound</span></td>
|
|
|
<td>URL doesn't exist.</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">HTTP500InternalServerError</span></td>
|
|
|
<td>URL causes an error (bug?) on the server.</td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
<h3>User-agent and referrer</h3>
|
|
|
<p>The <span class="inline_code">URL.open()</span> and <span class="inline_code">URL.download()</span> methods have two optional parameters <span class="inline_code">user_agent</span> and <span class="inline_code">referrer</span>, which can be used to identify the application accessing the web. Some websites include code to block out any application except browsers. By setting a <span class="inline_code">user_agent</span> you can make the application appear as a browser. This is called <em>spoofing</em> and it is not encouraged, but sometimes necessary.</p>
|
|
|
<p>For example, to pose as a Firefox browser:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> URL('http://www.clips.ua.ac.be').download(user_agent='Mozilla/5.0')
|
|
|
</pre></div>
|
|
|
<h3>Find URLs</h3>
|
|
|
<p>The <span class="inline_code">find_urls()</span> function can be used to parse URLs from a text string. It will retrieve a list of links starting with <span class="inline_code">http://</span>, <span class="inline_code">https://</span>, <span class="inline_code">www.</span> and domain names ending with <span class="inline_code">.com</span>, <span class="inline_code">.org</span>. <span class="inline_code">.net</span>. It will detect and strip leading punctuation (open parens) and trailing punctuation (period, comma, close parens). Similarly, the <span class="inline_code">find_email()</span> function can be used to parse e-mail addresses from a string.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import find_urls
|
|
|
>>> print find_urls('Visit our website (www․clips.ua.ac.be)', unique=True)
|
|
|
|
|
|
['www.clips.ua.ac.be']
|
|
|
</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="asynchronous"></a>Asynchronous requests</h2>
|
|
|
<p>The <span class="inline_code">asynchronous()</span> function can be used to execute a function "in the background" (i.e., threaded). It takes the function, its arguments and optional keyword arguments. It returns an <span class="inline_code">AsynchronousRequest</span> object that contains the function's return value (when done). The main program does not halt in the meantime.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">request = asynchronous(function, *args, **kwargs)</pre><pre class="brush:python; gutter:false; light:true;">request.done # True when the function is done.
|
|
|
request.elapsed # Running time, in seconds.
|
|
|
request.value # Function return value when done (or None).
|
|
|
request.error # Function Exception (or None).
|
|
|
</pre><pre class="brush:python; gutter:false; light:true;">request.now() # Waits for function and returns its value.
|
|
|
</pre><p>The example below executes a Google query without halting the main program. Instead, it displays a "busy" message (e.g., a progress bar updated in the application's event loop) until <span class="inline_code">request.done</span>.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import asynchronous, time, Google
|
|
|
>>>
|
|
|
>>> request = asynchronous(Google().search, 'holy grail', timeout=4)
|
|
|
>>> while not request.done:
|
|
|
>>> time.sleep(0.1)
|
|
|
>>> print 'busy...'
|
|
|
>>> print request.value
|
|
|
</pre></div>
|
|
|
<p>There is no way to stop a thread. You are responsible for ensuring that the given function doesn't hang.</p>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="services"></a>Search engine + web services</h2>
|
|
|
<p>The <span class="inline_code">SearchEngine</span> object has a number of subclasses that can be used to query different web services (e.g., Google, Wikipedia). <span class="inline_code">SearchEngine.search()</span> returns a list of <span class="inline_code">Result</span> objects for a given query string – similar to a search field and a results page in a browser.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">engine = SearchEngine(license=None, throttle=1.0, language=None)</pre><pre class="brush:python; gutter:false; light:true;">engine.license # Service license key.
|
|
|
engine.throttle # Time between requests (being nice to server).
|
|
|
engine.language # Restriction for Result.language (e.g., 'en').</pre><pre class="brush:python; gutter:false; light:true;">engine.search(query,
|
|
|
type = SEARCH, # SEARCH | IMAGE | NEWS
|
|
|
start = 1, # Starting page.
|
|
|
count = 10, # Results per page.
|
|
|
size = None # Image size: TINY | SMALL | MEDIUM | LARGE
|
|
|
cached = True) # Cache locally?</pre><p><span class="small"><span style="text-decoration: underline;">Note</span>: <span class="inline_code">SearchEngine.search()</span> takes the same optional parameters as <span class="inline_code">URL.download()</span>.</span></p>
|
|
|
<h3>Google, Bing, Twitter, Facebook, Wikipedia, Flickr</h3>
|
|
|
<p><span class="inline_code">SearchEngine</span> is subclassed by <span class="inline_code">Google</span>, <span class="inline_code">Yahoo</span>, <span class="inline_code">Bing</span>, <span class="inline_code">DuckDuckGo</span>, <span class="inline_code">Twitter</span>, <span class="inline_code">Facebook</span>, <span class="inline_code">Wikipedia</span>, <span class="inline_code">Wiktionary</span>, <span class="inline_code">Wikia</span>, <span class="inline_code">DBPedia</span>, <span class="inline_code">Flickr</span> and <span class="inline_code">Newsfeed</span>. The constructors take the same parameters:</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">engine = Google(license=None, throttle=0.5, language=None)</pre><pre class="brush:python; gutter:false; light:true;">engine = Bing(license=None, throttle=0.5, language=None)</pre><pre class="brush:python; gutter:false; light:true;">engine = Twitter(license=None, throttle=0.5, language=None)</pre><pre class="brush:python; gutter:false; light:true;">engine = Facebook(license=None, throttle=1.0, language='en')</pre><pre class="brush:python; gutter:false; light:true;">engine = Wikipedia(license=None, throttle=5.0, language=None)</pre><pre class="brush:python; gutter:false; light:true;">engine = Flickr(license=None, throttle=5.0, language=None)</pre><p>Each search engine has different settings for the <span class="inline_code">search()</span> method. For example, <span class="inline_code">Twitter.search()</span> returns up to 3000 results for a given query (30 queries with 100 results each, or 300 queries with 10 results each). It has a limit of 150 queries per 15 minutes. Each call to <span class="inline_code">search()</span> counts as one query.</p>
|
|
|
<table class="border">
|
|
|
<tbody>
|
|
|
<tr>
|
|
|
<td><span class="smallcaps">Engine</span></td>
|
|
|
<td><span class="smallcaps">type</span></td>
|
|
|
<td><span class="smallcaps">start</span></td>
|
|
|
<td><span class="smallcaps">count</span></td>
|
|
|
<td><span class="smallcaps">limit</span></td>
|
|
|
<td><span class="smallcaps">throttle</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Google</span></td>
|
|
|
<td><span class="inline_code">SEARCH<sup>1</sup></span></td>
|
|
|
<td>1-100/<span class="inline_code">count</span></td>
|
|
|
<td>1-10</td>
|
|
|
<td><span class="smallcaps">paid</span></td>
|
|
|
<td>0.5</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Bing</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span> <span class="inline_code">|</span> <span class="inline_code">NEWS</span> <span class="inline_code">|</span> <span class="inline_code">IMAGE</span><sup>12</sup></td>
|
|
|
<td>1-1000/<span class="inline_code">count</span></td>
|
|
|
<td>1-50</td>
|
|
|
<td class="smallcaps">paid</td>
|
|
|
<td>0.5</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Yahoo</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span> <span class="inline_code">|</span> <span class="inline_code">NEWS</span> <span class="inline_code">|</span> <span class="inline_code">IMAGE</span><sup>13</sup></td>
|
|
|
<td>1-1000/<span class="inline_code">count</span></td>
|
|
|
<td>1-50</td>
|
|
|
<td class="smallcaps">paid</td>
|
|
|
<td>0.5</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">DuckDuckGo</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span></td>
|
|
|
<td>1</td>
|
|
|
<td>-</td>
|
|
|
<td class="smallcaps">-</td>
|
|
|
<td>0.5</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Twitter</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span></td>
|
|
|
<td>1-3000/<span class="inline_code">count</span></td>
|
|
|
<td>1-100</td>
|
|
|
<td>600/hour</td>
|
|
|
<td>0.5</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Facebook</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span> <span class="inline_code">|</span> <span class="inline_code">NEWS</span></td>
|
|
|
<td>1</td>
|
|
|
<td>1-100</td>
|
|
|
<td>500/hour</td>
|
|
|
<td>1.0</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Wikipedia</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span></td>
|
|
|
<td>1</td>
|
|
|
<td>1</td>
|
|
|
<td>-</td>
|
|
|
<td>5.0</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Wiktionary</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span></td>
|
|
|
<td>1</td>
|
|
|
<td>1</td>
|
|
|
<td>-</td>
|
|
|
<td>5.0</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Wikia</span></td>
|
|
|
<td><span class="inline_code">SEARCH</span></td>
|
|
|
<td>1</td>
|
|
|
<td>1</td>
|
|
|
<td>-</td>
|
|
|
<td>5.0</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">DBPedia</span></td>
|
|
|
<td><span class="inline_code">SPARQL</span></td>
|
|
|
<td>1+</td>
|
|
|
<td>1-1000</td>
|
|
|
<td>10/sec</td>
|
|
|
<td>1.0</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Flickr<br /></span></td>
|
|
|
<td><span class="inline_code">IMAGE</span></td>
|
|
|
<td>1+</td>
|
|
|
<td>1-500</td>
|
|
|
<td>-</td>
|
|
|
<td>5.0</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td><span class="inline_code">Newsfeed</span></td>
|
|
|
<td><span class="inline_code">NEWS</span></td>
|
|
|
<td>1</td>
|
|
|
<td>1+</td>
|
|
|
<td>?</td>
|
|
|
<td>1.0</td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
<p><span class="small"><sup>1 </sup><span class="inline_code">Google</span>, <span class="inline_code">Bing</span> and <span class="inline_code">Yahoo</span> are paid services – see further how to obtain a license key.<br /></span> <span class="small"><sup>2 </sup><span class="inline_code">Bing.search(type=NEWS)</span> has a <span class="inline_code">count</span> of 1-15.<br /></span> <span class="small"><sup>3 </sup><span class="inline_code">Yahoo.search(type=IMAGES)</span> has a <span class="inline_code">count</span> of 1-35.</span><br /> <span class="smallcaps"><br /><a name="license"></a>Web service license key</span></p>
|
|
|
<p>Some services require a license key. They may work without one, but this implies that you share a public license key (and query limit) with other users of the pattern.web module. If the query limit is exceeded, <span class="inline_code">SearchEngine.search()</span> raises a <span class="inline_code">SearchEngineLimitError</span>.</p>
|
|
|
<ul>
|
|
|
<li><span class="inline_code">Google</span> is a paid service ($1 for 200 queries), with a 100 free queries per day. When you obtain a license key (follow the link below), activate "Custom Search API" and "Translate API" under "Services" and look up the key under "API Access".</li>
|
|
|
<li><span class="inline_code">Bing</span> is a paid service ($1 for 500 queries), with a 5,000 free queries per month.</li>
|
|
|
<li><span class="inline_code">Yahoo</span> is a paid service ($1 for 1250 queries) that requires an OAuth key + secret, which can be passed as a tuple: <span class="inline_code">Yahoo(license=(key,</span> <span class="inline_code">secret))</span>.</li>
|
|
|
</ul>
|
|
|
<p>Obtain a license key: <a href="https://code.google.com/apis/console/" target="_blank">Google</a>, <a href="https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57-A49D98D29A44" target="_blank">Bing</a>, <a href="http://developer.yahoo.com/search/boss/" target="_blank">Yahoo</a>, <a href="https://apps.twitter.com/app/new" target="_blank">Twitter</a>, <a href="/pattern-facebook" target="_blank">Facebook</a>, <a href="http://www.flickr.com/services/api/keys/" target="_blank">Flickr</a>.<br /><span class="smallcaps"><br />Web service request throttle</span></p>
|
|
|
<p>A <span class="inline_code">SearchEngine.search()</span> request takes a minimum amount of time to complete, as outlined in the table above. This is intended as etiquette towards the server providing the service. Raise the <span class="inline_code">throttle</span> value if you plan to run multiple queries in batch. Wikipedia requests are especially intensive. If you plan to mine a lot of data from Wikipedia, download the <a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download">Wikipedia database</a> instead.</p>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2>Search Engine results</h2>
|
|
|
<p><span class="inline_code">SearchEngine.search()</span> returns a list of <span class="inline_code">Result</span> objects. It has an additional <span class="inline_code">total</span> property, which is the total number of results available for the given query. Each <span class="inline_code">Result</span> is a dictionary with extra properties:</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">result = Result(url)</pre><pre class="brush:python; gutter:false; light:true;">result.url # URL of content associated with the given query.
|
|
|
result.title # Content title.
|
|
|
result.text # Content summary.
|
|
|
result.language # Content language.
|
|
|
result.author # For news items and images.
|
|
|
result.date # For news items.</pre><pre class="brush:python; gutter:false; light:true;">result.download(timeout=10, cached=True, proxy=None)
|
|
|
</pre><ul>
|
|
|
<li><span class="inline_code">Result.download()</span> takes the same optional parameters as <span class="inline_code">URL.download()</span>.</li>
|
|
|
<li>The attributes (e.g., <span class="inline_code">result.text</span>) are Unicode strings.</li>
|
|
|
</ul>
|
|
|
<p><a name="google"></a>For example:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Bing, SEARCH, plaintext
|
|
|
>>>
|
|
|
>>> engine = Bing(license=None) # Enter your license key.
|
|
|
>>> for i in range(1,5):
|
|
|
>>> for result in engine.search('holy handgrenade', type=SEARCH, start=i):
|
|
|
>>> print repr(plaintext(result.text))
|
|
|
>>> print
|
|
|
|
|
|
u"The Holy Hand Grenade of Antioch is a fictional weapon from ..."
|
|
|
u'Once the number three, being the third number, be reached, then ...'
|
|
|
</pre></div>
|
|
|
<p>Since <span class="inline_code">SearchEngine.search()</span> takes the same optional parameters as <span class="inline_code">URL.download()</span> it is easy to disable local caching, set a proxy server, a throttle (minimum time) or a timeout (maximum time).</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import Google
|
|
|
>>>
|
|
|
>>> engine = Google(license=None) # Enter your license key.
|
|
|
>>> for result in engine.search('tim', cached=False, proxy=('proxy.com', 'https'))
|
|
|
>>> print result.url
|
|
|
>>> print result.text</pre></div>
|
|
|
<p><span class="smallcaps"><br />Image search</span></p>
|
|
|
<p>For <span class="inline_code">Flickr</span>, <span class="inline_code">Bing</span> and <span class="inline_code">Yahoo</span>, image URLs retrieved with <span class="inline_code">search(type=IMAGE)</span> can be filtered by setting the <span class="inline_code">size</span> to <span class="inline_code">TINY</span>, <span class="inline_code">SMALL</span>, <span class="inline_code">MEDIUM</span>, <span class="inline_code">LARGE</span> or <span class="inline_code">None</span> (any size). Images may be subject to copyright.</p>
|
|
|
<p>For <span class="inline_code">Flickr</span>, use <span class="inline_code">search(copyright=False)</span> to retrieve results with no copyright restrictions (either public domain or Creative Commons <a href="http://creativecommons.org/licenses/by-sa/2.0/">by-sa</a>).</p>
|
|
|
<p>For <span class="inline_code">Twitter</span>, each result has a <span class="inline_code">Result.profile</span> property with the URL to the user's profile picture.</p>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2>Google translate</h2>
|
|
|
<p><span class="inline_code">Google.translate()</span> returns the translated string in the given language.<br /><span class="inline_code">Google.identify()</span> returns a <span class="inline_code">(language</span> <span class="inline_code">code,</span> <span class="inline_code">confidence)</span>-tuple for a given string.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Google
|
|
|
>>>
|
|
|
>>> s = "C'est un lapin, lapin de bois. Quoi? Un cadeau."
|
|
|
>>> g = Google()
|
|
|
>>> print g.translate(s, input='fr', output='en', cached=False)
|
|
|
>>> print g.identify(s)
|
|
|
|
|
|
u"It's a rabbit, wood. What? A gift."
|
|
|
(u'fr', 0.76) </pre></div>
|
|
|
<p>Remember to activate the Translate API in the <a href="https://code.google.com/apis/console" target="_blank">Google API Console</a>. Max. 1,000 characters per request.</p>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="twitter"></a>Twitter search</h2>
|
|
|
<p>The <span class="inline_code">start</span> parameter of <span class="inline_code">Twitter.search()</span> takes an <span class="inline_code">int</span> (= the starting page, cfr. other search engines) or a <span class="inline_code">tweet.id</span>. If you create two <span class="inline_code">Twitter</span> objects, their result pages for a given query may not correspond, since new tweets become available more quickly than we can query pages. The best way is to pass the last seen tweet id:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import Twitter
|
|
|
>>>
|
|
|
>>> t = Twitter()
|
|
|
>>> i = None
|
|
|
>>> for j in range(3):
|
|
|
>>> for tweet in t.search('win', start=i, count=10):
|
|
|
>>> print tweet.text
|
|
|
>>> print
|
|
|
>>> i = tweet.id</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2>Twitter streams</h2>
|
|
|
<p><span class="inline_code">Twitter.stream()</span> returns an endless, live stream of <span class="inline_code">Result</span> objects. A <span class="inline_code">Stream</span> is a Python list that accumulates each time <span class="inline_code">Stream.update()</span> is called:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import Twitter
|
|
|
>>>
|
|
|
>>> s = Twitter().stream('#fail')
|
|
|
>>> for i in range(10):
|
|
|
>>> time.sleep(1)
|
|
|
>>> s.update(bytes=1024)
|
|
|
>>> print s[-1].text if s else ''</pre></div>
|
|
|
<p>To clear the accumulated list, call <span class="inline_code">Stream.clear()</span>.</p>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2>Twitter trends</h2>
|
|
|
<p><span class="inline_code">Twitter.trends()</span> returns a list of 10 "trending topics":</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Twitter
|
|
|
>>> print Twitter().trends(cached=False)
|
|
|
|
|
|
[u'#neverunderstood', u'Not Top 10', ...]</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="wikipedia"></a>Wikipedia articles</h2>
|
|
|
<p><span class="inline_code">Wikipedia.search()</span> returns a single <span class="inline_code">WikipediaArticle</span> for the given (case-sensitive) query, which is the title of an article. <span class="inline_code">Wikipedia.index()</span> returns an iterator over all article titles on Wikipedia. The <span class="inline_code">language</span> parameter of the <span class="inline_code">Wikipedia()</span>defines the language of the returned articles (by default it is <span class="inline_code">"en"</span>, which corresponds to <a href="http://en.wikipedia.org/" target="_blank">en.wikipedia.org</a>).</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">article = WikipediaArticle(title='', source='', links=[])</pre><pre class="brush:python; gutter:false; light:true;">article.source # Article HTML source.
|
|
|
article.string # Article plaintext unicode string.</pre><pre class="brush:python; gutter:false; light:true;">article.title # Article title.
|
|
|
article.sections # Article sections.
|
|
|
article.links # List of titles of linked articles.
|
|
|
article.external # List of external links.
|
|
|
article.categories # List of categories.
|
|
|
article.media # List of linked media (images, sounds, ...)
|
|
|
article.languages # Dictionary of (language, article)-items.
|
|
|
article.language # Article language (i.e., 'en').
|
|
|
article.disambiguation # True if it is a disambiguation page</pre><pre class="brush:python; gutter:false; light:true;">article.plaintext(**kwargs) # See plaintext() for parameters overview.
|
|
|
article.download(media, **kwargs)
|
|
|
</pre><p><span class="inline_code">WikipediaArticle.plaintext()</span> is similar to <span class="inline_code">plaintext()</span>, with special attention for MediaWiki markup. It strips metadata, infoboxes, table of contents, annotations, thumbnails and disambiguation links.</p>
|
|
|
<h3>Wikipedia article sections</h3>
|
|
|
<p><span class="inline_code">WikipediaArticle.sections</span> is a list of <span class="inline_code">WikipediaSection</span> objects. Each section has a title and a number of paragraphs that belong together.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">section = WikipediaSection(article, title='', start=0, stop=0, level=1)</pre><pre class="brush:python; gutter:false; light:true;">section.article # WikipediaArticle parent.
|
|
|
section.parent # WikipediaSection this section is part of.
|
|
|
section.children # WikipediaSections belonging to this section.</pre><pre class="brush:python; gutter:false; light:true;">section.title # Section title.
|
|
|
section.source # Section HTML source.
|
|
|
section.string # Section plaintext unicode string.
|
|
|
section.content # Section string minus title.
|
|
|
section.level # Section nested depth (from 0).
|
|
|
section.links # List of titles of linked articles.
|
|
|
section.tables # List of WikipediaTable objects.</pre><p>The following example downloads a Wikipedia article and prints the title of each section, indented according to the section level:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Wikipedia
|
|
|
>>>
|
|
|
>>> article = Wikipedia().search('cat')
|
|
|
>>> for section in article.sections:
|
|
|
>>> print repr(' ' * section.level + section.title)
|
|
|
|
|
|
u'Cat'
|
|
|
u' Nomenclature and etymology'
|
|
|
u' Taxonomy and evolution'
|
|
|
u' Genetics'
|
|
|
u' Anatomy'
|
|
|
u' Behavior'
|
|
|
u' Sociability'
|
|
|
u' Grooming'
|
|
|
u' Fighting'
|
|
|
... </pre></div>
|
|
|
<h3>Wikipedia article tables</h3>
|
|
|
<p><span class="inline_code">WikipediaSection.tables</span> is a list of <span class="inline_code">WikipediaTable</span> objects. Each table has a title, headers and rows.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">table = WikipediaTable(section, title='', headers=[], rows=[], source='')</pre><pre class="brush:python; gutter:false; light:true;">table.section # WikipediaSection parent.
|
|
|
table.source # Table HTML source.
|
|
|
table.title # Table title.
|
|
|
table.headers # List of table column headers.
|
|
|
table.rows # List of table rows, each a list of column values.</pre><p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="wikia"></a>Wikia</h2>
|
|
|
<p><a href="http://www.wikia.com/" target="_blank">Wikia</a> is a free hosting service for thousands of wikis. <span class="inline_code">Wikipedia</span>, <span class="inline_code">Wiktionary</span> and <span class="inline_code">Wikia</span> all inherit the <span class="inline_code">MediaWiki</span> base class, so <span class="inline_code">Wikia</span> has the same methods and properties as <span class="inline_code">Wikipedia</span>. Its constructor takes the name of a domain on Wikia. Note the use of <span class="inline_code">Wikia.index()</span>, which returns an iterator over all available article titles:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import Wikia
|
|
|
>>>
|
|
|
>>> w = Wikia(domain='montypython')
|
|
|
>>> for i, title in enumerate(w.index(start='a', throttle=1.0, cached=True)):
|
|
|
>>> if i >= 3:
|
|
|
>>> break
|
|
|
>>> article = w.search(title)
|
|
|
>>> print repr(article.title)
|
|
|
|
|
|
u'Albatross'
|
|
|
u'Always Look on the Bright Side of Life'
|
|
|
u'And Now for Something Completely Different'</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="dbpedia"></a>DBPedia</h2>
|
|
|
<p><a href="http://dbpedia.org/About" target="_blank">DBPedia</a> is a database of structured information mined from Wikipedia and stored as (subject, predicate, object)-triples (e.g., <em>cat</em> <span class="postag">is-a</span> <em>animal</em>). DBPedia can be queried with <a href="http://www.w3.org/TR/rdf-sparql-query/" target="_blank">SPARQL</a>, where subject, predicate and/or object can be given as <span class="inline_code">?variables</span>. The <span class="inline_code">Result</span> objects in the list returned from <span class="inline_code">DBPedia.search()</span> have the variables as additional properties:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import DBPedia
|
|
|
>>>
|
|
|
>>> sparql = '\n'.join((
|
|
|
>>> 'prefix dbo: <http://dbpedia.org/ontology/>',
|
|
|
>>> 'select ?person ?place where {',
|
|
|
>>> ' ?person a dbo:President.',
|
|
|
>>> ' ?person dbo:birthPlace ?place.',
|
|
|
>>> '}'
|
|
|
>>> ))
|
|
|
>>> for r in DBPedia().search(sparql, start=1, count=10):
|
|
|
>>> print '%s (%s)' % (r.person.name, r.place.name)
|
|
|
|
|
|
Álvaro Arzú (Guatemala City)
|
|
|
Árpád Göncz (Budapest)
|
|
|
...</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="facebook"></a>Facebook posts, comments & likes</h2>
|
|
|
<p><span class="inline_code">Facebook.search(query,</span> <span class="inline_code">type=SEARCH)</span> returns a list of <span class="inline_code">Result</span> objects, where each result is a (publicly available) post that contains (or which comments contain) the given query.</p>
|
|
|
<p><span class="inline_code">Facebook.search(id,</span> <span class="inline_code">type=NEWS)</span> returns posts from a given user profile. You need to supply a personal license key. You can get a key when you <a href="/pattern-facebook" target="_blank">authorize Pattern</a> to search Facebook in your name.</p>
|
|
|
<p><span class="inline_code">Facebook.search(id,</span> <span class="inline_code">type=COMMENTS)</span> retrieves comments for a given post's <span class="inline_code">Result.id</span>. You can also pass the id of a post or a comment to <span class="inline_code">Facebook.search(id, type=LIKES)</span> to retrieve users that liked it.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import Facebook, NEWS, COMMENTS, LIKES
|
|
|
>>>
|
|
|
>>> fb = Facebook(license='your key')
|
|
|
>>> me = fb.profile(id=None) # (id, name, date, gender, locale, likes)-tuple
|
|
|
>>>
|
|
|
>>> for post in fb.search(me[0], type=NEWS, count=100):
|
|
|
>>> print repr(post.id)
|
|
|
>>> print repr(post.text)
|
|
|
>>> print repr(post.url)
|
|
|
>>> if post.comments > 0:
|
|
|
>>> print '%i comments' % post.comments
|
|
|
>>> print [(r.text, r.author) for r in fb.search(post.id, type=COMMENTS)]
|
|
|
>>> if post.likes > 0:
|
|
|
>>> print '%i likes' % post.likes
|
|
|
>>> print [r.author for r in fb.search(post.id, type=LIKES)]
|
|
|
|
|
|
u'530415277_10151455896030278'
|
|
|
u'Tom De Smedt likes CLiPS Research Center'
|
|
|
u'http://www.facebook.com/CLiPS.UA'
|
|
|
1 likes
|
|
|
[(u'485942414773810', u'CLiPS Research Center')]
|
|
|
.... </pre></div>
|
|
|
<p>The maximum <span class="inline_code">count</span> for <span class="inline_code">COMMENTS</span> and <span class="inline_code">LIKES</span> is 1000 (by default, 10). </p>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2>RSS + Atom newsfeeds</h2>
|
|
|
<p>The <span class="inline_code">Newsfeed</span> object is a wrapper for Mark Pilgrim's <a href="http://www.feedparser.org/" target="_blank">Universal Feed Parser</a>. <span class="inline_code">Newsfeed.search()</span> takes the URL of an RSS or Atom news feed and returns a list of <span class="inline_code">Result</span> objects.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Newsfeed
|
|
|
>>>
|
|
|
>>> NATURE = 'http://www.nature.com/nature/current_issue/rss/index.html'
|
|
|
>>> for result in Newsfeed().search(NATURE)[:5]:
|
|
|
>>> print repr(result.title)
|
|
|
|
|
|
u'Biopiracy rules should not block biological control'
|
|
|
u'Animal behaviour: Same-shaped shoals'
|
|
|
u'Genetics: Fast disease factor'
|
|
|
u'Biomimetics: Material monitors mugginess'
|
|
|
u'Cell biology: Lung lipid hurts breathing'
|
|
|
</pre></div>
|
|
|
<p><span class="inline_code">Newsfeed.search()</span> has an optional parameter <span class="inline_code">tags</span>, which is a list of custom tags to parse:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> for result in Newsfeed().search(NATURE, tags=['dc:identifier']):
|
|
|
>>> print result.dc_identifier</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="sort"></a>Web sort</h2>
|
|
|
<p>The return value of <span class="inline_code">SearchEngine.search()</span> has a <span class="inline_code">total</span> property which can be used to sort queries by "crowdvoting". The <span class="inline_code">sort()</span> function sorts a given list of terms according to their total result count, and returns a list of <span class="inline_code">(percentage,</span> <span class="inline_code">term)</span>-tuples.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">sort(
|
|
|
terms = [], # List of search terms.
|
|
|
context = '', # Term used for sorting.
|
|
|
service = GOOGLE, # GOOGLE | BING | YAHOO | FLICKR
|
|
|
license = None, # Service license key.
|
|
|
strict = True, # Wrap query in quotes?
|
|
|
prefix = False, # context + term or term + context?
|
|
|
cached = True)</pre><p>When a <span class="inline_code">context</span> is defined, the function sorts by relevance to the context, e.g., <span class="inline_code">sort(["black",</span> <span class="inline_code">"white"],</span> <span class="inline_code">context="Darth</span> <span class="inline_code">Vader")</span> yields <em>black</em> as the best candidate, because <span class="inline_code">"black</span> <span class="inline_code">Darth</span> <span class="inline_code">Vader"</span> is more common in search results.</p>
|
|
|
<p>Now let's see who is more dangerous:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import sort
|
|
|
>>>
|
|
|
>>> results = sort(terms=[
|
|
|
>>> 'arnold schwarzenegger',
|
|
|
>>> 'chuck norris',
|
|
|
>>> 'dolph lundgren',
|
|
|
>>> 'steven seagal',
|
|
|
>>> 'sylvester stallone',
|
|
|
>>> 'mickey mouse'], context='dangerous', prefix=True)
|
|
|
>>>
|
|
|
>>> for weight, term in results:
|
|
|
>>> print "%.2f" % (weight * 100) + '%', term
|
|
|
|
|
|
84.34% 'dangerous mickey mouse'
|
|
|
9.24% 'dangerous chuck norris'
|
|
|
2.41% 'dangerous sylvester stallone'
|
|
|
2.01% 'dangerous arnold schwarzenegger'
|
|
|
1.61% 'dangerous steven seagal'
|
|
|
0.40% 'dangerous dolph lundgren'
|
|
|
</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="plaintext"></a>HTML to plaintext</h2>
|
|
|
<p>The HTML source code of a web page can be retrieved with <span class="inline_code">URL.download()</span>. HTML is a markup language that uses <em>tags</em> to define text formatting. For example, <span class="inline_code"><b>hello</b></span> displays <strong>hello</strong> in bold. For many tasks we may want to strip the formatting so we can analyze (e.g., <a href="pattern-en.html#parser">parse</a> or <a href="pattern-vector.html#wordcount">count</a>) the plain text.</p>
|
|
|
<p>The <span class="inline_code">plaintext()</span> function removes HTML formatting from a string.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">plaintext(html, keep=[], replace=blocks, linebreaks=2, indentation=False)</pre><p>It performs the following steps to clean up the given string:</p>
|
|
|
<ul>
|
|
|
<li><strong>Strip javascript:</strong> remove all <span class="inline_code"><script></span> elements.</li>
|
|
|
<li><strong>Strip CSS: </strong>remove all <span class="inline_code"><style></span> elements.</li>
|
|
|
<li><strong>Strip comments:</strong> remove all <span class="inline_code"><!-- --></span> elements.</li>
|
|
|
<li><strong>Strip forms: </strong>remove all <span class="inline_code"><form></span> elements.</li>
|
|
|
<li><strong>Strip tags: </strong>remove all HTML tags.</li>
|
|
|
<li><strong>Decode entities:</strong> replace <span class="inline_code">&lt;</span> with <span class="inline_code"><</span> (for example).</li>
|
|
|
<li><strong>Collapse spaces:</strong> replace consecutive spaces with a single space.</li>
|
|
|
<li><strong>Collapse linebreaks:</strong> replace consecutive linebreaks with a single linebreak.</li>
|
|
|
<li><strong>Collapse tabs:</strong> replace consecutive tabs with a single space, optionally indentation (i.e., tabs at the start of a line) can be preserved.</li>
|
|
|
</ul>
|
|
|
<p><span class="smallcaps">plaintext parameters</span></p>
|
|
|
<p>The <span class="inline_code">keep</span> parameter is a list of tags to retain. By default, attributes are stripped, e.g., <span class="inline_code"><table border="0"></span> becomes <span class="inline_code"><table></span>. To preserve specific attributes, a dictionary can be given: <span class="inline_code">{"a":</span> <span class="inline_code">["href"]}</span>.</p>
|
|
|
<p>The <span class="inline_code">replace</span> parameter defines how HTML elements are replaced with other characters to improve plain text layout. It is a dictionary of <span class="inline_code">tag</span> → <span class="inline_code">(before,</span> <span class="inline_code">after)</span> items. By default, it replaces block elements (i.e., <span class="inline_code"><h1></span>, <span class="inline_code"> </span><span class="inline_code"><h2></span>, <span class="inline_code"> </span><span class="inline_code"><p></span>, <span class="inline_code"> </span><span class="inline_code"><div></span>, <span class="inline_code"> </span><span class="inline_code"><table></span>, ...) with two linebreaks, <span class="inline_code"><th></span> and <span class="inline_code"><tr></span> with one linebreak, <span class="inline_code"><td></span> with one tab, and <span class="inline_code"><li></span> with an asterisk (<span class="inline_code">*</span>) before and a linebreak after.</p>
|
|
|
<p>The <span class="inline_code">linebreaks</span> parameter defines the maximum number of consecutive linebreaks to retain.</p>
|
|
|
<p>The <span class="inline_code">indentation</span> parameter defines whether or not to retain tab indentation.</p>
|
|
|
<p>The following example downloads a HTML document and keeps a minimal amount of formatting (headings, bold, links).</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import URL, plaintext
|
|
|
>>>
|
|
|
>>> s = URL('http://www.clips.ua.ac.be').download()
|
|
|
>>> s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
|
|
|
>>> print s
|
|
|
</pre></div>
|
|
|
<p style="margin-top: 1.3em;"><span class="smallcaps">plaintext = strip + decode + collapse</span></p>
|
|
|
<p>The different steps in <span class="inline_code">plaintext()</span> are available as separate functions:</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">decode_utf8(string) # Byte string to Unicode string.</pre><pre class="brush:python; gutter:false; light:true;">encode_utf8(string) # Unicode string to byte string.
|
|
|
</pre><pre class="brush:python; gutter:false; light:true;">strip_tags(html, keep=[], replace=blocks) # Non-trivial, using SGML parser.
|
|
|
</pre><pre class="brush:python; gutter:false; light:true;">strip_between(a, b, string) # Remove anything between (and including) a and b.
|
|
|
</pre><pre class="brush:python; gutter:false; light:true;">strip_javascript(html) # Strips between '<script*>' and '</script'.</pre><pre class="brush:python; gutter:false; light:true;">strip_inline_css(html) # Strips between '<style*>' and '</style>'.</pre><pre class="brush:python; gutter:false; light:true;">strip_comments(html) # Strips between '<!--' and '-->'.</pre><pre class="brush:python; gutter:false; light:true;">strip_forms(html) # Strips between '<form*>' and '</form>'.</pre><pre class="brush:python; gutter:false; light:true;">decode_entities(string) # '&lt;' => '<'</pre><pre class="brush:python; gutter:false; light:true;">encode_entities(string) # '<' => '&lt;' </pre><pre class="brush:python; gutter:false; light:true;">decode_url(string) # 'and%2For' => 'and/or'</pre><pre class="brush:python; gutter:false; light:true;">encode_url(string) # 'and/or' => 'and%2For' </pre><pre class="brush:python; gutter:false; light:true;">collapse_spaces(string, indentation=False, replace=' ')</pre><pre class="brush:python; gutter:false; light:true;">collapse_tabs(string, indentation=False, replace=' ')</pre><pre class="brush:python; gutter:false; light:true;">collapse_linebreaks(string, threshold=1)</pre><p> </p>
|
|
|
<hr />
|
|
|
<h2 class="example"><a name="DOM"></a>HTML DOM parser</h2>
|
|
|
<p>The Document Object Model (DOM) is a language-independent convention for representing HTML, XHTML and XML documents. The pattern.web module includes a HTML DOM parser (based on Leonard Richardson's <a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">BeautifulSoup</a>) that can be used to traverse a HTML document as a tree of linked Python objects. This is useful to extract specific portions from a HTML string retrieved with <span class="inline_code">URL.download()</span>.</p>
|
|
|
<h3>Node</h3>
|
|
|
<p>The DOM consists of a <span class="inline_code">DOM</span> object that contains <span class="inline_code">Text</span>, <span class="inline_code">Comment</span> and <span class="inline_code">Element</span> objects.<br />All of these are subclasses of <span class="inline_code">Node</span>.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">node = Node(html, type=NODE)</pre><pre class="brush:python; gutter:false; light:true;">node.type # NODE | TEXT | COMMENT | ELEMENT | DOCUMENT
|
|
|
node.source # HTML source.
|
|
|
node.parent # Parent node.
|
|
|
node.children # List of child nodes.
|
|
|
node.next # Next child in node.parent (or None).
|
|
|
node.previous # Previous child in node.parent (or None).</pre><pre class="brush:python; gutter:false; light:true;">node.traverse(visit=lambda node: None)</pre><h3>Element</h3>
|
|
|
<p><span class="inline_code">Text</span>, <span class="inline_code">Comment</span> and <span class="inline_code">Element</span> are subclasses of <span class="inline_code">Node</span>. For example, <span class="inline_code">'the</span> <span class="inline_code"><b>cat</b>'</span> is parsed to <span class="inline_code">Text('the')</span> + <span class="inline_code">Element('cat',</span> <span class="inline_code">tag='b')</span>. The <span class="inline_code">Element</span> object has a number of additional properties:</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">element = Element(html)</pre><pre class="brush:python; gutter:false; light:true;">element.tag # Tag name.
|
|
|
element.attrs # Dictionary of attributes, e.g. {'class':'comment'}.
|
|
|
element.id # Value for id attribute (or None).</pre><pre class="brush:python; gutter:false; light:true;">element.source # HTML source.
|
|
|
element.content # HTML source minus open and close tag.</pre><pre class="brush:python; gutter:false; light:true;">element.by_id(str) # First nested Element with given id.
|
|
|
element.by_tag(str) # List of nested Elements with given tag name.
|
|
|
element.by_class(str) # List of nested Elements with given class.
|
|
|
element.by_attr(**kwargs) # List of nested Elements with given attribute.
|
|
|
element(selector) # List of nested Elements matching a CSS selector.
|
|
|
</pre><ul>
|
|
|
<li><span class="inline_code">Element.by_tag()</span> can include a class (e.g., <span class="inline_code">"div.header"</span>) or an id (e.g., <span class="inline_code">"div#content"</span>). <br />A wildcard can be used to match any tag. (e.g. <span class="inline_code">"*.even"</span>).<br />The element is searched recursively (children in children, etc.)</li>
|
|
|
<li><span class="inline_code">Element.by_attr()</span> takes one or more keyword arguments (e.g., <span class="inline_code">name="keywords"</span>).</li>
|
|
|
<li><span class="inline_code">Element(selector)</span> returns a list of nested elements that match the given <a href="http://www.w3.org/TR/CSS2/selector.html" target="_blank">CSS selector</a>:</li>
|
|
|
</ul>
|
|
|
<p>Overview of CSS selectors:</p>
|
|
|
<div>
|
|
|
<table class="border">
|
|
|
<tbody>
|
|
|
<tr>
|
|
|
<td class="smallcaps">CSS Selector</td>
|
|
|
<td class="smallcaps">Description</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('*')</td>
|
|
|
<td>all nested elements</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('*#x')</td>
|
|
|
<td>all nested elements with <span class="inline_code">id="x"</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div#x')</td>
|
|
|
<td>all nested <span class="inline_code"><div></span> elements with <span class="inline_code">id="x"</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div.x')</td>
|
|
|
<td>all nested <span class="inline_code"><div></span> elements with <span class="inline_code">class="x"</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div[class="x"]')</td>
|
|
|
<td>all nested<span class="inline_code"> <div></span> elements with attribute <span class="inline_code">"class"</span> = <span class="inline_code">"x"</span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div:first-child')</td>
|
|
|
<td>the first child in a <span class="inline_code"><div></span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div a')</td>
|
|
|
<td>all nested <span class="inline_code"><a></span>'s inside a nested <span class="inline_code"><div></span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div, a')</td>
|
|
|
<td>all nested <span class="inline_code"><a></span>'s and <span class="inline_code"><div></span> elements</td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div + a')</td>
|
|
|
<td>all nested <span class="inline_code"><a></span>'s directly preceded by a <span class="inline_code"><div></span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div > a')</td>
|
|
|
<td>all nested <span class="inline_code"><a></span>'s directly inside a nested <span class="inline_code"><div></span></td>
|
|
|
</tr>
|
|
|
<tr>
|
|
|
<td class="inline_code">element('div < a')</td>
|
|
|
<td>all nested <span class="inline_code"><div></span>'s directly containing an <span class="inline_code"><a></span></td>
|
|
|
</tr>
|
|
|
</tbody>
|
|
|
</table>
|
|
|
</div>
|
|
|
<div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import Element
|
|
|
>>>
|
|
|
>>> div = Element('<div> <a>1st</a> <a>2nd<a> </div>')
|
|
|
>>> print div('a:first-child')
|
|
|
>>> print div('a:first-child')[0].source
|
|
|
|
|
|
[Element(tag='a')]
|
|
|
<a>1st</a> </pre></div>
|
|
|
<h3>DOM</h3>
|
|
|
<p>The top-level element in the Document Object Model.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">dom = DOM(html)</pre><pre class="brush:python; gutter:false; light:true;">dom.declaration # <!doctype> TEXT Node.
|
|
|
dom.head # <head> Element.
|
|
|
dom.body # <body> Element.</pre><p>The following example retrieves the most recent <a href="http://www.reddit.com/" target="_blank">reddit</a> entries. The pattern.web module does not include a reddit search engine, but we can parse entries directly from the HTML source. This is called <em>screen scraping</em>, and many websites will strongly dislike it.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import URL, DOM, plaintext
|
|
|
>>>
|
|
|
>>> url = URL('http://www.reddit.com/top/')
|
|
|
>>> dom = DOM(url.download(cached=True))
|
|
|
>>> for e in dom('div.entry')[:3]: # Top 3 reddit entries.
|
|
|
>>> for a in e('a.title')[:1]: # First <a class="title">.
|
|
|
>>> print repr(plaintext(a.content))
|
|
|
|
|
|
u'Invisible Kitty'
|
|
|
u'Naturally, he said yes.'
|
|
|
u"I'd just like to remind everyone that /r/minecraft exists and not everyone wants"
|
|
|
"to have 10 Minecraft posts a day on their front page."</pre></div>
|
|
|
<p><span class="smallcaps"><br />Absolute URLs</span></p>
|
|
|
<p>Links parsed from the <span class="inline_code">DOM</span> can be relative (e.g., starting with <span class="inline_code">"../"</span> instead of <span class="inline_code">"http://"</span>).<br />To get the absolute URL, you can use the <span class="inline_code">abs()</span> function in combination with <span class="inline_code">URL.redirect</span>:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import URL, DOM, abs
|
|
|
>>>
|
|
|
>>> url = URL('http://www.clips.ua.ac.be')
|
|
|
>>> dom = DOM(url.download())
|
|
|
>>> for link in dom('a'):
|
|
|
>>> print abs(link.attributes.get('href',''), base=url.redirect or url.string) </pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="pdf"></a>PDF Parser</h2>
|
|
|
<p style="margin-top: 0.2em; margin-right: 0px; margin-bottom: 0.5em; margin-left: 0px;">Portable Document Format (PDF) is a popular open standard, where text, fonts, images and layout are contained in a single document that displays the same across systems. However, extracting the source text from a PDF can be difficult.</p>
|
|
|
<p style="margin-top: 0.2em; margin-right: 0px; margin-bottom: 0.5em; margin-left: 0px;">The <span class="inline_code">PDF</span> object (based on <a href="http://www.unixuser.org/~euske/python/pdfminer/" target="_self">PDFMiner</a>) parses the source text from a PDF file.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import URL, PDF
|
|
|
>>>
|
|
|
>>> url = URL('http://www.clips.ua.ac.be/sites/default/files/ctrs-002_0.pdf')
|
|
|
>>> pdf = PDF(url.download())
|
|
|
>>> print pdf.string
|
|
|
|
|
|
CLiPS Technical Report series 002 September 7, 2010
|
|
|
Tom De Smedt, Vincent Van Asch, Walter Daelemans
|
|
|
Computational Linguistics & Psycholinguistics Research Center
|
|
|
... </pre></div>
|
|
|
<p style="margin-top: 0.2em; margin-right: 0px; margin-bottom: 0.5em; margin-left: 0px;">URLs linking to a PDF document can be identified with: <span class="inline_code">URL.mimetype</span> <span class="inline_code">in</span> <span class="inline_code">MIMETYPE_PDF</span>.</p>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="crawler"></a>Crawler</h2>
|
|
|
<p>A web crawler or web spider can be used to traverse the web automatically. The <span class="inline_code">Crawler</span> object takes a list of URLs. These are then visited by the crawler. If they lead to a web page, the HTML content is parsed for new links. These are added to the list of links scheduled for a visit.</p>
|
|
|
<p>The given <span class="inline_code">domains</span> is a list of allowed domain names. An empty list means the crawler can visit the entire web. The given <span class="inline_code">delay</span> defines the number of seconds to wait before revisiting the same (sub)domain – continually hammering one server with a robot disrupts requests from the website's regular visitors (this is called a <em>denial-of-service attack</em>).</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">crawler = Crawler(links=[], domains=[], delay=20.0, sort=FIFO)</pre><pre class="brush:python; gutter:false; light:true;">crawler.domains # Domains allowed to visit (e.g., ['clips.ua.ac.be']).
|
|
|
crawler.delay # Delay between visits to the same (sub)domain.
|
|
|
crawler.history # Dictionary of (domain, time last visited)-items.
|
|
|
crawler.visited # Dictionary of URLs visited.
|
|
|
crawler.sort # FIFO | LIFO (how new links are queued).
|
|
|
crawler.done # True when all links have been visited.</pre><pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">crawler.push(link, priority=1.0, sort=LIFO)
|
|
|
crawler.pop(remove=True)
|
|
|
crawler.next # Yields the next scheduled link = Crawler.pop(False)</pre><pre class="brush:python; gutter:false; light:true;">crawler.crawl(method=DEPTH) # DEPTH | BREADTH | None.</pre><pre class="brush:python; gutter:false; light:true;">crawler.priority(link, method=DEPTH)
|
|
|
crawler.follow(link)
|
|
|
crawler.visit(link, source=None)
|
|
|
crawler.fail(link)</pre><h3>Crawling process</h3>
|
|
|
<ul>
|
|
|
<li><span class="inline_code">Crawler.crawl()</span> is meant to be called continuously in a loop. It selects a link to visit and parses the HTML content for new links. The <span class="inline_code">method</span> parameter defines whether the crawler prefers internal links (<span class="inline_code">DEPTH</span>) or external links to other domains (<span class="inline_code">BREADTH</span>). If the link leads to a recently visited domain (i.e., elapsed time < <span class="inline_code">Crawler.delay</span>) it is temporarily skipped. To disable this behaviour, use an optional <span class="inline_code">throttle</span> parameter >= <span class="inline_code">Crawler.delay</span>.</li>
|
|
|
</ul>
|
|
|
<ul>
|
|
|
<li><span class="inline_code">Crawler.priority()</span> is called from <span class="inline_code">Crawler.crawl()</span> to determine the priority (<span class="inline_code">0.0</span>-<span class="inline_code">1.0</span>) of a new <span class="inline_code">Link</span>, where links with highest priority are visited first. It can be overridden in a subclass. </li>
|
|
|
</ul>
|
|
|
<ul>
|
|
|
<li><span class="inline_code">Crawler.follow()</span> is called from <span class="inline_code">Crawler.crawl()</span> to determine if it should schedule the given <span class="inline_code">Link</span> for a visit. By default it yields <span class="inline_code">True</span>. It can be overridden to disallow selected links.</li>
|
|
|
</ul>
|
|
|
<ul>
|
|
|
<li><span class="inline_code">Crawler.visit()</span> is called from <span class="inline_code">Crawler.crawl()</span> when a <span class="inline_code">Link</span> is visited. The given <span class="inline_code">source</span> is a HTML string with the page content. By default, this method does nothing (it should be overridden).</li>
|
|
|
</ul>
|
|
|
<ul>
|
|
|
<li><span class="inline_code">Crawler.fail()</span> is called from <span class="inline_code">Crawler.crawl()</span> for links whose MIME-type could not be determined, or which raise a <span class="inline_code">URLError</span> while downloading.</li>
|
|
|
</ul>
|
|
|
<p>The crawler uses <span class="inline_code">Link</span> objects internally, which contain additional information besides the URL string:</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">link = Link(url, text='', relation='')</pre><pre class="brush:python; gutter:false; light:true;">link.url # Parsed from <a href=''> attribute.
|
|
|
link.text # Parsed from <a title=''> attribute.
|
|
|
link.relation # Parsed from <a rel=''> attribute.
|
|
|
link.referrer # Parent web page URL.</pre><p>The following example shows a subclass of <span class="inline_code">Crawler</span> that prints each link it visits. Since it uses <span class="inline_code">DEPTH</span> for crawling, it will prefer internal links.</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Crawler
|
|
|
>>>
|
|
|
>>> class Polly(Crawler):
|
|
|
>>> def visit(self, link, source=None):
|
|
|
>>> print 'visited:', repr(link.url), 'from:', link.referrer
|
|
|
>>> def fail(self, link):
|
|
|
>>> print 'failed:', repr(link.url)
|
|
|
>>>
|
|
|
>>> p = Polly(links=['http://www.clips.ua.ac.be/'], delay=3)
|
|
|
>>> while not p.done:
|
|
|
>>> p.crawl(method=DEPTH, cached=False, throttle=3)
|
|
|
|
|
|
visited: u'http://www.clips.ua.ac.be/'
|
|
|
visited: u'http://www.clips.ua.ac.be/#navigation'
|
|
|
visited: u'http://www.clips.ua.ac.be/colloquia'
|
|
|
visited: u'http://www.clips.ua.ac.be/computational-linguistics'
|
|
|
visited: u'http://www.clips.ua.ac.be/contact'
|
|
|
</pre></div>
|
|
|
<p><span class="small"><span style="text-decoration: underline;">Note</span>: <span class="inline_code">Crawler.crawl()</span> takes the same parameters as <span class="inline_code">URL.download()</span>, e.g., </span><span class="small"><span class="inline_code">cached=False</span> or <span class="inline_code">throttle=10</span>.<br /></span></p>
|
|
|
<h3>Crawl function</h3>
|
|
|
<p>The <span class="inline_code">crawl()</span> function returns an iterator that yields <span class="inline_code">(Link,</span> <span class="inline_code">source)</span>-tuples. When it is <em>idle</em> (e.g., waiting for the <span class="inline_code">delay</span> on a domain) it yields (<span class="inline_code">None</span>, <span class="inline_code">None</span>).</p>
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">crawl(
|
|
|
links = [],
|
|
|
domains = [],
|
|
|
delay = 20.0,
|
|
|
sort = FIFO,
|
|
|
method = DEPTH, **kwargs)</pre><div class="example">
|
|
|
<pre class="brush: python;gutter: false; light: true; fontsize: 100; first-line: 1; ">>>> from pattern.web import crawl
|
|
|
>>>
|
|
|
>>> for link, source in crawl('http://www.clips.ua.ac.be/', delay=3, throttle=3):
|
|
|
>>> print link
|
|
|
|
|
|
Link(url=u'http://www.clips.ua.ac.be/')
|
|
|
Link(url=u'http://www.clips.ua.ac.be/#navigation')
|
|
|
Link(url=u'http://www.clips.ua.ac.be/computational-linguistics')
|
|
|
...</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="mail"></a>E-mail</h2>
|
|
|
<p>The <span class="inline_code">Mail</span> object can be used to retrieve e-mail messages from Gmail, provided that IMAP is <a href="http://mail.google.com/support/bin/answer.py?answer=77695">enabled</a>. It may also work with other services, by passing the server address to the <span class="inline_code">service</span> parameter (e.g., <span class="inline_code">service="imap.gmail.com"</span>). With <span class="inline_code">secure=False</span> (no SSL) the default <span class="inline_code">port</span> is 143.</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">mail = Mail(username, password, service=GMAIL, port=993, secure=True)</pre><pre class="brush:python; gutter:false; light:true;">mail.folders # Dictionary of (name, MailFolder)-items.
|
|
|
mail.[folder] # E.g., Mail.inbox.read(id)
|
|
|
mail.[folder].count # Number of messages in folder.
|
|
|
</pre><pre class="brush:python; gutter:false; light:true;">mail.[folder].search(query, field=FROM) # FROM | SUBJECT | DATE
|
|
|
mail.[folder].read(id, attachments=False, cached=True)</pre><ul>
|
|
|
<li><span class="inline_code">Mail.folders</span> is a <span class="inline_code">name</span> → <span class="inline_code">MailFolder</span> dictionary. Common names include <span class="inline_code">inbox</span>, <span class="inline_code">spam</span> and <span class="inline_code">trash</span>.</li>
|
|
|
<li><span class="inline_code">MailFolder.search()</span> returns a list of e-mail id's, most recent first.</li>
|
|
|
<li><span class="inline_code">MailFolder.read()</span> retrieves the e-mail with given id as a <span class="inline_code">Message</span>.</li>
|
|
|
</ul>
|
|
|
<div><span style="line-height: 18px;">A <span class="inline_code">Message</span> has the following properties:</span></div>
|
|
|
<pre class="brush:python; gutter:false; light:true;">message = Mail.[folder].read(i)</pre><pre class="brush:python; gutter:false; light:true;">message.author # Unicode string, sender name + e-mail address.
|
|
|
message.email_address # Unicode string, sender e-mail address.
|
|
|
message.date # Unicode string, date received.
|
|
|
message.subject # Unicode string, message subject.
|
|
|
message.body # Unicode string, message body.
|
|
|
message.attachments # List of (MIME-type, str)-tuples.
|
|
|
</pre><p>The following example retrieves spam e-mails containing the word "wish":</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Mail, GMAIL, SUBJECT
|
|
|
>>>
|
|
|
>>> gmail = Mail(username='...', password='...', service=GMAIL)
|
|
|
>>> print gmail.folders.keys()
|
|
|
|
|
|
['drafts', 'spam', 'personal', 'work', 'inbox', 'mail', 'starred', 'trash']</pre></div>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> i = gmail.spam.search('wish', field=SUBJECT)[0] # What riches await...
|
|
|
>>> m = gmail.spam.read(i)
|
|
|
>>> print ' From:', m.author
|
|
|
>>> print 'Subject:', m.subject
|
|
|
>>> print 'Message:'
|
|
|
>>> print m.body
|
|
|
|
|
|
From: u'Vegas VIP Clib <amllhbmjb@acciongeoda.org>'
|
|
|
Subject: u'Your wish has been granted'
|
|
|
Message: u'No one has claimed our jackpot! This is your chance to try!'
|
|
|
</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="locale"></a>Locale</h2>
|
|
|
<p>The pattern.web.locale module contains functions for region and language codes, based on the ISO-639 language code (e.g., <span class="inline_code">en</span>), the ISO-3166 region code (e.g., <span class="inline_code">US</span>) and the IETF BCP 47 language-region specification (<span class="inline_code">en-US</span>):</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">encode_language(name) # 'English' => 'en'</pre><pre class="brush:python; gutter:false; light:true;">decode_language(code) # 'en' => 'English'</pre><pre class="brush:python; gutter:false; light:true;">encode_region(name) # 'United States' => 'US'</pre><pre class="brush:python; gutter:false; light:true;">decode_region(code) # 'US' => 'United States'</pre><pre class="brush:python; gutter:false; light:true;">languages(region) # 'US' => ['en']</pre><pre class="brush:python; gutter:false; light:true;">regions(language) # 'en' => ['AU', 'BZ', 'CA', ...]</pre><pre class="brush:python; gutter:false; light:true;">regionalize(language) # 'en' => ['en-US', 'en-AU', ...]</pre><pre class="brush:python; gutter:false; light:true;">market(language) # 'en' => 'en-US'</pre><p>The <span class="inline_code">geocode()</span> function recognizes a number of world capital cities and returns a tuple (<span class="inline_code">latitude</span>, <span class="inline_code">longitude</span>, <span class="inline_code">ISO-639</span>, <span class="inline_code">region</span>).</p>
|
|
|
<pre class="brush:python; gutter:false; light:true;">geocode(location) # 'Brussels' => (50.83, 4.33, u'nl', u'Belgium')</pre><p>This is useful in combination with the <span class="inline_code">geo</span> parameter for <span class="inline_code">Twitter.search()</span> to obtain regional tweets:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import Twitter
|
|
|
>>> from pattern.web.locale import geocode
|
|
|
>>>
|
|
|
>>> twitter = Twitter(language='en')
|
|
|
>>> for tweet in twitter.search('restaurant', geo=geocode('Brussels')[:2]):
|
|
|
>>> print tweet.text
|
|
|
|
|
|
u'Did you know: every McDonalds restaurant has free internet in Belgium...'</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2><a name="cache"></a>Cache</h2>
|
|
|
<p>By, default, <span class="inline_code">URL.download()</span> and <span class="inline_code">SearchEngine.search()</span> will cache results locally. Once the results of a query have been cached, there is no need to connect to the internet (i.e., the query runs faster). Over time the cache can grow quite large, filling up with whatever was downloaded – from tweets to zip archives.</p>
|
|
|
<p>To empty the cache:</p>
|
|
|
<div class="example">
|
|
|
<pre class="brush:python; gutter:false; light:true;">>>> from pattern.web import cache
|
|
|
>>> cache.clear()
|
|
|
</pre></div>
|
|
|
<p> </p>
|
|
|
<hr />
|
|
|
<h2>See also</h2>
|
|
|
<ul>
|
|
|
<li><a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">BeautifulSoup</a> (BSD): r<span>obust HTML parser for Python.</span></li>
|
|
|
<li><span><a href="http://scrapy.org/" target="_blank">Scrapy</a> (BSD): s</span><span>creen scraping and web crawling with Python.</span></li>
|
|
|
</ul>
|
|
|
</div>
|
|
|
</div></div>
|
|
|
</div>
|
|
|
</div>
|
|
|
</div>
|
|
|
</div>
|
|
|
</div>
|
|
|
</div>
|
|
|
<script>
|
|
|
SyntaxHighlighter.all();
|
|
|
</script>
|
|
|
</body>
|
|
|
</html> |