adding derive notebook

master
manetta 4 years ago
parent 7d43cb8aab
commit 046ade44b5

@ -0,0 +1,464 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# MediaWiki API (part 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook:\n",
"\n",
"* continues with exploring the connections between `Hypertext` & `Dérive`\n",
"* saves (parts of) wiki pages as html files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import urllib\n",
"import json\n",
"from IPython.display import JSON # iPython JSON renderer\n",
"import sys"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use another wiki this time: the English Wikipedia.\n",
"\n",
"You can pick any page, i took the Hypertext page for this notebook as an example: https://en.wikipedia.org/wiki/Hypertext"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# parse the wiki page Hypertext\n",
"request = 'https://en.wikipedia.org/w/api.php?action=parse&page=Hypertext&format=json'\n",
"response = urllib.request.urlopen(request).read()\n",
"data = json.loads(response)\n",
"JSON(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wiki links dérive\n",
"\n",
"Select the wiki links from the `data` response:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"links = data['parse']['links']\n",
"JSON(links)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's save the links as a list of pagenames, to make it look like this:\n",
"\n",
"`['hyperdocuments', 'hyperwords', 'hyperworld']`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# How is \"links\" structured now?\n",
"print(links)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It helps to copy paste a small part of the output first:\n",
"\n",
"`[{'ns': 0, 'exists': '', '*': 'Metatext'}, {'ns': 0, '*': 'De man met de hoed'}]`\n",
"\n",
"and to write it differently with indentation:\n",
"\n",
"```\n",
"links = [\n",
" { \n",
" 'ns' : 0,\n",
" 'exists' : '',\n",
" '*', 'Metatext'\n",
" }, \n",
" {\n",
" 'ns' : 0,\n",
" 'exists' : '',\n",
" '*' : 'De man met de hoed'\n",
" } \n",
"]\n",
"```\n",
"\n",
"We can now loop through \"links\" and add all the pagenames to a new list called \"wikilinks\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"wikilinks = []\n",
"\n",
"for link in links:\n",
" \n",
" print('link:', link)\n",
" \n",
" for key, value in link.items():\n",
" print('----- key:', key)\n",
" print('----- value:', value)\n",
" print('-----')\n",
" \n",
" pagename = link['*']\n",
" print('===== pagename:', pagename)\n",
" \n",
" wikilinks.append(pagename)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"wikilinks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving the links in a HTML page\n",
"\n",
"Let's convert the list of pagenames into HTML link elements (`<a href=\"\"></a>`):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"html = ''\n",
"\n",
"for wikilink in wikilinks:\n",
" print(wikilink)\n",
" \n",
" # let's use the \"safe\" pagenames for the filenames \n",
" # by replacing the ' ' with '_'\n",
" filename = wikilink.replace(' ', '_')\n",
" \n",
" a = f'<a href=\"{ filename }.html\">{ wikilink }</a>'\n",
" html += a\n",
" html += '\\n'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"print(html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's save this page in a separate folder, i called it \"mediawiki-api-dérive\"\n",
"# We can make this folder here using a terminal command, but you can also do it in the interface on the left\n",
"! mkdir mediawiki-api-dérive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output = open('mediawiki-api-dérive/Hypertext.html', 'w')\n",
"output.write(html)\n",
"output.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Recursive parsing\n",
"\n",
"We can now repeat the steps for each wikilink that we collected!\n",
"\n",
"We can make an API request for each wikilink, \\\n",
"ask for all the links on the page \\\n",
"and save it as an HTML page."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First we save the Hypertext page again:\n",
"\n",
"startpage = 'Hypertext'\n",
"\n",
"# parse the first wiki page\n",
"request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ startpage }&format=json'\n",
"response = urllib.request.urlopen(request).read()\n",
"data = json.loads(response)\n",
"JSON(data)\n",
"\n",
"# select the links\n",
"links = data['parse']['links']\n",
"\n",
"# turn it into a list of pagenames\n",
"wikilinks = []\n",
"for link in links:\n",
" pagename = link['*']\n",
" wikilinks.append(pagename)\n",
"\n",
"# turn the wikilinks into a set of <a href=\"\"></a> links\n",
"html = ''\n",
"for wikilink in wikilinks:\n",
" filename = wikilink.replace(' ', '_')\n",
" a = f'<a href=\"{ filename }.html\">{ wikilink }</a>'\n",
" html += a\n",
" html += '\\n'\n",
"\n",
"# save it as a HTML page\n",
"startpage = startpage.replace(' ', '_') # let's again stay safe on the filename side\n",
"output = open(f'mediawiki-api-dérive/{ startpage }.html', 'w')\n",
"output.write(html)\n",
"output.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Then we loop through the list of wikilinks\n",
"# and repeat the steps for each page\n",
" \n",
"for wikilink in wikilinks:\n",
" \n",
" # let's copy the current wikilink pagename, to avoid confusion later\n",
" currentwikilink = wikilink \n",
" print('Now requesting:', currentwikilink)\n",
" \n",
" # parse this wiki page\n",
" wikilink = wikilink.replace(' ', '_')\n",
" request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ wikilink }&format=json'\n",
" \n",
" # --> we insert a \"try and error\" condition, \n",
" # to catch errors in case a page does not exist \n",
" try: \n",
" \n",
" # continue the parse request\n",
" response = urllib.request.urlopen(request).read()\n",
" data = json.loads(response)\n",
" JSON(data)\n",
"\n",
" # select the links\n",
" links = data['parse']['links']\n",
"\n",
" # turn it into a list of pagenames\n",
" wikilinks = []\n",
" for link in links:\n",
" pagename = link['*']\n",
" wikilinks.append(pagename)\n",
"\n",
" # turn the wikilinks into a set of <a href=\"\"></a> links\n",
" html = ''\n",
" for wikilink in wikilinks:\n",
" filename = wikilink.replace(' ', '_')\n",
" a = f'<a href=\"{ filename }.html\">{ wikilink }</a>'\n",
" html += a\n",
" html += '\\n'\n",
"\n",
" # save it as a HTML page\n",
" currentwikilink = currentwikilink.replace(' ', '_') # let's again stay safe on the filename side\n",
" output = open(f'mediawiki-api-dérive/{ currentwikilink }.html', 'w')\n",
" output.write(html)\n",
" output.close()\n",
" \n",
" except:\n",
" error = sys.exc_info()[0]\n",
" print('Skipped:', wikilink)\n",
" print('With the error:', error)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"\n",
"?\n",
"\n",
"You could add more loops to the recursive parsing, adding more layers ...\n",
"\n",
"You could request all images of a page (instead of links) ...\n",
"\n",
"or something else the API offers ... (contributors, text, etc)\n",
"\n",
"or ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading…
Cancel
Save