diff --git a/mediawiki-api-dérive.ipynb b/mediawiki-api-dérive.ipynb new file mode 100644 index 0000000..5aab04f --- /dev/null +++ b/mediawiki-api-dérive.ipynb @@ -0,0 +1,464 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# MediaWiki API (part 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook:\n", + "\n", + "* continues with exploring the connections between `Hypertext` & `Dérive`\n", + "* saves (parts of) wiki pages as html files" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import urllib\n", + "import json\n", + "from IPython.display import JSON # iPython JSON renderer\n", + "import sys" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parse" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's use another wiki this time: the English Wikipedia.\n", + "\n", + "You can pick any page, i took the Hypertext page for this notebook as an example: https://en.wikipedia.org/wiki/Hypertext" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# parse the wiki page Hypertext\n", + "request = 'https://en.wikipedia.org/w/api.php?action=parse&page=Hypertext&format=json'\n", + "response = urllib.request.urlopen(request).read()\n", + "data = json.loads(response)\n", + "JSON(data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Wiki links dérive\n", + "\n", + "Select the wiki links from the `data` response:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "links = data['parse']['links']\n", + "JSON(links)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's save the links as a list of pagenames, to make it look like this:\n", + "\n", + "`['hyperdocuments', 'hyperwords', 'hyperworld']`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# How is \"links\" structured now?\n", + "print(links)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It helps to copy paste a small part of the output first:\n", + "\n", + "`[{'ns': 0, 'exists': '', '*': 'Metatext'}, {'ns': 0, '*': 'De man met de hoed'}]`\n", + "\n", + "and to write it differently with indentation:\n", + "\n", + "```\n", + "links = [\n", + " { \n", + " 'ns' : 0,\n", + " 'exists' : '',\n", + " '*', 'Metatext'\n", + " }, \n", + " {\n", + " 'ns' : 0,\n", + " 'exists' : '',\n", + " '*' : 'De man met de hoed'\n", + " } \n", + "]\n", + "```\n", + "\n", + "We can now loop through \"links\" and add all the pagenames to a new list called \"wikilinks\"." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "wikilinks = []\n", + "\n", + "for link in links:\n", + " \n", + " print('link:', link)\n", + " \n", + " for key, value in link.items():\n", + " print('----- key:', key)\n", + " print('----- value:', value)\n", + " print('-----')\n", + " \n", + " pagename = link['*']\n", + " print('===== pagename:', pagename)\n", + " \n", + " wikilinks.append(pagename)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "wikilinks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving the links in a HTML page\n", + "\n", + "Let's convert the list of pagenames into HTML link elements (``):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "html = ''\n", + "\n", + "for wikilink in wikilinks:\n", + " print(wikilink)\n", + " \n", + " # let's use the \"safe\" pagenames for the filenames \n", + " # by replacing the ' ' with '_'\n", + " filename = wikilink.replace(' ', '_')\n", + " \n", + " a = f'{ wikilink }'\n", + " html += a\n", + " html += '\\n'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "print(html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Let's save this page in a separate folder, i called it \"mediawiki-api-dérive\"\n", + "# We can make this folder here using a terminal command, but you can also do it in the interface on the left\n", + "! mkdir mediawiki-api-dérive" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "output = open('mediawiki-api-dérive/Hypertext.html', 'w')\n", + "output.write(html)\n", + "output.close()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Recursive parsing\n", + "\n", + "We can now repeat the steps for each wikilink that we collected!\n", + "\n", + "We can make an API request for each wikilink, \\\n", + "ask for all the links on the page \\\n", + "and save it as an HTML page." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# First we save the Hypertext page again:\n", + "\n", + "startpage = 'Hypertext'\n", + "\n", + "# parse the first wiki page\n", + "request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ startpage }&format=json'\n", + "response = urllib.request.urlopen(request).read()\n", + "data = json.loads(response)\n", + "JSON(data)\n", + "\n", + "# select the links\n", + "links = data['parse']['links']\n", + "\n", + "# turn it into a list of pagenames\n", + "wikilinks = []\n", + "for link in links:\n", + " pagename = link['*']\n", + " wikilinks.append(pagename)\n", + "\n", + "# turn the wikilinks into a set of links\n", + "html = ''\n", + "for wikilink in wikilinks:\n", + " filename = wikilink.replace(' ', '_')\n", + " a = f'{ wikilink }'\n", + " html += a\n", + " html += '\\n'\n", + "\n", + "# save it as a HTML page\n", + "startpage = startpage.replace(' ', '_') # let's again stay safe on the filename side\n", + "output = open(f'mediawiki-api-dérive/{ startpage }.html', 'w')\n", + "output.write(html)\n", + "output.close()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Then we loop through the list of wikilinks\n", + "# and repeat the steps for each page\n", + " \n", + "for wikilink in wikilinks:\n", + " \n", + " # let's copy the current wikilink pagename, to avoid confusion later\n", + " currentwikilink = wikilink \n", + " print('Now requesting:', currentwikilink)\n", + " \n", + " # parse this wiki page\n", + " wikilink = wikilink.replace(' ', '_')\n", + " request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ wikilink }&format=json'\n", + " \n", + " # --> we insert a \"try and error\" condition, \n", + " # to catch errors in case a page does not exist \n", + " try: \n", + " \n", + " # continue the parse request\n", + " response = urllib.request.urlopen(request).read()\n", + " data = json.loads(response)\n", + " JSON(data)\n", + "\n", + " # select the links\n", + " links = data['parse']['links']\n", + "\n", + " # turn it into a list of pagenames\n", + " wikilinks = []\n", + " for link in links:\n", + " pagename = link['*']\n", + " wikilinks.append(pagename)\n", + "\n", + " # turn the wikilinks into a set of links\n", + " html = ''\n", + " for wikilink in wikilinks:\n", + " filename = wikilink.replace(' ', '_')\n", + " a = f'{ wikilink }'\n", + " html += a\n", + " html += '\\n'\n", + "\n", + " # save it as a HTML page\n", + " currentwikilink = currentwikilink.replace(' ', '_') # let's again stay safe on the filename side\n", + " output = open(f'mediawiki-api-dérive/{ currentwikilink }.html', 'w')\n", + " output.write(html)\n", + " output.close()\n", + " \n", + " except:\n", + " error = sys.exc_info()[0]\n", + " print('Skipped:', wikilink)\n", + " print('With the error:', error)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's next?\n", + "\n", + "?\n", + "\n", + "You could add more loops to the recursive parsing, adding more layers ...\n", + "\n", + "You could request all images of a page (instead of links) ...\n", + "\n", + "or something else the API offers ... (contributors, text, etc)\n", + "\n", + "or ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}