Merge branch 'master' of ssh://git.xpub.nl:2501/XPUB/prototyping-times

4 years ago · 75b3b84380
parent f7d4f4151c 68fd2ef600
commit 75b3b84380
3 changed files with 464 additions and 968 deletions
--- a/mediawiki-api-dérive.ipynb
+++ b/mediawiki-api-dérive.ipynb
@ -0,0 +1,464 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# MediaWiki API (Dérive)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook:\n",
+    "\n",
+    "* continues with exploring the connections between `Hypertext` & `Dérive`\n",
+    "* saves (parts of) wiki pages as html files"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import urllib\n",
+    "import json\n",
+    "from IPython.display import JSON # iPython JSON renderer\n",
+    "import sys"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Parse"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's use another wiki this time: the English Wikipedia.\n",
+    "\n",
+    "You can pick any page, i took the Hypertext page for this notebook as an example: https://en.wikipedia.org/wiki/Hypertext"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# parse the wiki page Hypertext\n",
+    "request = 'https://en.wikipedia.org/w/api.php?action=parse&page=Hypertext&format=json'\n",
+    "response = urllib.request.urlopen(request).read()\n",
+    "data = json.loads(response)\n",
+    "JSON(data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Wiki links dérive\n",
+    "\n",
+    "Select the wiki links from the `data` response:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "links = data['parse']['links']\n",
+    "JSON(links)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's save the links as a list of pagenames, to make it look like this:\n",
+    "\n",
+    "`['hyperdocuments', 'hyperwords', 'hyperworld']`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# How is \"links\" structured now?\n",
+    "print(links)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It helps to copy paste a small part of the output first:\n",
+    "\n",
+    "`[{'ns': 0, 'exists': '', '*': 'Metatext'}, {'ns': 0, '*': 'De man met de hoed'}]`\n",
+    "\n",
+    "and to write it differently with indentation:\n",
+    "\n",
+    "```\n",
+    "links = [\n",
+    "    { \n",
+    "        'ns' : 0,\n",
+    "        'exists' : '',\n",
+    "        '*', 'Metatext'\n",
+    "    }, \n",
+    "    {\n",
+    "        'ns' : 0,\n",
+    "        'exists' : '',\n",
+    "        '*' : 'De man met de hoed'\n",
+    "    }  \n",
+    "]\n",
+    "```\n",
+    "\n",
+    "We can now loop through \"links\" and add all the pagenames to a new list called \"wikilinks\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "wikilinks = []\n",
+    "\n",
+    "for link in links:\n",
+    "    \n",
+    "    print('link:', link)\n",
+    "    \n",
+    "    for key, value in link.items():\n",
+    "        print('----- key:', key)\n",
+    "        print('----- value:', value)\n",
+    "        print('-----')\n",
+    "        \n",
+    "    pagename = link['*']\n",
+    "    print('===== pagename:', pagename)\n",
+    "    \n",
+    "    wikilinks.append(pagename)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "wikilinks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Saving the links in a HTML page\n",
+    "\n",
+    "Let's convert the list of pagenames into HTML link elements (`<a href=\"\"></a>`):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "html = ''\n",
+    "\n",
+    "for wikilink in wikilinks:\n",
+    "    print(wikilink)\n",
+    "    \n",
+    "    # let's use the \"safe\" pagenames for the filenames \n",
+    "    # by replacing the ' ' with '_'\n",
+    "    filename = wikilink.replace(' ', '_')\n",
+    "    \n",
+    "    a = f'<a href=\"{ filename }.html\">{ wikilink }</a>'\n",
+    "    html += a\n",
+    "    html += '\\n'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "print(html)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's save this page in a separate folder, i called it \"mediawiki-api-dérive\"\n",
+    "# We can make this folder here using a terminal command, but you can also do it in the interface on the left\n",
+    "! mkdir mediawiki-api-dérive"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output = open('mediawiki-api-dérive/Hypertext.html', 'w')\n",
+    "output.write(html)\n",
+    "output.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Recursive parsing\n",
+    "\n",
+    "We can now repeat the steps for each wikilink that we collected!\n",
+    "\n",
+    "We can make an API request for each wikilink, \\\n",
+    "ask for all the links on the page \\\n",
+    "and save it as an HTML page."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# First we save the Hypertext page again:\n",
+    "\n",
+    "startpage = 'Hypertext'\n",
+    "\n",
+    "# parse the first wiki page\n",
+    "request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ startpage }&format=json'\n",
+    "response = urllib.request.urlopen(request).read()\n",
+    "data = json.loads(response)\n",
+    "JSON(data)\n",
+    "\n",
+    "# select the links\n",
+    "links = data['parse']['links']\n",
+    "\n",
+    "# turn it into a list of pagenames\n",
+    "wikilinks = []\n",
+    "for link in links:\n",
+    "    pagename = link['*']\n",
+    "    wikilinks.append(pagename)\n",
+    "\n",
+    "# turn the wikilinks into a set of <a href=\"\"></a> links\n",
+    "html = ''\n",
+    "for wikilink in wikilinks:\n",
+    "    filename = wikilink.replace(' ', '_')\n",
+    "    a = f'<a href=\"{ filename }.html\">{ wikilink }</a>'\n",
+    "    html += a\n",
+    "    html += '\\n'\n",
+    "\n",
+    "# save it as a HTML page\n",
+    "startpage = startpage.replace(' ', '_') # let's again stay safe on the filename side\n",
+    "output = open(f'mediawiki-api-dérive/{ startpage }.html', 'w')\n",
+    "output.write(html)\n",
+    "output.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Then we loop through the list of wikilinks\n",
+    "# and repeat the steps for each page\n",
+    "    \n",
+    "for wikilink in wikilinks:\n",
+    "    \n",
+    "    # let's copy the current wikilink pagename, to avoid confusion later\n",
+    "    currentwikilink = wikilink \n",
+    "    print('Now requesting:', currentwikilink)\n",
+    "    \n",
+    "    # parse this wiki page\n",
+    "    wikilink = wikilink.replace(' ', '_')\n",
+    "    request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ wikilink }&format=json'\n",
+    "    \n",
+    "    # --> we insert a \"try and error\" condition, \n",
+    "    # to catch errors in case a page does not exist \n",
+    "    try: \n",
+    "        \n",
+    "        # continue the parse request\n",
+    "        response = urllib.request.urlopen(request).read()\n",
+    "        data = json.loads(response)\n",
+    "        JSON(data)\n",
+    "\n",
+    "        # select the links\n",
+    "        links = data['parse']['links']\n",
+    "\n",
+    "        # turn it into a list of pagenames\n",
+    "        wikilinks = []\n",
+    "        for link in links:\n",
+    "            pagename = link['*']\n",
+    "            wikilinks.append(pagename)\n",
+    "\n",
+    "        # turn the wikilinks into a set of <a href=\"\"></a> links\n",
+    "        html = ''\n",
+    "        for wikilink in wikilinks:\n",
+    "            filename = wikilink.replace(' ', '_')\n",
+    "            a = f'<a href=\"{ filename }.html\">{ wikilink }</a>'\n",
+    "            html += a\n",
+    "            html += '\\n'\n",
+    "\n",
+    "        # save it as a HTML page\n",
+    "        currentwikilink = currentwikilink.replace(' ', '_') # let's again stay safe on the filename side\n",
+    "        output = open(f'mediawiki-api-dérive/{ currentwikilink }.html', 'w')\n",
+    "        output.write(html)\n",
+    "        output.close()\n",
+    "            \n",
+    "    except:\n",
+    "        error = sys.exc_info()[0]\n",
+    "        print('Skipped:', wikilink)\n",
+    "        print('With the error:', error)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What's next?\n",
+    "\n",
+    "?\n",
+    "\n",
+    "You could add more loops to the recursive parsing, adding more layers ...\n",
+    "\n",
+    "You could request all images of a page (instead of links) ...\n",
+    "\n",
+    "or something else the API offers ... (contributors, text, etc)\n",
+    "\n",
+    "or ..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/mediawiki-api-part-2-recovered.ipynb
+++ b/mediawiki-api-part-2-recovered.ipynb
--- a/mediawiki-api-part-2.ipynb
+++ b/mediawiki-api-part-2.ipynb
@ -13,7 +13,6 @@
   "source": [
    "This notebook:\n",
    "\n",
-    "* continues with exploring the connections between `Hypertext` & `Dérive`\n",
    "* uses the `query` & `parse` actions of the `MediaWiki API`, which we can use to work with wiki pages as (versioned and hypertextual) technotexts\n",
    "\n",
    "## Epicpedia\n",