You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
prototyping-times/mediawiki-api-part-2-recove...

968 lines
127 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# MediaWiki API (part 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's request the https://pzwiki.wdka.nl/mediadesign/Dérive page.\n",
"\n",
"(I made a copy of the page of last week, to make the URL a bit simpler :).)\n",
"\n",
"And let's try to save different versions of it as .html pages, using the API. "
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [],
"source": [
"import urllib\n",
"import json\n",
"from IPython.display import JSON"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Make an API request"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is how we did an API request to the PZI MediaWiki last week:"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"url = 'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=parse&page=D%C3%A9rive&format=json' # urllib doesn't like the \"é\", so we're writing it in ASCII\n",
"request = urllib.request.urlopen(url).read()\n",
"data = json.loads(request)"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"application/json": {
"parse": {
"categories": [],
"displaytitle": "Dérive",
"externallinks": [
"https://sites.google.com/a/cougars.csusm.edu/20-poetry/my-20-project/theexpansionalookintohypertext"
],
"images": [
"Debo_009_05_01.jpg",
"Sex-majik-2004.gif"
],
"iwlinks": [],
"langlinks": [],
"links": [
{
"*": "Hi",
"exists": "",
"ns": 0
},
{
"*": "Hyper Poetry",
"exists": "",
"ns": 0
},
{
"*": "Hypertext",
"exists": "",
"ns": 0
},
{
"*": "Wiki Tutorial",
"exists": "",
"ns": 0
}
],
"pageid": 33524,
"parsewarnings": [],
"properties": [],
"revid": 188107,
"sections": [],
"templates": [
{
"*": "Template:Youtube",
"exists": "",
"ns": 10
}
],
"text": {
"*": "<div class=\"mw-parser-output\"><p><a href=\"/mediadesign/Hi\" title=\"Hi\">hi</a>\n</p><p>Refresh your memory here&#160;:<a href=\"/mediadesign/Wiki_Tutorial\" title=\"Wiki Tutorial\">Wiki_Tutorial</a>\n</p><p><a href=\"/mediadesign/Hypertext\" title=\"Hypertext\">Hypertext</a>\nHypertext: An Educational Experiment in English and Computer Science at Brown University\n</p><p><iframe width=\"420\" height=\"315\" src=\"https://www.youtube.com/embed/wUTaNQWjNy8\" frameborder=\"0\" allowfullscreen></iframe>\n</p><p><a href=\"/mediadesign/Hyper_Poetry\" title=\"Hyper Poetry\">Hyper Poetry</a>\nWhat it is? <a target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external autonumber\" href=\"https://sites.google.com/a/cougars.csusm.edu/20-poetry/my-20-project/theexpansionalookintohypertext\">[1]</a>\n</p>\n<div class=\"thumb tright\"><div class=\"thumbinner\" style=\"width:302px;\"><a href=\"/mediadesign/File:Debo_009_05_01.jpg\" class=\"image\"><img alt=\"\" src=\"/mw-mediadesign/images/thumb/c/c8/Debo_009_05_01.jpg/300px-Debo_009_05_01.jpg\" decoding=\"async\" width=\"300\" height=\"208\" class=\"thumbimage\" srcset=\"/mw-mediadesign/images/thumb/c/c8/Debo_009_05_01.jpg/450px-Debo_009_05_01.jpg 1.5x, /mw-mediadesign/images/thumb/c/c8/Debo_009_05_01.jpg/600px-Debo_009_05_01.jpg 2x\" /></a> <div class=\"thumbcaption\"><div class=\"magnify\"><a href=\"/mediadesign/File:Debo_009_05_01.jpg\" class=\"internal\" title=\"Enlarge\"></a></div>Derive</div></div></div>\n<div class=\"thumb tright\"><div class=\"thumbinner\" style=\"width:302px;\"><a href=\"/mediadesign/File:Sex-majik-2004.gif\" class=\"image\"><img alt=\"\" src=\"/mw-mediadesign/images/thumb/3/39/Sex-majik-2004.gif/300px-Sex-majik-2004.gif\" decoding=\"async\" width=\"300\" height=\"196\" class=\"thumbimage\" srcset=\"/mw-mediadesign/images/thumb/3/39/Sex-majik-2004.gif/450px-Sex-majik-2004.gif 1.5x, /mw-mediadesign/images/thumb/3/39/Sex-majik-2004.gif/600px-Sex-majik-2004.gif 2x\" /></a> <div class=\"thumbcaption\"><div class=\"magnify\"><a href=\"/mediadesign/File:Sex-majik-2004.gif\" class=\"internal\" title=\"Enlarge\"></a></div>Derive</div></div></div>\n<!-- \nNewPP limit report\nCached time: 20210121160437\nCache expiry: 86400\nDynamic content: false\nCPU time usage: 0.017 seconds\nReal time usage: 0.020 seconds\nPreprocessor visited node count: 7/1000000\nPreprocessor generated node count: 38/1000000\nPostexpand include size: 80/2097152 bytes\nTemplate argument size: 11/2097152 bytes\nHighest expansion depth: 3/40\nExpensive parser function count: 0/100\nUnstrip recursion depth: 0/20\nUnstrip postexpand size: 0/5000000 bytes\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00% 7.082 1 Template:Youtube\n100.00% 7.082 1 -total\n-->\n\n<!-- Saved in parser cache with key wdka_mw_mediadesign-mw_:pcache:idhash:33524-0!canonical and timestamp 20210121160437 and revision id 188107\n -->\n</div>"
},
"title": "Dérive"
}
},
"text/plain": [
"<IPython.core.display.JSON object>"
]
},
"execution_count": 113,
"metadata": {
"application/json": {
"expanded": false,
"root": "root"
}
},
"output_type": "execute_result"
}
],
"source": [
"# To inspect the JSON formatted output of that request: \n",
"JSON(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Select from the API request's output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can navigate the output in Python, by using the \"key/value\" structure of the JSON output:"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [],
"source": [
"title = data['parse']['title']\n",
"html = data['parse']['text']['*']\n",
"links = data['parse']['links']\n",
"externallinks = data['parse']['externallinks']\n",
"images = data['parse']['images']"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dérive\n"
]
}
],
"source": [
"print(title)"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<div class=\"mw-parser-output\"><p><a href=\"/mediadesign/Hi\" title=\"Hi\">hi</a>\n",
"</p><p>Refresh your memory here&#160;:<a href=\"/mediadesign/Wiki_Tutorial\" title=\"Wiki Tutorial\">Wiki_Tutorial</a>\n",
"</p><p><a href=\"/mediadesign/Hypertext\" title=\"Hypertext\">Hypertext</a>\n",
"Hypertext: An Educational Experiment in English and Computer Science at Brown University\n",
"</p><p><iframe width=\"420\" height=\"315\" src=\"https://www.youtube.com/embed/wUTaNQWjNy8\" frameborder=\"0\" allowfullscreen></iframe>\n",
"</p><p><a href=\"/mediadesign/Hyper_Poetry\" title=\"Hyper Poetry\">Hyper Poetry</a>\n",
"What it is? <a target=\"_blank\" rel=\"nofollow noreferrer noopener\" class=\"external autonumber\" href=\"https://sites.google.com/a/cougars.csusm.edu/20-poetry/my-20-project/theexpansionalookintohypertext\">[1]</a>\n",
"</p>\n",
"<div class=\"thumb tright\"><div class=\"thumbinner\" style=\"width:302px;\"><a href=\"/mediadesign/File:Debo_009_05_01.jpg\" class=\"image\"><img alt=\"\" src=\"/mw-mediadesign/images/thumb/c/c8/Debo_009_05_01.jpg/300px-Debo_009_05_01.jpg\" decoding=\"async\" width=\"300\" height=\"208\" class=\"thumbimage\" srcset=\"/mw-mediadesign/images/thumb/c/c8/Debo_009_05_01.jpg/450px-Debo_009_05_01.jpg 1.5x, /mw-mediadesign/images/thumb/c/c8/Debo_009_05_01.jpg/600px-Debo_009_05_01.jpg 2x\" /></a> <div class=\"thumbcaption\"><div class=\"magnify\"><a href=\"/mediadesign/File:Debo_009_05_01.jpg\" class=\"internal\" title=\"Enlarge\"></a></div>Derive</div></div></div>\n",
"<div class=\"thumb tright\"><div class=\"thumbinner\" style=\"width:302px;\"><a href=\"/mediadesign/File:Sex-majik-2004.gif\" class=\"image\"><img alt=\"\" src=\"/mw-mediadesign/images/thumb/3/39/Sex-majik-2004.gif/300px-Sex-majik-2004.gif\" decoding=\"async\" width=\"300\" height=\"196\" class=\"thumbimage\" srcset=\"/mw-mediadesign/images/thumb/3/39/Sex-majik-2004.gif/450px-Sex-majik-2004.gif 1.5x, /mw-mediadesign/images/thumb/3/39/Sex-majik-2004.gif/600px-Sex-majik-2004.gif 2x\" /></a> <div class=\"thumbcaption\"><div class=\"magnify\"><a href=\"/mediadesign/File:Sex-majik-2004.gif\" class=\"internal\" title=\"Enlarge\"></a></div>Derive</div></div></div>\n",
"<!-- \n",
"NewPP limit report\n",
"Cached time: 20210121160437\n",
"Cache expiry: 86400\n",
"Dynamic content: false\n",
"CPU time usage: 0.017 seconds\n",
"Real time usage: 0.020 seconds\n",
"Preprocessor visited node count: 7/1000000\n",
"Preprocessor generated node count: 38/1000000\n",
"Postexpand include size: 80/2097152 bytes\n",
"Template argument size: 11/2097152 bytes\n",
"Highest expansion depth: 3/40\n",
"Expensive parser function count: 0/100\n",
"Unstrip recursion depth: 0/20\n",
"Unstrip postexpand size: 0/5000000 bytes\n",
"-->\n",
"<!--\n",
"Transclusion expansion time report (%,ms,calls,template)\n",
"100.00% 7.082 1 Template:Youtube\n",
"100.00% 7.082 1 -total\n",
"-->\n",
"\n",
"<!-- Saved in parser cache with key wdka_mw_mediadesign-mw_:pcache:idhash:33524-0!canonical and timestamp 20210121160437 and revision id 188107\n",
" -->\n",
"</div>\n"
]
}
],
"source": [
"print(html)"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'ns': 0, 'exists': '', '*': 'Hi'}, {'ns': 0, 'exists': '', '*': 'Hyper Poetry'}, {'ns': 0, 'exists': '', '*': 'Hypertext'}, {'ns': 0, 'exists': '', '*': 'Wiki Tutorial'}]\n"
]
}
],
"source": [
"print(links)"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['https://sites.google.com/a/cougars.csusm.edu/20-poetry/my-20-project/theexpansionalookintohypertext']\n"
]
}
],
"source": [
"print(externallinks)"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Debo_009_05_01.jpg', 'Sex-majik-2004.gif']\n"
]
}
],
"source": [
"print(images)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download images"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(all the images that are used on this page)\n",
"\n",
"For this, we will use the \"query\" request, with the parameter \"list=allimages\". \n",
"\n",
"This is the only way to retrieve the full URL of an image using the MediaWiki API. "
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [],
"source": [
"# Let's first test it with one image.\n",
"# For example: File:Debo 009 05 01.jpg\n",
"\n",
"filename = 'Debo 009 05 01.jpg'\n",
"filename = filename.replace(' ', '_') # let's replace spaces again with _\n",
"filename = filename.replace('.jpg', '') # and let's remove the file extension"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [],
"source": [
"# We cannot ask the API for the URL of a specific image (:///), but we can still find it using the \"aifrom=\" parameter.\n",
"# Note: ai=allimages\n",
"url = f'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=query&list=allimages&aifrom={ filename }&format=json'\n",
"response = urllib.request.urlopen(url).read()\n",
"data = json.loads(response)"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"data": {
"application/json": {
"batchcomplete": "",
"continue": {
"aicontinue": "Deck_6.jpg",
"continue": "-||"
},
"query": {
"allimages": [
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33518",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Debo_009_05_01.jpg",
"name": "Debo_009_05_01.jpg",
"ns": 6,
"timestamp": "2021-01-21T14:54:44Z",
"title": "File:Debo 009 05 01.jpg",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/c/c8/Debo_009_05_01.jpg"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=14589",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Debord-societysml.gif",
"name": "Debord-societysml.gif",
"ns": 6,
"timestamp": "2014-11-30T00:19:20Z",
"title": "File:Debord-societysml.gif",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/b/ba/Debord-societysml.gif"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=4462",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Dec_6_AWU.pdf",
"name": "Dec_6_AWU.pdf",
"ns": 6,
"timestamp": "2011-12-06T15:23:11Z",
"title": "File:Dec 6 AWU.pdf",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/7/70/Dec_6_AWU.pdf"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=4463",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Dec_6_AWUII.pdf",
"name": "Dec_6_AWUII.pdf",
"ns": 6,
"timestamp": "2011-12-06T16:34:43Z",
"title": "File:Dec 6 AWUII.pdf",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/f/fd/Dec_6_AWUII.pdf"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=2090",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:December.gif",
"name": "December.gif",
"ns": 6,
"timestamp": "2010-12-14T21:07:54Z",
"title": "File:December.gif",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/3/3f/December.gif"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33093",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Deck_1.jpg",
"name": "Deck_1.jpg",
"ns": 6,
"timestamp": "2020-11-23T14:31:00Z",
"title": "File:Deck 1.jpg",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/7/74/Deck_1.jpg"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33095",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Deck_2.jpg",
"name": "Deck_2.jpg",
"ns": 6,
"timestamp": "2020-11-23T14:31:00Z",
"title": "File:Deck 2.jpg",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/0/08/Deck_2.jpg"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33084",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Deck_3.jpg",
"name": "Deck_3.jpg",
"ns": 6,
"timestamp": "2020-11-23T14:30:52Z",
"title": "File:Deck 3.jpg",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/f/f5/Deck_3.jpg"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33088",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Deck_4.jpg",
"name": "Deck_4.jpg",
"ns": 6,
"timestamp": "2020-11-23T14:30:52Z",
"title": "File:Deck 4.jpg",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/2/24/Deck_4.jpg"
},
{
"descriptionshorturl": "https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33085",
"descriptionurl": "https://pzwiki.wdka.nl/mediadesign/File:Deck_5.jpg",
"name": "Deck_5.jpg",
"ns": 6,
"timestamp": "2020-11-23T14:30:52Z",
"title": "File:Deck 5.jpg",
"url": "https://pzwiki.wdka.nl/mw-mediadesign/images/9/93/Deck_5.jpg"
}
]
}
},
"text/plain": [
"<IPython.core.display.JSON object>"
]
},
"execution_count": 122,
"metadata": {
"application/json": {
"expanded": false,
"root": "root"
}
},
"output_type": "execute_result"
}
],
"source": [
"JSON(data)"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [],
"source": [
"# Select the first result [0], let's assume that that is always the right image that we need :)\n",
"image = data['query']['allimages'][0]"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'name': 'Debo_009_05_01.jpg', 'timestamp': '2021-01-21T14:54:44Z', 'url': 'https://pzwiki.wdka.nl/mw-mediadesign/images/c/c8/Debo_009_05_01.jpg', 'descriptionurl': 'https://pzwiki.wdka.nl/mediadesign/File:Debo_009_05_01.jpg', 'descriptionshorturl': 'https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33518', 'ns': 6, 'title': 'File:Debo 009 05 01.jpg'}\n"
]
}
],
"source": [
"print(image)"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://pzwiki.wdka.nl/mw-mediadesign/images/c/c8/Debo_009_05_01.jpg\n"
]
}
],
"source": [
"print(image['url'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can use this URL to download the images!"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"image_url = image['url']\n",
"image_filename = image['name']\n",
"image_response = urllib.request.urlopen(image_url).read() # We use urllib for this again, this is basically our tool to download things from the web !"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b'\\xff\\xd8\\xff\\xe0\\x00\\x10JFIF\\x00\\x01\\x01\\x01\\x00d\\x00d\\x00\\x00\\xff\\xdb\\x00C\\x00\\x08\\x05\\x06\\x07\\x06\\x05\\x08\\x07\\x06\\x07\\t\\x08\\x08\\t\\x0c\\x13\\x0c\\x0c\\x0b\\x0b\\x0c\\x18\\x11\\x12\\x0e\\x13\\x1c\\x18\\x1d\\x1d\\x1b\\x18\\x1b\\x1a\\x1f#,%\\x1f!*!\\x1a\\x1b&4\\'*./121\\x1e%6:60:,010\\xff\\xdb\\x00C\\x01\\x08\\t\\t\\x0c\\n\\x0c\\x17\\x0c\\x0c\\x170 \\x1b 00000000000000000000000000000000000000000000000000\\xff\\xc2\\x00\\x11\\x08\\x01\\xb5\\x02v\\x03\\x01\"\\x00\\x02\\x11\\x01\\x03\\x11\\x01\\xff\\xc4\\x00\\x1a\\x00\\x01\\x00\\x02\\x03\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x01\\x04\\x02\\x03\\x05\\x06\\xff\\xc4\\x00\\x19\\x01\\x01\\x00\\x03\\x01\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x01\\x02\\x03\\x04\\x05\\xff\\xda\\x00\\x0c\\x03\\x01\\x00\\x02\\x10\\x03\\x10\\x00\\x00\\x01\\xf7\\x08\\xca\\x96$\\x88\\x90\\x00\\x02D$B@\\t\\x89\\x00\\x00\\x00\\x00\\x84\\x88H\\x84\\x88H\\x84\\x88H\\x84\\x88H\\x80D\\xb9q~\\xa0\\x9aBD$A$\\x00\\x00\\x00D\\x88H\\x84\\x8cYA\\t\\x11\\x19A\\x8b \\xcb\\x1c\\x80\\t\\x11 $\\x89\\x00\\x00\\x00\\x00\\x04\\x90\\x91\\t\\x10\\x91\\t\\x10\\x91\\t\\x10\\x98\\x04\\x90\\x98\\x00\\x00\\x08\\xe2_\\xad\\x97WXk\\xca\\x00\\x08\\x91\\t\\x80\\x08H\\x84\\x88L\\x00\\x00\\x02$B`\\x84\\x84\\xc4\\x89\\x89\\x04\\x90\\x90\\x00\\x00\\x12\\x90\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x01\\x14\\xf5J\\xf4\\xf1\\xae\\x161\\xd9V\\x0b57D\\xdb\\x9d{&\\x11\"\\x00\\x00\\x00\\x00\\x89\\x10\\x00\\x00\\x84\\xc4\\x80D\\xc4\\x00LI H\\x00\\x12\\x00\\x90\\x00\\x01\\x00\\x90@\\x00\\x00\\x00\\x00\\x00#\\x97\\x17\\xea\\xe3\\xaa\\x9d\\xa8\\x9d[\\x0c\\xb5e`\\x88\\x8de\\xad1\\x9c-f\\x00\"D$Bb@\\x00\\x02%\\x08\\x04\\x89De\\x10\\x81 \\x84$\\x01 \\x90\\x04\\x80\\x01 \\x00\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x001\\xf3>\\x93\\x95\\x97V\\xcc-s\\xb6\\xe5\\xe8*\\xd9,j\\xdd\\x99W;RWX\\x00\\x00\\x00\\x08\\x91\\x00\\x00\\x00\\x11 \\x00\"&\\x00\\x02\\x00H$\\x01 \\x13\\x12\\x00\\x00\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x00*W\\xe9\\xab\\xa5~wC\\x9bjn\\xea\\xf9\\xee\\xe4\\xc6\\xd0\\x00\\x00\\x00\\x00\\x00\\x08\\x98\\x00\\x00\\x00\\x00D\\xc0\\x04\\x00 \\x04\\x92\"y\\x91~\\x9b\\x9f\\xd0@\\xd7j\\xe7>g\\xb8[\\x00\\x00\\x00\\x101\\x92@\\x01\\x1c\\xb9u\\'\\x99X\\xeb\\xd4\\xe7c\\xcf\\xdb\\xd2\\xb9\\xc5\\xa5\\x16\\xf5\\x8e\\x0f{n@\\xbe~S\\xbf\\xc3\\xf5\\x1c\\xfd\\xd5\\xec\\x1d\\x1c!\\x02&@\\x00\\x00\\x00\\x00\\x89\\x10\\x00\\x00\\x00\\x00 B\\x00\\x00\\x131\"$y\\xfe\\xae\\xef/\\x9fOn\\x95K\\xfa\\xf3k\\xcbn\\x83\\xbf8\\xc8\\x8f/\\xcc\\xe6\\xf4}\\xce\\xaf\\x139\\xed\\xec\\xb4\\xf9{\\x93\\x1dm\\x18\\xdc\\xb6\\\\m^\\x978\\xb6VN\\xaf5\\x0c&9[uo(\\xe3\\xbb\\x13+\\xb4:p\\xb6\\x0eN=\\x8f+\\x87o\\xaa\\xc2*\\xed\\xcb\\xe7\\xfdw\\x93\\xf5\\x98\\xf6\\x06\\xfc\\x14\\xb0\\xd9ZS\\x19\\xe4a\\xb7]3\\xaf\\xb7\\x8b\\xd17g\\xcd\\xb0o\\xd3\\xa3i_\\xa3\\xc8\\xe8\\x96\\x80\\x80\\x00\\x00\\x04\\x12\\x04L\\x01\\x08\\x00\\x02ID\\x80)\\xdcK\\xca\\xee\\xe8W*e6\\xce\\xae\\xce\\x07H\\xa5W\\xbbc\\x1e\\xae=\\xdbk\\xe5\\x8eE\\xb2\\x00\\x009S\\xafn}J\\x96\\xb3\\xd7\\x97Ek\\xf5\\nX\\xf4-s\\xf6\\xd7\\xb1\\x14\\xa2z\\xbb<\\xcdJ\\xe9\\xed\\xb9\\x1b\\xb7\\xef\\xc5\\xc3\\xf4\\xd8S\\x8b\\xed\\xab\\x16\\xe2\\xdc\\x8b\\xf7+\\xdf:\\xd6\\'U\\xf1m\\xaf\\xb4\\xd3kN\\xd8gN\\xf4\\x14.\\xed\\xb1*X\\xdf\\x0cs\\xa9\\x13\\xc8\\xee\\xf8\\xf9\\xe2\\xf5\\xbd\\x9a\\xa5\\xbe\\xdf(&\\xa0\"b\\x12$\\x89\\x88LLJ\\x04\\x00N9\\x12\\x00\\x90\\x92)\\xdd\\x18\\xe1\\xb4r\\xaa\\xf7\\xb8\\xe6\\x8e\\xe7>a\\xd3D\\xc8 \\x00\\x00y\\xeb\\xf79\\xf4\\xd7mm6o\\x96\\xcd\\x1b\\xf6\\x94\\xe21\\xcf\\xa2\\xad\\xbe\\xdc\\xc5\\xb9=Y\\x9d0\\xa7\\xba\\xbesK|\\xbc\\xa8g\\xbe\\xcc\\xba\\xd9-\\xce\\xdbw+\\xe3IujT\\xb3\\x90\\x89 \\x12\\x00\\x04LC\\x0e\\x17\\xa0So\\x0f\\xe8\\xedy\\xcen\\xff\\x00\\\\\\xf3\\xbe\\x83\\xab\\xcf\\xc8_ \\x110LHD\\xc1\\x02\\x00&$\\x90\\x04\\x92\\x00\\x04\\x1aw\\x0e^\\xcbt\\x8b\\xdb0\\xd1+M\\x1b\\xe0\\x00\\x008\\xfd~ \\x9dpo\\xd9\\xa6\\x0c\\xbb\\\\~\\xc0\\x00\\xc4\\xe7\\xe5N\\xe5t\\x9e7[\\xa1\\x17\\xe2\\xf6\\xfc\\xe5\\xac\\xf7\\xed\\r\\xf8@\\x00\\x00\\x00\\x10\\x00\\x02$q\\xf9^\\xb6\\xa6\\x1d\\x8bt/\\xeb\\xce\"\\xd9\\xccL\\x01 \\x84\\
]
}
],
"source": [
"print(image_response)"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [],
"source": [
"out = open(image_filename, 'wb') # 'wb' stands for 'write bytes', we basically ask this file to accept data in byte format\n",
"out.write(image_response)\n",
"out.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download all the images of our page"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Debo_009_05_01.jpg', 'Sex-majik-2004.gif']\n"
]
}
],
"source": [
"# We have our variable \"images\"\n",
"print(images)"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading: Debo_009_05_01.jpg\n",
"Downloading: Sex-majik-2004.gif\n"
]
}
],
"source": [
"# Let's loop through this list and download each image!\n",
"for filename in images:\n",
" print('Downloading:', filename)\n",
" \n",
" filename = filename.replace(' ', '_') # let's replace spaces again with _\n",
" filename = filename.replace('.jpg', '').replace('.gif', '').replace('.png','').replace('.jpeg','').replace('.JPG','').replace('.JPEG','') # and let's remove the file extension\n",
" \n",
" # first we search for the full URL of the image\n",
" url = f'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=query&list=allimages&aifrom={ filename }&format=json'\n",
" response = urllib.request.urlopen(url).read()\n",
" data = json.loads(response)\n",
" image = data['query']['allimages'][0]\n",
" \n",
" # then we download the image\n",
" image_url = image['url']\n",
" image_filename = image['name']\n",
" image_response = urllib.request.urlopen(image_url).read()\n",
" \n",
" # and we save it as a file\n",
" out = open(image_filename, 'wb') \n",
" out.write(image_response)\n",
" out.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fix the links :)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(of image src links + page links)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [],
"source": [
"html = html.replace('/mediadesign/', './')"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [],
"source": [
"html = html.replace('File:', '')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html = html.replace('/mw-mediadesign/images/thumb/\\w*/c8/Debo_009_05_01.jpg/300px-', '') # needs regex"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save the text/html to a file"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [],
"source": [
"# Let's use _ in the filenames, before we open the file\n",
"title = title.replace(' ', '_')"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [],
"source": [
"out = open(f'{ title }.html', 'w')\n",
"out.write(html)\n",
"out.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save all the linked wiki pages to files "
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"link: {'ns': 0, 'exists': '', '*': 'Hi'}\n",
"Saving: Hi\n",
"---\n",
"link: {'ns': 0, 'exists': '', '*': 'Hyper Poetry'}\n",
"Saving: Hyper_Poetry\n",
"---\n",
"link: {'ns': 0, 'exists': '', '*': 'Hypertext'}\n",
"Saving: Hypertext\n",
"---\n",
"link: {'ns': 0, 'exists': '', '*': 'Wiki Tutorial'}\n",
"Saving: Wiki_Tutorial\n",
"---\n"
]
}
],
"source": [
"for link in links:\n",
" \n",
" print('link:', link)\n",
" \n",
" pagename = link['*']\n",
" pagename = pagename.replace(' ', '_')\n",
" print('Saving:', pagename)\n",
" \n",
" url = f'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=parse&page={ pagename }&format=json'\n",
" request = urllib.request.urlopen(url).read()\n",
" data = json.loads(request)\n",
" \n",
" html = data['parse']['text']['*']\n",
" html = html.replace('/mediadesign/', './')\n",
" html = html.replace('File:', '')\n",
"\n",
" out = open(f'{ pagename }.html', 'w')\n",
" out.write(html)\n",
" out.close() \n",
"\n",
" print('---')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Etc. Now you could also download all the images from all the pages that are linked :)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}