# MediaWiki API (part 2)

Let's request the https://pzwiki.wdka.nl/mediadesign/Dérive page.

(I made a copy of the page of last week, to make the URL a bit simpler :).)

And let's try to save different versions of it as .html pages, using the API. 

In [111]:
import urllib
import json
from IPython.display import JSON

## Make an API request

This is how we did an API request to the PZI MediaWiki last week:

In [112]:
url = 'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=parse&page=D%C3%A9rive&format=json' # urllib doesn't like the "é", so we're writing it in ASCII
request = urllib.request.urlopen(url).read()
data = json.loads(request)

In [113]:
# To inspect the JSON formatted output of that request: 
JSON(data)

<IPython.core.display.JSON object>

## Select from the API request's output

We can navigate the output in Python, by using the "key/value" structure of the JSON output:

In [114]:
title = data['parse']['title']
html = data['parse']['text']['*']
links = data['parse']['links']
externallinks = data['parse']['externallinks']
images = data['parse']['images']

In [115]:
print(title)

Dérive


In [116]:
print(html)

<div class="mw-parser-output"><p><a href="/mediadesign/Hi" title="Hi">hi</a>
</p><p>Refresh your memory here&#160;:<a href="/mediadesign/Wiki_Tutorial" title="Wiki Tutorial">Wiki_Tutorial</a>
</p><p><a href="/mediadesign/Hypertext" title="Hypertext">Hypertext</a>
Hypertext: An Educational Experiment in English and Computer Science at Brown University
</p><p><iframe width="420" height="315" src="https://www.youtube.com/embed/wUTaNQWjNy8" frameborder="0" allowfullscreen></iframe>
</p><p><a href="/mediadesign/Hyper_Poetry" title="Hyper Poetry">Hyper Poetry</a>
What it is? <a target="_blank" rel="nofollow noreferrer noopener" class="external autonumber" href="https://sites.google.com/a/cougars.csusm.edu/20-poetry/my-20-project/theexpansionalookintohypertext">[1]</a>
</p>
<div class="thumb tright"><div class="thumbinner" style="width:302px;"><a href="/mediadesign/File:Debo_009_05_01.jpg" class="image"><img alt="" src="/mw-mediadesign/images/thumb/c/c8/Debo_009_05_01.jpg/300px-Debo_009_05_01

In [117]:
print(links)

[{'ns': 0, 'exists': '', '*': 'Hi'}, {'ns': 0, 'exists': '', '*': 'Hyper Poetry'}, {'ns': 0, 'exists': '', '*': 'Hypertext'}, {'ns': 0, 'exists': '', '*': 'Wiki Tutorial'}]


In [118]:
print(externallinks)

['https://sites.google.com/a/cougars.csusm.edu/20-poetry/my-20-project/theexpansionalookintohypertext']


In [119]:
print(images)

['Debo_009_05_01.jpg', 'Sex-majik-2004.gif']


## Download images

(all the images that are used on this page)

For this, we will use the "query" request, with the parameter "list=allimages". 

This is the only way to retrieve the full URL of an image using the MediaWiki API. 

In [120]:
# Let's first test it with one image.
# For example: File:Debo 009 05 01.jpg

filename = 'Debo 009 05 01.jpg'
filename = filename.replace(' ', '_') # let's replace spaces again with _
filename = filename.replace('.jpg', '') # and let's remove the file extension

In [121]:
# We cannot ask the API for the URL of a specific image (:///), but we can still find it using the "aifrom=" parameter.
# Note: ai=allimages
url = f'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=query&list=allimages&aifrom={ filename }&format=json'
response = urllib.request.urlopen(url).read()
data = json.loads(response)

In [122]:
JSON(data)

<IPython.core.display.JSON object>

In [123]:
# Select the first result [0], let's assume that that is always the right image that we need :)
image = data['query']['allimages'][0]

In [124]:
print(image)

{'name': 'Debo_009_05_01.jpg', 'timestamp': '2021-01-21T14:54:44Z', 'url': 'https://pzwiki.wdka.nl/mw-mediadesign/images/c/c8/Debo_009_05_01.jpg', 'descriptionurl': 'https://pzwiki.wdka.nl/mediadesign/File:Debo_009_05_01.jpg', 'descriptionshorturl': 'https://pzwiki.wdka.nl/mw-mediadesign/index.php?curid=33518', 'ns': 6, 'title': 'File:Debo 009 05 01.jpg'}


In [125]:
print(image['url'])

https://pzwiki.wdka.nl/mw-mediadesign/images/c/c8/Debo_009_05_01.jpg


Now we can use this URL to download the images!

In [126]:
image_url = image['url']
image_filename = image['name']
image_response = urllib.request.urlopen(image_url).read() # We use urllib for this again, this is basically our tool to download things from the web !

In [127]:
print(image_response)

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00d\x00d\x00\x00\xff\xdb\x00C\x00\x08\x05\x06\x07\x06\x05\x08\x07\x06\x07\t\x08\x08\t\x0c\x13\x0c\x0c\x0b\x0b\x0c\x18\x11\x12\x0e\x13\x1c\x18\x1d\x1d\x1b\x18\x1b\x1a\x1f#,%\x1f!*!\x1a\x1b&4\'*./121\x1e%6:60:,010\xff\xdb\x00C\x01\x08\t\t\x0c\n\x0c\x17\x0c\x0c\x170 \x1b 00000000000000000000000000000000000000000000000000\xff\xc2\x00\x11\x08\x01\xb5\x02v\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1a\x00\x01\x00\x02\x03\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x04\x02\x03\x05\x06\xff\xc4\x00\x19\x01\x01\x00\x03\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\xff\xda\x00\x0c\x03\x01\x00\x02\x10\x03\x10\x00\x00\x01\xf7\x08\xca\x96$\x88\x90\x00\x02D$B@\t\x89\x00\x00\x00\x00\x84\x88H\x84\x88H\x84\x88H\x84\x88H\x80D\xb9q~\xa0\x9aBD$A$\x00\x00\x00D\x88H\x84\x8cYA\t\x11\x19A\x8b \xcb\x1c\x80\t\x11 $\x89\x00\x00\x00\x00\x04\x90\x91\t\x10\x91\t\x10\x91\t\x10\x98\x04\x90\x98\x00\x00\x08\xe2_\xad\x97WXk\x

In [128]:
out = open(image_filename, 'wb') # 'wb' stands for 'write bytes', we basically ask this file to accept data in byte format
out.write(image_response)
out.close()

## Download all the images of our page

In [129]:
# We have our variable "images"
print(images)

['Debo_009_05_01.jpg', 'Sex-majik-2004.gif']


In [130]:
# Let's loop through this list and download each image!
for filename in images:
    print('Downloading:', filename)
    
    filename = filename.replace(' ', '_') # let's replace spaces again with _
    filename = filename.replace('.jpg', '').replace('.gif', '').replace('.png','').replace('.jpeg','').replace('.JPG','').replace('.JPEG','') # and let's remove the file extension
    
    # first we search for the full URL of the image
    url = f'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=query&list=allimages&aifrom={ filename }&format=json'
    response = urllib.request.urlopen(url).read()
    data = json.loads(response)
    image = data['query']['allimages'][0]
    
    # then we download the image
    image_url = image['url']
    image_filename = image['name']
    image_response = urllib.request.urlopen(image_url).read()
    
    # and we save it as a file
    out = open(image_filename, 'wb') 
    out.write(image_response)
    out.close()

Downloading: Debo_009_05_01.jpg
Downloading: Sex-majik-2004.gif


## Fix the links :)

(of image src links + page links)

In [131]:
html = html.replace('/mediadesign/', './')

In [132]:
html = html.replace('File:', '')

In [None]:
html = html.replace('/mw-mediadesign/images/thumb/\w*/c8/Debo_009_05_01.jpg/300px-', '') # needs regex

## Save the text/html to a file

In [133]:
# Let's use _ in the filenames, before we open the file
title = title.replace(' ', '_')

In [134]:
out = open(f'{ title }.html', 'w')
out.write(html)
out.close()

## Save all the linked wiki pages to files 

In [135]:
for link in links:
    
    print('link:', link)
    
    pagename = link['*']
    pagename = pagename.replace(' ', '_')
    print('Saving:', pagename)
    
    url = f'https://pzwiki.wdka.nl/mw-mediadesign/api.php?action=parse&page={ pagename }&format=json'
    request = urllib.request.urlopen(url).read()
    data = json.loads(request)
    
    html = data['parse']['text']['*']
    html = html.replace('/mediadesign/', './')
    html = html.replace('File:', '')

    out = open(f'{ pagename }.html', 'w')
    out.write(html)
    out.close() 

    print('---')

link: {'ns': 0, 'exists': '', '*': 'Hi'}
Saving: Hi
---
link: {'ns': 0, 'exists': '', '*': 'Hyper Poetry'}
Saving: Hyper_Poetry
---
link: {'ns': 0, 'exists': '', '*': 'Hypertext'}
Saving: Hypertext
---
link: {'ns': 0, 'exists': '', '*': 'Wiki Tutorial'}
Saving: Wiki_Tutorial
---


## Etc. Now you could also download all the images from all the pages that are linked :)