11 KiB
MediaWiki API (Dérive)¶
This notebook:
- continues with exploring the connections between
Hypertext
&Dérive
- saves (parts of) wiki pages as html files
import urllib import json from IPython.display import JSON # iPython JSON renderer import sys
Parse¶
Let's use another wiki this time: the English Wikipedia.
You can pick any page, i took the Hypertext page for this notebook as an example: https://en.wikipedia.org/wiki/Hypertext
# parse the wiki page Hypertext request = 'https://en.wikipedia.org/w/api.php?action=parse&page=Hypertext&format=json' response = urllib.request.urlopen(request).read() data = json.loads(response) JSON(data)
Wiki links dérive¶
Select the wiki links from the data
response:
links = data['parse']['links'] JSON(links)
Let's save the links as a list of pagenames, to make it look like this:
['hyperdocuments', 'hyperwords', 'hyperworld']
# How is "links" structured now? print(links)
It helps to copy paste a small part of the output first:
[{'ns': 0, 'exists': '', '*': 'Metatext'}, {'ns': 0, '*': 'De man met de hoed'}]
and to write it differently with indentation:
links = [
{
'ns' : 0,
'exists' : '',
'*', 'Metatext'
},
{
'ns' : 0,
'exists' : '',
'*' : 'De man met de hoed'
}
]
We can now loop through "links" and add all the pagenames to a new list called "wikilinks".
wikilinks = [] for link in links: print('link:', link) for key, value in link.items(): print('----- key:', key) print('----- value:', value) print('-----') pagename = link['*'] print('===== pagename:', pagename) wikilinks.append(pagename)
wikilinks
Saving the links in a HTML page¶
Let's convert the list of pagenames into HTML link elements (<a href=""></a>
):
html = '' for wikilink in wikilinks: print(wikilink) # let's use the "safe" pagenames for the filenames # by replacing the ' ' with '_' filename = wikilink.replace(' ', '_') a = f'<a href="{ filename }.html">{ wikilink }</a>' html += a html += '\n'
print(html)
# Let's save this page in a separate folder, i called it "mediawiki-api-dérive" # We can make this folder here using a terminal command, but you can also do it in the interface on the left ! mkdir mediawiki-api-dérive
output = open('mediawiki-api-dérive/Hypertext.html', 'w') output.write(html) output.close()
Recursive parsing¶
We can now repeat the steps for each wikilink that we collected!
We can make an API request for each wikilink, \ ask for all the links on the page \ and save it as an HTML page.
# First we save the Hypertext page again: startpage = 'Hypertext' # parse the first wiki page request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ startpage }&format=json' response = urllib.request.urlopen(request).read() data = json.loads(response) JSON(data) # select the links links = data['parse']['links'] # turn it into a list of pagenames wikilinks = [] for link in links: pagename = link['*'] wikilinks.append(pagename) # turn the wikilinks into a set of <a href=""></a> links html = '' for wikilink in wikilinks: filename = wikilink.replace(' ', '_') a = f'<a href="{ filename }.html">{ wikilink }</a>' html += a html += '\n' # save it as a HTML page startpage = startpage.replace(' ', '_') # let's again stay safe on the filename side output = open(f'mediawiki-api-dérive/{ startpage }.html', 'w') output.write(html) output.close()
# Then we loop through the list of wikilinks # and repeat the steps for each page for wikilink in wikilinks: # let's copy the current wikilink pagename, to avoid confusion later currentwikilink = wikilink print('Now requesting:', currentwikilink) # parse this wiki page wikilink = wikilink.replace(' ', '_') request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ wikilink }&format=json' # --> we insert a "try and error" condition, # to catch errors in case a page does not exist try: # continue the parse request response = urllib.request.urlopen(request).read() data = json.loads(response) JSON(data) # select the links links = data['parse']['links'] # turn it into a list of pagenames wikilinks = [] for link in links: pagename = link['*'] wikilinks.append(pagename) # turn the wikilinks into a set of <a href=""></a> links html = '' for wikilink in wikilinks: filename = wikilink.replace(' ', '_') a = f'<a href="{ filename }.html">{ wikilink }</a>' html += a html += '\n' # save it as a HTML page currentwikilink = currentwikilink.replace(' ', '_') # let's again stay safe on the filename side output = open(f'mediawiki-api-dérive/{ currentwikilink }.html', 'w') output.write(html) output.close() except: error = sys.exc_info()[0] print('Skipped:', wikilink) print('With the error:', error)
What's next?¶
?
You could add more loops to the recursive parsing, adding more layers ...
You could request all images of a page (instead of links) ...
or something else the API offers ... (contributors, text, etc)
or ...