You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

11 KiB

MediaWiki API (Dérive)

This notebook:

  • continues with exploring the connections between Hypertext & Dérive
  • saves (parts of) wiki pages as html files
In [ ]:
import urllib
import json
from IPython.display import JSON # iPython JSON renderer
import sys

Parse

Let's use another wiki this time: the English Wikipedia.

You can pick any page, i took the Hypertext page for this notebook as an example: https://en.wikipedia.org/wiki/Hypertext

In [ ]:
# parse the wiki page Hypertext
request = 'https://en.wikipedia.org/w/api.php?action=parse&page=Hypertext&format=json'
response = urllib.request.urlopen(request).read()
data = json.loads(response)
JSON(data)
In [ ]:
 
In [ ]:
 

Select the wiki links from the data response:

In [ ]:
links = data['parse']['links']
JSON(links)

Let's save the links as a list of pagenames, to make it look like this:

['hyperdocuments', 'hyperwords', 'hyperworld']

In [ ]:
# How is "links" structured now?
print(links)

It helps to copy paste a small part of the output first:

[{'ns': 0, 'exists': '', '*': 'Metatext'}, {'ns': 0, '*': 'De man met de hoed'}]

and to write it differently with indentation:

links = [
    { 
        'ns' : 0,
        'exists' : '',
        '*', 'Metatext'
    }, 
    {
        'ns' : 0,
        'exists' : '',
        '*' : 'De man met de hoed'
    }  
]

We can now loop through "links" and add all the pagenames to a new list called "wikilinks".

In [ ]:
wikilinks = []

for link in links:
    
    print('link:', link)
    
    for key, value in link.items():
        print('----- key:', key)
        print('----- value:', value)
        print('-----')
        
    pagename = link['*']
    print('===== pagename:', pagename)
    
    wikilinks.append(pagename)
In [ ]:
wikilinks
In [ ]:
 

Let's convert the list of pagenames into HTML link elements (<a href=""></a>):

In [ ]:
html = ''

for wikilink in wikilinks:
    print(wikilink)
    
    # let's use the "safe" pagenames for the filenames 
    # by replacing the ' ' with '_'
    filename = wikilink.replace(' ', '_')
    
    a = f'<a href="{ filename }.html">{ wikilink }</a>'
    html += a
    html += '\n'
In [ ]:
print(html)
In [ ]:
 
In [ ]:
# Let's save this page in a separate folder, i called it "mediawiki-api-dérive"
# We can make this folder here using a terminal command, but you can also do it in the interface on the left
! mkdir mediawiki-api-dérive
In [ ]:
output = open('mediawiki-api-dérive/Hypertext.html', 'w')
output.write(html)
output.close()
In [ ]:
 
In [ ]:
 

Recursive parsing

We can now repeat the steps for each wikilink that we collected!

We can make an API request for each wikilink, \ ask for all the links on the page \ and save it as an HTML page.

In [ ]:
# First we save the Hypertext page again:

startpage = 'Hypertext'

# parse the first wiki page
request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ startpage }&format=json'
response = urllib.request.urlopen(request).read()
data = json.loads(response)
JSON(data)

# select the links
links = data['parse']['links']

# turn it into a list of pagenames
wikilinks = []
for link in links:
    pagename = link['*']
    wikilinks.append(pagename)

# turn the wikilinks into a set of <a href=""></a> links
html = ''
for wikilink in wikilinks:
    filename = wikilink.replace(' ', '_')
    a = f'<a href="{ filename }.html">{ wikilink }</a>'
    html += a
    html += '\n'

# save it as a HTML page
startpage = startpage.replace(' ', '_') # let's again stay safe on the filename side
output = open(f'mediawiki-api-dérive/{ startpage }.html', 'w')
output.write(html)
output.close()
In [ ]:
# Then we loop through the list of wikilinks
# and repeat the steps for each page
    
for wikilink in wikilinks:
    
    # let's copy the current wikilink pagename, to avoid confusion later
    currentwikilink = wikilink 
    print('Now requesting:', currentwikilink)
    
    # parse this wiki page
    wikilink = wikilink.replace(' ', '_')
    request = f'https://en.wikipedia.org/w/api.php?action=parse&page={ wikilink }&format=json'
    
    # --> we insert a "try and error" condition, 
    # to catch errors in case a page does not exist 
    try: 
        
        # continue the parse request
        response = urllib.request.urlopen(request).read()
        data = json.loads(response)
        JSON(data)

        # select the links
        links = data['parse']['links']

        # turn it into a list of pagenames
        wikilinks = []
        for link in links:
            pagename = link['*']
            wikilinks.append(pagename)

        # turn the wikilinks into a set of <a href=""></a> links
        html = ''
        for wikilink in wikilinks:
            filename = wikilink.replace(' ', '_')
            a = f'<a href="{ filename }.html">{ wikilink }</a>'
            html += a
            html += '\n'

        # save it as a HTML page
        currentwikilink = currentwikilink.replace(' ', '_') # let's again stay safe on the filename side
        output = open(f'mediawiki-api-dérive/{ currentwikilink }.html', 'w')
        output.write(html)
        output.close()
            
    except:
        error = sys.exc_info()[0]
        print('Skipped:', wikilink)
        print('With the error:', error)
In [ ]:
 

What's next?

?

You could add more loops to the recursive parsing, adding more layers ...

You could request all images of a page (instead of links) ...

or something else the API offers ... (contributors, text, etc)

or ...

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: