master
Michael Murtaugh 4 years ago
commit c02db565d0

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

@ -0,0 +1,10 @@
all: botswaller.html botswaller.slides.html
%.slides.html: %.ipynb
jupyter nbconvert $< --to slides
%.html: %.ipynb
jupyter nbconvert $< --to html

@ -0,0 +1,971 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Revisiting Botopera"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In 2015, as part of Constant's celebration of [Public Domain Day](https://constantvzw.org/site/-Public-Domain-Day,178-.html), I worked on a collaborative project that came to be known as [Botopera](https://constantvzw.org/site/The-Death-of-the-Authors-1943.html). Each year Constant has celebrated with a series of works (ironically) titled *The Death of the Authors*, a somewhat macabre reflection of European copyright law's stipulation that works remain under the legal restrictions of copyright for 70 years after the deaths of their respective authors. This particular year the works of authors who had died in 1943 were to be considered; specifically the works of: Henri La Fontaine, Sergei Rachmaninoff, Beatrix Potter, Nicola Tesla, and Fats Waller.\n",
"\n",
"> Thomas Wright \"Fats\" Waller (May 21, 1904 December 15, 1943) was an American jazz pianist, organist, composer, violinist, singer, and comedic entertainer. His innovations in the Harlem stride style laid the groundwork for modern jazz piano. [From the English wikipedia entry on Waller](https://en.wikipedia.org/wiki/Fats_Waller)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![](800px-Fats_Waller_edit.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"As Waller was tragically young when he died, just 39, his incredibly rich oeuvre of work represents some of the most (relatively) contemporary music available for study and use in the public domain.\n",
"\n",
"Inspired by a workshop members of the group had participated in during the Relearn \"summer school\", the decision was made to use IRC as the \"theatre\" to perform in, each author (and their public domain works) represented by [chatbots](https://en.wikipedia.org/wiki/Chatbot). Thus the final billing for the project became: BotsWaller, NICKola tesla, Beatrix Plotter, Rachmanibot, henrIRC lafontaine & their plotters."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"An interesting discovery for me was how using IRC suggested a novel way of working as a group. Somewhat stifled by the usual complexities of beginning and planning a collective creative work (with two members working remotely from our \"base\" in Brussels), we decided early on to simply try to meet weekly in a chatroom and force ourselves to practice by performing the work, in whatever rough form that was. In a sense it felt like jam sessions for a nascent band of performers with rather diverse intstruments and skill sets. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"As a programmer, what was novel to experience was the particular ways the chatbots worked, not just in a strictly technical sense, but in a social one. Chatbots are just programs, that can be run from any computer. Starting a script on my local laptop, the script is programmed with name of the chat server to connect to, the name of the \"channel\" or room to join, and the \"nickname\" the bot should use (in our case BOTSwaller). After a few seconds, the programs presence is seen by all in the room as the entrance of the bot is announced and their nickname subsequently alongside the other human participants. While running, the script is given access to any messages that appear in the chat, allowing the program to \"read\" what other say and eventually to respond back in kind with a text. When the bot program is stopped (by virtue of \"ending naturally\", being \"cancelled\" by the person operating it, or \"dying\" due to an error in the code), the bot \"disappears\" from the chatroom, other participants seeing a message along the lines of \"BOTSwaller has left the room\"."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Each person could run their own scripts from their own laptops, starting and stopping them at will. Most of us were using the Python language; Antonio, with his background in music production and D/Vjaying was using puredata. As chatbots receive messages from all other participants, and subsequently may introduce their own messages in response, interactions start to occur quickly: between bots and bots, bots and people, in addition to the usual social dynamics between the human participants. In fact, very quickly in designing the bots, the skill becomes limiting the bots activity to prevent an overflow of activity and avoiding a cascade of never ending messages to occur and flood the chat."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"For someone with technical experience making websites with such technologies as CGI and other kinds of \"server-side programming\", the *promiscuity* of mixing the social interations of a chatroom with the ability for multiple sources of code to be started and stopped at will by the participants themselves, and which then mix into fabric of this interaction is extremely rich, and this from a *pre-web technology*. Classic server side programming requires that code runs on the server, requiring access to that server, and the ability to transfer and run the code there. In addition, any interactions between the code and visitors to the website (or other code-based processes) must all be very explicitly designed and coordinated. In short, it's rare that any kind of social interactions happen *accidentally*. In contrast, in the chatroom, *happy accidents* would occur regularly, often suggesting things that we might want to later explicitly code in the form of the bot -- just like a jam session and the iterative processes of improvisations develop a practice that may (or may not) be codified in more \"repeatable\" forms (such as musical notation)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Steps towards a first bot\n",
"\n",
"> Waller's innovations in the Harlem stride style laid the groundwork for modern jazz piano.\n",
"\n",
"It was in this early \"rehearsal\" mode that we started \"simulating\" what eventual bots might be by performing the bots ourselves. I simply logged into the chat a second time as \"BOTSwaller\" and started to cut and paste sentences I found from Wikipedia, changing 3rd person references to \"Waller\" and \"he\" to \"I\", and \"Waller's\", and \"his\" to \"my\":\n",
"\n",
"> My innovations in the Harlem stride style laid the groundwork for modern jazz piano."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Consider Historical Precendents\n",
"\n",
"I had to think of classic chatter bots, like Joseph Weizenbaums **Eliza** and Kevin Lenzo's classic early IRC **Infobot**. In the case of Eliza, the bot made clever use of grammatical substitutions to mirror responses.\n",
"\n",
"> Human: I am very motherly. \n",
"> Bot: Is it because you are very motherly that you came to see me?\n",
"\n",
"In the case of infobot, the bot made use of an extensible \"fact pack\" database storing collections of responses in a kind of question / answer index."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Using wikipedia articles as a source\n",
"\n",
"The biographical style of the Fats Waller wikipedia entry seemed like a good starting point for the bots knowledge. The documentation of the mediawiki API describes the various ways to [get the contents of a page](https://www.mediawiki.org/wiki/API:Get_the_contents_of_a_page). I choose to use the parse function. It \"parses\" the contents of a wikipedia article and returns text with HTML markup. An alternative approach would be to directly *scrape* the contents of the wikipedia article, but the API is freely available[^mediawikiAPI] and avoids having to separate the navigational elements of wikipedia from the main content of the article.\n",
"\n",
"[^mediawikiAPI]: Unlike commercial services like Twitter or Instagram, the wiki's based on mediawiki (like Wikipedia) offer an API that can be used without making a legal agreement limiting your use in exchange for an API key. Instead, following the spirit of a community-driven free software project, wiki's stipulate that you follow an API etiquette for using the site responably, not overburdening the servers or producing spam."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'<div class=\"mw-parser-output\"><div class=\"shortdescription nomobile noexcerpt noprint searchaux\" style=\"display:none\">American jazz pianist and composer</div>\\n<p class=\"mw-empty-elt\">\\n</p>\\n<table class=\"infobox vcard plainlist\" style=\"width:22em\"><tbody><tr><th colspan=\"2\" style=\"text-align:center;font-size:125%;font-weight:bold;background-color: #f0e68c\"><div style=\"display:inline;\" class=\"fn\">Fats Waller</div></th></tr><tr><td colspan=\"2\" style=\"text-align:center\"><a href=\"/wiki/File:Fats_Waller_edit.jpg\" class=\"image\" title=\"Waller in 1938\"><img alt=\"Waller in 1938\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/1/1c/Fats_Waller_edit.jpg/220px-Fats_Waller_edit.jpg\" decoding=\"async\" width=\"220\" height=\"274\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/1/1c/Fats_Waller_edit.jpg/330px-Fats_Waller_edit.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/1c/Fats_Waller_edit.jpg/440px-Fats_Waller_edit.jpg 2x\" data-file-width=\"1996\" data-file-height=\"2485\" /></a><div>'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from urllib.request import urlopen\n",
"import json\n",
"\n",
"url = \"https://en.wikipedia.org/w/api.php?action=parse&page=Fats_Waller&format=json&formatversion=2\"\n",
"data = json.load(urlopen(url))\n",
"\n",
"# print (data['parse']['text'][:1000])\n",
"data['parse']['text'][:1000]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"For the purposes of an IRC bot, we would like unformatted \"plain\" text. [html5lib](https://html5lib.readthedocs.io/en/latest/) is a the modern python library to parse or read HTML, translating the textual source to a structure called an [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html). This can then be [rendered as plain text](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.tostring), effectively stripping away the HTML markup."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"American jazz pianist and composer\n",
"\n",
"\n",
"Fats WallerWaller in 1938Background informationBirth nameThomas Wright WallerBorn(1904-05-21)May 21, 1904New York City, New York, U.S.DiedDecember 15, 1943(1943-12-15) (aged 39)Kansas City, Missouri, U.S.GenresDixieland, jazz, swing, stride, ragtimeOccupation(s)Musician, composerInstrumentsPiano, vocals, organYears active19181943\n",
"Thomas Wright \"Fats\" Waller (May 21, 1904 December 15, 1943) was an American jazz pianist, organist, composer, violinist, singer, and comedic entertainer.[1] His innovations in the Harlem stride style laid the groundwork for modern jazz piano. His best-known compositions, \"Ain't Misbehavin'\" and \"Honeysuckle Rose\", were inducted into the Grammy Hall of Fame in 1984 and 1999.[2] Waller copyrighted over 400 songs, many of them co-written with his closest collaborator, Andy Razaf. Razaf described his partner as \"the soul of melody... a man who made the piano sing... both big in body and in mind... known for his generosity..\n"
]
}
],
"source": [
"import html5lib\n",
"t = html5lib.parse(data['parse']['text'])\n",
"\n",
"from xml.etree import ElementTree as ET\n",
"text = ET.tostring(t, method=\"text\", encoding=\"unicode\")\n",
"\n",
"print (text[:1000])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Eventually, this process could make better use of the HTML to avoid certain kinds of non-sentence content (such as figures and tables). In this case, however, I decided simply to do the cleaning (later after the next step) by hand. First, however, I will use the [nltk](http://nltk.org/) library's sentence tokenizer to do some of the work."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /home/murtaugh/nltk_data...\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"270 sentences\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Package punkt is already up-to-date!\n"
]
}
],
"source": [
"import nltk\n",
"\n",
"nltk.download(\"punkt\")\n",
"sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')\n",
"sentences = sent_tokenizer.tokenize(text)\n",
"\n",
"print (f\"{len(sentences)} sentences\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"American jazz pianist and composer\n",
"\n",
"\n",
"Fats WallerWaller in 1938Background informationBirth nameThomas Wright WallerBorn(1904-05-21)May 21, 1904New York City, New York, U.S.DiedDecember 15, 1943(1943-12-15) (aged 39)Kansas City, Missouri, U.S.GenresDixieland, jazz, swing, stride, ragtimeOccupation(s)Musician, composerInstrumentsPiano, vocals, organYears active19181943\n",
"Thomas Wright \"Fats\" Waller (May 21, 1904 December 15, 1943) was an American jazz pianist, organist, composer, violinist, singer, and comedic entertainer.\n",
"---\n",
"[1] His innovations in the Harlem stride style laid the groundwork for modern jazz piano.\n",
"---\n",
"His best-known compositions, \"Ain't Misbehavin'\" and \"Honeysuckle Rose\", were inducted into the Grammy Hall of Fame in 1984 and 1999.\n",
"---\n",
"[2] Waller copyrighted over 400 songs, many of them co-written with his closest collaborator, Andy Razaf.\n",
"---\n",
"Razaf described his partner as \"the soul of melody... a man who made the piano sing... both big in body and in mind... known for his generosity... a bubbling bundle of joy\".\n",
"---\n",
"It's possible he composed many more popular songs and sold them to other performers when times were tough.\n",
"---\n",
"Waller started playing the piano at the age of six, and became a professional organist aged 15.\n",
"---\n",
"By the age of 18 he was a recording artist.\n",
"---\n",
"Waller's first recordings, \"Muscle Shoals Blues\" and \"Birmingham Blues\", were made in October 1922 for Okeh Records.\n",
"---\n",
"[3] That year, he also made his first player piano roll, \"Got to Cool My Doggies Now\".\n",
"---\n"
]
}
],
"source": [
"for s in sentences[:10]:\n",
" print (s)\n",
" print (\"---\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"I then save the sentences, one sentence per line, to a file."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"with open (\"waller_sentences.txt\", \"w\") as f:\n",
" for s in sentences:\n",
" print (s, file=f)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"I then edited this file by hand, removing non-sentences (text from tables of information and things like headers and images). I then saved the resulting file \"[waller_sentences_edited.txt](waller_sentences_edited.txt)\" so as not to lose my edits."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Now, to do the rewriting. When performing the bot myself, I replaced certain words that spoke of Waller in the *third person* with the equivilent in the *first person*. The file contains sentences of the form:\n",
"> Waller was an American jazz pianist, organist, composer, violinist, singer, and comedic entertainer.\n",
">\n",
"> His innovations in the Harlem stride style laid the groundwork for modern jazz piano.\n",
">\n",
"> His best-known compositions, \"Ain't Misbehavin'\" and \"Honeysuckle Rose\", were inducted into the Grammy Hall of Fame in 1984 and 1999.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"One option is to just use the string.replace function:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Waller was an American jazz pianist, organist, composer, violinist, singer, and comedic entertainer.\n",
"His innovations in tI Harlem stride style laid tI groundwork for modern jazz piano.\n",
"His best-known compositions, \"Ain't Misbehavin'\" and \"Honeysuckle Rose\", were inducted into tI Grammy Hall of Fame in 1984 and 1999.\n",
"Waller copyrighted over 400 songs, many of tIm co-written with his closest collaborator, Andy Razaf.\n",
"It's possible I composed many more popular songs and sold tIm to otIr performers wIn times were tough.\n",
"Waller started playing tI piano at tI age of six, and became a professional organist aged 15.\n",
"By tI age of 18 I was a recording artist.\n",
"Waller's first recordings, \"Muscle Shoals Blues\" and \"Birmingham Blues\", were made in October 1922 for Okeh Records.\n",
"That year, I also made his first player piano roll, \"Got to Cool My Doggies Now\".\n",
"Waller's first publisId composition, \"Squeeze Me\", was publisId in 1924.\n",
"He became one of tI most popular performers of his era, touring intern\n"
]
}
],
"source": [
"text = open(\"waller_sentences_edited.txt\").read()\n",
"# text = text.replace(\"Waller\", \"I\")\n",
"# text = text.replace(\"His\", \"My\")\n",
"# text = text.replace(\"his\", \"my\")\n",
"# text = text.replace(\" he \", \" I \")\n",
"print (text[:1000])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"With this simple approach, you see glitches appearing with searches for \"he\" matching *inside* words like \"them\", and becoming \"tIm\". You could strategically change the order of substitution and/or think of including spaces in the search and replace. But regular expressions offer another solution, with the use of \"word boundary\" anchors (\\b)."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I was an American jazz pianist, organist, composer, violinist, singer, and comedic entertainer.\n",
"My innovations in the Harlem stride style laid the groundwork for modern jazz piano.\n",
"My best-known compositions, \"Ain't Misbehavin'\" and \"Honeysuckle Rose\", were inducted into the Grammy Hall of Fame in 1984 and 1999.\n",
"I copyrighted over 400 songs, many of them co-written with my closest collaborator, Andy Razaf.\n",
"It's possible I composed many more popular songs and sold them to other performers when times were tough.\n",
"I started playing the piano at the age of six, and became a professional organist aged 15.\n",
"By the age of 18 I was a recording artist.\n",
"My first recordings, \"Muscle Shoals Blues\" and \"Birmingham Blues\", were made in October 1922 for Okeh Records.\n",
"That year, I also made my first player piano roll, \"Got to Cool My Doggies Now\".\n",
"My first published composition, \"Squeeze Me\", was published in 1924.\n",
"I became one of the most popular performers of my era, touring internationally and achieving critical and commercial success in the United States and Europe.\n",
"I died from pneumonia, aged 39.\n",
"I was the seventh child of 11 (five of whom survived childhood) born to Adeline Locket I, a musician, and Reverend Edward Martin I, a trucker and pastor in New York City.\n",
"I started playing the piano when I was six and graduated to playing the organ at my father's church four years later.\n",
"My mother instructed me in my youth, and I attended other music lessons, paying for them by working in a grocery store.\n",
"I attended DeWitt Clinton High School for one semester, but left school at 15 to work as an organist at the Lincoln Theater in Harlem, where I earned $32 a week.\n",
"Within 12 months I had composed my first rag.\n",
"I was the prize pupil and later the friend and colleague of the stride pianist James P. Johnson.\n",
"My mother died on November 10, 1920 from a stroke due to diabetes.\n",
"My first recordings, \"Muscle Shoals Blues\" and \"Birmingham Blues\", were made in October 1922 for Okeh Records.\n",
"That ye\n"
]
}
],
"source": [
"import re\n",
"\n",
"text = open(\"waller_sentences_edited.txt\").read()\n",
"text = re.sub(r\"\\bWaller's\\b\", \"My\", text)\n",
"text = re.sub(r\"\\bWaller\\b\", \"I\", text)\n",
"text = re.sub(r\"\\bHis\\b\", \"My\", text)\n",
"text = re.sub(r\"\\bhis\\b\", \"my\", text)\n",
"text = re.sub(r\"\\bhe\\b\", \"I\", text)\n",
"text = re.sub(r\"\\bHe\\b\", \"I\", text)\n",
"text = re.sub(r\"\\bhim\\b\", \"me\", text)\n",
"\n",
"print (text[:2000])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"There are still some glitches related to grammar that go beyond the limits of simple word replacement (such as when to use me or I). But it's good enough to begin, so I save the output in [another file](waller_sentences_first_person.txt)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"with open(\"waller_sentences_first_person.txt\", \"w\") as f:\n",
" print (text, file=f)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Responding like a search engine -- Whoosh to the rescue, Step 1: Create an index\n",
"\n",
"Now, to make a chat bot based on these sentences! I could roll my own matching algorithm, attempting to find useful overlapping terms from a chat message and the sentences. In many ways and \"infobot\" style bot is precursor of a kind of search engine like Altavista, Ask Jeeves, or finally Google.\n",
"\n",
"Rather than roll my own, I choose to make use of [whoosh](https://pypi.org/project/Whoosh/) a pure python library that is designed specifically to support search engine style applications. It also provides some more refined abstractions that reflect some best practices from the information retrieval and indexing. For instance, I make use of the \"StemmingAnalyzer\" to compare the \"roots\" of words (rather than there exact forms) and a \"stop word\" filter to help avoid matches based on common question words that might occur in chat messages, but which we don't want to use in searching for a matching response."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"from whoosh.index import create_in, open_dir\n",
"from whoosh.fields import *\n",
"from whoosh.analysis import StemmingAnalyzer, LowercaseFilter, StopFilter\n",
"from whoosh import qparser\n",
"from whoosh.highlight import WholeFragmenter, UppercaseFormatter\n",
"import os\n",
"\n",
"\n",
"indexdir = \"index\"\n",
"s = StopFilter()\n",
"stop_words = set(s.stops) | set([\"more\", \"which\", \"get\", \"did\", \"each\", \"that\", \"were\", \"about\", \"tell\", \"my\", \"his\", \"her\", \"after\", \"been\", \"me\", \"i\", \"wa\", \"you\", \"have\", \"there\", \"where\", \"what\", \"why\", \"how\"])\n",
"custom_ana = StemmingAnalyzer(stoplist = stop_words ) # | StopFilter(stoplist = stop_words)\n",
"schema = Schema(\n",
" text=TEXT(stored=True, analyzer=custom_ana),\n",
" years=KEYWORD(stored=True),\n",
" source=ID(stored=True)\n",
")\n",
"\n",
"os.makedirs(indexdir, exist_ok=True)\n",
"ix = create_in(indexdir, schema)\n",
"writer = ix.writer()\n",
"\n",
"with open(\"waller_sentences_first_person.txt\") as f:\n",
" for line in f:\n",
" line = line.strip()\n",
" if line:\n",
" # extract years\n",
" years = \" \".join(re.findall(r\"\\b\\d{4}\\b\", line))\n",
" writer.add_document(text=line, years=years, source=line)\n",
"writer.commit()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Step 2: Query the index\n",
"\n",
"Another useful readymade feature of whoosh, is the ability to parse a free text query to then se to search our index. To make the logic of the search more visible, we use a \"highlighter\" to show which words were matched in (IRC-friendly) uppercase."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
" father\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"I started playing the piano when I was six and graduated to playing the organ at my FATHER's church four years later.\n"
]
}
],
"source": [
"from random import choice\n",
"ix = open_dir(indexdir)\n",
"parser = qparser.QueryParser(\"text\", schema=ix.schema, group=qparser.OrGroup)\n",
"with ix.searcher() as searcher:\n",
" line = input()\n",
" line = line.rstrip().rstrip(\"?\")\n",
" query = parser.parse(line)\n",
" results = searcher.search(query, terms=True)\n",
" results.fragmenter = WholeFragmenter()\n",
" uf = UppercaseFormatter()\n",
" results.formatter = UppercaseFormatter()\n",
" # could eventually use results[x].score\n",
" if len(results) > 0:\n",
" results = list(results)\n",
" r = choice(results)\n",
" print (r.highlights(\"text\"))\n",
"\n",
" # print (r.get(\"text\").encode(\"utf-8\"))\n",
" # print (r.matched_terms())\n",
" # print (u\", \".join(r.matched_terms()).encode(\"utf-8\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"> Q: Tell me about your father.\n",
">\n",
"> A: I started playing the piano when I was six and graduated to playing the organ at my FATHER's church four years later.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Now in IRC Bot form\n",
"\n",
"Now we place the above code in the body of an IRC bot. This code uses the [irc](https://pypi.org/project/irc/) module, and specifically extends the class [SingleServerIRCBot](https://python-irc.readthedocs.io/en/latest/irc.html#irc.bot.SingleServerIRCBot). NB: This code should be saved in it's [own file](botswaller.py), the code is pasted here for convenience, but the use of argparse in a notebook produces an error."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"import irc.bot\n",
"from random import choice\n",
"import whoosh\n",
"from whoosh import qparser\n",
"import whoosh.index\n",
"\n",
"\n",
"class BotsWaller (irc.bot.SingleServerIRCBot):\n",
" def __init__(self, indexdir, channel, nickname, server, port=6667):\n",
" irc.bot.SingleServerIRCBot.__init__(self, [(server, port)], nickname, nickname)\n",
" self.channel = channel\n",
" self.indexdir = indexdir\n",
" self.ix = whoosh.index.open_dir(self.indexdir)\n",
" self.parser = whoosh.qparser.QueryParser(\"text\", schema=self.ix.schema, group=qparser.OrGroup)\n",
"\n",
" def on_welcome(self, c, e):\n",
" c.join(self.channel)\n",
" print (\"join\")\n",
" \n",
" def on_privmsg(self, c, e):\n",
" pass\n",
"\n",
" def on_pubmsg(self, c, e):\n",
" # print e.arguments, e.target, e.source, e.arguments, e.type\n",
" msg = e.arguments[0]\n",
" with self.ix.searcher() as searcher:\n",
" query = self.parser.parse(msg)\n",
" results = searcher.search(query, terms=True)\n",
" results.fragmenter = whoosh.highlight.WholeFragmenter()\n",
" results.formatter = whoosh.highlight.UppercaseFormatter()\n",
" # could eventually use results[x].score as \"confidence\" to respond\n",
" if len(results) > 0:\n",
" results = list(results)\n",
" r = choice(results)\n",
" c.privmsg(self.channel, r.highlights(\"text\"))\n",
"\n",
"if __name__ == \"__main__\":\n",
" import sys, argparse\n",
"\n",
" parser = argparse.ArgumentParser(description='Fats Waller Wikipedia Bot')\n",
" parser.add_argument('--index', default='index', help='path to whoosh index')\n",
" parser.add_argument('--server', default='irc.freenode.net', help='server hostname')\n",
" parser.add_argument('--port', default=6667, type=int, help='server port')\n",
" parser.add_argument('--channel', default='#botopera', help='channel to join')\n",
" parser.add_argument('--nickname', default='BOTSwaller', help='bot nickname')\n",
"\n",
" args = parser.parse_args()\n",
" bot = BotsWaller(args.index, args.channel, args.nickname, args.server, args.port)\n",
" bot.start()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Steps to developing a custom bot\n",
"\n",
"* **Perform the bot(s) speculatively**: Open a new IRC channel, invite some participants and play the role of your *bots* yourselves. You can eventually open multiple windows to play both \"human\" roles and the roles of your bots.\n",
"* **Make use of an existing bot**: For instance Kevin Lenzo's classic [InfoBot](http://www.infobot.org/) implements a sort of mini-language for creating bots. It might be worth experimenting with what results you can get using an already coded bot such as this one.\n",
"* **Explore the histories of algorithms, tools, and techniques**\n",
"* **Translate your speculations and experiences from the previous steps into your own bot**: Make use of IRC libraries like [irc](https://pypi.org/project/irc/) for Python."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Perform the bot(s) speculatively\n",
"\n",
"Open a new IRC channel, invite some participants and play the role of your *bots* yourselves. You can eventually open multiple windows to play both \"human\" roles and the roles of your bots.\n",
"\n",
"Consider exploring artistic traditions of creating \"rule-based\" games, programs that in effect can be implemented by people following a fixed set of rules."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Example: [Oulipo: N+7](https://poets.org/text/brief-guide-oulipo)\n",
"> One of the most popular OULIPO formulas is \"N+7,\" in which the writer takes a poem already in existence and substitutes each of the poems substantive nouns with the noun appearing seven nouns away in the dictionary. Care is taken to ensure that the substitution is not just a compound derivative of the original, or shares a similar root, but a wholly different word. Results can vary widely depending on the version of the dictionary one uses."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Examples: Fluxus George Brecht: [Water yam](https://en.wikipedia.org/wiki/Water_Yam_(artist%27s_book))\n",
"![](fluxus_brecht_water_yam.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Think about sources\n",
"\n",
"Often rules, and algorithms, work by transforming existing data. In the case of the BOTSwaller bot, the source were the sentences of the biographical Wikipedia entry."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Robin van 't Haar: Cityscripts\n",
"Artist and de Kooning instructor [Robin van t' Haar](https://www.cbkrotterdam.nl/2019/03/29/in-memoriam-robin-van-t-haar-1974-2019/) used the city as input, exploring in a photographic practice, ways that the [city \"scripts\"](https://web.archive.org/web/20200805155927/https://cityscripts.com/) its users.\n",
"\n",
"![](vanthaar_zebraanimatie1.gif) ![](vanthaar_camera.jpg) ![](vanthaar_publicaties.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Robin van 't Haar: Cityscripts\n",
"![](vanthaar_aldi1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Robin van 't Haar: Cityscripts\n",
"![](vanthaar_aldi2.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Robin van 't Haar: Cityscripts\n",
"![](vanthaar_publication_files/easycity1.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Robin van 't Haar: Cityscripts\n",
"![](vanthaar_publication_files/easycity2.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Robin van 't Haar: Cityscripts\n",
"![](vanthaar_publication_files/easycity3.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Robin van 't Haar: Cityscripts\n",
"![](vanthaar_publication_files/easycity4.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Think about algorithms, tools, techniques and their histories\n",
"\n",
"In addition to the artistic traditions, and their techniques, isolate and explore other algorithms, tools, and techniques that may be useful to you. These may come from any number of disciplines, scientific or other. Avoid thinking of these tools as *universal* and *timeless*, but rather explore their histories and the relationship between algorithms as ideas and as implementations.\n",
"\n",
"In the case of BOTSwaller, useful tools were [sentence tokenization](https://www.researchgate.net/publication/220355311_Unsupervised_Multilingual_Sentence_Boundary_Detection) and *parts of speech tagging* as implemented in [nltk](http://nltk.org/), and word stemming and search indexing and querying as implmented in [whoosh](https://www.youtube.com/watch?v=gRvZbYtwTeo)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Kiss & Strunk (punkt)\n",
"\n",
"![](kiss_strunk.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![](kiss_strunk_corpora2.png)\n",
"\n",
"![](kiss_strunk_corpora.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Whoosh: Inverted Index for Help Systems\n",
"\n",
"![](whoosh_inverted_index.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![](whoosh_searching.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## TODO\n",
"* Develop a \"researchbot\" to aid in your research\n",
"* Install / setup a local IRC server on the sandbox, with...\n",
"* A custom kiwi install; kiwi has [download packages](https://kiwiirc.com/downloads/) to install all the necessary files for a Kiwi client on your own server.\n",
"* Run jupyter notebook locally on your laptop -- and try *this notebook* interactively"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

@ -0,0 +1,50 @@
import irc.bot
from random import choice
import whoosh
from whoosh import qparser
import whoosh.index
class BotsWaller (irc.bot.SingleServerIRCBot):
def __init__(self, indexdir, channel, nickname, server, port=6667):
irc.bot.SingleServerIRCBot.__init__(self, [(server, port)], nickname, nickname)
self.channel = channel
self.indexdir = indexdir
self.ix = whoosh.index.open_dir(self.indexdir)
self.parser = whoosh.qparser.QueryParser("text", schema=self.ix.schema, group=qparser.OrGroup)
def on_welcome(self, c, e):
c.join(self.channel)
print ("join")
def on_privmsg(self, c, e):
pass
def on_pubmsg(self, c, e):
# print e.arguments, e.target, e.source, e.arguments, e.type
msg = e.arguments[0]
with self.ix.searcher() as searcher:
query = self.parser.parse(msg)
results = searcher.search(query, terms=True)
results.fragmenter = whoosh.highlight.WholeFragmenter()
results.formatter = whoosh.highlight.UppercaseFormatter()
# could eventually use results[x].score as "confidence" to respond
if len(results) > 0:
results = list(results)
r = choice(results)
c.privmsg(self.channel, r.highlights("text"))
if __name__ == "__main__":
import sys, argparse
parser = argparse.ArgumentParser(description='Fats Waller Wikipedia Bot')
parser.add_argument('--index', default='index', help='path to whoosh index')
parser.add_argument('--server', default='irc.freenode.net', help='server hostname')
parser.add_argument('--port', default=6667, type=int, help='server port')
parser.add_argument('--channel', default='#botopera', help='channel to join')
parser.add_argument('--nickname', default='BOTSwaller', help='bot nickname')
args = parser.parse_args()
bot = BotsWaller(args.index, args.channel, args.nickname, args.server, args.port)
bot.start()

Binary file not shown.

After

Width:  |  Height:  |  Size: 333 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 218 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 93 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1016 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 652 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 407 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 105 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 143 KiB

Loading…
Cancel
Save