python-irc-bots/parody-bot.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \"Beatrix Botter\"\n",
    "\n",
    "[An's original code](https://gitlab.constantvzw.org/death-of-the-authors/1943/-/blob/master/bots/beatrixbotter_parody.py) used the pattern library.. but it's possible to implement the same technique using just nltk. It relies on two key functions from nltk: word_tokenize and pos_tag.\n",
    "\n",
    "### The \"parody algorithm\"\n",
    "\n",
    "The essence of the \"parody algorithm\" is to translate an input text by replacing its words with randomly chosen words from a \"source\" text -- but which have the *same part of speech* according to nltk's pos_tag function. For example consider the first two lines of Peter Rabbit as a source:\n",
    "\n",
    "    Once upon a  time there were four little Rabbits, and their names were--\n",
    "    Flopsy, Mopsy, Cotton-tail, and Peter.\n",
    "\n",
    "    They lived with their Mother in a  sand-bank, underneath the root of a\n",
    "    very big fir-tree.\n",
    "\n",
    "And then consider the input text to transform:\n",
    "\n",
    "    The blue pen is in the top drawer.\n",
    "\n",
    "Applying word tokenization and part of speech tagging to both texts:\n",
    "\n",
    "    Once upon a  time there were four little Rabbits, and their names were--\n",
    "    RB   IN   DT NN   EX    VBD  CD   JJ     NNP    , CC  PRP$  NNS   VBD :\n",
    "    \n",
    "    Flopsy, Mopsy, Cotton-tail, and Peter.\n",
    "    NNP   , NNP  , NNP        , CC  NNP  .\n",
    "\n",
    "    They lived with their Mother in a  sand-bank, underneath the root of a\n",
    "    PRP  VBD   IN   PRP$  NN     IN DT JJ       , IN         DT  NN   IN DT\n",
    "\n",
    "    very big fir-tree.\n",
    "    RB   JJ  NN      .\n",
    " \n",
    " and\n",
    " \n",
    "    The blue pen is  in the top drawer.\n",
    "    DT  JJ   NN  VBZ IN DT  JJ  NN    .\n",
    "\n",
    "Here's an overview of the [parts of speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) that NLTK is using.\n",
    "\n",
    "To transform the input text, we consider each word, looking in the source for another word with the same part of speech and replace it. For instance starting with \"The\", the part of speech is \"DT\" (determiner) ... looking in the source text there are the following words also tagged DT: a, a, the, a, The, the. So we pick one at random: a. Next consider the word \"blue\", we search the input for all words tagged \"JJ\" (adjective): little, sand-bank, big. We pick \"little\". When we get to \"is\" (tagged: VBZ), there's no match in the source, so we just keep the original word. Following these rules, we can producing the new text:\n",
    "\n",
    "    a little time is  upon the sand-bank Mother.\n",
    "    DT JJ     NN   VBZ IN   DT  JJ        NN    .\n",
    "   \n",
    "\n",
    "\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Doing parts of speech tagging on a text\n",
    "See: [Chapter 5: Categorizing and Tagging Words](http://www.nltk.org/book_1ed/ch05.html) in the NLTK book"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = \"\"\"The blue pen is in the top drawer.\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "tt = nltk.word_tokenize(t)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "tagged = nltk.pos_tag(tt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('The', 'DT'), ('blue', 'JJ'), ('pen', 'NN'), ('is', 'VBZ'), ('in', 'IN'), ('the', 'DT'), ('top', 'JJ'), ('drawer', 'NN'), ('.', '.')]\n"
     ]
    }
   ],
   "source": [
    "print (tagged)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Counting words\n",
    "\n",
    "Recall the following code for counting words in a text. The code creates an empty dictionary called *counts* to store the count of each word. The text is stripped and split to make a list. The for loop then loops over this list assigning each to the variable *word*. The if checks if the word is in the dictionary, and when it's *not* already there, initializes the count to 0. Finally count[word] is incremented."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"\n",
    "this is a simple sentence . and this is another sentence .\n",
    "\"\"\"\n",
    "counts = {}\n",
    "for word in text.strip().split():\n",
    "    if word not in counts:\n",
    "        counts[word] = 0\n",
    "    counts[word] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'this': 2, 'is': 2, 'a': 1, 'simple': 1, 'sentence': 2, '.': 2, 'and': 1, 'another': 1}\n"
     ]
    }
   ],
   "source": [
    "print (counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Create the index\n",
    "A variation on the word counting code, rather than counting each word, use the parts of speech as the key values of the dictionary, and append each word tagged with that tag on a list. *NB: The code assumes there is a file named source.txt with the your source text.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/murtaugh/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /home/murtaugh/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "\n",
    "nltk.download('punkt')\n",
    "nltk.download('averaged_perceptron_tagger')\n",
    "\n",
    "source = open(\"1_peter_rabbit.txt\").read()\n",
    "tokens = nltk.word_tokenize(source)\n",
    "pos = nltk.pos_tag(tokens)\n",
    "index = {}\n",
    "for word, tag in pos:\n",
    "    # print (word, \"is\", tag)\n",
    "    if tag not in index:\n",
    "        index[tag] = []\n",
    "    index[tag].append(word)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['time', 'Mother', 'root', 'fir-tree', 'morning', \"'you\", 'lane', 'garden', 'accident', 'pie', 'run', 'mischief', 'basket', 'umbrella', 'wood', 'baker', 'loaf', 'bread', 'buns', 'lane', 'garden', 'gate', 'parsley', 'end', 'cucumber', 'frame', 'rake', 'thief', 'garden', 'way', 'gate', 'shoe', 'net', 'jacket', 'jacket', 'brass', 'sobs', 'excitement', 'sieve', 'top', 'time', 'jacket', 'thing', 'water', 'flower-pot', 'time', 'foot', 'window', 'window', 'work', 'breath', 'fright', 'idea', 'way', 'time', 'lippity', 'lippity', 'round', 'door', 'wall', 'room', 'rabbit', 'underneath', 'mouse', 'stone', 'doorstep', 'family', 'wood', 'way', 'gate', 'pea', 'mouth', 'head', 'way', 'garden', 'pond', 'cat', 'tip', 'tail', 'cousin', 'noise', 'hoe', 'scratch', 'scratch', 'scritch', 'nothing', 'wheelbarrow', 'thing', 'back', 'gate', 'wheelbarrow', 'walk', 'sight', 'corner', 'gate', 'wood', 'garden', 'jacket', 'fir-tree', 'sand', 'floor', 'mother', 'cooking', 'jacket', 'pair', 'fortnight', 'evening', 'mother', 'bed', 'tea', 'dose', 'milk', 'supper', 'END']\n"
     ]
    }
   ],
   "source": [
    "print (index['NN'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Transform some input using the index\n",
    "Use a *new* list to assemble a new sentence. Use string.join to produce the final text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      " Today is the first day of the rest of your life\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Today is the first day of the rest of your life\n",
      "END is a sorry thief of the sobs upon your run\n"
     ]
    }
   ],
   "source": [
    "from random import choice\n",
    "\n",
    "i = input()\n",
    "tokens = nltk.word_tokenize(i)\n",
    "pos = nltk.pos_tag(tokens)\n",
    "new = []\n",
    "for word, tag in pos:\n",
    "    # print (word,tag)\n",
    "    # replace word with a random choice from the \"hat\" of words for the tag\n",
    "    if tag not in index:\n",
    "        # print (\"no replacement\")\n",
    "        new.append(word)\n",
    "    else:\n",
    "        newword = choice(index[tag])\n",
    "        new.append(newword)\n",
    "        # print (\"replace with\", newword)\n",
    "print (i)\n",
    "print (' '.join(new))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
parody bot 4 years ago			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# \"Beatrix Botter\"\n",`
			`"\n",`
			`"[An's original code](https://gitlab.constantvzw.org/death-of-the-authors/1943/-/blob/master/bots/beatrixbotter_parody.py) used the pattern library.. but it's possible to implement the same technique using just nltk. It relies on two key functions from nltk: word_tokenize and pos_tag.\n",`
			`"\n",`
			`"### The \"parody algorithm\"\n",`
			`"\n",`
			`"The essence of the \"parody algorithm\" is to translate an input text by replacing its words with randomly chosen words from a \"source\" text -- but which have the same part of speech according to nltk's pos_tag function. For example consider the first two lines of Peter Rabbit as a source:\n",`
			`"\n",`
			`" Once upon a time there were four little Rabbits, and their names were--\n",`
			`" Flopsy, Mopsy, Cotton-tail, and Peter.\n",`
			`"\n",`
			`" They lived with their Mother in a sand-bank, underneath the root of a\n",`
			`" very big fir-tree.\n",`
			`"\n",`
			`"And then consider the input text to transform:\n",`
			`"\n",`
			`" The blue pen is in the top drawer.\n",`
			`"\n",`
			`"Applying word tokenization and part of speech tagging to both texts:\n",`
			`"\n",`
			`" Once upon a time there were four little Rabbits, and their names were--\n",`
			`" RB IN DT NN EX VBD CD JJ NNP , CC PRP$ NNS VBD :\n",`
			`" \n",`
			`" Flopsy, Mopsy, Cotton-tail, and Peter.\n",`
			`" NNP , NNP , NNP , CC NNP .\n",`
			`"\n",`
			`" They lived with their Mother in a sand-bank, underneath the root of a\n",`
			`" PRP VBD IN PRP$ NN IN DT JJ , IN DT NN IN DT\n",`
			`"\n",`
			`" very big fir-tree.\n",`
			`" RB JJ NN .\n",`
			`" \n",`
			`" and\n",`
			`" \n",`
			`" The blue pen is in the top drawer.\n",`
			`" DT JJ NN VBZ IN DT JJ NN .\n",`
			`"\n",`
updated parody bot 4 years ago			`"Here's an overview of the [parts of speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) that NLTK is using.\n",`
			`"\n",`
			"To transform the input text, we consider each word, looking in the source for another word with the same part of speech and replace it. For instance starting with \"The\", the part of speech is \"DT\" (determiner) ... looking in the source text there are the following words also tagged DT: a, a, the, a, The, the. So we pick one at random: a. Next consider the word \"blue\", we search the input for all words tagged \"JJ\" (adjective): little, sand-bank, big. We pick \"little\". When we get to \"is\" (tagged: VBZ), there's no match in the source, so we just keep the original word. Following these rules, we can producing the new text:\n",
parody bot 4 years ago			`"\n",`
			`" a little time is upon the sand-bank Mother.\n",`
			`" DT JJ NN VBZ IN DT JJ NN .\n",`
			`" \n",`
			`"\n",`
			`"\n",`
			`" "`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Doing parts of speech tagging on a text\n",`
			`"See: [Chapter 5: Categorizing and Tagging Words](http://www.nltk.org/book_1ed/ch05.html) in the NLTK book"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import nltk"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"t = \"\"\"The blue pen is in the top drawer.\"\"\""`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"tt = nltk.word_tokenize(t)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 19,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"tagged = nltk.pos_tag(tt)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 20,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[('The', 'DT'), ('blue', 'JJ'), ('pen', 'NN'), ('is', 'VBZ'), ('in', 'IN'), ('the', 'DT'), ('top', 'JJ'), ('drawer', 'NN'), ('.', '.')]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"print (tagged)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Counting words\n",`
			`"\n",`
			`"Recall the following code for counting words in a text. The code creates an empty dictionary called counts to store the count of each word. The text is stripped and split to make a list. The for loop then loops over this list assigning each to the variable word. The if checks if the word is in the dictionary, and when it's not already there, initializes the count to 0. Finally count[word] is incremented."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 26,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"text = \"\"\"\n",`
			`"this is a simple sentence . and this is another sentence .\n",`
			`"\"\"\"\n",`
			`"counts = {}\n",`
			`"for word in text.strip().split():\n",`
			`" if word not in counts:\n",`
			`" counts[word] = 0\n",`
			`" counts[word] += 1"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 27,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"{'this': 2, 'is': 2, 'a': 1, 'simple': 1, 'sentence': 2, '.': 2, 'and': 1, 'another': 1}\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"print (counts)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Step 1: Create the index\n",`
			`"A variation on the word counting code, rather than counting each word, use the parts of speech as the key values of the dictionary, and append each word tagged with that tag on a list. NB: The code assumes there is a file named source.txt with the your source text."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated parody bot 4 years ago			`"execution_count": 3,`
parody bot 4 years ago			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"[nltk_data] Downloading package punkt to /home/murtaugh/nltk_data...\n",`
			`"[nltk_data] Package punkt is already up-to-date!\n",`
			`"[nltk_data] Downloading package averaged_perceptron_tagger to\n",`
			`"[nltk_data] /home/murtaugh/nltk_data...\n",`
			`"[nltk_data] Package averaged_perceptron_tagger is already up-to-\n",`
			`"[nltk_data] date!\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"import nltk\n",`
			`"\n",`
			`"nltk.download('punkt')\n",`
			`"nltk.download('averaged_perceptron_tagger')\n",`
			`"\n",`
updated parody bot 4 years ago			`"source = open(\"1_peter_rabbit.txt\").read()\n",`
parody bot 4 years ago			`"tokens = nltk.word_tokenize(source)\n",`
			`"pos = nltk.pos_tag(tokens)\n",`
			`"index = {}\n",`
			`"for word, tag in pos:\n",`
			`" # print (word, \"is\", tag)\n",`
			`" if tag not in index:\n",`
			`" index[tag] = []\n",`
			`" index[tag].append(word)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated parody bot 4 years ago			`"execution_count": 4,`
			`"metadata": {},`
parody bot 4 years ago			`"outputs": [`
			`{`
updated parody bot 4 years ago			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			"['time', 'Mother', 'root', 'fir-tree', 'morning', \"'you\", 'lane', 'garden', 'accident', 'pie', 'run', 'mischief', 'basket', 'umbrella', 'wood', 'baker', 'loaf', 'bread', 'buns', 'lane', 'garden', 'gate', 'parsley', 'end', 'cucumber', 'frame', 'rake', 'thief', 'garden', 'way', 'gate', 'shoe', 'net', 'jacket', 'jacket', 'brass', 'sobs', 'excitement', 'sieve', 'top', 'time', 'jacket', 'thing', 'water', 'flower-pot', 'time', 'foot', 'window', 'window', 'work', 'breath', 'fright', 'idea', 'way', 'time', 'lippity', 'lippity', 'round', 'door', 'wall', 'room', 'rabbit', 'underneath', 'mouse', 'stone', 'doorstep', 'family', 'wood', 'way', 'gate', 'pea', 'mouth', 'head', 'way', 'garden', 'pond', 'cat', 'tip', 'tail', 'cousin', 'noise', 'hoe', 'scratch', 'scratch', 'scritch', 'nothing', 'wheelbarrow', 'thing', 'back', 'gate', 'wheelbarrow', 'walk', 'sight', 'corner', 'gate', 'wood', 'garden', 'jacket', 'fir-tree', 'sand', 'floor', 'mother', 'cooking', 'jacket', 'pair', 'fortnight', 'evening', 'mother', 'bed', 'tea', 'dose', 'milk', 'supper', 'END']\n"
			`]`
parody bot 4 years ago			`}`
			`],`
			`"source": [`
updated parody bot 4 years ago			`"print (index['NN'])"`
parody bot 4 years ago			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Step 2: Transform some input using the index\n",`
			`"Use a new list to assemble a new sentence. Use string.join to produce the final text."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated parody bot 4 years ago			`"execution_count": 12,`
parody bot 4 years ago			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdin",`
			`"output_type": "stream",`
			`"text": [`
updated parody bot 4 years ago			`" Today is the first day of the rest of your life\n"`
parody bot 4 years ago			`]`
			`},`
			`{`
updated parody bot 4 years ago			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
updated parody bot 4 years ago			`"Today is the first day of the rest of your life\n",`
			`"END is a sorry thief of the sobs upon your run\n"`
updated parody bot 4 years ago			`]`
parody bot 4 years ago			`}`
			`],`
			`"source": [`
updated parody bot 4 years ago			`"from random import choice\n",`
			`"\n",`
parody bot 4 years ago			`"i = input()\n",`
			`"tokens = nltk.word_tokenize(i)\n",`
			`"pos = nltk.pos_tag(tokens)\n",`
			`"new = []\n",`
			`"for word, tag in pos:\n",`
			`" # print (word,tag)\n",`
			`" # replace word with a random choice from the \"hat\" of words for the tag\n",`
			`" if tag not in index:\n",`
			`" # print (\"no replacement\")\n",`
			`" new.append(word)\n",`
			`" else:\n",`
			`" newword = choice(index[tag])\n",`
			`" new.append(newword)\n",`
			`" # print (\"replace with\", newword)\n",`
updated parody bot 4 years ago			`"print (i)\n",`
parody bot 4 years ago			`"print (' '.join(new))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.7.3"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`