You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

299 lines
9.8 KiB
Plaintext

4 years ago
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# \"Beatrix Botter\"\n",
"\n",
"[An's original code](https://gitlab.constantvzw.org/death-of-the-authors/1943/-/blob/master/bots/beatrixbotter_parody.py) used the pattern library.. but it's possible to implement the same technique using just nltk. It relies on two key functions from nltk: word_tokenize and pos_tag.\n",
"\n",
"### The \"parody algorithm\"\n",
"\n",
"The essence of the \"parody algorithm\" is to translate an input text by replacing its words with randomly chosen words from a \"source\" text -- but which have the *same part of speech* according to nltk's pos_tag function. For example consider the first two lines of Peter Rabbit as a source:\n",
"\n",
" Once upon a time there were four little Rabbits, and their names were--\n",
" Flopsy, Mopsy, Cotton-tail, and Peter.\n",
"\n",
" They lived with their Mother in a sand-bank, underneath the root of a\n",
" very big fir-tree.\n",
"\n",
"And then consider the input text to transform:\n",
"\n",
" The blue pen is in the top drawer.\n",
"\n",
"Applying word tokenization and part of speech tagging to both texts:\n",
"\n",
" Once upon a time there were four little Rabbits, and their names were--\n",
" RB IN DT NN EX VBD CD JJ NNP , CC PRP$ NNS VBD :\n",
" \n",
" Flopsy, Mopsy, Cotton-tail, and Peter.\n",
" NNP , NNP , NNP , CC NNP .\n",
"\n",
" They lived with their Mother in a sand-bank, underneath the root of a\n",
" PRP VBD IN PRP$ NN IN DT JJ , IN DT NN IN DT\n",
"\n",
" very big fir-tree.\n",
" RB JJ NN .\n",
" \n",
" and\n",
" \n",
" The blue pen is in the top drawer.\n",
" DT JJ NN VBZ IN DT JJ NN .\n",
"\n",
"Here's an overview of the [parts of speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) that NLTK is using.\n",
"\n",
"To transform the input text, we consider each word, looking in the source for another word with the same part of speech and replace it. For instance starting with \"The\", the part of speech is \"DT\" (determiner) ... looking in the source text there are the following words also tagged DT: a, a, the, a, The, the. So we pick one at random: a. Next consider the word \"blue\", we search the input for all words tagged \"JJ\" (adjective): little, sand-bank, big. We pick \"little\". When we get to \"is\" (tagged: VBZ), there's no match in the source, so we just keep the original word. Following these rules, we can producing the new text:\n",
4 years ago
"\n",
" a little time is upon the sand-bank Mother.\n",
" DT JJ NN VBZ IN DT JJ NN .\n",
" \n",
"\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Doing parts of speech tagging on a text\n",
"See: [Chapter 5: Categorizing and Tagging Words](http://www.nltk.org/book_1ed/ch05.html) in the NLTK book"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import nltk"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"t = \"\"\"The blue pen is in the top drawer.\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"tt = nltk.word_tokenize(t)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"tagged = nltk.pos_tag(tt)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('The', 'DT'), ('blue', 'JJ'), ('pen', 'NN'), ('is', 'VBZ'), ('in', 'IN'), ('the', 'DT'), ('top', 'JJ'), ('drawer', 'NN'), ('.', '.')]\n"
]
}
],
"source": [
"print (tagged)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Counting words\n",
"\n",
"Recall the following code for counting words in a text. The code creates an empty dictionary called *counts* to store the count of each word. The text is stripped and split to make a list. The for loop then loops over this list assigning each to the variable *word*. The if checks if the word is in the dictionary, and when it's *not* already there, initializes the count to 0. Finally count[word] is incremented."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"text = \"\"\"\n",
"this is a simple sentence . and this is another sentence .\n",
"\"\"\"\n",
"counts = {}\n",
"for word in text.strip().split():\n",
" if word not in counts:\n",
" counts[word] = 0\n",
" counts[word] += 1"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'this': 2, 'is': 2, 'a': 1, 'simple': 1, 'sentence': 2, '.': 2, 'and': 1, 'another': 1}\n"
]
}
],
"source": [
"print (counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Create the index\n",
"A variation on the word counting code, rather than counting each word, use the parts of speech as the key values of the dictionary, and append each word tagged with that tag on a list. *NB: The code assumes there is a file named source.txt with the your source text.*"
]
},
{
"cell_type": "code",
"execution_count": 3,
4 years ago
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /home/murtaugh/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n",
"[nltk_data] Downloading package averaged_perceptron_tagger to\n",
"[nltk_data] /home/murtaugh/nltk_data...\n",
"[nltk_data] Package averaged_perceptron_tagger is already up-to-\n",
"[nltk_data] date!\n"
]
}
],
"source": [
"import nltk\n",
"\n",
"nltk.download('punkt')\n",
"nltk.download('averaged_perceptron_tagger')\n",
"\n",
"source = open(\"1_peter_rabbit.txt\").read()\n",
4 years ago
"tokens = nltk.word_tokenize(source)\n",
"pos = nltk.pos_tag(tokens)\n",
"index = {}\n",
"for word, tag in pos:\n",
" # print (word, \"is\", tag)\n",
" if tag not in index:\n",
" index[tag] = []\n",
" index[tag].append(word)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
4 years ago
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['time', 'Mother', 'root', 'fir-tree', 'morning', \"'you\", 'lane', 'garden', 'accident', 'pie', 'run', 'mischief', 'basket', 'umbrella', 'wood', 'baker', 'loaf', 'bread', 'buns', 'lane', 'garden', 'gate', 'parsley', 'end', 'cucumber', 'frame', 'rake', 'thief', 'garden', 'way', 'gate', 'shoe', 'net', 'jacket', 'jacket', 'brass', 'sobs', 'excitement', 'sieve', 'top', 'time', 'jacket', 'thing', 'water', 'flower-pot', 'time', 'foot', 'window', 'window', 'work', 'breath', 'fright', 'idea', 'way', 'time', 'lippity', 'lippity', 'round', 'door', 'wall', 'room', 'rabbit', 'underneath', 'mouse', 'stone', 'doorstep', 'family', 'wood', 'way', 'gate', 'pea', 'mouth', 'head', 'way', 'garden', 'pond', 'cat', 'tip', 'tail', 'cousin', 'noise', 'hoe', 'scratch', 'scratch', 'scritch', 'nothing', 'wheelbarrow', 'thing', 'back', 'gate', 'wheelbarrow', 'walk', 'sight', 'corner', 'gate', 'wood', 'garden', 'jacket', 'fir-tree', 'sand', 'floor', 'mother', 'cooking', 'jacket', 'pair', 'fortnight', 'evening', 'mother', 'bed', 'tea', 'dose', 'milk', 'supper', 'END']\n"
]
4 years ago
}
],
"source": [
"print (index['NN'])"
4 years ago
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Transform some input using the index\n",
"Use a *new* list to assemble a new sentence. Use string.join to produce the final text."
]
},
{
"cell_type": "code",
"execution_count": 12,
4 years ago
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
" Today is the first day of the rest of your life\n"
4 years ago
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Today is the first day of the rest of your life\n",
"END is a sorry thief of the sobs upon your run\n"
]
4 years ago
}
],
"source": [
"from random import choice\n",
"\n",
4 years ago
"i = input()\n",
"tokens = nltk.word_tokenize(i)\n",
"pos = nltk.pos_tag(tokens)\n",
"new = []\n",
"for word, tag in pos:\n",
" # print (word,tag)\n",
" # replace word with a random choice from the \"hat\" of words for the tag\n",
" if tag not in index:\n",
" # print (\"no replacement\")\n",
" new.append(word)\n",
" else:\n",
" newword = choice(index[tag])\n",
" new.append(newword)\n",
" # print (\"replace with\", newword)\n",
"print (i)\n",
4 years ago
"print (' '.join(new))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}