XPUB

python-irc-bots

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

9.8 KiB

Raw Permalink Blame History

"Beatrix Botter"¶

An's original code used the pattern library.. but it's possible to implement the same technique using just nltk. It relies on two key functions from nltk: word_tokenize and pos_tag.

The "parody algorithm"¶

The essence of the "parody algorithm" is to translate an input text by replacing its words with randomly chosen words from a "source" text -- but which have the same part of speech according to nltk's pos_tag function. For example consider the first two lines of Peter Rabbit as a source:

Once upon a  time there were four little Rabbits, and their names were--
Flopsy, Mopsy, Cotton-tail, and Peter.

They lived with their Mother in a  sand-bank, underneath the root of a
very big fir-tree.

And then consider the input text to transform:

The blue pen is in the top drawer.

Applying word tokenization and part of speech tagging to both texts:

Once upon a  time there were four little Rabbits, and their names were--
RB   IN   DT NN   EX    VBD  CD   JJ     NNP    , CC  PRP$  NNS   VBD :

Flopsy, Mopsy, Cotton-tail, and Peter.
NNP   , NNP  , NNP        , CC  NNP  .

They lived with their Mother in a  sand-bank, underneath the root of a
PRP  VBD   IN   PRP$  NN     IN DT JJ       , IN         DT  NN   IN DT

very big fir-tree.
RB   JJ  NN      .

and

The blue pen is  in the top drawer.
DT  JJ   NN  VBZ IN DT  JJ  NN    .

Here's an overview of the parts of speech tags that NLTK is using.

To transform the input text, we consider each word, looking in the source for another word with the same part of speech and replace it. For instance starting with "The", the part of speech is "DT" (determiner) ... looking in the source text there are the following words also tagged DT: a, a, the, a, The, the. So we pick one at random: a. Next consider the word "blue", we search the input for all words tagged "JJ" (adjective): little, sand-bank, big. We pick "little". When we get to "is" (tagged: VBZ), there's no match in the source, so we just keep the original word. Following these rules, we can producing the new text:

a little time is  upon the sand-bank Mother.
DT JJ     NN   VBZ IN   DT  JJ        NN    .

Doing parts of speech tagging on a text¶

See: Chapter 5: Categorizing and Tagging Words in the NLTK book

In [4]:

import nltk

In [17]:

t = """The blue pen is in the top drawer."""

In [18]:

tt = nltk.word_tokenize(t)

In [19]:

tagged = nltk.pos_tag(tt)

In [20]:

print (tagged)

[('The', 'DT'), ('blue', 'JJ'), ('pen', 'NN'), ('is', 'VBZ'), ('in', 'IN'), ('the', 'DT'), ('top', 'JJ'), ('drawer', 'NN'), ('.', '.')]

Counting words¶

Recall the following code for counting words in a text. The code creates an empty dictionary called counts to store the count of each word. The text is stripped and split to make a list. The for loop then loops over this list assigning each to the variable word. The if checks if the word is in the dictionary, and when it's not already there, initializes the count to 0. Finally count[word] is incremented.

In [26]:

text = """
this is a simple sentence . and this is another sentence .
"""
counts = {}
for word in text.strip().split():
    if word not in counts:
        counts[word] = 0
    counts[word] += 1

In [27]:

print (counts)

{'this': 2, 'is': 2, 'a': 1, 'simple': 1, 'sentence': 2, '.': 2, 'and': 1, 'another': 1}

Step 1: Create the index¶

A variation on the word counting code, rather than counting each word, use the parts of speech as the key values of the dictionary, and append each word tagged with that tag on a list. NB: The code assumes there is a file named source.txt with the your source text.

In [3]:

import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

source = open("1_peter_rabbit.txt").read()
tokens = nltk.word_tokenize(source)
pos = nltk.pos_tag(tokens)
index = {}
for word, tag in pos:
    # print (word, "is", tag)
    if tag not in index:
        index[tag] = []
    index[tag].append(word)

[nltk_data] Downloading package punkt to /home/murtaugh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/murtaugh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

In [4]:

print (index['NN'])

['time', 'Mother', 'root', 'fir-tree', 'morning', "'you", 'lane', 'garden', 'accident', 'pie', 'run', 'mischief', 'basket', 'umbrella', 'wood', 'baker', 'loaf', 'bread', 'buns', 'lane', 'garden', 'gate', 'parsley', 'end', 'cucumber', 'frame', 'rake', 'thief', 'garden', 'way', 'gate', 'shoe', 'net', 'jacket', 'jacket', 'brass', 'sobs', 'excitement', 'sieve', 'top', 'time', 'jacket', 'thing', 'water', 'flower-pot', 'time', 'foot', 'window', 'window', 'work', 'breath', 'fright', 'idea', 'way', 'time', 'lippity', 'lippity', 'round', 'door', 'wall', 'room', 'rabbit', 'underneath', 'mouse', 'stone', 'doorstep', 'family', 'wood', 'way', 'gate', 'pea', 'mouth', 'head', 'way', 'garden', 'pond', 'cat', 'tip', 'tail', 'cousin', 'noise', 'hoe', 'scratch', 'scratch', 'scritch', 'nothing', 'wheelbarrow', 'thing', 'back', 'gate', 'wheelbarrow', 'walk', 'sight', 'corner', 'gate', 'wood', 'garden', 'jacket', 'fir-tree', 'sand', 'floor', 'mother', 'cooking', 'jacket', 'pair', 'fortnight', 'evening', 'mother', 'bed', 'tea', 'dose', 'milk', 'supper', 'END']

Step 2: Transform some input using the index¶

Use a new list to assemble a new sentence. Use string.join to produce the final text.

In [12]:

from random import choice

i = input()
tokens = nltk.word_tokenize(i)
pos = nltk.pos_tag(tokens)
new = []
for word, tag in pos:
    # print (word,tag)
    # replace word with a random choice from the "hat" of words for the tag
    if tag not in index:
        # print ("no replacement")
        new.append(word)
    else:
        newword = choice(index[tag])
        new.append(newword)
        # print ("replace with", newword)
print (i)
print (' '.join(new))

Today is the first day of the rest of your life
END is a sorry thief of the sobs upon your run

In [ ]:

9.8 KiB Raw Permalink Blame History