XPUB

S13-Words-for-the-Future-notebooks

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

15 KiB

Raw Blame History

https://www.nltk.org/book/

https://www.nltk.org/book/ch00.html#natural-language-toolkit-nltk

In [ ]:

import nltk

In [ ]:

nltk.download("book", download_dir="/usr/local/share/nltk_data")

In [ ]:

from nltk.book import *

In [ ]:

text1

In [ ]:

type(text1)

In [ ]:

from nltk.text import Text

In [ ]:

nltk.text.Text

In [ ]:

for line in text1.concordance_list("whale"):
    print (line.left_print, line.query, line.right_print)

In [ ]:

text5.tokens

Reading Words for the Future texts¶

Chapter 3 of the NLTK book discusses using your own texts using urlopen and the nltk.text.Text class.

We can use urllib.request.urlopen + pull the "raw" URLs of materials from the SI13 materials on git.xpub.nl.

In [ ]:

url = "https://git.xpub.nl/XPUB/S13-Words-for-the-Future-materials/raw/branch/master/txt-essays/RESURGENCE%20Isabelle%20Stengers.txt"

In [ ]:

url

In [ ]:

from urllib.request import urlopen

In [ ]:

r = urlopen(url)

In [ ]:

rawtext = r.read()

In [ ]:

text = rawtext.decode()

In [ ]:

text = urlopen(url).read().decode()

In [ ]:

len(text)

In [ ]:

words = text.split?

In [ ]:

words = text.split

In [ ]:

words = text.split

In [ ]:

words = text.split

In [ ]:

words = text.split

In [ ]:

words = text.split()

In [ ]:

len(words)

In [ ]:

from nltk import word_tokenize

In [ ]:

tokens = word_tokenize(text)

In [ ]:

len(tokens)

In [ ]:

len(tokens)

In [ ]:

tokens[-10:]

In [ ]:

stengers = Text(tokens)

In [ ]:

stengers.concordance("the", width=82, lines=74)

In [ ]:

for line in stengers.concordance_list("the", width=82, lines=74):
    print (line.left_print, line.query, line.right_print)

In [ ]:

with open ("patches/stengers_the.txt", "w") as output:
    for line in stengers.concordance_list("the", width=82, lines=74):
        print (line.left_print, line.query, line.right_print, file=output)

In [ ]:

for line in stengers.concordance_list("the", width=82, lines=74):
    print (line.query)

In [ ]:

stengers.concordance("the", width=3)

In [ ]:

stengers.common_contexts(["power", "victims"])

In [ ]:

stengers.dispersion_plot(["power", "the", "victims"])

In [ ]:

from nltk.probability import FreqDist

In [ ]:

freq = FreqDist(stengers)

In [ ]:

freq["WHALE"]

In [ ]:

freq['power']

In [ ]:

freq.plot(50)

In [ ]:

freq.plot(50, cumulative=True)

Counting Vocabulary¶

Making a function¶

Investigating a text as a list of words, we discover that we can compare the count of the total number of words, with the number of unique words. If we compare

In [ ]:

len(stengers)

In [ ]:

len(set(stengers))

In [ ]:

def lexical_diversity(text):
    return len(text) / len(set(text))

In [ ]:

lexical_diversity(stengers)

In [ ]:

def percentage (count, total):
    return 100 * count / total

In [ ]:

percentage(4, 5)

NB: BE CAREFUL RUNNING THE FOLLOWING LINE ... IT'S REALLY SLOW... Not all code is equal, and just because two different methods produce the same result doesn't mean they're equally usable in practice

Why? because text1 (Moby Dick) is a list and checking if (x not in text1) has to scan the whole list of words AND THEN this scan is done FOR EVERY WORD in the stengers text The result is called "order n squared" execution, as the number of words in each text increases the time to perform the code get EXPONENTIALLY slower it's basically the phenomenon of nested loops on large lists.... SSSSSSSSSLLLLLLLLLOOOOOOOOOOOWWWWWWWWWWW

In [ ]:

# stengers_unique = []
# for word in stengers.tokens:
#     if word not in text1:
#         stengers_unique.append(word)

In [ ]:

# stengers_unique = [x for x in stengers.tokens if x not in text1]

FIX: make a set based on the Moby Dick text, checking if something is in a set is VERY FAST compared to scanning a list (Order log(n) instead of n)...

In [ ]:

moby = set(text1)

In [ ]:

"the" in moby

Rather than n*n (n squared), the following is just n log(n) which is not* exponential as n gets big

In [ ]:

stengers_unique = []
for word in stengers.tokens:
    if word not in moby:
        stengers_unique.append(word)

The above can also be expressed using the more compact form of a list comprehension

In [ ]:

stengers_unique = [word for word in stengers.tokens if word not in moby]

In [ ]:

len(stengers_unique)

In [ ]:

stengers_unique

In [ ]:

stengers_unique_text = Text(stengers_unique)

In [ ]:

freq = FreqDist(stengers_unique)

In [ ]:

freq.plot(50)

In [ ]:

stengers_unique_text.concordance("witches")

Increasing the default figure size¶

In [ ]:

from IPython.core.pylabtools import figsize

In [ ]:

figsize(20.0,20.0)

In [ ]:

stengers

In [ ]:

stengers

Nami asks: How to I get concordances of just words ending "ity"¶

In [ ]:

t = stengers

In [ ]:

ity = []
for w in stengers:
    if w.endswith("ity"):
        # print (w)
        ity.append(w.lower())
ity = set(ity)

In [ ]:

for word in ity:
    stengers.concordance(word)

In [ ]:

"Objectivity".lower

In [ ]:

set(ity)

Clara asks, what about lines that are shorter than the width you give?¶

https://www.peterbe.com/plog/how-to-pad-fill-string-by-variable-python

cwidth is how much "padding" is needed for each side it's our page width - the length of the word divided by 2 in python means "integer" (whole number) division

In [ ]:

for line in stengers.concordance_list("resurgence", width=82, lines=74):
    cwidth = (82 - len("resurgence")) // 2
    # print (cwidth)
    print ( line.left_print.rjust(cwidth), line.query, line.right_print.ljust(cwidth) )

In [ ]:

15 KiB Raw Blame History

Reading Words for the Future texts¶

Counting Vocabulary¶

Making a function¶

Increasing the default figure size¶

Nami asks: How to I get concordances of just words ending "ity"¶

Clara asks, what about lines that are shorter than the width you give?¶

15 KiB

Raw Blame History