XPUB

S13-Words-for-the-Future-notebooks

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

13 KiB

Raw Blame History

NLTK - Part of Speech¶

In [ ]:

import nltk
import random

In [ ]:

lines = open('../txt/language.txt').readlines()
sentence = random.choice(lines)
print(sentence)

Tokens¶

In [ ]:

tokens = nltk.word_tokenize(sentence)
print(tokens)

Part of Speech "tags"¶

In [ ]:

tagged = nltk.pos_tag(tokens)
print(tagged)

Now, you could select for example all the type of verbs:

In [ ]:

selection = []

for word, tag in tagged:
    if 'VB' in tag:
        selection.append(word)

print(selection)

Where do these tags come from?¶

An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset.

From: http://www.nltk.org/api/nltk.tag.html#module-nltk.tag

NLTK provides documentation for each tag, which can be queried using the tag, e.g. nltk.help.upenn_tagset('RB').

From: http://www.nltk.org/book_1ed/ch05.html

In [ ]:

nltk.help.upenn_tagset('PRP')

An alphabetical list of part-of-speech tags used in the Penn Treebank Project (link):

Number	Tag	Description
1.	CC	Coordinating conjunction
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	FW	Foreign word
6.	IN	Preposition or subordinating conjunction
7.	JJ	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	LS	List item marker
11.	MD	Modal
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural
16.	PDT	Predeterminer
17.	POS	Possessive ending
18.	PRP	Personal pronoun
19.	PRP\$	Possessive pronoun
20.	RB	Adverb
21.	RBR	Adverb, comparative
22.	RBS	Adverb, superlative
23.	RP	Particle
24.	SYM	Symbol
25.	TO	to
26.	UH	Interjection
27.	VB	Verb, base form
28.	VBD	Verb, past tense
29.	VBG	Verb, gerund or present participle
30.	VBN	Verb, past participle
31.	VBP	Verb, non-3rd person singular present
32.	VBZ	Verb, 3rd person singular present
33.	WDT	Wh-determiner
34.	WP	Wh-pronoun
35.	WP$	Possessive wh-pronoun
36.	WRB	Wh-adverb

A telling/tricky case¶

It's important to realize that POS tagging is not a fixed property of a word -- but depends on the context of each word. The NLTK book gives an example of homonyms -- words that are written the same, but are actually pronounced differently and have different meanings depending on their use.

In [ ]:

text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

From the book:

Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)

Applying to an entire text¶

In [ ]:

language = open('../txt/language.txt').read()
tokens = nltk.word_tokenize(language)
tagged = nltk.pos_tag(tokens)

In [ ]:

tagged

In [1]:

words = "in the beginning was heaven and earth and the time of the whatever".split()

In [ ]:

words

In [2]:

words.index("the")

Out[2]:

In [3]:

for i, word in enumerate(words):
    if word == "the":
        print (i, word)
    else:
        print (word.upper())

IN
1 the
BEGINNING
WAS
HEAVEN
AND
EARTH
AND
8 the
TIME
OF
11 the
WHATEVER

In [8]:

import random 

words = {}
words["VB"] = []

for word in nltk.word_tokenize("in the beginning was heaven and earth and the time of the whatever"):
    words["VB"].append(word)
    
random.choice(words["VB"])

Out[8]:

'VB'

In [ ]:

13 KiB Raw Blame History

NLTK - Part of Speech¶

Tokens¶

Part of Speech "tags"¶

Where do these tags come from?¶

A telling/tricky case¶

Applying to an entire text¶

13 KiB

Raw Blame History