13 KiB
NLTK - Part of Speech¶
import nltk import random
lines = open('../txt/language.txt').readlines() sentence = random.choice(lines) print(sentence)
Tokens¶
tokens = nltk.word_tokenize(sentence) print(tokens)
Part of Speech "tags"¶
tagged = nltk.pos_tag(tokens) print(tagged)
Now, you could select for example all the type of verbs:
selection = [] for word, tag in tagged: if 'VB' in tag: selection.append(word) print(selection)
Where do these tags come from?¶
An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset.
NLTK provides documentation for each tag, which can be queried using the tag, e.g. nltk.help.upenn_tagset('RB').
nltk.help.upenn_tagset('PRP')
An alphabetical list of part-of-speech tags used in the Penn Treebank Project (link):
Number
|
Tag
|
Description
|
1. | CC | Coordinating conjunction |
2. | CD | Cardinal number |
3. | DT | Determiner |
4. | EX | Existential there |
5. | FW | Foreign word |
6. | IN | Preposition or subordinating conjunction |
7. | JJ | Adjective |
8. | JJR | Adjective, comparative |
9. | JJS | Adjective, superlative |
10. | LS | List item marker |
11. | MD | Modal |
12. | NN | Noun, singular or mass |
13. | NNS | Noun, plural |
14. | NNP | Proper noun, singular |
15. | NNPS | Proper noun, plural |
16. | PDT | Predeterminer |
17. | POS | Possessive ending |
18. | PRP | Personal pronoun |
19. | PRP\$ | Possessive pronoun |
20. | RB | Adverb |
21. | RBR | Adverb, comparative |
22. | RBS | Adverb, superlative |
23. | RP | Particle |
24. | SYM | Symbol |
25. | TO | to |
26. | UH | Interjection |
27. | VB | Verb, base form |
28. | VBD | Verb, past tense |
29. | VBG | Verb, gerund or present participle |
30. | VBN | Verb, past participle |
31. | VBP | Verb, non-3rd person singular present |
32. | VBZ | Verb, 3rd person singular present |
33. | WDT | Wh-determiner |
34. | WP | Wh-pronoun |
35. | WP$ | Possessive wh-pronoun |
36. | WRB | Wh-adverb |
A telling/tricky case¶
It's important to realize that POS tagging is not a fixed property of a word -- but depends on the context of each word. The NLTK book gives an example of homonyms -- words that are written the same, but are actually pronounced differently and have different meanings depending on their use.
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") nltk.pos_tag(text)
From the book:
Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning "deny," while REFuse is a noun meaning "trash" (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)
Applying to an entire text¶
language = open('../txt/language.txt').read() tokens = nltk.word_tokenize(language) tagged = nltk.pos_tag(tokens)
tagged
words = "in the beginning was heaven and earth and the time of the whatever".split()
words
words.index("the")
1
for i, word in enumerate(words): if word == "the": print (i, word) else: print (word.upper())
IN 1 the BEGINNING WAS HEAVEN AND EARTH AND 8 the TIME OF 11 the WHATEVER
import random words = {} words["VB"] = [] for word in nltk.word_tokenize("in the beginning was heaven and earth and the time of the whatever"): words["VB"].append(word) random.choice(words["VB"])
'VB'