# NLTK - Part of Speech

In [2]:
import nltk
import random

In [3]:
lines = open('txt/language.txt').readlines()
sentence = random.choice(lines)
print(sentence)

To complicate things even further, computer science has its own understanding of “operational semantics” in programming languages, for example in the construction of a programming language interpreter or compiler.



## Tokens

In [4]:
tokens = nltk.word_tokenize(sentence)
print(tokens)

['To', 'complicate', 'things', 'even', 'further', ',', 'computer', 'science', 'has', 'its', 'own', 'understanding', 'of', '“', 'operational', 'semantics', '”', 'in', 'programming', 'languages', ',', 'for', 'example', 'in', 'the', 'construction', 'of', 'a', 'programming', 'language', 'interpreter', 'or', 'compiler', '.']


## Part of Speech "tags"

In [5]:
tagged = nltk.pos_tag(tokens)
print(tagged)

[('To', 'TO'), ('complicate', 'VB'), ('things', 'NNS'), ('even', 'RB'), ('further', 'RB'), (',', ','), ('computer', 'NN'), ('science', 'NN'), ('has', 'VBZ'), ('its', 'PRP$'), ('own', 'JJ'), ('understanding', 'NN'), ('of', 'IN'), ('“', 'NNP'), ('operational', 'JJ'), ('semantics', 'NNS'), ('”', 'VBP'), ('in', 'IN'), ('programming', 'NN'), ('languages', 'NNS'), (',', ','), ('for', 'IN'), ('example', 'NN'), ('in', 'IN'), ('the', 'DT'), ('construction', 'NN'), ('of', 'IN'), ('a', 'DT'), ('programming', 'JJ'), ('language', 'NN'), ('interpreter', 'NN'), ('or', 'CC'), ('compiler', 'NN'), ('.', '.')]


Now, you could select for example all the type of **verbs**:

In [6]:
selection = []

for word, tag in tagged:
    if 'VB' in tag:
        selection.append(word)

print(selection)

['complicate', 'has', '”']


### Where do these tags come from?

> An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset.

From: http://www.nltk.org/api/nltk.tag.html#module-nltk.tag

> NLTK provides documentation for each tag, which can be queried using the tag, e.g. nltk.help.upenn_tagset('RB').

From: http://www.nltk.org/book_1ed/ch05.html

In [7]:
nltk.help.upenn_tagset('PRP')

PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


------------

An alphabetical list of part-of-speech tags used in the Penn Treebank Project ([link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)):

<table cellspacing="2" cellpadding="2" border="0">
  <tbody><tr bgcolor="#DFDFFF" align="none"> 
    <td align="none"> 
      <div align="left">Number</div>
    </td>
    <td> 
      <div align="left">Tag</div>
    </td>
    <td> 
      <div align="left">Description</div>
    </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 1. </td>
    <td>CC </td>
    <td>Coordinating conjunction </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 2. </td>
    <td>CD </td>
    <td>Cardinal number </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 3. </td>
    <td>DT </td>
    <td>Determiner </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 4. </td>
    <td>EX </td>
    <td>Existential <i>there<i> </i></i></td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 5. </td>
    <td>FW </td>
    <td>Foreign word </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 6. </td>
    <td>IN </td>
    <td>Preposition or subordinating conjunction </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 7. </td>
    <td>JJ </td>
    <td>Adjective </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 8. </td>
    <td>JJR </td>
    <td>Adjective, comparative </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 9. </td>
    <td>JJS </td>
    <td>Adjective, superlative </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 10. </td>
    <td>LS </td>
    <td>List item marker </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 11. </td>
    <td>MD </td>
    <td>Modal </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 12. </td>
    <td>NN </td>
    <td>Noun, singular or mass </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 13. </td>
    <td>NNS </td>
    <td>Noun, plural </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 14. </td>
    <td>NNP </td>
    <td>Proper noun, singular </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 15. </td>
    <td>NNPS </td>
    <td>Proper noun, plural </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 16. </td>
    <td>PDT </td>
    <td>Predeterminer </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 17. </td>
    <td>POS </td>
    <td>Possessive ending </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 18. </td>
    <td>PRP </td>
    <td>Personal pronoun </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 19. </td>
    <td>PRP\$ </td>
    <td>Possessive pronoun </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 20. </td>
    <td>RB </td>
    <td>Adverb </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 21. </td>
    <td>RBR </td>
    <td>Adverb, comparative </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 22. </td>
    <td>RBS </td>
    <td>Adverb, superlative </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 23. </td>
    <td>RP </td>
    <td>Particle </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 24. </td>
    <td>SYM </td>
    <td>Symbol </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 25. </td>
    <td>TO </td>
    <td><i>to</i> </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 26. </td>
    <td>UH </td>
    <td>Interjection </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 27. </td>
    <td>VB </td>
    <td>Verb, base form </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 28. </td>
    <td>VBD </td>
    <td>Verb, past tense </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 29. </td>
    <td>VBG </td>
    <td>Verb, gerund or present participle </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 30. </td>
    <td>VBN </td>
    <td>Verb, past participle </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 31. </td>
    <td>VBP </td>
    <td>Verb, non-3rd person singular present </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 32. </td>
    <td>VBZ </td>
    <td>Verb, 3rd person singular present </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 33. </td>
    <td>WDT </td>
    <td>Wh-determiner </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 34. </td>
    <td>WP </td>
    <td>Wh-pronoun </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 35. </td>
    <td>WP$ </td>
    <td>Possessive wh-pronoun </td>
  </tr>
  <tr bgcolor="#FFFFCA"> 
    <td align="none"> 36. </td>
    <td>WRB </td>
    <td>Wh-adverb 
</td></tr></tbody></table>

## Applying to an entire text

In [8]:
language = open('txt/language.txt').read()
tokens = nltk.word_tokenize(language)
tagged = nltk.pos_tag(tokens)

In [9]:
tagged

[('Language', 'NN'),
 ('Florian', 'JJ'),
 ('Cramer', 'NNP'),
 ('Software', 'NNP'),
 ('and', 'CC'),
 ('language', 'NN'),
 ('are', 'VBP'),
 ('intrinsically', 'RB'),
 ('related', 'VBN'),
 (',', ','),
 ('since', 'IN'),
 ('software', 'NN'),
 ('may', 'MD'),
 ('process', 'VB'),
 ('language', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('is', 'VBZ'),
 ('constructed', 'VBN'),
 ('in', 'IN'),
 ('language', 'NN'),
 ('.', '.'),
 ('Yet', 'CC'),
 ('language', 'NN'),
 ('means', 'VBZ'),
 ('different', 'JJ'),
 ('things', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('context', 'NN'),
 ('of', 'IN'),
 ('computing', 'VBG'),
 (':', ':'),
 ('formal', 'JJ'),
 ('languages', 'NNS'),
 ('in', 'IN'),
 ('which', 'WDT'),
 ('algorithms', 'EX'),
 ('are', 'VBP'),
 ('expressed', 'VBN'),
 ('and', 'CC'),
 ('software', 'NN'),
 ('is', 'VBZ'),
 ('implemented', 'VBN'),
 (',', ','),
 ('and', 'CC'),
 ('in', 'IN'),
 ('so-called', 'JJ'),
 ('“', 'NNP'),
 ('natural', 'JJ'),
 ('”', 'NNP'),
 ('spoken', 'NN'),
 ('languages', 'NNS'),
 ('.', '.'),
