You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

9.2 KiB

In [4]:
import pprint
  • Document: some text.
  • Corpus: a collection of documents.
  • Vector: a mathematically convenient representation of a document.
  • Model: an algorithm for transforming vectors from one representation to another.
In [5]:
document = 'Lorem ipsum dolor sit amet eheh 123 gelato'

text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]
In [6]:
# Cleaning the corpus

# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))

# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist] for document in text_corpus]

this next line seems crazy but it reads like:

  • for every document in the list text_corpus do this:
  • create a list of words by splitting the document
  • and keep the word if it's not in the stoplist

so the result should be a list of lists of words, one for each document

In [7]:
# Count word frequencies

# we are using defaultdict instead of a normal dictionary 
# bc with this you can return a default value instead of an error if the key is missing in the dictionary
from collections import defaultdict

frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

# Only keep words that appear more than once

processed_corpus = [[token for token in text if frequency[token]>1] for text in texts]
pprint.pprint(processed_corpus)
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]
In [10]:
# to associate each word with an unique integer ID we use the dictionary class provided by gensim. This dictionary defines the vocabulary of all words that our processing knows about.

from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)
# Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>
Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>
In [9]:
# print the id for each word
pprint.pprint(dictionary.token2id)
{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}
In [11]:
# create a bag of word for a new document based on our corpus
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
[(0, 1), (1, 1)]

The first entry in each tuple corresponds to the ID of the token in the dictionary, the second corresponds to the count of this token.

In [13]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)
[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

Now we can use models aka way to represent documents. One simple example of a model is the tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

In [14]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow_corpus)

# transform the 'system minors' string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])
[(5, 0.5898341626740045), (11, 0.8075244024440723)]

We can save the model and later load them back, to continue training or transform new documents. So the training is something that could be done through time.

In [16]:
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)

query_document = 'system engineering'.lower().split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))
[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
In [18]:
# sorting the similarities by score

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)
3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
In [ ]: