You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
glueberry/tutorials/gensim_corpora_vector_space...

14 KiB

Corpora and Vector Spaces

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
In [2]:
# start from documents as strings

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]
In [3]:
# tokenize the documents, remove common words using the stoplist as well as words that only appear once

from pprint import pprint
from collections import defaultdict

# remove common words and tokenize

stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once

frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

pprint(texts)
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

To convert documents to vectors, well use a document representation called bag-of-words.

In [4]:
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict') # store the dictionary for future refefence
print(dictionary)
2022-05-23 15:36:12,400 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-05-23 15:36:12,400 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2022-05-23 15:36:12,401 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2022-05-23T15:36:12.401796', 'gensim': '4.2.0', 'python': '3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'created'}
2022-05-23 15:36:12,401 : INFO : Dictionary lifecycle event {'fname_or_handle': '/tmp/deerwester.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-05-23T15:36:12.401796', 'gensim': '4.2.0', 'python': '3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'saving'}
2022-05-23 15:36:12,402 : INFO : saved /tmp/deerwester.dict
Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>
In [5]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use
print(corpus)
2022-05-23 15:36:12,851 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
2022-05-23 15:36:12,853 : INFO : saving sparse matrix to /tmp/deerwester.mm
2022-05-23 15:36:12,853 : INFO : PROGRESS: saving document #0
2022-05-23 15:36:12,854 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2022-05-23 15:36:12,855 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

Corpus Streaming - One Document at a Time

This is usefull for working with big corpus, since they are not loaded entirely in memory at once. Instead with smart_open they can be loaded one document at a time

In [6]:
from smart_open import open # for transparently opening remote files

class MyCorpus:
    def __iter__(self):
        for line in open('https://radimrehurek.com/mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

About this yield statement: https://stackoverflow.com/a/231855

About this MyCorpus class:

The assumption that each document occupies one line in a single file is not important; you can mold the \_iter__ function to fit your input format, whatever it is. Walking directories, parsing XML, accessing the network… Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside __iter__._

In [7]:
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
print(corpus_memory_friendly)
<__main__.MyCorpus object at 0x000002362CFCA530>
In [8]:
for vector in corpus_memory_friendly: #Load one vector into memory at a time
    print(vector)
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.

In [9]:
# We can also construct dictionary without loading all texts into memory

# collect statistics about all tokens

dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt'))



stop_ids = [ 
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]

once_ids = [
    tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1
]

dictionary.filter_tokens(stop_ids + once_ids) # remove stopwords and words that appear only once
dictionary.compactify() # remove gaps in id sequence after words that were removed
print(dictionary)
2022-05-23 15:42:28,778 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-05-23 15:42:28,779 : INFO : built Dictionary<42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...> from 9 documents (total 69 corpus positions)
2022-05-23 15:42:28,779 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...> from 9 documents (total 69 corpus positions)", 'datetime': '2022-05-23T15:42:28.779876', 'gensim': '4.2.0', 'python': '3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'created'}
Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>

Corpus Formats

There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. Gensim implements them via the streaming corpus interface mentioned earlier: documents are read from (resp. stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.

One of the more notable file formats is the Market Matrix format.

In [12]:
corpus = [[(1, 0.5)], []] # two documents (one is empty!)


# To save
corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

# To load
corpus = corpora.MmCorpus('/tmp/corpus.mm')
2022-05-23 15:50:56,704 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
2022-05-23 15:50:56,705 : INFO : saving sparse matrix to /tmp/corpus.mm
2022-05-23 15:50:56,706 : INFO : PROGRESS: saving document #0
2022-05-23 15:50:56,707 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2022-05-23 15:50:56,708 : INFO : saving MmCorpus index to /tmp/corpus.mm.index
2022-05-23 15:50:56,711 : INFO : loaded corpus index from /tmp/corpus.mm.index
2022-05-23 15:50:56,711 : INFO : initializing cython corpus reader from /tmp/corpus.mm
2022-05-23 15:50:56,714 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries
In [13]:
# Corpus objects are streams, so typically you wont be able to print them directly:
print(corpus)
MmCorpus(2 documents, 2 features, 1 non-zero entries)
In [14]:
# one way of printing a corpus: load it entirely into memory

print(list(corpus))  # calling list() will convert any sequence to a plain Python list
[[(1, 0.5)], []]
In [15]:
# another way of doing it: print one document at a time, making use of the streaming interface
# (more memory friendly)
for doc in corpus:
    print(doc)
[(1, 0.5)]
[]
In [ ]: