14 KiB
Corpora and Vector Spaces¶
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# start from documents as strings documents = [ "Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey", ]
# tokenize the documents, remove common words using the stoplist as well as words that only appear once from pprint import pprint from collections import defaultdict # remove common words and tokenize stoplist = set('for a of the and to in'.split()) texts = [ [word for word in document.lower().split() if word not in stoplist] for document in documents ] # remove words that appear only once frequency = defaultdict(int) for text in texts: for token in text: frequency[token] += 1 texts = [ [token for token in text if frequency[token] > 1] for text in texts ] pprint(texts)
[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]
To convert documents to vectors, we’ll use a document representation called bag-of-words.
from gensim import corpora dictionary = corpora.Dictionary(texts) dictionary.save('/tmp/deerwester.dict') # store the dictionary for future refefence print(dictionary)
2022-05-23 15:36:12,400 : INFO : adding document #0 to Dictionary<0 unique tokens: []> 2022-05-23 15:36:12,400 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions) 2022-05-23 15:36:12,401 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2022-05-23T15:36:12.401796', 'gensim': '4.2.0', 'python': '3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'created'} 2022-05-23 15:36:12,401 : INFO : Dictionary lifecycle event {'fname_or_handle': '/tmp/deerwester.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-05-23T15:36:12.401796', 'gensim': '4.2.0', 'python': '3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'saving'} 2022-05-23 15:36:12,402 : INFO : saved /tmp/deerwester.dict
Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>
corpus = [dictionary.doc2bow(text) for text in texts] corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use print(corpus)
2022-05-23 15:36:12,851 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm 2022-05-23 15:36:12,853 : INFO : saving sparse matrix to /tmp/deerwester.mm 2022-05-23 15:36:12,853 : INFO : PROGRESS: saving document #0 2022-05-23 15:36:12,854 : INFO : saved 9x12 matrix, density=25.926% (28/108) 2022-05-23 15:36:12,855 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]
Corpus Streaming - One Document at a Time¶
This is usefull for working with big corpus, since they are not loaded entirely in memory at once. Instead with smart_open they can be loaded one document at a time
from smart_open import open # for transparently opening remote files class MyCorpus: def __iter__(self): for line in open('https://radimrehurek.com/mycorpus.txt'): # assume there's one document per line, tokens separated by whitespace yield dictionary.doc2bow(line.lower().split())
About this yield statement: https://stackoverflow.com/a/231855
About this MyCorpus class:
The assumption that each document occupies one line in a single file is not important; you can mold the \_iter__ function to fit your input format, whatever it is. Walking directories, parsing XML, accessing the network… Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside __iter__._
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory! print(corpus_memory_friendly)
<__main__.MyCorpus object at 0x000002362CFCA530>
for vector in corpus_memory_friendly: #Load one vector into memory at a time print(vector)
[(0, 1), (1, 1), (2, 1)] [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] [(2, 1), (5, 1), (7, 1), (8, 1)] [(1, 1), (5, 2), (8, 1)] [(3, 1), (6, 1), (7, 1)] [(9, 1)] [(9, 1), (10, 1)] [(9, 1), (10, 1), (11, 1)] [(4, 1), (10, 1), (11, 1)]
Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.
# We can also construct dictionary without loading all texts into memory # collect statistics about all tokens dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt')) stop_ids = [ dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id ] once_ids = [ tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1 ] dictionary.filter_tokens(stop_ids + once_ids) # remove stopwords and words that appear only once dictionary.compactify() # remove gaps in id sequence after words that were removed print(dictionary)
2022-05-23 15:42:28,778 : INFO : adding document #0 to Dictionary<0 unique tokens: []> 2022-05-23 15:42:28,779 : INFO : built Dictionary<42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...> from 9 documents (total 69 corpus positions) 2022-05-23 15:42:28,779 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...> from 9 documents (total 69 corpus positions)", 'datetime': '2022-05-23T15:42:28.779876', 'gensim': '4.2.0', 'python': '3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'created'}
Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...>
Corpus Formats¶
There exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk. Gensim implements them via the streaming corpus interface mentioned earlier: documents are read from (resp. stored to) disk in a lazy fashion, one document at a time, without the whole corpus being read into main memory at once.
One of the more notable file formats is the Market Matrix format.
corpus = [[(1, 0.5)], []] # two documents (one is empty!) # To save corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus) # To load corpus = corpora.MmCorpus('/tmp/corpus.mm')
2022-05-23 15:50:56,704 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm 2022-05-23 15:50:56,705 : INFO : saving sparse matrix to /tmp/corpus.mm 2022-05-23 15:50:56,706 : INFO : PROGRESS: saving document #0 2022-05-23 15:50:56,707 : INFO : saved 2x2 matrix, density=25.000% (1/4) 2022-05-23 15:50:56,708 : INFO : saving MmCorpus index to /tmp/corpus.mm.index 2022-05-23 15:50:56,711 : INFO : loaded corpus index from /tmp/corpus.mm.index 2022-05-23 15:50:56,711 : INFO : initializing cython corpus reader from /tmp/corpus.mm 2022-05-23 15:50:56,714 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries
# Corpus objects are streams, so typically you won’t be able to print them directly: print(corpus)
MmCorpus(2 documents, 2 features, 1 non-zero entries)
# one way of printing a corpus: load it entirely into memory print(list(corpus)) # calling list() will convert any sequence to a plain Python list
[[(1, 0.5)], []]
# another way of doing it: print one document at a time, making use of the streaming interface # (more memory friendly) for doc in corpus: print(doc)
[(1, 0.5)] []