You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

256 lines
335 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NLTK - Term Frequency Inverse Document Frequency"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"import os, nltk\n",
"from math import log"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Making a corpus of txt files"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"corpus = {}\n",
"folder = '../txt/words-for-the-future/'\n",
"files = os.listdir(folder)\n",
"for f in files:\n",
" if '.txt' in f:\n",
" txt = open(f'{ folder }/{ f }').read()\n",
" words = nltk.word_tokenize(txt)\n",
" name = f.replace('.txt','')\n",
" corpus[name] = words"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'RESURGENCE': ['RESURGENCE', '|', 'Isabelle', 'Stengers', '“', 'We', 'are', 'the', 'grandchildren', 'of', 'the', 'witches', 'you', 'were', 'not', 'able', 'to', 'burn', '”', '', 'Tish', 'Thawer', 'I', 'will', 'take', 'this', 'motto', ',', 'which', 'has', 'flourished', 'in', 'recent', 'protests', 'in', 'the', 'United', 'States', ',', 'as', 'the', 'defiant', 'cry', 'of', 'resurgence', '', 'refusing', 'to', 'define', 'the', 'past', 'as', 'dead', 'and', 'buried', '.', 'Not', 'only', 'were', 'the', 'witches', 'killed', 'all', 'over', 'Europe', ',', 'but', 'their', 'memory', 'has', 'been', 'buried', 'by', 'the', 'many', 'retrospective', 'analyses', 'which', 'triumphantly', 'concluded', 'that', 'their', 'power', 'and', 'practices', 'were', 'a', 'matter', 'of', 'imaginary', 'collective', 'construction', 'affecting', 'both', 'the', 'victims', 'and', 'the', 'inquisitors', '.', 'Eco-feminists', 'have', 'proposed', 'a', 'very', 'different', 'understanding', 'of', 'the', '', 'burning', 'times.', '', 'They', 'associate', 'it', 'with', 'the', 'destruction', 'of', 'rural', 'cultures', 'and', 'their', 'old', 'rites', ',', 'with', 'the', 'violent', 'appropriation', 'of', 'the', 'commons', ',', 'with', 'the', 'rule', 'of', 'a', 'law', 'that', 'consecrated', 'the', 'unquestionable', 'rights', 'of', 'the', 'owner', ',', 'and', 'with', 'the', 'invention', 'of', 'the', 'modern', 'workers', 'who', 'can', 'only', 'sell', 'their', 'labour-power', 'on', 'the', 'market', 'as', 'a', 'commodity', '.', 'Listening', 'to', 'the', 'defiant', 'cry', 'of', 'the', 'women', 'who', 'name', 'themselves', 'granddaughters', 'of', 'the', 'past', 'witches', ',', 'I', 'will', 'go', 'further', '.', 'I', 'will', 'honour', 'the', 'vision', 'which', ',', 'since', 'the', 'Reagan', 'era', ',', 'has', 'sustained', 'reclaiming', 'witches', 'such', 'as', 'Starhawk', ',', 'who', 'associate', 'their', 'activism', 'with', 'the', 'memory', 'of', 'a', 'past', 'earth-based', 'religion', 'of', 'the', 'goddess', '-', 'who', 'now', '', 'returns.', '', 'Against', 'the', 'ongoing', 'academic', 'critical', 'judgement', ',', 'I', 'claim', 'that', 'the', 'witches', '', 'resurgence', ',', 'their', 'chant', 'about', 'the', 'goddess', '', 'return', ',', 'and', 'inseparably', 'their', 'return', 'to', 'the', 'goddess', ',', 'should', 'not', 'be', 'taken', 'as', 'a', '', 'regression.', '', 'Given', 'the', 'threatening', 'unknown', 'our', 'future', 'is', 'facing', ',', 'the', 'question', 'of', 'academic', 'judgements', 'may', 'seem', 'like', 'a', 'rather', 'futile', 'one', '.', 'Very', 'few', ',', 'including', 'academics', 'themselves', ',', 'among', 'those', 'who', 'disqualify', 'the', 'resurgence', 'of', 'witches', 'as', 'regressive', ',', 'are', 'effectively', 'forced', 'to', 'think', 'by', 'this', 'future', ',', 'which', 'the', 'witches', 'resolutely', 'address', '.', 'They', 'are', 'too', 'busy', 'living', 'up', 'to', 'the', 'relentless', 'neoliberal', 'demands', 'which', 'they', 'have', 'now', 'to', 'satisfy', 'in', 'order', 'to', 'survive', '.', 'However', ',', 'if', 'there', 'is', 'something', 'to', 'be', 'learned', 'from', 'the', 'past', ',', 'it', 'may', 'well', 'be', 'the', 'way', 'in', 'which', 'defending', 'the', 'victims', 'of', 'eradicative', 'operations', 'has', 'so', 'often', 'deemed', 'futile', '.', 'In', 'one', 'way', 'or', 'another', ',', 'these', 'victims', 'deserved', 'their', 'fate', ',', 'or', 'this', 'fate', 'was', 'the', 'price', 'to', 'be', 'unhappily', 'paid', 'for', 'progress', '.', '“', 'Creative', 'destructions', ',', '”', 'economists', 'croon', '.', 'What', 'we', 'have', 'now', 'discovered', 'is', 'that', 'these', 'destructions', 'come', 'with', 'cascading', 'and', 'interconnecting', 'consequences', '.', 'Worlds', 'are', 'destroyed', 'and', 'no', 'such', 'destruction', 'is', 'ever', 'deserved', '.', 'This', 'is', 'why', 'I', 'will', 'address', 'the', 'academic', 'world', ',', 'which', ',', 'in', 'turns', ',', 'is', 'facing', 'its', 'own', 'destruction', '.', 'Probably', ',', 'because', 'it', 'is', 'the', 'one', 'I', '
]
}
],
"source": [
"print(corpus)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dict_keys(['RESURGENCE', 'ATATA', 'UNDECIDABILITY', 'TENSE', 'PRACTICAL-VISION', 'OTHERNESS', '!?', 'HOPE', 'LIQUID', 'ECO-SWARAJ'])\n"
]
}
],
"source": [
"print(corpus.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# What do you want to query?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"i would like to know how relevant the"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"query = 'language'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"is for the"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"text = 'HOPE'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"thanks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Term Frequency"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TF count: 0\n",
"Total number of words: 4067\n",
"TF - count/total: 0.0\n"
]
}
],
"source": [
"tf_count = 0\n",
"\n",
"for word in corpus[text]:\n",
" if query == word:\n",
" tf_count += 1\n",
"\n",
"tf = tf_count/len(words)\n",
"\n",
"print('TF count:', tf_count)\n",
"print('Total number of words:', len(words))\n",
"print('TF - count/total:', tf_count/len(words))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inverse Document Frequency"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of documents: 10\n",
"documents/count 1.4285714285714286\n",
"IDF - log(documents/count) 0.3566749439387324\n"
]
}
],
"source": [
"idf_count = 0\n",
"\n",
"for name, words in corpus.items():\n",
" if query in words:\n",
" idf_count += 1\n",
"\n",
"# print('count:', idf_count)\n",
"\n",
"idf = log(len(corpus)/idf_count)\n",
"\n",
"print('Total number of documents:', len(corpus))\n",
"print('documents/count', len(corpus)/idf_count)\n",
"print('IDF - log(documents/count)', log(len(corpus)/idf_count))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TF-IDF: Term Frequency / Inverse Document Frequency"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"tfidf_value = tf * idf"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TF-IDF: 0.0006138983544556496\n"
]
}
],
"source": [
"print('TF-IDF:', tfidf_value)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}