rita

Categorization_of_Files

Experimenting with automated classification of text files. The script searches for the most common words in the text and tries to match these words to a category.

Du kannst nicht mehr als 25 Themen auswählen Themen müssen entweder mit einem Buchstaben oder einer Ziffer beginnen. Sie können Bindestriche („-“) enthalten und bis zu 35 Zeichen lang sein.

Datei suchen

rita 29636e5764 Upload files to ''		vor 4 Jahren
1600px-Common_words.png	Upload files to ''	vor 4 Jahren
README.md	Upload files to ''	vor 4 Jahren
anarchist_cookbook.txt	files	vor 5 Jahren
categorization.py	files	vor 5 Jahren
example.png	files	vor 5 Jahren
fifty_shades.txt	files	vor 5 Jahren
library_studies.txt	files	vor 5 Jahren
lolita.txt	files	vor 5 Jahren
mein_kampf.txt	files	vor 5 Jahren
own_nothing.txt	files	vor 5 Jahren
prideandprejudice.txt	files	vor 5 Jahren
stopwords.txt	files	vor 5 Jahren

README.md

Categorisation of text files

The actions of categorising and cataloging happen in the most mundane activities, but they are not innocent. They translate values and certain visions of the world.

In the Rietveld Academy Library, we saw how the librarians are challenging the Library of Congress classification. With Dušan we browsed in the Monoskop Index, an interesting combination of a “book index, library catalog, and tag cloud”.

With this script, I was experimenting with an automated classification of text files. The script searches for the three most common words in the text and tries to match these words to a category. For example, if one of the most common words is “books” the category of the text is considered “Library Studies”. The same would happen with the word “archives”, “author”, “bibliographic”, “bibliotheca”, “book”, “bookcase”, etc. The script only has one category right now, but it would be easy to add more. By doing so, I would be making associations that are very personal, sometimes inaccurate, and I would be creating a bias in the catalog.