You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

123 lines
4.3 KiB
Plaintext

TEST CORPORA
============
The purpose of the corpora is for testing and evaluating the functionality in the Pattern module. These are not the original corpora; but samples that have been reduced in size and/or balanced. The original corpora can be found by following the links below.
The corpora are meant for personal use, they are not part of the module's BSD license.
1) Through the Looking-Glass, written by Lewis Carroll
- carroll-lookingglass.pdf
- http://www.gutenberg.org/
- Chapter 1 of Through the Looking-Glass in Office Open XML format.
2) Alice in Wonderland, written by Lewis Carroll
- carroll-wonderland.pdf
- http://www.gutenberg.org/
- Full text of Alice in Wonderland in PDF format.
3) Clough & Stevenson's plagiarism corpus
- plagiarism-clough&stevenson.csv
- http://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html
- 100 texts: authentic (0), heavy (1) or light revision (2), cut & paste (3).
4) Amazon.de German book reviews
- polarity-de-amazon.csv
- http://www.amazon.de/gp/bestsellers/books/
- 100 "positive" and 100 "negative" book reviews.
5) Amazon.fr French book reviews
- polarity-fr-amazon.csv
- http://www.amazon.fr/
- 750 "positive" and 750 "negative" movie reviews.
6) Pang & Lee's sentence polarity dataset v1.0
- polarity-en-pang&lee1.csv
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
- 2000 "positive" and 2000 "negative" sentences.
7) Pang & Lee's polarity dataset v2.0
- polarity-en-pang&lee2.csv
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
- 750 "positive" and 750 "negative" movie reviews.
8) Bol.com Dutch book reviews
- polarity-nl-bol.com.csv
- http://www.bol.com/nl/m/nederlandse-boeken/literatuur/
- 1500 "positive" and 1500 "negative" book reviews.
9) German portion of Tiger Treebank (Brants et al.)
- tagged-de-tiger.txt
- http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora
- 250 German sentences with STTS part-of-speech tags.
10) English portion of Open American National Corpus (Ide et al.)
- tagged-en-oanc.txt
- http://www.anc.org/data/oanc/
- 1000 English sentences with Penn Treebank part-of-speech tags.
11) English portion of Penn Treebank (Marcus et al.)
- tagged-en-wsj.txt
- http://www.cis.upenn.edu/~treebank/home.html
- 1000 English sentences with Penn Treebank part-of-speech tags.
12) Spanish portion of Wikicorpus v.1.0 (Reese & Boleda et al.)
- tagged-es-wikicorpus.txt
- http://www.lsi.upc.edu/~nlp/wikicorpus/
- 1000 Spanish sentences with Parole part-of-speech tags.
13) Italian portion of WaCKy Corpus (Baroni et al.)
- tagged-it-wacky.txt
- http://wacky.sslmit.unibo.it/doku.php?id=corpora
- 1000 Italian sentences with Penn Treebank II part-of-speech tags.
14) Dutch portion of Twente Nieuws Corpus (Ordelman et al.)
- tagged-nl-twnc.txt
- http://hmi.ewi.utwente.nl/TwNC
- 1000 Dutch sentences with Wotan part-of-speech tags.
15) Apache SpamAssassin public mail corpus
- spam-apache.csv
- http://spamassassin.apache.org/publiccorpus/
- 125 "spam" and 125 (mostly technical) "ham" messages.
16) Birkbeck spelling error corpus
- spelling-birkbeck.csv
- http://www.ota.ox.ac.uk/headers/0643.xml
- 500 words and how they are commonly misspelled.
17) CoNLL 2010 Shared Task 1 - Wikipedia uncertainty
- uncertainty-conll2010.csv
- http://www.inf.u-szeged.hu/rgai/conll2010st/tasks.html#task1
- 1500 "certain" and 1500 "uncertain" Wikipedia sentences.
18) Celex 2.5 German word forms
- wordforms-de-celex.csv
- http://celex.mpi.nl/
- 250 singular nouns and their plural form.
- 250 predicative adjectives and their attributive form.
19) Celex 2.5 English word forms
- wordforms-en-celex.csv
- http://celex.mpi.nl/
- 4000 singular nouns and their plural form.
20) Celex 2.5 Dutch word forms
- wordforms-nl-celex.csv
- http://celex.mpi.nl/
- 1000 singular nouns and their plural form.
- 1000 predicative adjectives and their attributive form.
21) Davies Corpus del Espa<70>ol word forms
- wordforms-es-davies.csv
- http://www.wordfrequency.info/files/spanish/spanish_lemmas20k.txt
- 3000 word forms with lemma, part-of-speech and frequency.
22) Wiktionary Italian word forms
- wordforms-it-wiktionary.csv
- https://en.wiktionary.org/wiki/Category:Italian_language
- 2000 word forms with lemma, part-of-speech and gender.
23) Lexique 3 French word forms
- wordforms-fr-lexique.csv
- http://www.lexique.org/
- 2000 word forms with lemma and part-of-speech.