|
|
TEST CORPORA
|
|
|
============
|
|
|
|
|
|
The purpose of the corpora is for testing and evaluating the functionality in the Pattern module. These are not the original corpora; but samples that have been reduced in size and/or balanced. The original corpora can be found by following the links below.
|
|
|
|
|
|
The corpora are meant for personal use, they are not part of the module's BSD license.
|
|
|
|
|
|
1) Through the Looking-Glass, written by Lewis Carroll
|
|
|
- carroll-lookingglass.pdf
|
|
|
- http://www.gutenberg.org/
|
|
|
- Chapter 1 of Through the Looking-Glass in Office Open XML format.
|
|
|
|
|
|
2) Alice in Wonderland, written by Lewis Carroll
|
|
|
- carroll-wonderland.pdf
|
|
|
- http://www.gutenberg.org/
|
|
|
- Full text of Alice in Wonderland in PDF format.
|
|
|
|
|
|
3) Clough & Stevenson's plagiarism corpus
|
|
|
- plagiarism-clough&stevenson.csv
|
|
|
- http://ir.shef.ac.uk/cloughie/resources/plagiarism_corpus.html
|
|
|
- 100 texts: authentic (0), heavy (1) or light revision (2), cut & paste (3).
|
|
|
|
|
|
4) Amazon.de German book reviews
|
|
|
- polarity-de-amazon.csv
|
|
|
- http://www.amazon.de/gp/bestsellers/books/
|
|
|
- 100 "positive" and 100 "negative" book reviews.
|
|
|
|
|
|
5) Amazon.fr French book reviews
|
|
|
- polarity-fr-amazon.csv
|
|
|
- http://www.amazon.fr/
|
|
|
- 750 "positive" and 750 "negative" movie reviews.
|
|
|
|
|
|
6) Pang & Lee's sentence polarity dataset v1.0
|
|
|
- polarity-en-pang&lee1.csv
|
|
|
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
|
|
|
- 2000 "positive" and 2000 "negative" sentences.
|
|
|
|
|
|
7) Pang & Lee's polarity dataset v2.0
|
|
|
- polarity-en-pang&lee2.csv
|
|
|
- http://www.cs.cornell.edu/people/pabo/movie-review-data/
|
|
|
- 750 "positive" and 750 "negative" movie reviews.
|
|
|
|
|
|
8) Bol.com Dutch book reviews
|
|
|
- polarity-nl-bol.com.csv
|
|
|
- http://www.bol.com/nl/m/nederlandse-boeken/literatuur/
|
|
|
- 1500 "positive" and 1500 "negative" book reviews.
|
|
|
|
|
|
9) German portion of Tiger Treebank (Brants et al.)
|
|
|
- tagged-de-tiger.txt
|
|
|
- http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora
|
|
|
- 250 German sentences with STTS part-of-speech tags.
|
|
|
|
|
|
10) English portion of Open American National Corpus (Ide et al.)
|
|
|
- tagged-en-oanc.txt
|
|
|
- http://www.anc.org/data/oanc/
|
|
|
- 1000 English sentences with Penn Treebank part-of-speech tags.
|
|
|
|
|
|
11) English portion of Penn Treebank (Marcus et al.)
|
|
|
- tagged-en-wsj.txt
|
|
|
- http://www.cis.upenn.edu/~treebank/home.html
|
|
|
- 1000 English sentences with Penn Treebank part-of-speech tags.
|
|
|
|
|
|
12) Spanish portion of Wikicorpus v.1.0 (Reese & Boleda et al.)
|
|
|
- tagged-es-wikicorpus.txt
|
|
|
- http://www.lsi.upc.edu/~nlp/wikicorpus/
|
|
|
- 1000 Spanish sentences with Parole part-of-speech tags.
|
|
|
|
|
|
13) Italian portion of WaCKy Corpus (Baroni et al.)
|
|
|
- tagged-it-wacky.txt
|
|
|
- http://wacky.sslmit.unibo.it/doku.php?id=corpora
|
|
|
- 1000 Italian sentences with Penn Treebank II part-of-speech tags.
|
|
|
|
|
|
14) Dutch portion of Twente Nieuws Corpus (Ordelman et al.)
|
|
|
- tagged-nl-twnc.txt
|
|
|
- http://hmi.ewi.utwente.nl/TwNC
|
|
|
- 1000 Dutch sentences with Wotan part-of-speech tags.
|
|
|
|
|
|
15) Apache SpamAssassin public mail corpus
|
|
|
- spam-apache.csv
|
|
|
- http://spamassassin.apache.org/publiccorpus/
|
|
|
- 125 "spam" and 125 (mostly technical) "ham" messages.
|
|
|
|
|
|
16) Birkbeck spelling error corpus
|
|
|
- spelling-birkbeck.csv
|
|
|
- http://www.ota.ox.ac.uk/headers/0643.xml
|
|
|
- 500 words and how they are commonly misspelled.
|
|
|
|
|
|
17) CoNLL 2010 Shared Task 1 - Wikipedia uncertainty
|
|
|
- uncertainty-conll2010.csv
|
|
|
- http://www.inf.u-szeged.hu/rgai/conll2010st/tasks.html#task1
|
|
|
- 1500 "certain" and 1500 "uncertain" Wikipedia sentences.
|
|
|
|
|
|
18) Celex 2.5 German word forms
|
|
|
- wordforms-de-celex.csv
|
|
|
- http://celex.mpi.nl/
|
|
|
- 250 singular nouns and their plural form.
|
|
|
- 250 predicative adjectives and their attributive form.
|
|
|
|
|
|
19) Celex 2.5 English word forms
|
|
|
- wordforms-en-celex.csv
|
|
|
- http://celex.mpi.nl/
|
|
|
- 4000 singular nouns and their plural form.
|
|
|
|
|
|
20) Celex 2.5 Dutch word forms
|
|
|
- wordforms-nl-celex.csv
|
|
|
- http://celex.mpi.nl/
|
|
|
- 1000 singular nouns and their plural form.
|
|
|
- 1000 predicative adjectives and their attributive form.
|
|
|
|
|
|
21) Davies Corpus del Espa<70>ol word forms
|
|
|
- wordforms-es-davies.csv
|
|
|
- http://www.wordfrequency.info/files/spanish/spanish_lemmas20k.txt
|
|
|
- 3000 word forms with lemma, part-of-speech and frequency.
|
|
|
|
|
|
22) Wiktionary Italian word forms
|
|
|
- wordforms-it-wiktionary.csv
|
|
|
- https://en.wiktionary.org/wiki/Category:Italian_language
|
|
|
- 2000 word forms with lemma, part-of-speech and gender.
|
|
|
|
|
|
23) Lexique 3 French word forms
|
|
|
- wordforms-fr-lexique.csv
|
|
|
- http://www.lexique.org/
|
|
|
- 2000 word forms with lemma and part-of-speech. |