You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

78 lines
2.0 KiB
Plaintext

.. Copyright (C) 2001-2020 NLTK Project
5 years ago
.. For license information, see LICENSE.TXT
==========
Stemmers
==========
Overview
~~~~~~~~
Stemmers remove morphological affixes from words, leaving only the
word stem.
>>> from nltk.stem import *
Unit tests for the Porter stemmer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> from nltk.stem.porter import *
Create a new Porter stemmer.
>>> stemmer = PorterStemmer()
Test the stemmer on various pluralised words.
>>> plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
... 'died', 'agreed', 'owned', 'humbled', 'sized',
... 'meeting', 'stating', 'siezing', 'itemization',
... 'sensational', 'traditional', 'reference', 'colonizer',
... 'plotted']
>>> singles = [stemmer.stem(plural) for plural in plurals]
>>> print(' '.join(singles)) # doctest: +NORMALIZE_WHITESPACE
caress fli die mule deni die agre own humbl size meet
state siez item sensat tradit refer colon plot
Unit tests for Snowball stemmer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> from nltk.stem.snowball import SnowballStemmer
See which languages are supported.
>>> print(" ".join(SnowballStemmer.languages))
arabic danish dutch english finnish french german hungarian italian
norwegian porter portuguese romanian russian spanish swedish
Create a new instance of a language specific subclass.
>>> stemmer = SnowballStemmer("english")
Stem a word.
>>> print(stemmer.stem("running"))
run
Decide not to stem stopwords.
>>> stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
>>> print(stemmer.stem("having"))
have
>>> print(stemmer2.stem("having"))
having
The 'english' stemmer is better than the original 'porter' stemmer.
>>> print(SnowballStemmer("english").stem("generously"))
generous
>>> print(SnowballStemmer("porter").stem("generously"))
gener
.. note::
Extra stemmer tests can be found in `nltk.test.unit.test_stem`.