xref: /Lucene/lucene/analysis/stempel/src/java/overview.html (revision e290f91bb233f33cde4b2249d676298d5740e8b1)
1e8e4245dSRobert Muir<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2e8e4245dSRobert Muir<!--
3e8e4245dSRobert Muir Licensed to the Apache Software Foundation (ASF) under one or more
4e8e4245dSRobert Muir contributor license agreements.  See the NOTICE file distributed with
5e8e4245dSRobert Muir this work for additional information regarding copyright ownership.
6e8e4245dSRobert Muir The ASF licenses this file to You under the Apache License, Version 2.0
7e8e4245dSRobert Muir (the "License"); you may not use this file except in compliance with
8e8e4245dSRobert Muir the License.  You may obtain a copy of the License at
9e8e4245dSRobert Muir
10e8e4245dSRobert Muir     http://www.apache.org/licenses/LICENSE-2.0
11e8e4245dSRobert Muir
12e8e4245dSRobert Muir Unless required by applicable law or agreed to in writing, software
13e8e4245dSRobert Muir distributed under the License is distributed on an "AS IS" BASIS,
14e8e4245dSRobert Muir WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15e8e4245dSRobert Muir See the License for the specific language governing permissions and
16e8e4245dSRobert Muir limitations under the License.
17e8e4245dSRobert Muir-->
18e8e4245dSRobert Muir<html>
19e8e4245dSRobert Muir<head>
20e8e4245dSRobert Muir  <meta content="text/html; charset=UTF-8" http-equiv="content-type">
21e8e4245dSRobert Muir  <title>Stempel - Algorithmic Stemmer for Polish Language</title>
22e8e4245dSRobert Muir  <meta content="Andrzej Bialecki" name="author">
23e8e4245dSRobert Muir  <meta name="keywords"
24e8e4245dSRobert Muir content="stemming, stemmer, algorithmic stemmer, Polish stemmer">
25e8e4245dSRobert Muir  <meta
26e8e4245dSRobert Muir content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables."
27e8e4245dSRobert Muir name="description">
28e8e4245dSRobert Muir</head>
29e8e4245dSRobert Muir<body style="font-family: Arial,SansSerif;">
30e8e4245dSRobert Muir<h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1>
31e8e4245dSRobert Muir<h2>Introduction</h2>
32e8e4245dSRobert Muir<p>A method for conflation of different inflected word forms is an
33e8e4245dSRobert Muirimportant component of many Information Retrieval systems. It helps to
34e8e4245dSRobert Muirimprove the system's recall and can significantly reduce the index
35e8e4245dSRobert Muirsize. This is especially true for highly-inflectional languages like
36e8e4245dSRobert Muirthose from the Slavic language family (Czech, Slovak, Polish, Russian,
37e8e4245dSRobert MuirBulgarian, etc).</p>
38e8e4245dSRobert Muir<p>This page describes a software package consisting of high-quality
39e8e4245dSRobert Muirstemming tables for Polish, and a universal algorithmic stemmer, which
40e8e4245dSRobert Muiroperates using these tables. The stemmer code is taken virtually
41e8e4245dSRobert Muirunchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p>
42e8e4245dSRobert Muir<p>The software distribution includes stemmer
43e8e4245dSRobert Muirtables prepared using an extensive corpus of Polish language (see
44e8e4245dSRobert Muirdetails below).</p>
45e8e4245dSRobert Muir<p>This work is available under Apache-style Open Source license - the
46e8e4245dSRobert Muirstemmer code is covered by Egothor License, the tables and other
47e8e4245dSRobert Muiradditions are covered by Apache License 2.0. Both licenses allow to use
48e8e4245dSRobert Muirthe code in Open Source as well as commercial (closed source) projects.</p>
49e8e4245dSRobert Muir<h3>Terminology</h3>
50e8e4245dSRobert Muir<p>A short explanation is in order about the terminology used in this
51e8e4245dSRobert Muirtext.</p>
52e8e4245dSRobert Muir<p>In the following sections I make a distinction between <b>stem</b>
53e8e4245dSRobert Muirand <b>lemma</b>.</p>
54e8e4245dSRobert Muir<p>Lemma is a base grammatical form (dictionary form, headword) of a
55e8e4245dSRobert Muirword. Lemma is an existing, grammatically correct word in some human
56e8e4245dSRobert Muirlanguage.</p>
57e8e4245dSRobert Muir<p>Stem on the other hand is just a unique token, not necessarily
58e8e4245dSRobert Muirmaking any sense in any human language, but which can serve as a unique
59e8e4245dSRobert Muirlabel instead of lemma for the same set of inflected forms. Quite often
60e8e4245dSRobert Muirstem is referred to as a "root" of the word - which is incorrect and
61e8e4245dSRobert Muirmisleading (stems sometimes have very little to do with the linguistic
62e8e4245dSRobert Muirroot of a word, i.e. a pattern found in a word which is common to all
63e8e4245dSRobert Muirinflected forms or within a family of languages).</p>
64e8e4245dSRobert Muir<p>For an IR system stems are usually sufficient, for a morphological
65e8e4245dSRobert Muiranalysis system obviously lemmas are a must. In practice, various
66e8e4245dSRobert Muirstemmers produce a mix of stems and lemmas, as is the case with the
67e8e4245dSRobert Muirstemmer described here. Additionally, for some languages, which use
68e8e4245dSRobert Muirsuffix-based inflection rules many stemmers based on suffix-stripping
69e8e4245dSRobert Muirwill produce a large percentage of stems equivalent to lemmas. This is
70e8e4245dSRobert Muirhowever not the case for languages with complex, irregular inflection
71e8e4245dSRobert Muirrules (such as Slavic languages) - here simplistic suffix-stripping
72e8e4245dSRobert Muirstemmers produce very poor results.</p>
73e8e4245dSRobert Muir<h3>Background</h3>
74e8e4245dSRobert Muir<p>Lemmatization is a process of finding the base, non-inflected form
75e8e4245dSRobert Muirof a word. The result of lemmatization is a correct existing word,
76e8e4245dSRobert Muiroften in nominative case for nouns and infinitive form for verbs. A
77e8e4245dSRobert Muirgiven inflected form may correspond to several lemmas (e.g. "found"
78e8e4245dSRobert Muir-&gt; find, found) - the correct choice depends on the context.<br>
79e8e4245dSRobert Muir<br>
80e8e4245dSRobert MuirStemming is concerned mostly with finding a unique "root" of a word,
81e8e4245dSRobert Muirwhich not necessarily results in any existing word or lemma. The
82e8e4245dSRobert Muirquality of stemming is measured by the rate of collisions (overstemming
83e8e4245dSRobert Muir- which causes words with different lemmas to be incorrectly conflated
84e8e4245dSRobert Muirinto one "root"), and the rate of superfluous word "roots"
85e8e4245dSRobert Muir(understemming - which assigns several "roots" to words with the same
86e8e4245dSRobert Muirlemma). <br>
87e8e4245dSRobert Muir<br>
88e8e4245dSRobert MuirBoth stemmer and lemmatizer can be implemented in various ways. The two
89e8e4245dSRobert Muirmost common approaches are:<br>
90e8e4245dSRobert Muir</p>
91e8e4245dSRobert Muir<ul>
92e8e4245dSRobert Muir  <li>dictionary-based: where the stemmer uses an extensive dictionary
93e8e4245dSRobert Muirof morphological forms in order to find the corresponding stem or lemma</li>
94e8e4245dSRobert Muir  <li>algorithmic: where the stemmer uses an algorithm, based on
95e8e4245dSRobert Muirgeneral morphological properties of a given language plus a set of
96e8e4245dSRobert Muirheuristic rules<br>
97e8e4245dSRobert Muir  </li>
98e8e4245dSRobert Muir</ul>
99e8e4245dSRobert MuirThere are many existing and well-known implementations of stemmers for
100e8e4245dSRobert MuirEnglish (Porter, Lovins, Krovetz) and other European languages
101*e290f91bSDawid Weiss(<a href="https://snowballstem.org/">Snowball</a>). There are also
102e8e4245dSRobert Muirgood quality commercial lemmatizers for Polish. However, there is only
103e8e4245dSRobert Muirone
104e8e4245dSRobert Muirfreely available Polish stemmer, implemented by
105e8e4245dSRobert Muir<a
106e8e4245dSRobert Muir href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid
107e8e4245dSRobert MuirWeiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a
108e8e4245dSRobert Muir href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That
109e8e4245dSRobert Muirstemmer is dictionary-based. This means that even
110e8e4245dSRobert Muirthough it can achieve
111e8e4245dSRobert Muirperfect accuracy for previously known word forms found in its
112e8e4245dSRobert Muirdictionary, it
113e8e4245dSRobert Muircompletely fails in case of all other word forms. This deficiency is
114e8e4245dSRobert Muirsomewhat mitigated by the comprehensive dictionary distributed with
115e8e4245dSRobert Muirthis stemmer (so there is a high probability that most of the words in
116e8e4245dSRobert Muirthe input text will be found in the dictionary), however the problem
117e8e4245dSRobert Muirstill remains (please see the page above for more detailed description).<br>
118e8e4245dSRobert Muir<br>
119e8e4245dSRobert MuirThe implementation described here uses an algorithmic method. This
120e8e4245dSRobert Muirmethod
121e8e4245dSRobert Muirand particular algorithm implementation are described in detail in
122e8e4245dSRobert Muir[1][2].
123e8e4245dSRobert MuirThe main advantage of algorithmic stemmers is their ability to process
124e8e4245dSRobert Muirpreviously
125e8e4245dSRobert Muirunseen word forms with high accuracy. This particular algorithm uses a
126e8e4245dSRobert Muirset
127e8e4245dSRobert Muirof
128e8e4245dSRobert Muirtransformation rules (patch commands), which describe how a word with a
129e8e4245dSRobert Muirgiven pattern should be transformed to its stem. These rules are first
130e8e4245dSRobert Muirlearned from a training corpus. They don't
131e8e4245dSRobert Muircover
132e8e4245dSRobert Muirall possible cases, so there is always some loss of precision/recall
133e8e4245dSRobert Muir(which
134e8e4245dSRobert Muirmeans that even the words from the training corpus are sometimes
135e8e4245dSRobert Muirincorrectly stemmed).<br>
1360d339043SRobert Muir<h2>Algorithm and implementation</h2>
137e8e4245dSRobert MuirThe algorithm and its Java implementation is described in detail in the
138e8e4245dSRobert Muirpublications cited below. Here's just a short excerpt from [2]:<br>
139e8e4245dSRobert Muir<br>
1400d339043SRobert Muir<div style="width: 80%; text-align: center">"The aim is separation of the
141e8e4245dSRobert Muirstemmer execution code from the data
142e8e4245dSRobert Muirstructures [...]. In other words, a static algorithm configurable by
143e8e4245dSRobert Muirdata must be developed. The word transformations that happen in the
144e8e4245dSRobert Muirstemmer must be then encoded to the data tables.<br>
145e8e4245dSRobert Muir<br>
146e8e4245dSRobert MuirThe tacit input of our method is a sample set (a so-called dictionary)
147e8e4245dSRobert Muirof words (as keys) and their stems. Each record can be equivalently
148e8e4245dSRobert Muirstored as a key and the record of key's transformation to its
149e8e4245dSRobert Muirrespective stem. The transformation record is termed a patch command
150e8e4245dSRobert Muir(P-command). It must be ensured that P-commands are universal, and that
151e8e4245dSRobert MuirP-commands can transform any word to its stem. Our solution[6,8] is
152e8e4245dSRobert Muirbased on the Levenstein metric [10], which produces P-command as the
153e8e4245dSRobert Muirminimum cost path in a directed graph.<br>
154e8e4245dSRobert Muir<br>
155e8e4245dSRobert MuirOne can imagine the P-command as an algorithm for an operator (editor)
156e8e4245dSRobert Muirthat rewrites a string to another string. The operator can use these
157e8e4245dSRobert Muirinstructions (PP-command's): <span style="font-weight: bold;">removal </span>-
158e8e4245dSRobert Muirdeletes a sequence of characters starting at the current cursor
159e8e4245dSRobert Muirposition and moves the cursor to the next character. The length of this
160e8e4245dSRobert Muirsequence is the parameter; <span style="font-weight: bold;">insertion </span>-
161e8e4245dSRobert Muirinserts a character ch, without moving the cursor. The character ch is
162e8e4245dSRobert Muira parameter; <span style="font-weight: bold;">substitution&nbsp;</span>
163e8e4245dSRobert Muir- rewrites a character at the current cursor position to the character
164e8e4245dSRobert Muirch and moves the cursor to the next character. The character ch is a
165e8e4245dSRobert Muirparameter; <span style="font-weight: bold;">no operation</span> (NOOP)
166e8e4245dSRobert Muir- skip a sequence of characters starting at the current cursor
167e8e4245dSRobert Muirposition. The length of this sequence is the parameter.<br>
168e8e4245dSRobert Muir<br>
169e8e4245dSRobert MuirThe P-commands are applied from the end of a word (right to left). This
170e8e4245dSRobert Muirassumption can reduce the set of P-command's, because the last NOOP,
171e8e4245dSRobert Muirmoving the cursor to the end of a string without any changes, need not
172e8e4245dSRobert Muirbe stored."</div>
173e8e4245dSRobert Muir<br>
174e8e4245dSRobert MuirData structure used to keep the dictionary (words and their P-commands)
175e8e4245dSRobert Muiris a trie. Several optimization steps are applied in turn to reduce and
176e8e4245dSRobert Muiroptimize the initial trie, by eliminating useless information and
177e8e4245dSRobert Muirshortening the paths in the trie.<br>
178e8e4245dSRobert Muir<br>
179e8e4245dSRobert MuirFinally, in order to obtain a stem from the input word, the word is
180e8e4245dSRobert Muirpassed once through a matching path in the trie (applying at each node
181e8e4245dSRobert Muirthe P-commands stored there). The result is a word stem.<br>
182e8e4245dSRobert Muir<h2>Corpus</h2>
183e8e4245dSRobert Muir<p><i>(to be completed...)</i></p>
184e8e4245dSRobert Muir<p>The following Polish corpora have been used:</p>
185e8e4245dSRobert Muir<ul>
186e8e4245dSRobert Muir  <li><a
187e8e4245dSRobert Muir href="http://sourceforge.net/project/showfiles.php?group_id=49316&amp;package_id=65354">Polish
188e8e4245dSRobert Muirdictionary
189e8e4245dSRobert Muirfrom ispell distribution</a></li>
190e8e4245dSRobert Muir  <li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus
191e8e4245dSRobert Muirsłownika frekwencyjnego</a></li>
192e8e4245dSRobert Muir<!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>-->
193e8e4245dSRobert Muir<!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The
194e8e4245dSRobert MuirBible (so called "TysiÄ…clecia") - unauthorized electronic version</li>
195e8e4245dSRobert Muir  <li><a
196e8e4245dSRobert Muir href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator
197e8e4245dSRobert Muirmorfologiczny SAM v. 3.4</a> - this was used to recover lemmas
198e8e4245dSRobert Muirmissing from other texts</li>
199e8e4245dSRobert Muir</ul>
200e8e4245dSRobert Muir<p>This step was the most time-consuming - and it would probably be
201e8e4245dSRobert Muireven more tedious and difficult if not for the
202e8e4245dSRobert Muirhelp of
203e8e4245dSRobert Muir<a href="http://www.python.org/">Python</a>. The source texts had to be
204e8e4245dSRobert Muirbrought to a common encoding (UTF-8) - some of them used quite ancient
205e8e4245dSRobert Muirencodings like Mazovia or DHN - and then scripts were written to
206e8e4245dSRobert Muircollect all lemmas and
207e8e4245dSRobert Muirinflected forms from the source texts. In cases when the source text
208e8e4245dSRobert Muirwas not
209e8e4245dSRobert Muirtagged,
210e8e4245dSRobert MuirI used the SAM analyzer to produce lemmas. In cases of ambiguous
211e8e4245dSRobert Muirlemmatization I decided to put references to inflected forms from all
212e8e4245dSRobert Muirbase forms.<br>
213e8e4245dSRobert Muir</p>
214e8e4245dSRobert Muir<p>All grammatical categories were allowed to appear in the corpus,
215e8e4245dSRobert Muiri.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
216e8e4245dSRobert Muircorpus consisted of roughly 87,000+ inflection sets, i.e. each set
217e8e4245dSRobert Muirconsisted of one base form (lemma) and many inflected forms. However,
218e8e4245dSRobert Muirbecause of the nature of the training method I restricted these sets to
219e8e4245dSRobert Muirinclude only those where there were at least 4 inflected forms. Sets
220e8e4245dSRobert Muirwith 3 or less inflected forms were removed, so that the final corpus
221e8e4245dSRobert Muirconsisted of ~69,000 unique sets, which in turn contained ~1.5 mln
222e8e4245dSRobert Muirinflected forms. <br>
223e8e4245dSRobert Muir</p>
224e8e4245dSRobert Muir<h2>Testing</h2>
225e8e4245dSRobert Muir<p>I tested the stemmer tables produced using the implementation
226e8e4245dSRobert Muirdescribed above. The following sections give some details about
227e8e4245dSRobert Muirthe testing setup.
228e8e4245dSRobert Muir</p>
229e8e4245dSRobert Muir<h3>Testing procedure</h3>
230e8e4245dSRobert Muir<p>The testing procedure was as follows:
231e8e4245dSRobert Muir</p>
232e8e4245dSRobert Muir<ul>
233e8e4245dSRobert Muir  <li>the whole corpus of ~69,000 unique sets was shuffled, so that the
234e8e4245dSRobert Muirinput sets were in random order.</li>
235e8e4245dSRobert Muir  <li>the corpus was split into two parts - one with 30,000 sets (Part
236e8e4245dSRobert Muir1), the other with ~39,000 sets (Part 2).</li>
237e8e4245dSRobert Muir  <li>Training samples were drawn in sequential order from the Part 1.
238e8e4245dSRobert MuirSince the sets were already randomized, the training samples were also
239e8e4245dSRobert Muirrandomized, but this procedure ensured that each larger training sample
240e8e4245dSRobert Muircontained all smaller samples.</li>
241e8e4245dSRobert Muir  <li>Part 2 was used for testing. Note: this means that the testing
242e8e4245dSRobert Muirrun used <em>only</em> words previously unseen during the training
243e8e4245dSRobert Muirphase. This is the worst scenario, because it means that stemmer must
244e8e4245dSRobert Muirextrapolate the learned rules to unknown cases. This also means that in
245e8e4245dSRobert Muira real-life case (where the input is a mix between known and unknown
246e8e4245dSRobert Muirwords) the F-measure of the stemmer will be even higher than in the
247e8e4245dSRobert Muirtable below.</li>
248e8e4245dSRobert Muir</ul>
249e8e4245dSRobert Muir<h3>Test results</h3>
250e8e4245dSRobert Muir<p>The following table summarizes test results for varying sizes
251e8e4245dSRobert Muirof training samples. The meaning of the table columns is
252e8e4245dSRobert Muirdescribed below:
253e8e4245dSRobert Muir</p>
254e8e4245dSRobert Muir<ul>
255e8e4245dSRobert Muir  <li><b>training sets:</b> the number of training sets. One set
256e8e4245dSRobert Muirconsists of one lemma and at least 4 and up to ~80 inflected forms
257e8e4245dSRobert Muir(including pre- and suffixed forms).</li>
258e8e4245dSRobert Muir  <li><b>testing forms:</b> the number of testing forms. Only inflected
259e8e4245dSRobert Muirforms were used in testing.</li>
260e8e4245dSRobert Muir  <li><b>stem OK:</b> the number of cases when produced output was a
261e8e4245dSRobert Muircorrect (unique) stem. Note: quite often correct stems were also
262e8e4245dSRobert Muircorrect lemmas.</li>
263e8e4245dSRobert Muir  <li><b>lemma OK:</b> the number of cases when produced output was a
264e8e4245dSRobert Muircorrect lemma.</li>
265e8e4245dSRobert Muir  <li><b>missing:</b> the number of cases when stemmer was unable to
266e8e4245dSRobert Muirprovide any output.</li>
267e8e4245dSRobert Muir  <li><b>stem bad:</b> the number of cases when produced output was a
268e8e4245dSRobert Muirstem, but already in use identifying a different set.</li>
269e8e4245dSRobert Muir  <li><b>lemma bad:</b> the number of cases when produced output was an
270e8e4245dSRobert Muirincorrect lemma. Note: quite often in such case the output was a
271e8e4245dSRobert Muircorrect stem.</li>
272e8e4245dSRobert Muir  <li><b>table size:</b> the size in bytes of the stemmer table.</li>
273e8e4245dSRobert Muir</ul>
2740d339043SRobert Muir<table class="padding2" style="border: 1px solid; border-spacing: 0px; border-collapse: separate">
275f41eabdcSRobert Muir  <caption>test results for different sizes of training samples</caption>
276e8e4245dSRobert Muir  <tbody>
2770d339043SRobert Muir    <tr style="background-color: #a0b0c0">
278e8e4245dSRobert Muir      <th>Training sets</th>
279e8e4245dSRobert Muir      <th>Testing forms</th>
280e8e4245dSRobert Muir      <th>Stem OK</th>
281e8e4245dSRobert Muir      <th>Lemma OK</th>
282e8e4245dSRobert Muir      <th>Missing</th>
283e8e4245dSRobert Muir      <th>Stem Bad</th>
284e8e4245dSRobert Muir      <th>Lemma Bad</th>
285e8e4245dSRobert Muir      <th>Table size [B]</th>
286e8e4245dSRobert Muir    </tr>
2870d339043SRobert Muir    <tr style="text-align: right">
288e8e4245dSRobert Muir      <td>100</td>
289e8e4245dSRobert Muir      <td>1022985</td>
290e8e4245dSRobert Muir      <td>842209</td>
291e8e4245dSRobert Muir      <td>593632</td>
292e8e4245dSRobert Muir      <td>172711</td>
293e8e4245dSRobert Muir      <td>22331</td>
294e8e4245dSRobert Muir      <td>256642</td>
295e8e4245dSRobert Muir      <td>28438</td>
296e8e4245dSRobert Muir    </tr>
2970d339043SRobert Muir    <tr style="text-align: right">
298e8e4245dSRobert Muir      <td>200</td>
299e8e4245dSRobert Muir      <td>1022985</td>
300e8e4245dSRobert Muir      <td>862789</td>
301e8e4245dSRobert Muir      <td>646488</td>
302e8e4245dSRobert Muir      <td>153288</td>
303e8e4245dSRobert Muir      <td>16306</td>
304e8e4245dSRobert Muir      <td>223209</td>
305e8e4245dSRobert Muir      <td>48660</td>
306e8e4245dSRobert Muir    </tr>
3070d339043SRobert Muir    <tr style="text-align: right">
308e8e4245dSRobert Muir      <td>500</td>
309e8e4245dSRobert Muir      <td>1022985</td>
310e8e4245dSRobert Muir      <td>885786</td>
311e8e4245dSRobert Muir      <td>685009</td>
312e8e4245dSRobert Muir      <td>130772</td>
313e8e4245dSRobert Muir      <td>14856</td>
314e8e4245dSRobert Muir      <td>207204</td>
315e8e4245dSRobert Muir      <td>108798</td>
316e8e4245dSRobert Muir    </tr>
3170d339043SRobert Muir    <tr style="text-align: right">
318e8e4245dSRobert Muir      <td>700</td>
319e8e4245dSRobert Muir      <td>1022985</td>
320e8e4245dSRobert Muir      <td>909031</td>
321e8e4245dSRobert Muir      <td>704609</td>
322e8e4245dSRobert Muir      <td>107084</td>
323e8e4245dSRobert Muir      <td>15442</td>
324e8e4245dSRobert Muir      <td>211292</td>
325e8e4245dSRobert Muir      <td>139291</td>
326e8e4245dSRobert Muir    </tr>
3270d339043SRobert Muir    <tr style="text-align: right">
328e8e4245dSRobert Muir      <td>1000</td>
329e8e4245dSRobert Muir      <td>1022985</td>
330e8e4245dSRobert Muir      <td>926079</td>
331e8e4245dSRobert Muir      <td>725720</td>
332e8e4245dSRobert Muir      <td>90117</td>
333e8e4245dSRobert Muir      <td>14941</td>
334e8e4245dSRobert Muir      <td>207148</td>
335e8e4245dSRobert Muir      <td>183677</td>
336e8e4245dSRobert Muir    </tr>
3370d339043SRobert Muir    <tr style="text-align: right">
338e8e4245dSRobert Muir      <td>2000</td>
339e8e4245dSRobert Muir      <td>1022985</td>
340e8e4245dSRobert Muir      <td>942886</td>
341e8e4245dSRobert Muir      <td>746641</td>
342e8e4245dSRobert Muir      <td>73429</td>
343e8e4245dSRobert Muir      <td>14903</td>
344e8e4245dSRobert Muir      <td>202915</td>
345e8e4245dSRobert Muir      <td>313516</td>
346e8e4245dSRobert Muir    </tr>
3470d339043SRobert Muir    <tr style="text-align: right">
348e8e4245dSRobert Muir      <td>5000</td>
349e8e4245dSRobert Muir      <td>1022985</td>
350e8e4245dSRobert Muir      <td>954721</td>
351e8e4245dSRobert Muir      <td>759930</td>
352e8e4245dSRobert Muir      <td>61476</td>
353e8e4245dSRobert Muir      <td>14817</td>
354e8e4245dSRobert Muir      <td>201579</td>
355e8e4245dSRobert Muir      <td>640969</td>
356e8e4245dSRobert Muir    </tr>
3570d339043SRobert Muir    <tr style="text-align: right">
358e8e4245dSRobert Muir      <td>7000</td>
359e8e4245dSRobert Muir      <td>1022985</td>
360e8e4245dSRobert Muir      <td>956165</td>
361e8e4245dSRobert Muir      <td>764033</td>
362e8e4245dSRobert Muir      <td>60364</td>
363e8e4245dSRobert Muir      <td>14620</td>
364e8e4245dSRobert Muir      <td>198588</td>
365e8e4245dSRobert Muir      <td>839347</td>
366e8e4245dSRobert Muir    </tr>
3670d339043SRobert Muir    <tr style="text-align: right">
368e8e4245dSRobert Muir      <td>10000</td>
369e8e4245dSRobert Muir      <td>1022985</td>
370e8e4245dSRobert Muir      <td>965427</td>
371e8e4245dSRobert Muir      <td>775507</td>
372e8e4245dSRobert Muir      <td>50797</td>
373e8e4245dSRobert Muir      <td>14662</td>
374e8e4245dSRobert Muir      <td>196681</td>
375e8e4245dSRobert Muir      <td>1144537</td>
376e8e4245dSRobert Muir    </tr>
3770d339043SRobert Muir    <tr style="text-align: right">
378e8e4245dSRobert Muir      <td>12000</td>
379e8e4245dSRobert Muir      <td>1022985</td>
380e8e4245dSRobert Muir      <td>967664</td>
381e8e4245dSRobert Muir      <td>782143</td>
382e8e4245dSRobert Muir      <td>48722</td>
383e8e4245dSRobert Muir      <td>14284</td>
384e8e4245dSRobert Muir      <td>192120</td>
385e8e4245dSRobert Muir      <td>1313508</td>
386e8e4245dSRobert Muir    </tr>
3870d339043SRobert Muir    <tr style="text-align: right">
388e8e4245dSRobert Muir      <td>15000</td>
389e8e4245dSRobert Muir      <td>1022985</td>
390e8e4245dSRobert Muir      <td>973188</td>
391e8e4245dSRobert Muir      <td>788867</td>
392e8e4245dSRobert Muir      <td>43247</td>
393e8e4245dSRobert Muir      <td>14349</td>
394e8e4245dSRobert Muir      <td>190871</td>
395e8e4245dSRobert Muir      <td>1567902</td>
396e8e4245dSRobert Muir    </tr>
3970d339043SRobert Muir    <tr style="text-align: right">
398e8e4245dSRobert Muir      <td>17000</td>
399e8e4245dSRobert Muir      <td>1022985</td>
400e8e4245dSRobert Muir      <td>974203</td>
401e8e4245dSRobert Muir      <td>791804</td>
402e8e4245dSRobert Muir      <td>42319</td>
403e8e4245dSRobert Muir      <td>14333</td>
404e8e4245dSRobert Muir      <td>188862</td>
405e8e4245dSRobert Muir      <td>1733957</td>
406e8e4245dSRobert Muir    </tr>
4070d339043SRobert Muir    <tr style="text-align: right">
408e8e4245dSRobert Muir      <td>20000</td>
409e8e4245dSRobert Muir      <td>1022985</td>
410e8e4245dSRobert Muir      <td>976234</td>
411e8e4245dSRobert Muir      <td>791554</td>
412e8e4245dSRobert Muir      <td>40058</td>
413e8e4245dSRobert Muir      <td>14601</td>
414e8e4245dSRobert Muir      <td>191373</td>
415e8e4245dSRobert Muir      <td>1977615</td>
416e8e4245dSRobert Muir    </tr>
417e8e4245dSRobert Muir  </tbody>
418e8e4245dSRobert Muir</table>
419e8e4245dSRobert Muir<p>I also measured the time to produce a stem (which involves
420e8e4245dSRobert Muirtraversing a trie,
421e8e4245dSRobert Muirretrieving a patch command and applying the patch command to the input
422e8e4245dSRobert Muirstring).
423e8e4245dSRobert MuirOn a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
424e8e4245dSRobert MuirHotSpot),
425e8e4245dSRobert Muirfor tables ranging in size from 1,000 to 20,000 cells, the time to
426e8e4245dSRobert Muirproduce a
427e8e4245dSRobert Muirsingle stem varies between 5-10 microseconds.<br>
428e8e4245dSRobert Muir</p>
429e8e4245dSRobert Muir<p>This means that the stemmer can process up to <span
430e8e4245dSRobert Muir style="font-weight: bold;">200,000 words per second</span>, an
431e8e4245dSRobert Muiroutstanding result when compared to other stemmers (Morfeusz - ~2,000
432e8e4245dSRobert Muirw/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br>
433e8e4245dSRobert Muir</p>
434e8e4245dSRobert Muir<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>,
435e8e4245dSRobert Muirwhich you can use to produce reports
436e8e4245dSRobert Muirlike the one below:<br>
437e8e4245dSRobert Muir</p>
438e8e4245dSRobert Muir<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table:  /res/tables/stemmer_2000.out<br>Input file:     ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER:            1       2       3<br> Total input words      1378176 1378176 1378176<br> Missed output words    112     112     112<br> Time elapsed [ms]      6989    6940    6640<br> Hit rate percent       99.99%  99.99%  99.99%<br> Miss rate percent      00.01%  00.01%  00.01%<br> Words per second       197192  198584  207557<br> Time per word [us]     5.07    5.04    4.82<br></pre>
439e8e4245dSRobert Muir<h2>Summary</h2>
440e8e4245dSRobert Muir<p>The results of these tests are very encouraging. It seems that using
441e8e4245dSRobert Muirthe
442e8e4245dSRobert Muirtraining corpus and the stemming algorithm described above results in a
443e8e4245dSRobert Muirhigh-quality stemmer useful for most applications. Moreover, it can
444e8e4245dSRobert Muiralso
445e8e4245dSRobert Muirbe used as a better than average lemmatizer.</p>
446e8e4245dSRobert Muir<p>Both the author of the implementation
447e8e4245dSRobert Muir(Leo Galambos, &lt;leo.galambos AT egothor DOT org&gt;) and the author
448e8e4245dSRobert Muirof this
449e8e4245dSRobert Muircompilation (Andrzej Bialecki &lt;ab AT getopt DOT org&gt;) would
450e8e4245dSRobert Muirappreciate any
451e8e4245dSRobert Muirfeedback and suggestions for further improvements.</p>
452e8e4245dSRobert Muir<h2>Bibliography</h2>
453e8e4245dSRobert Muir<ol>
454e8e4245dSRobert Muir  <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
455e8e4245dSRobert MuirThesis,
456e8e4245dSRobert MuirFaculty of Mathematics and Physics, Charles University in Prague, in
457e8e4245dSRobert Muirpress.</li>
458e8e4245dSRobert Muir  <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International
459e8e4245dSRobert MuirIntelligent Information Processing and Web Mining Conference, 2004,
460e8e4245dSRobert MuirZakopane, Poland.</li>
461e8e4245dSRobert Muir  <li>Galambos, L.: Lemmatizer for Document Information Retrieval
4620d339043SRobert MuirSystems in JAVA.<a
463e8e4245dSRobert Muir class="moz-txt-link-rfc2396E"
464e8e4245dSRobert Muir href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01">&lt;http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01&gt;</a>
465e8e4245dSRobert MuirSOFSEM 2001, Piestany, Slovakia. <br>
466e8e4245dSRobert Muir  </li>
467e8e4245dSRobert Muir</ol>
468e8e4245dSRobert Muir<br>
469e8e4245dSRobert Muir<br>
470e8e4245dSRobert Muir</body>
471e8e4245dSRobert Muir</html>
472