xref: /Lucene/lucene/analysis/stempel/src/java/overview.html (revision e8e4245d9b36123446546ff15967ac95429ea2b0)
1*e8e4245dSRobert Muir<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2*e8e4245dSRobert Muir<!--
3*e8e4245dSRobert Muir Licensed to the Apache Software Foundation (ASF) under one or more
4*e8e4245dSRobert Muir contributor license agreements.  See the NOTICE file distributed with
5*e8e4245dSRobert Muir this work for additional information regarding copyright ownership.
6*e8e4245dSRobert Muir The ASF licenses this file to You under the Apache License, Version 2.0
7*e8e4245dSRobert Muir (the "License"); you may not use this file except in compliance with
8*e8e4245dSRobert Muir the License.  You may obtain a copy of the License at
9*e8e4245dSRobert Muir
10*e8e4245dSRobert Muir     http://www.apache.org/licenses/LICENSE-2.0
11*e8e4245dSRobert Muir
12*e8e4245dSRobert Muir Unless required by applicable law or agreed to in writing, software
13*e8e4245dSRobert Muir distributed under the License is distributed on an "AS IS" BASIS,
14*e8e4245dSRobert Muir WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15*e8e4245dSRobert Muir See the License for the specific language governing permissions and
16*e8e4245dSRobert Muir limitations under the License.
17*e8e4245dSRobert Muir-->
18*e8e4245dSRobert Muir<html>
19*e8e4245dSRobert Muir<head>
20*e8e4245dSRobert Muir  <meta content="text/html; charset=UTF-8" http-equiv="content-type">
21*e8e4245dSRobert Muir  <title>Stempel - Algorithmic Stemmer for Polish Language</title>
22*e8e4245dSRobert Muir  <meta content="Andrzej Bialecki" name="author">
23*e8e4245dSRobert Muir  <meta name="keywords"
24*e8e4245dSRobert Muir content="stemming, stemmer, algorithmic stemmer, Polish stemmer">
25*e8e4245dSRobert Muir  <meta
26*e8e4245dSRobert Muir content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables."
27*e8e4245dSRobert Muir name="description">
28*e8e4245dSRobert Muir</head>
29*e8e4245dSRobert Muir<body style="font-family: Arial,SansSerif;">
30*e8e4245dSRobert Muir<h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1>
31*e8e4245dSRobert Muir<h2>Introduction</h2>
32*e8e4245dSRobert Muir<p>A method for conflation of different inflected word forms is an
33*e8e4245dSRobert Muirimportant component of many Information Retrieval systems. It helps to
34*e8e4245dSRobert Muirimprove the system's recall and can significantly reduce the index
35*e8e4245dSRobert Muirsize. This is especially true for highly-inflectional languages like
36*e8e4245dSRobert Muirthose from the Slavic language family (Czech, Slovak, Polish, Russian,
37*e8e4245dSRobert MuirBulgarian, etc).</p>
38*e8e4245dSRobert Muir<p>This page describes a software package consisting of high-quality
39*e8e4245dSRobert Muirstemming tables for Polish, and a universal algorithmic stemmer, which
40*e8e4245dSRobert Muiroperates using these tables. The stemmer code is taken virtually
41*e8e4245dSRobert Muirunchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p>
42*e8e4245dSRobert Muir<p>The software distribution includes stemmer
43*e8e4245dSRobert Muirtables prepared using an extensive corpus of Polish language (see
44*e8e4245dSRobert Muirdetails below).</p>
45*e8e4245dSRobert Muir<p>This work is available under Apache-style Open Source license - the
46*e8e4245dSRobert Muirstemmer code is covered by Egothor License, the tables and other
47*e8e4245dSRobert Muiradditions are covered by Apache License 2.0. Both licenses allow to use
48*e8e4245dSRobert Muirthe code in Open Source as well as commercial (closed source) projects.</p>
49*e8e4245dSRobert Muir<h3>Terminology</h3>
50*e8e4245dSRobert Muir<p>A short explanation is in order about the terminology used in this
51*e8e4245dSRobert Muirtext.</p>
52*e8e4245dSRobert Muir<p>In the following sections I make a distinction between <b>stem</b>
53*e8e4245dSRobert Muirand <b>lemma</b>.</p>
54*e8e4245dSRobert Muir<p>Lemma is a base grammatical form (dictionary form, headword) of a
55*e8e4245dSRobert Muirword. Lemma is an existing, grammatically correct word in some human
56*e8e4245dSRobert Muirlanguage.</p>
57*e8e4245dSRobert Muir<p>Stem on the other hand is just a unique token, not necessarily
58*e8e4245dSRobert Muirmaking any sense in any human language, but which can serve as a unique
59*e8e4245dSRobert Muirlabel instead of lemma for the same set of inflected forms. Quite often
60*e8e4245dSRobert Muirstem is referred to as a "root" of the word - which is incorrect and
61*e8e4245dSRobert Muirmisleading (stems sometimes have very little to do with the linguistic
62*e8e4245dSRobert Muirroot of a word, i.e. a pattern found in a word which is common to all
63*e8e4245dSRobert Muirinflected forms or within a family of languages).</p>
64*e8e4245dSRobert Muir<p>For an IR system stems are usually sufficient, for a morphological
65*e8e4245dSRobert Muiranalysis system obviously lemmas are a must. In practice, various
66*e8e4245dSRobert Muirstemmers produce a mix of stems and lemmas, as is the case with the
67*e8e4245dSRobert Muirstemmer described here. Additionally, for some languages, which use
68*e8e4245dSRobert Muirsuffix-based inflection rules many stemmers based on suffix-stripping
69*e8e4245dSRobert Muirwill produce a large percentage of stems equivalent to lemmas. This is
70*e8e4245dSRobert Muirhowever not the case for languages with complex, irregular inflection
71*e8e4245dSRobert Muirrules (such as Slavic languages) - here simplistic suffix-stripping
72*e8e4245dSRobert Muirstemmers produce very poor results.</p>
73*e8e4245dSRobert Muir<h3>Background</h3>
74*e8e4245dSRobert Muir<p>Lemmatization is a process of finding the base, non-inflected form
75*e8e4245dSRobert Muirof a word. The result of lemmatization is a correct existing word,
76*e8e4245dSRobert Muiroften in nominative case for nouns and infinitive form for verbs. A
77*e8e4245dSRobert Muirgiven inflected form may correspond to several lemmas (e.g. "found"
78*e8e4245dSRobert Muir-&gt; find, found) - the correct choice depends on the context.<br>
79*e8e4245dSRobert Muir<br>
80*e8e4245dSRobert MuirStemming is concerned mostly with finding a unique "root" of a word,
81*e8e4245dSRobert Muirwhich not necessarily results in any existing word or lemma. The
82*e8e4245dSRobert Muirquality of stemming is measured by the rate of collisions (overstemming
83*e8e4245dSRobert Muir- which causes words with different lemmas to be incorrectly conflated
84*e8e4245dSRobert Muirinto one "root"), and the rate of superfluous word "roots"
85*e8e4245dSRobert Muir(understemming - which assigns several "roots" to words with the same
86*e8e4245dSRobert Muirlemma). <br>
87*e8e4245dSRobert Muir<br>
88*e8e4245dSRobert MuirBoth stemmer and lemmatizer can be implemented in various ways. The two
89*e8e4245dSRobert Muirmost common approaches are:<br>
90*e8e4245dSRobert Muir</p>
91*e8e4245dSRobert Muir<ul>
92*e8e4245dSRobert Muir  <li>dictionary-based: where the stemmer uses an extensive dictionary
93*e8e4245dSRobert Muirof morphological forms in order to find the corresponding stem or lemma</li>
94*e8e4245dSRobert Muir  <li>algorithmic: where the stemmer uses an algorithm, based on
95*e8e4245dSRobert Muirgeneral morphological properties of a given language plus a set of
96*e8e4245dSRobert Muirheuristic rules<br>
97*e8e4245dSRobert Muir  </li>
98*e8e4245dSRobert Muir</ul>
99*e8e4245dSRobert MuirThere are many existing and well-known implementations of stemmers for
100*e8e4245dSRobert MuirEnglish (Porter, Lovins, Krovetz) and other European languages
101*e8e4245dSRobert Muir(<a href="http://snowball.tartarus.org">Snowball</a>). There are also
102*e8e4245dSRobert Muirgood quality commercial lemmatizers for Polish. However, there is only
103*e8e4245dSRobert Muirone
104*e8e4245dSRobert Muirfreely available Polish stemmer, implemented by
105*e8e4245dSRobert Muir<a
106*e8e4245dSRobert Muir href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid
107*e8e4245dSRobert MuirWeiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a
108*e8e4245dSRobert Muir href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That
109*e8e4245dSRobert Muirstemmer is dictionary-based. This means that even
110*e8e4245dSRobert Muirthough it can achieve
111*e8e4245dSRobert Muirperfect accuracy for previously known word forms found in its
112*e8e4245dSRobert Muirdictionary, it
113*e8e4245dSRobert Muircompletely fails in case of all other word forms. This deficiency is
114*e8e4245dSRobert Muirsomewhat mitigated by the comprehensive dictionary distributed with
115*e8e4245dSRobert Muirthis stemmer (so there is a high probability that most of the words in
116*e8e4245dSRobert Muirthe input text will be found in the dictionary), however the problem
117*e8e4245dSRobert Muirstill remains (please see the page above for more detailed description).<br>
118*e8e4245dSRobert Muir<br>
119*e8e4245dSRobert MuirThe implementation described here uses an algorithmic method. This
120*e8e4245dSRobert Muirmethod
121*e8e4245dSRobert Muirand particular algorithm implementation are described in detail in
122*e8e4245dSRobert Muir[1][2].
123*e8e4245dSRobert MuirThe main advantage of algorithmic stemmers is their ability to process
124*e8e4245dSRobert Muirpreviously
125*e8e4245dSRobert Muirunseen word forms with high accuracy. This particular algorithm uses a
126*e8e4245dSRobert Muirset
127*e8e4245dSRobert Muirof
128*e8e4245dSRobert Muirtransformation rules (patch commands), which describe how a word with a
129*e8e4245dSRobert Muirgiven pattern should be transformed to its stem. These rules are first
130*e8e4245dSRobert Muirlearned from a training corpus. They don't
131*e8e4245dSRobert Muircover
132*e8e4245dSRobert Muirall possible cases, so there is always some loss of precision/recall
133*e8e4245dSRobert Muir(which
134*e8e4245dSRobert Muirmeans that even the words from the training corpus are sometimes
135*e8e4245dSRobert Muirincorrectly stemmed).<br>
136*e8e4245dSRobert Muir<h2>Algorithm and implementation<span style="font-style: italic;"></span></h2>
137*e8e4245dSRobert MuirThe algorithm and its Java implementation is described in detail in the
138*e8e4245dSRobert Muirpublications cited below. Here's just a short excerpt from [2]:<br>
139*e8e4245dSRobert Muir<br>
140*e8e4245dSRobert Muir<center>
141*e8e4245dSRobert Muir<div style="width: 80%;" align="justify">"The aim is separation of the
142*e8e4245dSRobert Muirstemmer execution code from the data
143*e8e4245dSRobert Muirstructures [...]. In other words, a static algorithm configurable by
144*e8e4245dSRobert Muirdata must be developed. The word transformations that happen in the
145*e8e4245dSRobert Muirstemmer must be then encoded to the data tables.<br>
146*e8e4245dSRobert Muir<br>
147*e8e4245dSRobert MuirThe tacit input of our method is a sample set (a so-called dictionary)
148*e8e4245dSRobert Muirof words (as keys) and their stems. Each record can be equivalently
149*e8e4245dSRobert Muirstored as a key and the record of key's transformation to its
150*e8e4245dSRobert Muirrespective stem. The transformation record is termed a patch command
151*e8e4245dSRobert Muir(P-command). It must be ensured that P-commands are universal, and that
152*e8e4245dSRobert MuirP-commands can transform any word to its stem. Our solution[6,8] is
153*e8e4245dSRobert Muirbased on the Levenstein metric [10], which produces P-command as the
154*e8e4245dSRobert Muirminimum cost path in a directed graph.<br>
155*e8e4245dSRobert Muir<br>
156*e8e4245dSRobert MuirOne can imagine the P-command as an algorithm for an operator (editor)
157*e8e4245dSRobert Muirthat rewrites a string to another string. The operator can use these
158*e8e4245dSRobert Muirinstructions (PP-command's): <span style="font-weight: bold;">removal </span>-
159*e8e4245dSRobert Muirdeletes a sequence of characters starting at the current cursor
160*e8e4245dSRobert Muirposition and moves the cursor to the next character. The length of this
161*e8e4245dSRobert Muirsequence is the parameter; <span style="font-weight: bold;">insertion </span>-
162*e8e4245dSRobert Muirinserts a character ch, without moving the cursor. The character ch is
163*e8e4245dSRobert Muira parameter; <span style="font-weight: bold;">substitution&nbsp;</span>
164*e8e4245dSRobert Muir- rewrites a character at the current cursor position to the character
165*e8e4245dSRobert Muirch and moves the cursor to the next character. The character ch is a
166*e8e4245dSRobert Muirparameter; <span style="font-weight: bold;">no operation</span> (NOOP)
167*e8e4245dSRobert Muir- skip a sequence of characters starting at the current cursor
168*e8e4245dSRobert Muirposition. The length of this sequence is the parameter.<br>
169*e8e4245dSRobert Muir<br>
170*e8e4245dSRobert MuirThe P-commands are applied from the end of a word (right to left). This
171*e8e4245dSRobert Muirassumption can reduce the set of P-command's, because the last NOOP,
172*e8e4245dSRobert Muirmoving the cursor to the end of a string without any changes, need not
173*e8e4245dSRobert Muirbe stored."</div>
174*e8e4245dSRobert Muir</center>
175*e8e4245dSRobert Muir<br>
176*e8e4245dSRobert MuirData structure used to keep the dictionary (words and their P-commands)
177*e8e4245dSRobert Muiris a trie. Several optimization steps are applied in turn to reduce and
178*e8e4245dSRobert Muiroptimize the initial trie, by eliminating useless information and
179*e8e4245dSRobert Muirshortening the paths in the trie.<br>
180*e8e4245dSRobert Muir<br>
181*e8e4245dSRobert MuirFinally, in order to obtain a stem from the input word, the word is
182*e8e4245dSRobert Muirpassed once through a matching path in the trie (applying at each node
183*e8e4245dSRobert Muirthe P-commands stored there). The result is a word stem.<br>
184*e8e4245dSRobert Muir<h2>Corpus</h2>
185*e8e4245dSRobert Muir<p><i>(to be completed...)</i></p>
186*e8e4245dSRobert Muir<p>The following Polish corpora have been used:</p>
187*e8e4245dSRobert Muir<ul>
188*e8e4245dSRobert Muir  <li><a
189*e8e4245dSRobert Muir href="http://sourceforge.net/project/showfiles.php?group_id=49316&amp;package_id=65354">Polish
190*e8e4245dSRobert Muirdictionary
191*e8e4245dSRobert Muirfrom ispell distribution</a></li>
192*e8e4245dSRobert Muir  <li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus
193*e8e4245dSRobert Muirsłownika frekwencyjnego</a></li>
194*e8e4245dSRobert Muir<!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>-->
195*e8e4245dSRobert Muir<!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The
196*e8e4245dSRobert MuirBible (so called "TysiÄ…clecia") - unauthorized electronic version</li>
197*e8e4245dSRobert Muir  <li><a
198*e8e4245dSRobert Muir href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator
199*e8e4245dSRobert Muirmorfologiczny SAM v. 3.4</a> - this was used to recover lemmas
200*e8e4245dSRobert Muirmissing from other texts</li>
201*e8e4245dSRobert Muir</ul>
202*e8e4245dSRobert Muir<p>This step was the most time-consuming - and it would probably be
203*e8e4245dSRobert Muireven more tedious and difficult if not for the
204*e8e4245dSRobert Muirhelp of
205*e8e4245dSRobert Muir<a href="http://www.python.org/">Python</a>. The source texts had to be
206*e8e4245dSRobert Muirbrought to a common encoding (UTF-8) - some of them used quite ancient
207*e8e4245dSRobert Muirencodings like Mazovia or DHN - and then scripts were written to
208*e8e4245dSRobert Muircollect all lemmas and
209*e8e4245dSRobert Muirinflected forms from the source texts. In cases when the source text
210*e8e4245dSRobert Muirwas not
211*e8e4245dSRobert Muirtagged,
212*e8e4245dSRobert MuirI used the SAM analyzer to produce lemmas. In cases of ambiguous
213*e8e4245dSRobert Muirlemmatization I decided to put references to inflected forms from all
214*e8e4245dSRobert Muirbase forms.<br>
215*e8e4245dSRobert Muir</p>
216*e8e4245dSRobert Muir<p>All grammatical categories were allowed to appear in the corpus,
217*e8e4245dSRobert Muiri.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting
218*e8e4245dSRobert Muircorpus consisted of roughly 87,000+ inflection sets, i.e. each set
219*e8e4245dSRobert Muirconsisted of one base form (lemma) and many inflected forms. However,
220*e8e4245dSRobert Muirbecause of the nature of the training method I restricted these sets to
221*e8e4245dSRobert Muirinclude only those where there were at least 4 inflected forms. Sets
222*e8e4245dSRobert Muirwith 3 or less inflected forms were removed, so that the final corpus
223*e8e4245dSRobert Muirconsisted of ~69,000 unique sets, which in turn contained ~1.5 mln
224*e8e4245dSRobert Muirinflected forms. <br>
225*e8e4245dSRobert Muir</p>
226*e8e4245dSRobert Muir<h2>Testing</h2>
227*e8e4245dSRobert Muir<p>I tested the stemmer tables produced using the implementation
228*e8e4245dSRobert Muirdescribed above. The following sections give some details about
229*e8e4245dSRobert Muirthe testing setup.
230*e8e4245dSRobert Muir</p>
231*e8e4245dSRobert Muir<h3>Testing procedure</h3>
232*e8e4245dSRobert Muir<p>The testing procedure was as follows:
233*e8e4245dSRobert Muir</p>
234*e8e4245dSRobert Muir<ul>
235*e8e4245dSRobert Muir  <li>the whole corpus of ~69,000 unique sets was shuffled, so that the
236*e8e4245dSRobert Muirinput sets were in random order.</li>
237*e8e4245dSRobert Muir  <li>the corpus was split into two parts - one with 30,000 sets (Part
238*e8e4245dSRobert Muir1), the other with ~39,000 sets (Part 2).</li>
239*e8e4245dSRobert Muir  <li>Training samples were drawn in sequential order from the Part 1.
240*e8e4245dSRobert MuirSince the sets were already randomized, the training samples were also
241*e8e4245dSRobert Muirrandomized, but this procedure ensured that each larger training sample
242*e8e4245dSRobert Muircontained all smaller samples.</li>
243*e8e4245dSRobert Muir  <li>Part 2 was used for testing. Note: this means that the testing
244*e8e4245dSRobert Muirrun used <em>only</em> words previously unseen during the training
245*e8e4245dSRobert Muirphase. This is the worst scenario, because it means that stemmer must
246*e8e4245dSRobert Muirextrapolate the learned rules to unknown cases. This also means that in
247*e8e4245dSRobert Muira real-life case (where the input is a mix between known and unknown
248*e8e4245dSRobert Muirwords) the F-measure of the stemmer will be even higher than in the
249*e8e4245dSRobert Muirtable below.</li>
250*e8e4245dSRobert Muir</ul>
251*e8e4245dSRobert Muir<h3>Test results</h3>
252*e8e4245dSRobert Muir<p>The following table summarizes test results for varying sizes
253*e8e4245dSRobert Muirof training samples. The meaning of the table columns is
254*e8e4245dSRobert Muirdescribed below:
255*e8e4245dSRobert Muir</p>
256*e8e4245dSRobert Muir<ul>
257*e8e4245dSRobert Muir  <li><b>training sets:</b> the number of training sets. One set
258*e8e4245dSRobert Muirconsists of one lemma and at least 4 and up to ~80 inflected forms
259*e8e4245dSRobert Muir(including pre- and suffixed forms).</li>
260*e8e4245dSRobert Muir  <li><b>testing forms:</b> the number of testing forms. Only inflected
261*e8e4245dSRobert Muirforms were used in testing.</li>
262*e8e4245dSRobert Muir  <li><b>stem OK:</b> the number of cases when produced output was a
263*e8e4245dSRobert Muircorrect (unique) stem. Note: quite often correct stems were also
264*e8e4245dSRobert Muircorrect lemmas.</li>
265*e8e4245dSRobert Muir  <li><b>lemma OK:</b> the number of cases when produced output was a
266*e8e4245dSRobert Muircorrect lemma.</li>
267*e8e4245dSRobert Muir  <li><b>missing:</b> the number of cases when stemmer was unable to
268*e8e4245dSRobert Muirprovide any output.</li>
269*e8e4245dSRobert Muir  <li><b>stem bad:</b> the number of cases when produced output was a
270*e8e4245dSRobert Muirstem, but already in use identifying a different set.</li>
271*e8e4245dSRobert Muir  <li><b>lemma bad:</b> the number of cases when produced output was an
272*e8e4245dSRobert Muirincorrect lemma. Note: quite often in such case the output was a
273*e8e4245dSRobert Muircorrect stem.</li>
274*e8e4245dSRobert Muir  <li><b>table size:</b> the size in bytes of the stemmer table.</li>
275*e8e4245dSRobert Muir</ul>
276*e8e4245dSRobert Muir<div align="center">
277*e8e4245dSRobert Muir<table border="1" cellpadding="2" cellspacing="0">
278*e8e4245dSRobert Muir  <tbody>
279*e8e4245dSRobert Muir    <tr bgcolor="#a0b0c0">
280*e8e4245dSRobert Muir      <th>Training sets</th>
281*e8e4245dSRobert Muir      <th>Testing forms</th>
282*e8e4245dSRobert Muir      <th>Stem OK</th>
283*e8e4245dSRobert Muir      <th>Lemma OK</th>
284*e8e4245dSRobert Muir      <th>Missing</th>
285*e8e4245dSRobert Muir      <th>Stem Bad</th>
286*e8e4245dSRobert Muir      <th>Lemma Bad</th>
287*e8e4245dSRobert Muir      <th>Table size [B]</th>
288*e8e4245dSRobert Muir    </tr>
289*e8e4245dSRobert Muir    <tr align="right">
290*e8e4245dSRobert Muir      <td>100</td>
291*e8e4245dSRobert Muir      <td>1022985</td>
292*e8e4245dSRobert Muir      <td>842209</td>
293*e8e4245dSRobert Muir      <td>593632</td>
294*e8e4245dSRobert Muir      <td>172711</td>
295*e8e4245dSRobert Muir      <td>22331</td>
296*e8e4245dSRobert Muir      <td>256642</td>
297*e8e4245dSRobert Muir      <td>28438</td>
298*e8e4245dSRobert Muir    </tr>
299*e8e4245dSRobert Muir    <tr align="right">
300*e8e4245dSRobert Muir      <td>200</td>
301*e8e4245dSRobert Muir      <td>1022985</td>
302*e8e4245dSRobert Muir      <td>862789</td>
303*e8e4245dSRobert Muir      <td>646488</td>
304*e8e4245dSRobert Muir      <td>153288</td>
305*e8e4245dSRobert Muir      <td>16306</td>
306*e8e4245dSRobert Muir      <td>223209</td>
307*e8e4245dSRobert Muir      <td>48660</td>
308*e8e4245dSRobert Muir    </tr>
309*e8e4245dSRobert Muir    <tr align="right">
310*e8e4245dSRobert Muir      <td>500</td>
311*e8e4245dSRobert Muir      <td>1022985</td>
312*e8e4245dSRobert Muir      <td>885786</td>
313*e8e4245dSRobert Muir      <td>685009</td>
314*e8e4245dSRobert Muir      <td>130772</td>
315*e8e4245dSRobert Muir      <td>14856</td>
316*e8e4245dSRobert Muir      <td>207204</td>
317*e8e4245dSRobert Muir      <td>108798</td>
318*e8e4245dSRobert Muir    </tr>
319*e8e4245dSRobert Muir    <tr align="right">
320*e8e4245dSRobert Muir      <td>700</td>
321*e8e4245dSRobert Muir      <td>1022985</td>
322*e8e4245dSRobert Muir      <td>909031</td>
323*e8e4245dSRobert Muir      <td>704609</td>
324*e8e4245dSRobert Muir      <td>107084</td>
325*e8e4245dSRobert Muir      <td>15442</td>
326*e8e4245dSRobert Muir      <td>211292</td>
327*e8e4245dSRobert Muir      <td>139291</td>
328*e8e4245dSRobert Muir    </tr>
329*e8e4245dSRobert Muir    <tr align="right">
330*e8e4245dSRobert Muir      <td>1000</td>
331*e8e4245dSRobert Muir      <td>1022985</td>
332*e8e4245dSRobert Muir      <td>926079</td>
333*e8e4245dSRobert Muir      <td>725720</td>
334*e8e4245dSRobert Muir      <td>90117</td>
335*e8e4245dSRobert Muir      <td>14941</td>
336*e8e4245dSRobert Muir      <td>207148</td>
337*e8e4245dSRobert Muir      <td>183677</td>
338*e8e4245dSRobert Muir    </tr>
339*e8e4245dSRobert Muir    <tr align="right">
340*e8e4245dSRobert Muir      <td>2000</td>
341*e8e4245dSRobert Muir      <td>1022985</td>
342*e8e4245dSRobert Muir      <td>942886</td>
343*e8e4245dSRobert Muir      <td>746641</td>
344*e8e4245dSRobert Muir      <td>73429</td>
345*e8e4245dSRobert Muir      <td>14903</td>
346*e8e4245dSRobert Muir      <td>202915</td>
347*e8e4245dSRobert Muir      <td>313516</td>
348*e8e4245dSRobert Muir    </tr>
349*e8e4245dSRobert Muir    <tr align="right">
350*e8e4245dSRobert Muir      <td>5000</td>
351*e8e4245dSRobert Muir      <td>1022985</td>
352*e8e4245dSRobert Muir      <td>954721</td>
353*e8e4245dSRobert Muir      <td>759930</td>
354*e8e4245dSRobert Muir      <td>61476</td>
355*e8e4245dSRobert Muir      <td>14817</td>
356*e8e4245dSRobert Muir      <td>201579</td>
357*e8e4245dSRobert Muir      <td>640969</td>
358*e8e4245dSRobert Muir    </tr>
359*e8e4245dSRobert Muir    <tr align="right">
360*e8e4245dSRobert Muir      <td>7000</td>
361*e8e4245dSRobert Muir      <td>1022985</td>
362*e8e4245dSRobert Muir      <td>956165</td>
363*e8e4245dSRobert Muir      <td>764033</td>
364*e8e4245dSRobert Muir      <td>60364</td>
365*e8e4245dSRobert Muir      <td>14620</td>
366*e8e4245dSRobert Muir      <td>198588</td>
367*e8e4245dSRobert Muir      <td>839347</td>
368*e8e4245dSRobert Muir    </tr>
369*e8e4245dSRobert Muir    <tr align="right">
370*e8e4245dSRobert Muir      <td>10000</td>
371*e8e4245dSRobert Muir      <td>1022985</td>
372*e8e4245dSRobert Muir      <td>965427</td>
373*e8e4245dSRobert Muir      <td>775507</td>
374*e8e4245dSRobert Muir      <td>50797</td>
375*e8e4245dSRobert Muir      <td>14662</td>
376*e8e4245dSRobert Muir      <td>196681</td>
377*e8e4245dSRobert Muir      <td>1144537</td>
378*e8e4245dSRobert Muir    </tr>
379*e8e4245dSRobert Muir    <tr align="right">
380*e8e4245dSRobert Muir      <td>12000</td>
381*e8e4245dSRobert Muir      <td>1022985</td>
382*e8e4245dSRobert Muir      <td>967664</td>
383*e8e4245dSRobert Muir      <td>782143</td>
384*e8e4245dSRobert Muir      <td>48722</td>
385*e8e4245dSRobert Muir      <td>14284</td>
386*e8e4245dSRobert Muir      <td>192120</td>
387*e8e4245dSRobert Muir      <td>1313508</td>
388*e8e4245dSRobert Muir    </tr>
389*e8e4245dSRobert Muir    <tr align="right">
390*e8e4245dSRobert Muir      <td>15000</td>
391*e8e4245dSRobert Muir      <td>1022985</td>
392*e8e4245dSRobert Muir      <td>973188</td>
393*e8e4245dSRobert Muir      <td>788867</td>
394*e8e4245dSRobert Muir      <td>43247</td>
395*e8e4245dSRobert Muir      <td>14349</td>
396*e8e4245dSRobert Muir      <td>190871</td>
397*e8e4245dSRobert Muir      <td>1567902</td>
398*e8e4245dSRobert Muir    </tr>
399*e8e4245dSRobert Muir    <tr align="right">
400*e8e4245dSRobert Muir      <td>17000</td>
401*e8e4245dSRobert Muir      <td>1022985</td>
402*e8e4245dSRobert Muir      <td>974203</td>
403*e8e4245dSRobert Muir      <td>791804</td>
404*e8e4245dSRobert Muir      <td>42319</td>
405*e8e4245dSRobert Muir      <td>14333</td>
406*e8e4245dSRobert Muir      <td>188862</td>
407*e8e4245dSRobert Muir      <td>1733957</td>
408*e8e4245dSRobert Muir    </tr>
409*e8e4245dSRobert Muir    <tr align="right">
410*e8e4245dSRobert Muir      <td>20000</td>
411*e8e4245dSRobert Muir      <td>1022985</td>
412*e8e4245dSRobert Muir      <td>976234</td>
413*e8e4245dSRobert Muir      <td>791554</td>
414*e8e4245dSRobert Muir      <td>40058</td>
415*e8e4245dSRobert Muir      <td>14601</td>
416*e8e4245dSRobert Muir      <td>191373</td>
417*e8e4245dSRobert Muir      <td>1977615</td>
418*e8e4245dSRobert Muir    </tr>
419*e8e4245dSRobert Muir  </tbody>
420*e8e4245dSRobert Muir</table>
421*e8e4245dSRobert Muir</div>
422*e8e4245dSRobert Muir<p>I also measured the time to produce a stem (which involves
423*e8e4245dSRobert Muirtraversing a trie,
424*e8e4245dSRobert Muirretrieving a patch command and applying the patch command to the input
425*e8e4245dSRobert Muirstring).
426*e8e4245dSRobert MuirOn a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
427*e8e4245dSRobert MuirHotSpot),
428*e8e4245dSRobert Muirfor tables ranging in size from 1,000 to 20,000 cells, the time to
429*e8e4245dSRobert Muirproduce a
430*e8e4245dSRobert Muirsingle stem varies between 5-10 microseconds.<br>
431*e8e4245dSRobert Muir</p>
432*e8e4245dSRobert Muir<p>This means that the stemmer can process up to <span
433*e8e4245dSRobert Muir style="font-weight: bold;">200,000 words per second</span>, an
434*e8e4245dSRobert Muiroutstanding result when compared to other stemmers (Morfeusz - ~2,000
435*e8e4245dSRobert Muirw/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br>
436*e8e4245dSRobert Muir</p>
437*e8e4245dSRobert Muir<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>,
438*e8e4245dSRobert Muirwhich you can use to produce reports
439*e8e4245dSRobert Muirlike the one below:<br>
440*e8e4245dSRobert Muir</p>
441*e8e4245dSRobert Muir<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table:  /res/tables/stemmer_2000.out<br>Input file:     ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER:            1       2       3<br> Total input words      1378176 1378176 1378176<br> Missed output words    112     112     112<br> Time elapsed [ms]      6989    6940    6640<br> Hit rate percent       99.99%  99.99%  99.99%<br> Miss rate percent      00.01%  00.01%  00.01%<br> Words per second       197192  198584  207557<br> Time per word [us]     5.07    5.04    4.82<br></pre>
442*e8e4245dSRobert Muir<h2>Summary</h2>
443*e8e4245dSRobert Muir<p>The results of these tests are very encouraging. It seems that using
444*e8e4245dSRobert Muirthe
445*e8e4245dSRobert Muirtraining corpus and the stemming algorithm described above results in a
446*e8e4245dSRobert Muirhigh-quality stemmer useful for most applications. Moreover, it can
447*e8e4245dSRobert Muiralso
448*e8e4245dSRobert Muirbe used as a better than average lemmatizer.</p>
449*e8e4245dSRobert Muir<p>Both the author of the implementation
450*e8e4245dSRobert Muir(Leo Galambos, &lt;leo.galambos AT egothor DOT org&gt;) and the author
451*e8e4245dSRobert Muirof this
452*e8e4245dSRobert Muircompilation (Andrzej Bialecki &lt;ab AT getopt DOT org&gt;) would
453*e8e4245dSRobert Muirappreciate any
454*e8e4245dSRobert Muirfeedback and suggestions for further improvements.</p>
455*e8e4245dSRobert Muir<h2>Bibliography</h2>
456*e8e4245dSRobert Muir<ol>
457*e8e4245dSRobert Muir  <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
458*e8e4245dSRobert MuirThesis,
459*e8e4245dSRobert MuirFaculty of Mathematics and Physics, Charles University in Prague, in
460*e8e4245dSRobert Muirpress.</li>
461*e8e4245dSRobert Muir  <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International
462*e8e4245dSRobert MuirIntelligent Information Processing and Web Mining Conference, 2004,
463*e8e4245dSRobert MuirZakopane, Poland.</li>
464*e8e4245dSRobert Muir  <li>Galambos, L.: Lemmatizer for Document Information Retrieval
465*e8e4245dSRobert MuirSystems in JAVA.<span style="text-decoration: underline;"> </span><a
466*e8e4245dSRobert Muir class="moz-txt-link-rfc2396E"
467*e8e4245dSRobert Muir href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01">&lt;http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01&gt;</a>
468*e8e4245dSRobert MuirSOFSEM 2001, Piestany, Slovakia. <br>
469*e8e4245dSRobert Muir  </li>
470*e8e4245dSRobert Muir</ol>
471*e8e4245dSRobert Muir<br>
472*e8e4245dSRobert Muir<br>
473*e8e4245dSRobert Muir</body>
474*e8e4245dSRobert Muir</html>
475