1*e8e4245dSRobert Muir<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 2*e8e4245dSRobert Muir<!-- 3*e8e4245dSRobert Muir Licensed to the Apache Software Foundation (ASF) under one or more 4*e8e4245dSRobert Muir contributor license agreements. See the NOTICE file distributed with 5*e8e4245dSRobert Muir this work for additional information regarding copyright ownership. 6*e8e4245dSRobert Muir The ASF licenses this file to You under the Apache License, Version 2.0 7*e8e4245dSRobert Muir (the "License"); you may not use this file except in compliance with 8*e8e4245dSRobert Muir the License. You may obtain a copy of the License at 9*e8e4245dSRobert Muir 10*e8e4245dSRobert Muir http://www.apache.org/licenses/LICENSE-2.0 11*e8e4245dSRobert Muir 12*e8e4245dSRobert Muir Unless required by applicable law or agreed to in writing, software 13*e8e4245dSRobert Muir distributed under the License is distributed on an "AS IS" BASIS, 14*e8e4245dSRobert Muir WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15*e8e4245dSRobert Muir See the License for the specific language governing permissions and 16*e8e4245dSRobert Muir limitations under the License. 17*e8e4245dSRobert Muir--> 18*e8e4245dSRobert Muir<html> 19*e8e4245dSRobert Muir<head> 20*e8e4245dSRobert Muir <meta content="text/html; charset=UTF-8" http-equiv="content-type"> 21*e8e4245dSRobert Muir <title>Stempel - Algorithmic Stemmer for Polish Language</title> 22*e8e4245dSRobert Muir <meta content="Andrzej Bialecki" name="author"> 23*e8e4245dSRobert Muir <meta name="keywords" 24*e8e4245dSRobert Muir content="stemming, stemmer, algorithmic stemmer, Polish stemmer"> 25*e8e4245dSRobert Muir <meta 26*e8e4245dSRobert Muir content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables." 27*e8e4245dSRobert Muir name="description"> 28*e8e4245dSRobert Muir</head> 29*e8e4245dSRobert Muir<body style="font-family: Arial,SansSerif;"> 30*e8e4245dSRobert Muir<h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1> 31*e8e4245dSRobert Muir<h2>Introduction</h2> 32*e8e4245dSRobert Muir<p>A method for conflation of different inflected word forms is an 33*e8e4245dSRobert Muirimportant component of many Information Retrieval systems. It helps to 34*e8e4245dSRobert Muirimprove the system's recall and can significantly reduce the index 35*e8e4245dSRobert Muirsize. This is especially true for highly-inflectional languages like 36*e8e4245dSRobert Muirthose from the Slavic language family (Czech, Slovak, Polish, Russian, 37*e8e4245dSRobert MuirBulgarian, etc).</p> 38*e8e4245dSRobert Muir<p>This page describes a software package consisting of high-quality 39*e8e4245dSRobert Muirstemming tables for Polish, and a universal algorithmic stemmer, which 40*e8e4245dSRobert Muiroperates using these tables. The stemmer code is taken virtually 41*e8e4245dSRobert Muirunchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p> 42*e8e4245dSRobert Muir<p>The software distribution includes stemmer 43*e8e4245dSRobert Muirtables prepared using an extensive corpus of Polish language (see 44*e8e4245dSRobert Muirdetails below).</p> 45*e8e4245dSRobert Muir<p>This work is available under Apache-style Open Source license - the 46*e8e4245dSRobert Muirstemmer code is covered by Egothor License, the tables and other 47*e8e4245dSRobert Muiradditions are covered by Apache License 2.0. Both licenses allow to use 48*e8e4245dSRobert Muirthe code in Open Source as well as commercial (closed source) projects.</p> 49*e8e4245dSRobert Muir<h3>Terminology</h3> 50*e8e4245dSRobert Muir<p>A short explanation is in order about the terminology used in this 51*e8e4245dSRobert Muirtext.</p> 52*e8e4245dSRobert Muir<p>In the following sections I make a distinction between <b>stem</b> 53*e8e4245dSRobert Muirand <b>lemma</b>.</p> 54*e8e4245dSRobert Muir<p>Lemma is a base grammatical form (dictionary form, headword) of a 55*e8e4245dSRobert Muirword. Lemma is an existing, grammatically correct word in some human 56*e8e4245dSRobert Muirlanguage.</p> 57*e8e4245dSRobert Muir<p>Stem on the other hand is just a unique token, not necessarily 58*e8e4245dSRobert Muirmaking any sense in any human language, but which can serve as a unique 59*e8e4245dSRobert Muirlabel instead of lemma for the same set of inflected forms. Quite often 60*e8e4245dSRobert Muirstem is referred to as a "root" of the word - which is incorrect and 61*e8e4245dSRobert Muirmisleading (stems sometimes have very little to do with the linguistic 62*e8e4245dSRobert Muirroot of a word, i.e. a pattern found in a word which is common to all 63*e8e4245dSRobert Muirinflected forms or within a family of languages).</p> 64*e8e4245dSRobert Muir<p>For an IR system stems are usually sufficient, for a morphological 65*e8e4245dSRobert Muiranalysis system obviously lemmas are a must. In practice, various 66*e8e4245dSRobert Muirstemmers produce a mix of stems and lemmas, as is the case with the 67*e8e4245dSRobert Muirstemmer described here. Additionally, for some languages, which use 68*e8e4245dSRobert Muirsuffix-based inflection rules many stemmers based on suffix-stripping 69*e8e4245dSRobert Muirwill produce a large percentage of stems equivalent to lemmas. This is 70*e8e4245dSRobert Muirhowever not the case for languages with complex, irregular inflection 71*e8e4245dSRobert Muirrules (such as Slavic languages) - here simplistic suffix-stripping 72*e8e4245dSRobert Muirstemmers produce very poor results.</p> 73*e8e4245dSRobert Muir<h3>Background</h3> 74*e8e4245dSRobert Muir<p>Lemmatization is a process of finding the base, non-inflected form 75*e8e4245dSRobert Muirof a word. The result of lemmatization is a correct existing word, 76*e8e4245dSRobert Muiroften in nominative case for nouns and infinitive form for verbs. A 77*e8e4245dSRobert Muirgiven inflected form may correspond to several lemmas (e.g. "found" 78*e8e4245dSRobert Muir-> find, found) - the correct choice depends on the context.<br> 79*e8e4245dSRobert Muir<br> 80*e8e4245dSRobert MuirStemming is concerned mostly with finding a unique "root" of a word, 81*e8e4245dSRobert Muirwhich not necessarily results in any existing word or lemma. The 82*e8e4245dSRobert Muirquality of stemming is measured by the rate of collisions (overstemming 83*e8e4245dSRobert Muir- which causes words with different lemmas to be incorrectly conflated 84*e8e4245dSRobert Muirinto one "root"), and the rate of superfluous word "roots" 85*e8e4245dSRobert Muir(understemming - which assigns several "roots" to words with the same 86*e8e4245dSRobert Muirlemma). <br> 87*e8e4245dSRobert Muir<br> 88*e8e4245dSRobert MuirBoth stemmer and lemmatizer can be implemented in various ways. The two 89*e8e4245dSRobert Muirmost common approaches are:<br> 90*e8e4245dSRobert Muir</p> 91*e8e4245dSRobert Muir<ul> 92*e8e4245dSRobert Muir <li>dictionary-based: where the stemmer uses an extensive dictionary 93*e8e4245dSRobert Muirof morphological forms in order to find the corresponding stem or lemma</li> 94*e8e4245dSRobert Muir <li>algorithmic: where the stemmer uses an algorithm, based on 95*e8e4245dSRobert Muirgeneral morphological properties of a given language plus a set of 96*e8e4245dSRobert Muirheuristic rules<br> 97*e8e4245dSRobert Muir </li> 98*e8e4245dSRobert Muir</ul> 99*e8e4245dSRobert MuirThere are many existing and well-known implementations of stemmers for 100*e8e4245dSRobert MuirEnglish (Porter, Lovins, Krovetz) and other European languages 101*e8e4245dSRobert Muir(<a href="http://snowball.tartarus.org">Snowball</a>). There are also 102*e8e4245dSRobert Muirgood quality commercial lemmatizers for Polish. However, there is only 103*e8e4245dSRobert Muirone 104*e8e4245dSRobert Muirfreely available Polish stemmer, implemented by 105*e8e4245dSRobert Muir<a 106*e8e4245dSRobert Muir href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid 107*e8e4245dSRobert MuirWeiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a 108*e8e4245dSRobert Muir href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That 109*e8e4245dSRobert Muirstemmer is dictionary-based. This means that even 110*e8e4245dSRobert Muirthough it can achieve 111*e8e4245dSRobert Muirperfect accuracy for previously known word forms found in its 112*e8e4245dSRobert Muirdictionary, it 113*e8e4245dSRobert Muircompletely fails in case of all other word forms. This deficiency is 114*e8e4245dSRobert Muirsomewhat mitigated by the comprehensive dictionary distributed with 115*e8e4245dSRobert Muirthis stemmer (so there is a high probability that most of the words in 116*e8e4245dSRobert Muirthe input text will be found in the dictionary), however the problem 117*e8e4245dSRobert Muirstill remains (please see the page above for more detailed description).<br> 118*e8e4245dSRobert Muir<br> 119*e8e4245dSRobert MuirThe implementation described here uses an algorithmic method. This 120*e8e4245dSRobert Muirmethod 121*e8e4245dSRobert Muirand particular algorithm implementation are described in detail in 122*e8e4245dSRobert Muir[1][2]. 123*e8e4245dSRobert MuirThe main advantage of algorithmic stemmers is their ability to process 124*e8e4245dSRobert Muirpreviously 125*e8e4245dSRobert Muirunseen word forms with high accuracy. This particular algorithm uses a 126*e8e4245dSRobert Muirset 127*e8e4245dSRobert Muirof 128*e8e4245dSRobert Muirtransformation rules (patch commands), which describe how a word with a 129*e8e4245dSRobert Muirgiven pattern should be transformed to its stem. These rules are first 130*e8e4245dSRobert Muirlearned from a training corpus. They don't 131*e8e4245dSRobert Muircover 132*e8e4245dSRobert Muirall possible cases, so there is always some loss of precision/recall 133*e8e4245dSRobert Muir(which 134*e8e4245dSRobert Muirmeans that even the words from the training corpus are sometimes 135*e8e4245dSRobert Muirincorrectly stemmed).<br> 136*e8e4245dSRobert Muir<h2>Algorithm and implementation<span style="font-style: italic;"></span></h2> 137*e8e4245dSRobert MuirThe algorithm and its Java implementation is described in detail in the 138*e8e4245dSRobert Muirpublications cited below. Here's just a short excerpt from [2]:<br> 139*e8e4245dSRobert Muir<br> 140*e8e4245dSRobert Muir<center> 141*e8e4245dSRobert Muir<div style="width: 80%;" align="justify">"The aim is separation of the 142*e8e4245dSRobert Muirstemmer execution code from the data 143*e8e4245dSRobert Muirstructures [...]. In other words, a static algorithm configurable by 144*e8e4245dSRobert Muirdata must be developed. The word transformations that happen in the 145*e8e4245dSRobert Muirstemmer must be then encoded to the data tables.<br> 146*e8e4245dSRobert Muir<br> 147*e8e4245dSRobert MuirThe tacit input of our method is a sample set (a so-called dictionary) 148*e8e4245dSRobert Muirof words (as keys) and their stems. Each record can be equivalently 149*e8e4245dSRobert Muirstored as a key and the record of key's transformation to its 150*e8e4245dSRobert Muirrespective stem. The transformation record is termed a patch command 151*e8e4245dSRobert Muir(P-command). It must be ensured that P-commands are universal, and that 152*e8e4245dSRobert MuirP-commands can transform any word to its stem. Our solution[6,8] is 153*e8e4245dSRobert Muirbased on the Levenstein metric [10], which produces P-command as the 154*e8e4245dSRobert Muirminimum cost path in a directed graph.<br> 155*e8e4245dSRobert Muir<br> 156*e8e4245dSRobert MuirOne can imagine the P-command as an algorithm for an operator (editor) 157*e8e4245dSRobert Muirthat rewrites a string to another string. The operator can use these 158*e8e4245dSRobert Muirinstructions (PP-command's): <span style="font-weight: bold;">removal </span>- 159*e8e4245dSRobert Muirdeletes a sequence of characters starting at the current cursor 160*e8e4245dSRobert Muirposition and moves the cursor to the next character. The length of this 161*e8e4245dSRobert Muirsequence is the parameter; <span style="font-weight: bold;">insertion </span>- 162*e8e4245dSRobert Muirinserts a character ch, without moving the cursor. The character ch is 163*e8e4245dSRobert Muira parameter; <span style="font-weight: bold;">substitution </span> 164*e8e4245dSRobert Muir- rewrites a character at the current cursor position to the character 165*e8e4245dSRobert Muirch and moves the cursor to the next character. The character ch is a 166*e8e4245dSRobert Muirparameter; <span style="font-weight: bold;">no operation</span> (NOOP) 167*e8e4245dSRobert Muir- skip a sequence of characters starting at the current cursor 168*e8e4245dSRobert Muirposition. The length of this sequence is the parameter.<br> 169*e8e4245dSRobert Muir<br> 170*e8e4245dSRobert MuirThe P-commands are applied from the end of a word (right to left). This 171*e8e4245dSRobert Muirassumption can reduce the set of P-command's, because the last NOOP, 172*e8e4245dSRobert Muirmoving the cursor to the end of a string without any changes, need not 173*e8e4245dSRobert Muirbe stored."</div> 174*e8e4245dSRobert Muir</center> 175*e8e4245dSRobert Muir<br> 176*e8e4245dSRobert MuirData structure used to keep the dictionary (words and their P-commands) 177*e8e4245dSRobert Muiris a trie. Several optimization steps are applied in turn to reduce and 178*e8e4245dSRobert Muiroptimize the initial trie, by eliminating useless information and 179*e8e4245dSRobert Muirshortening the paths in the trie.<br> 180*e8e4245dSRobert Muir<br> 181*e8e4245dSRobert MuirFinally, in order to obtain a stem from the input word, the word is 182*e8e4245dSRobert Muirpassed once through a matching path in the trie (applying at each node 183*e8e4245dSRobert Muirthe P-commands stored there). The result is a word stem.<br> 184*e8e4245dSRobert Muir<h2>Corpus</h2> 185*e8e4245dSRobert Muir<p><i>(to be completed...)</i></p> 186*e8e4245dSRobert Muir<p>The following Polish corpora have been used:</p> 187*e8e4245dSRobert Muir<ul> 188*e8e4245dSRobert Muir <li><a 189*e8e4245dSRobert Muir href="http://sourceforge.net/project/showfiles.php?group_id=49316&package_id=65354">Polish 190*e8e4245dSRobert Muirdictionary 191*e8e4245dSRobert Muirfrom ispell distribution</a></li> 192*e8e4245dSRobert Muir <li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus 193*e8e4245dSRobert Muirsłownika frekwencyjnego</a></li> 194*e8e4245dSRobert Muir<!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>--> 195*e8e4245dSRobert Muir<!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The 196*e8e4245dSRobert MuirBible (so called "Tysiąclecia") - unauthorized electronic version</li> 197*e8e4245dSRobert Muir <li><a 198*e8e4245dSRobert Muir href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator 199*e8e4245dSRobert Muirmorfologiczny SAM v. 3.4</a> - this was used to recover lemmas 200*e8e4245dSRobert Muirmissing from other texts</li> 201*e8e4245dSRobert Muir</ul> 202*e8e4245dSRobert Muir<p>This step was the most time-consuming - and it would probably be 203*e8e4245dSRobert Muireven more tedious and difficult if not for the 204*e8e4245dSRobert Muirhelp of 205*e8e4245dSRobert Muir<a href="http://www.python.org/">Python</a>. The source texts had to be 206*e8e4245dSRobert Muirbrought to a common encoding (UTF-8) - some of them used quite ancient 207*e8e4245dSRobert Muirencodings like Mazovia or DHN - and then scripts were written to 208*e8e4245dSRobert Muircollect all lemmas and 209*e8e4245dSRobert Muirinflected forms from the source texts. In cases when the source text 210*e8e4245dSRobert Muirwas not 211*e8e4245dSRobert Muirtagged, 212*e8e4245dSRobert MuirI used the SAM analyzer to produce lemmas. In cases of ambiguous 213*e8e4245dSRobert Muirlemmatization I decided to put references to inflected forms from all 214*e8e4245dSRobert Muirbase forms.<br> 215*e8e4245dSRobert Muir</p> 216*e8e4245dSRobert Muir<p>All grammatical categories were allowed to appear in the corpus, 217*e8e4245dSRobert Muiri.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting 218*e8e4245dSRobert Muircorpus consisted of roughly 87,000+ inflection sets, i.e. each set 219*e8e4245dSRobert Muirconsisted of one base form (lemma) and many inflected forms. However, 220*e8e4245dSRobert Muirbecause of the nature of the training method I restricted these sets to 221*e8e4245dSRobert Muirinclude only those where there were at least 4 inflected forms. Sets 222*e8e4245dSRobert Muirwith 3 or less inflected forms were removed, so that the final corpus 223*e8e4245dSRobert Muirconsisted of ~69,000 unique sets, which in turn contained ~1.5 mln 224*e8e4245dSRobert Muirinflected forms. <br> 225*e8e4245dSRobert Muir</p> 226*e8e4245dSRobert Muir<h2>Testing</h2> 227*e8e4245dSRobert Muir<p>I tested the stemmer tables produced using the implementation 228*e8e4245dSRobert Muirdescribed above. The following sections give some details about 229*e8e4245dSRobert Muirthe testing setup. 230*e8e4245dSRobert Muir</p> 231*e8e4245dSRobert Muir<h3>Testing procedure</h3> 232*e8e4245dSRobert Muir<p>The testing procedure was as follows: 233*e8e4245dSRobert Muir</p> 234*e8e4245dSRobert Muir<ul> 235*e8e4245dSRobert Muir <li>the whole corpus of ~69,000 unique sets was shuffled, so that the 236*e8e4245dSRobert Muirinput sets were in random order.</li> 237*e8e4245dSRobert Muir <li>the corpus was split into two parts - one with 30,000 sets (Part 238*e8e4245dSRobert Muir1), the other with ~39,000 sets (Part 2).</li> 239*e8e4245dSRobert Muir <li>Training samples were drawn in sequential order from the Part 1. 240*e8e4245dSRobert MuirSince the sets were already randomized, the training samples were also 241*e8e4245dSRobert Muirrandomized, but this procedure ensured that each larger training sample 242*e8e4245dSRobert Muircontained all smaller samples.</li> 243*e8e4245dSRobert Muir <li>Part 2 was used for testing. Note: this means that the testing 244*e8e4245dSRobert Muirrun used <em>only</em> words previously unseen during the training 245*e8e4245dSRobert Muirphase. This is the worst scenario, because it means that stemmer must 246*e8e4245dSRobert Muirextrapolate the learned rules to unknown cases. This also means that in 247*e8e4245dSRobert Muira real-life case (where the input is a mix between known and unknown 248*e8e4245dSRobert Muirwords) the F-measure of the stemmer will be even higher than in the 249*e8e4245dSRobert Muirtable below.</li> 250*e8e4245dSRobert Muir</ul> 251*e8e4245dSRobert Muir<h3>Test results</h3> 252*e8e4245dSRobert Muir<p>The following table summarizes test results for varying sizes 253*e8e4245dSRobert Muirof training samples. The meaning of the table columns is 254*e8e4245dSRobert Muirdescribed below: 255*e8e4245dSRobert Muir</p> 256*e8e4245dSRobert Muir<ul> 257*e8e4245dSRobert Muir <li><b>training sets:</b> the number of training sets. One set 258*e8e4245dSRobert Muirconsists of one lemma and at least 4 and up to ~80 inflected forms 259*e8e4245dSRobert Muir(including pre- and suffixed forms).</li> 260*e8e4245dSRobert Muir <li><b>testing forms:</b> the number of testing forms. Only inflected 261*e8e4245dSRobert Muirforms were used in testing.</li> 262*e8e4245dSRobert Muir <li><b>stem OK:</b> the number of cases when produced output was a 263*e8e4245dSRobert Muircorrect (unique) stem. Note: quite often correct stems were also 264*e8e4245dSRobert Muircorrect lemmas.</li> 265*e8e4245dSRobert Muir <li><b>lemma OK:</b> the number of cases when produced output was a 266*e8e4245dSRobert Muircorrect lemma.</li> 267*e8e4245dSRobert Muir <li><b>missing:</b> the number of cases when stemmer was unable to 268*e8e4245dSRobert Muirprovide any output.</li> 269*e8e4245dSRobert Muir <li><b>stem bad:</b> the number of cases when produced output was a 270*e8e4245dSRobert Muirstem, but already in use identifying a different set.</li> 271*e8e4245dSRobert Muir <li><b>lemma bad:</b> the number of cases when produced output was an 272*e8e4245dSRobert Muirincorrect lemma. Note: quite often in such case the output was a 273*e8e4245dSRobert Muircorrect stem.</li> 274*e8e4245dSRobert Muir <li><b>table size:</b> the size in bytes of the stemmer table.</li> 275*e8e4245dSRobert Muir</ul> 276*e8e4245dSRobert Muir<div align="center"> 277*e8e4245dSRobert Muir<table border="1" cellpadding="2" cellspacing="0"> 278*e8e4245dSRobert Muir <tbody> 279*e8e4245dSRobert Muir <tr bgcolor="#a0b0c0"> 280*e8e4245dSRobert Muir <th>Training sets</th> 281*e8e4245dSRobert Muir <th>Testing forms</th> 282*e8e4245dSRobert Muir <th>Stem OK</th> 283*e8e4245dSRobert Muir <th>Lemma OK</th> 284*e8e4245dSRobert Muir <th>Missing</th> 285*e8e4245dSRobert Muir <th>Stem Bad</th> 286*e8e4245dSRobert Muir <th>Lemma Bad</th> 287*e8e4245dSRobert Muir <th>Table size [B]</th> 288*e8e4245dSRobert Muir </tr> 289*e8e4245dSRobert Muir <tr align="right"> 290*e8e4245dSRobert Muir <td>100</td> 291*e8e4245dSRobert Muir <td>1022985</td> 292*e8e4245dSRobert Muir <td>842209</td> 293*e8e4245dSRobert Muir <td>593632</td> 294*e8e4245dSRobert Muir <td>172711</td> 295*e8e4245dSRobert Muir <td>22331</td> 296*e8e4245dSRobert Muir <td>256642</td> 297*e8e4245dSRobert Muir <td>28438</td> 298*e8e4245dSRobert Muir </tr> 299*e8e4245dSRobert Muir <tr align="right"> 300*e8e4245dSRobert Muir <td>200</td> 301*e8e4245dSRobert Muir <td>1022985</td> 302*e8e4245dSRobert Muir <td>862789</td> 303*e8e4245dSRobert Muir <td>646488</td> 304*e8e4245dSRobert Muir <td>153288</td> 305*e8e4245dSRobert Muir <td>16306</td> 306*e8e4245dSRobert Muir <td>223209</td> 307*e8e4245dSRobert Muir <td>48660</td> 308*e8e4245dSRobert Muir </tr> 309*e8e4245dSRobert Muir <tr align="right"> 310*e8e4245dSRobert Muir <td>500</td> 311*e8e4245dSRobert Muir <td>1022985</td> 312*e8e4245dSRobert Muir <td>885786</td> 313*e8e4245dSRobert Muir <td>685009</td> 314*e8e4245dSRobert Muir <td>130772</td> 315*e8e4245dSRobert Muir <td>14856</td> 316*e8e4245dSRobert Muir <td>207204</td> 317*e8e4245dSRobert Muir <td>108798</td> 318*e8e4245dSRobert Muir </tr> 319*e8e4245dSRobert Muir <tr align="right"> 320*e8e4245dSRobert Muir <td>700</td> 321*e8e4245dSRobert Muir <td>1022985</td> 322*e8e4245dSRobert Muir <td>909031</td> 323*e8e4245dSRobert Muir <td>704609</td> 324*e8e4245dSRobert Muir <td>107084</td> 325*e8e4245dSRobert Muir <td>15442</td> 326*e8e4245dSRobert Muir <td>211292</td> 327*e8e4245dSRobert Muir <td>139291</td> 328*e8e4245dSRobert Muir </tr> 329*e8e4245dSRobert Muir <tr align="right"> 330*e8e4245dSRobert Muir <td>1000</td> 331*e8e4245dSRobert Muir <td>1022985</td> 332*e8e4245dSRobert Muir <td>926079</td> 333*e8e4245dSRobert Muir <td>725720</td> 334*e8e4245dSRobert Muir <td>90117</td> 335*e8e4245dSRobert Muir <td>14941</td> 336*e8e4245dSRobert Muir <td>207148</td> 337*e8e4245dSRobert Muir <td>183677</td> 338*e8e4245dSRobert Muir </tr> 339*e8e4245dSRobert Muir <tr align="right"> 340*e8e4245dSRobert Muir <td>2000</td> 341*e8e4245dSRobert Muir <td>1022985</td> 342*e8e4245dSRobert Muir <td>942886</td> 343*e8e4245dSRobert Muir <td>746641</td> 344*e8e4245dSRobert Muir <td>73429</td> 345*e8e4245dSRobert Muir <td>14903</td> 346*e8e4245dSRobert Muir <td>202915</td> 347*e8e4245dSRobert Muir <td>313516</td> 348*e8e4245dSRobert Muir </tr> 349*e8e4245dSRobert Muir <tr align="right"> 350*e8e4245dSRobert Muir <td>5000</td> 351*e8e4245dSRobert Muir <td>1022985</td> 352*e8e4245dSRobert Muir <td>954721</td> 353*e8e4245dSRobert Muir <td>759930</td> 354*e8e4245dSRobert Muir <td>61476</td> 355*e8e4245dSRobert Muir <td>14817</td> 356*e8e4245dSRobert Muir <td>201579</td> 357*e8e4245dSRobert Muir <td>640969</td> 358*e8e4245dSRobert Muir </tr> 359*e8e4245dSRobert Muir <tr align="right"> 360*e8e4245dSRobert Muir <td>7000</td> 361*e8e4245dSRobert Muir <td>1022985</td> 362*e8e4245dSRobert Muir <td>956165</td> 363*e8e4245dSRobert Muir <td>764033</td> 364*e8e4245dSRobert Muir <td>60364</td> 365*e8e4245dSRobert Muir <td>14620</td> 366*e8e4245dSRobert Muir <td>198588</td> 367*e8e4245dSRobert Muir <td>839347</td> 368*e8e4245dSRobert Muir </tr> 369*e8e4245dSRobert Muir <tr align="right"> 370*e8e4245dSRobert Muir <td>10000</td> 371*e8e4245dSRobert Muir <td>1022985</td> 372*e8e4245dSRobert Muir <td>965427</td> 373*e8e4245dSRobert Muir <td>775507</td> 374*e8e4245dSRobert Muir <td>50797</td> 375*e8e4245dSRobert Muir <td>14662</td> 376*e8e4245dSRobert Muir <td>196681</td> 377*e8e4245dSRobert Muir <td>1144537</td> 378*e8e4245dSRobert Muir </tr> 379*e8e4245dSRobert Muir <tr align="right"> 380*e8e4245dSRobert Muir <td>12000</td> 381*e8e4245dSRobert Muir <td>1022985</td> 382*e8e4245dSRobert Muir <td>967664</td> 383*e8e4245dSRobert Muir <td>782143</td> 384*e8e4245dSRobert Muir <td>48722</td> 385*e8e4245dSRobert Muir <td>14284</td> 386*e8e4245dSRobert Muir <td>192120</td> 387*e8e4245dSRobert Muir <td>1313508</td> 388*e8e4245dSRobert Muir </tr> 389*e8e4245dSRobert Muir <tr align="right"> 390*e8e4245dSRobert Muir <td>15000</td> 391*e8e4245dSRobert Muir <td>1022985</td> 392*e8e4245dSRobert Muir <td>973188</td> 393*e8e4245dSRobert Muir <td>788867</td> 394*e8e4245dSRobert Muir <td>43247</td> 395*e8e4245dSRobert Muir <td>14349</td> 396*e8e4245dSRobert Muir <td>190871</td> 397*e8e4245dSRobert Muir <td>1567902</td> 398*e8e4245dSRobert Muir </tr> 399*e8e4245dSRobert Muir <tr align="right"> 400*e8e4245dSRobert Muir <td>17000</td> 401*e8e4245dSRobert Muir <td>1022985</td> 402*e8e4245dSRobert Muir <td>974203</td> 403*e8e4245dSRobert Muir <td>791804</td> 404*e8e4245dSRobert Muir <td>42319</td> 405*e8e4245dSRobert Muir <td>14333</td> 406*e8e4245dSRobert Muir <td>188862</td> 407*e8e4245dSRobert Muir <td>1733957</td> 408*e8e4245dSRobert Muir </tr> 409*e8e4245dSRobert Muir <tr align="right"> 410*e8e4245dSRobert Muir <td>20000</td> 411*e8e4245dSRobert Muir <td>1022985</td> 412*e8e4245dSRobert Muir <td>976234</td> 413*e8e4245dSRobert Muir <td>791554</td> 414*e8e4245dSRobert Muir <td>40058</td> 415*e8e4245dSRobert Muir <td>14601</td> 416*e8e4245dSRobert Muir <td>191373</td> 417*e8e4245dSRobert Muir <td>1977615</td> 418*e8e4245dSRobert Muir </tr> 419*e8e4245dSRobert Muir </tbody> 420*e8e4245dSRobert Muir</table> 421*e8e4245dSRobert Muir</div> 422*e8e4245dSRobert Muir<p>I also measured the time to produce a stem (which involves 423*e8e4245dSRobert Muirtraversing a trie, 424*e8e4245dSRobert Muirretrieving a patch command and applying the patch command to the input 425*e8e4245dSRobert Muirstring). 426*e8e4245dSRobert MuirOn a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03 427*e8e4245dSRobert MuirHotSpot), 428*e8e4245dSRobert Muirfor tables ranging in size from 1,000 to 20,000 cells, the time to 429*e8e4245dSRobert Muirproduce a 430*e8e4245dSRobert Muirsingle stem varies between 5-10 microseconds.<br> 431*e8e4245dSRobert Muir</p> 432*e8e4245dSRobert Muir<p>This means that the stemmer can process up to <span 433*e8e4245dSRobert Muir style="font-weight: bold;">200,000 words per second</span>, an 434*e8e4245dSRobert Muiroutstanding result when compared to other stemmers (Morfeusz - ~2,000 435*e8e4245dSRobert Muirw/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br> 436*e8e4245dSRobert Muir</p> 437*e8e4245dSRobert Muir<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>, 438*e8e4245dSRobert Muirwhich you can use to produce reports 439*e8e4245dSRobert Muirlike the one below:<br> 440*e8e4245dSRobert Muir</p> 441*e8e4245dSRobert Muir<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table: /res/tables/stemmer_2000.out<br>Input file: ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER: 1 2 3<br> Total input words 1378176 1378176 1378176<br> Missed output words 112 112 112<br> Time elapsed [ms] 6989 6940 6640<br> Hit rate percent 99.99% 99.99% 99.99%<br> Miss rate percent 00.01% 00.01% 00.01%<br> Words per second 197192 198584 207557<br> Time per word [us] 5.07 5.04 4.82<br></pre> 442*e8e4245dSRobert Muir<h2>Summary</h2> 443*e8e4245dSRobert Muir<p>The results of these tests are very encouraging. It seems that using 444*e8e4245dSRobert Muirthe 445*e8e4245dSRobert Muirtraining corpus and the stemming algorithm described above results in a 446*e8e4245dSRobert Muirhigh-quality stemmer useful for most applications. Moreover, it can 447*e8e4245dSRobert Muiralso 448*e8e4245dSRobert Muirbe used as a better than average lemmatizer.</p> 449*e8e4245dSRobert Muir<p>Both the author of the implementation 450*e8e4245dSRobert Muir(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author 451*e8e4245dSRobert Muirof this 452*e8e4245dSRobert Muircompilation (Andrzej Bialecki <ab AT getopt DOT org>) would 453*e8e4245dSRobert Muirappreciate any 454*e8e4245dSRobert Muirfeedback and suggestions for further improvements.</p> 455*e8e4245dSRobert Muir<h2>Bibliography</h2> 456*e8e4245dSRobert Muir<ol> 457*e8e4245dSRobert Muir <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD 458*e8e4245dSRobert MuirThesis, 459*e8e4245dSRobert MuirFaculty of Mathematics and Physics, Charles University in Prague, in 460*e8e4245dSRobert Muirpress.</li> 461*e8e4245dSRobert Muir <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International 462*e8e4245dSRobert MuirIntelligent Information Processing and Web Mining Conference, 2004, 463*e8e4245dSRobert MuirZakopane, Poland.</li> 464*e8e4245dSRobert Muir <li>Galambos, L.: Lemmatizer for Document Information Retrieval 465*e8e4245dSRobert MuirSystems in JAVA.<span style="text-decoration: underline;"> </span><a 466*e8e4245dSRobert Muir class="moz-txt-link-rfc2396E" 467*e8e4245dSRobert Muir href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01"><http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01></a> 468*e8e4245dSRobert MuirSOFSEM 2001, Piestany, Slovakia. <br> 469*e8e4245dSRobert Muir </li> 470*e8e4245dSRobert Muir</ol> 471*e8e4245dSRobert Muir<br> 472*e8e4245dSRobert Muir<br> 473*e8e4245dSRobert Muir</body> 474*e8e4245dSRobert Muir</html> 475