1e8e4245dSRobert Muir<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 2e8e4245dSRobert Muir<!-- 3e8e4245dSRobert Muir Licensed to the Apache Software Foundation (ASF) under one or more 4e8e4245dSRobert Muir contributor license agreements. See the NOTICE file distributed with 5e8e4245dSRobert Muir this work for additional information regarding copyright ownership. 6e8e4245dSRobert Muir The ASF licenses this file to You under the Apache License, Version 2.0 7e8e4245dSRobert Muir (the "License"); you may not use this file except in compliance with 8e8e4245dSRobert Muir the License. You may obtain a copy of the License at 9e8e4245dSRobert Muir 10e8e4245dSRobert Muir http://www.apache.org/licenses/LICENSE-2.0 11e8e4245dSRobert Muir 12e8e4245dSRobert Muir Unless required by applicable law or agreed to in writing, software 13e8e4245dSRobert Muir distributed under the License is distributed on an "AS IS" BASIS, 14e8e4245dSRobert Muir WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15e8e4245dSRobert Muir See the License for the specific language governing permissions and 16e8e4245dSRobert Muir limitations under the License. 17e8e4245dSRobert Muir--> 18e8e4245dSRobert Muir<html> 19e8e4245dSRobert Muir<head> 20e8e4245dSRobert Muir <meta content="text/html; charset=UTF-8" http-equiv="content-type"> 21e8e4245dSRobert Muir <title>Stempel - Algorithmic Stemmer for Polish Language</title> 22e8e4245dSRobert Muir <meta content="Andrzej Bialecki" name="author"> 23e8e4245dSRobert Muir <meta name="keywords" 24e8e4245dSRobert Muir content="stemming, stemmer, algorithmic stemmer, Polish stemmer"> 25e8e4245dSRobert Muir <meta 26e8e4245dSRobert Muir content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables." 27e8e4245dSRobert Muir name="description"> 28e8e4245dSRobert Muir</head> 29e8e4245dSRobert Muir<body style="font-family: Arial,SansSerif;"> 30e8e4245dSRobert Muir<h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1> 31e8e4245dSRobert Muir<h2>Introduction</h2> 32e8e4245dSRobert Muir<p>A method for conflation of different inflected word forms is an 33e8e4245dSRobert Muirimportant component of many Information Retrieval systems. It helps to 34e8e4245dSRobert Muirimprove the system's recall and can significantly reduce the index 35e8e4245dSRobert Muirsize. This is especially true for highly-inflectional languages like 36e8e4245dSRobert Muirthose from the Slavic language family (Czech, Slovak, Polish, Russian, 37e8e4245dSRobert MuirBulgarian, etc).</p> 38e8e4245dSRobert Muir<p>This page describes a software package consisting of high-quality 39e8e4245dSRobert Muirstemming tables for Polish, and a universal algorithmic stemmer, which 40e8e4245dSRobert Muiroperates using these tables. The stemmer code is taken virtually 41e8e4245dSRobert Muirunchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p> 42e8e4245dSRobert Muir<p>The software distribution includes stemmer 43e8e4245dSRobert Muirtables prepared using an extensive corpus of Polish language (see 44e8e4245dSRobert Muirdetails below).</p> 45e8e4245dSRobert Muir<p>This work is available under Apache-style Open Source license - the 46e8e4245dSRobert Muirstemmer code is covered by Egothor License, the tables and other 47e8e4245dSRobert Muiradditions are covered by Apache License 2.0. Both licenses allow to use 48e8e4245dSRobert Muirthe code in Open Source as well as commercial (closed source) projects.</p> 49e8e4245dSRobert Muir<h3>Terminology</h3> 50e8e4245dSRobert Muir<p>A short explanation is in order about the terminology used in this 51e8e4245dSRobert Muirtext.</p> 52e8e4245dSRobert Muir<p>In the following sections I make a distinction between <b>stem</b> 53e8e4245dSRobert Muirand <b>lemma</b>.</p> 54e8e4245dSRobert Muir<p>Lemma is a base grammatical form (dictionary form, headword) of a 55e8e4245dSRobert Muirword. Lemma is an existing, grammatically correct word in some human 56e8e4245dSRobert Muirlanguage.</p> 57e8e4245dSRobert Muir<p>Stem on the other hand is just a unique token, not necessarily 58e8e4245dSRobert Muirmaking any sense in any human language, but which can serve as a unique 59e8e4245dSRobert Muirlabel instead of lemma for the same set of inflected forms. Quite often 60e8e4245dSRobert Muirstem is referred to as a "root" of the word - which is incorrect and 61e8e4245dSRobert Muirmisleading (stems sometimes have very little to do with the linguistic 62e8e4245dSRobert Muirroot of a word, i.e. a pattern found in a word which is common to all 63e8e4245dSRobert Muirinflected forms or within a family of languages).</p> 64e8e4245dSRobert Muir<p>For an IR system stems are usually sufficient, for a morphological 65e8e4245dSRobert Muiranalysis system obviously lemmas are a must. In practice, various 66e8e4245dSRobert Muirstemmers produce a mix of stems and lemmas, as is the case with the 67e8e4245dSRobert Muirstemmer described here. Additionally, for some languages, which use 68e8e4245dSRobert Muirsuffix-based inflection rules many stemmers based on suffix-stripping 69e8e4245dSRobert Muirwill produce a large percentage of stems equivalent to lemmas. This is 70e8e4245dSRobert Muirhowever not the case for languages with complex, irregular inflection 71e8e4245dSRobert Muirrules (such as Slavic languages) - here simplistic suffix-stripping 72e8e4245dSRobert Muirstemmers produce very poor results.</p> 73e8e4245dSRobert Muir<h3>Background</h3> 74e8e4245dSRobert Muir<p>Lemmatization is a process of finding the base, non-inflected form 75e8e4245dSRobert Muirof a word. The result of lemmatization is a correct existing word, 76e8e4245dSRobert Muiroften in nominative case for nouns and infinitive form for verbs. A 77e8e4245dSRobert Muirgiven inflected form may correspond to several lemmas (e.g. "found" 78e8e4245dSRobert Muir-> find, found) - the correct choice depends on the context.<br> 79e8e4245dSRobert Muir<br> 80e8e4245dSRobert MuirStemming is concerned mostly with finding a unique "root" of a word, 81e8e4245dSRobert Muirwhich not necessarily results in any existing word or lemma. The 82e8e4245dSRobert Muirquality of stemming is measured by the rate of collisions (overstemming 83e8e4245dSRobert Muir- which causes words with different lemmas to be incorrectly conflated 84e8e4245dSRobert Muirinto one "root"), and the rate of superfluous word "roots" 85e8e4245dSRobert Muir(understemming - which assigns several "roots" to words with the same 86e8e4245dSRobert Muirlemma). <br> 87e8e4245dSRobert Muir<br> 88e8e4245dSRobert MuirBoth stemmer and lemmatizer can be implemented in various ways. The two 89e8e4245dSRobert Muirmost common approaches are:<br> 90e8e4245dSRobert Muir</p> 91e8e4245dSRobert Muir<ul> 92e8e4245dSRobert Muir <li>dictionary-based: where the stemmer uses an extensive dictionary 93e8e4245dSRobert Muirof morphological forms in order to find the corresponding stem or lemma</li> 94e8e4245dSRobert Muir <li>algorithmic: where the stemmer uses an algorithm, based on 95e8e4245dSRobert Muirgeneral morphological properties of a given language plus a set of 96e8e4245dSRobert Muirheuristic rules<br> 97e8e4245dSRobert Muir </li> 98e8e4245dSRobert Muir</ul> 99e8e4245dSRobert MuirThere are many existing and well-known implementations of stemmers for 100e8e4245dSRobert MuirEnglish (Porter, Lovins, Krovetz) and other European languages 101*e290f91bSDawid Weiss(<a href="https://snowballstem.org/">Snowball</a>). There are also 102e8e4245dSRobert Muirgood quality commercial lemmatizers for Polish. However, there is only 103e8e4245dSRobert Muirone 104e8e4245dSRobert Muirfreely available Polish stemmer, implemented by 105e8e4245dSRobert Muir<a 106e8e4245dSRobert Muir href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid 107e8e4245dSRobert MuirWeiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a 108e8e4245dSRobert Muir href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That 109e8e4245dSRobert Muirstemmer is dictionary-based. This means that even 110e8e4245dSRobert Muirthough it can achieve 111e8e4245dSRobert Muirperfect accuracy for previously known word forms found in its 112e8e4245dSRobert Muirdictionary, it 113e8e4245dSRobert Muircompletely fails in case of all other word forms. This deficiency is 114e8e4245dSRobert Muirsomewhat mitigated by the comprehensive dictionary distributed with 115e8e4245dSRobert Muirthis stemmer (so there is a high probability that most of the words in 116e8e4245dSRobert Muirthe input text will be found in the dictionary), however the problem 117e8e4245dSRobert Muirstill remains (please see the page above for more detailed description).<br> 118e8e4245dSRobert Muir<br> 119e8e4245dSRobert MuirThe implementation described here uses an algorithmic method. This 120e8e4245dSRobert Muirmethod 121e8e4245dSRobert Muirand particular algorithm implementation are described in detail in 122e8e4245dSRobert Muir[1][2]. 123e8e4245dSRobert MuirThe main advantage of algorithmic stemmers is their ability to process 124e8e4245dSRobert Muirpreviously 125e8e4245dSRobert Muirunseen word forms with high accuracy. This particular algorithm uses a 126e8e4245dSRobert Muirset 127e8e4245dSRobert Muirof 128e8e4245dSRobert Muirtransformation rules (patch commands), which describe how a word with a 129e8e4245dSRobert Muirgiven pattern should be transformed to its stem. These rules are first 130e8e4245dSRobert Muirlearned from a training corpus. They don't 131e8e4245dSRobert Muircover 132e8e4245dSRobert Muirall possible cases, so there is always some loss of precision/recall 133e8e4245dSRobert Muir(which 134e8e4245dSRobert Muirmeans that even the words from the training corpus are sometimes 135e8e4245dSRobert Muirincorrectly stemmed).<br> 1360d339043SRobert Muir<h2>Algorithm and implementation</h2> 137e8e4245dSRobert MuirThe algorithm and its Java implementation is described in detail in the 138e8e4245dSRobert Muirpublications cited below. Here's just a short excerpt from [2]:<br> 139e8e4245dSRobert Muir<br> 1400d339043SRobert Muir<div style="width: 80%; text-align: center">"The aim is separation of the 141e8e4245dSRobert Muirstemmer execution code from the data 142e8e4245dSRobert Muirstructures [...]. In other words, a static algorithm configurable by 143e8e4245dSRobert Muirdata must be developed. The word transformations that happen in the 144e8e4245dSRobert Muirstemmer must be then encoded to the data tables.<br> 145e8e4245dSRobert Muir<br> 146e8e4245dSRobert MuirThe tacit input of our method is a sample set (a so-called dictionary) 147e8e4245dSRobert Muirof words (as keys) and their stems. Each record can be equivalently 148e8e4245dSRobert Muirstored as a key and the record of key's transformation to its 149e8e4245dSRobert Muirrespective stem. The transformation record is termed a patch command 150e8e4245dSRobert Muir(P-command). It must be ensured that P-commands are universal, and that 151e8e4245dSRobert MuirP-commands can transform any word to its stem. Our solution[6,8] is 152e8e4245dSRobert Muirbased on the Levenstein metric [10], which produces P-command as the 153e8e4245dSRobert Muirminimum cost path in a directed graph.<br> 154e8e4245dSRobert Muir<br> 155e8e4245dSRobert MuirOne can imagine the P-command as an algorithm for an operator (editor) 156e8e4245dSRobert Muirthat rewrites a string to another string. The operator can use these 157e8e4245dSRobert Muirinstructions (PP-command's): <span style="font-weight: bold;">removal </span>- 158e8e4245dSRobert Muirdeletes a sequence of characters starting at the current cursor 159e8e4245dSRobert Muirposition and moves the cursor to the next character. The length of this 160e8e4245dSRobert Muirsequence is the parameter; <span style="font-weight: bold;">insertion </span>- 161e8e4245dSRobert Muirinserts a character ch, without moving the cursor. The character ch is 162e8e4245dSRobert Muira parameter; <span style="font-weight: bold;">substitution </span> 163e8e4245dSRobert Muir- rewrites a character at the current cursor position to the character 164e8e4245dSRobert Muirch and moves the cursor to the next character. The character ch is a 165e8e4245dSRobert Muirparameter; <span style="font-weight: bold;">no operation</span> (NOOP) 166e8e4245dSRobert Muir- skip a sequence of characters starting at the current cursor 167e8e4245dSRobert Muirposition. The length of this sequence is the parameter.<br> 168e8e4245dSRobert Muir<br> 169e8e4245dSRobert MuirThe P-commands are applied from the end of a word (right to left). This 170e8e4245dSRobert Muirassumption can reduce the set of P-command's, because the last NOOP, 171e8e4245dSRobert Muirmoving the cursor to the end of a string without any changes, need not 172e8e4245dSRobert Muirbe stored."</div> 173e8e4245dSRobert Muir<br> 174e8e4245dSRobert MuirData structure used to keep the dictionary (words and their P-commands) 175e8e4245dSRobert Muiris a trie. Several optimization steps are applied in turn to reduce and 176e8e4245dSRobert Muiroptimize the initial trie, by eliminating useless information and 177e8e4245dSRobert Muirshortening the paths in the trie.<br> 178e8e4245dSRobert Muir<br> 179e8e4245dSRobert MuirFinally, in order to obtain a stem from the input word, the word is 180e8e4245dSRobert Muirpassed once through a matching path in the trie (applying at each node 181e8e4245dSRobert Muirthe P-commands stored there). The result is a word stem.<br> 182e8e4245dSRobert Muir<h2>Corpus</h2> 183e8e4245dSRobert Muir<p><i>(to be completed...)</i></p> 184e8e4245dSRobert Muir<p>The following Polish corpora have been used:</p> 185e8e4245dSRobert Muir<ul> 186e8e4245dSRobert Muir <li><a 187e8e4245dSRobert Muir href="http://sourceforge.net/project/showfiles.php?group_id=49316&package_id=65354">Polish 188e8e4245dSRobert Muirdictionary 189e8e4245dSRobert Muirfrom ispell distribution</a></li> 190e8e4245dSRobert Muir <li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus 191e8e4245dSRobert Muirsłownika frekwencyjnego</a></li> 192e8e4245dSRobert Muir<!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>--> 193e8e4245dSRobert Muir<!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The 194e8e4245dSRobert MuirBible (so called "Tysiąclecia") - unauthorized electronic version</li> 195e8e4245dSRobert Muir <li><a 196e8e4245dSRobert Muir href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator 197e8e4245dSRobert Muirmorfologiczny SAM v. 3.4</a> - this was used to recover lemmas 198e8e4245dSRobert Muirmissing from other texts</li> 199e8e4245dSRobert Muir</ul> 200e8e4245dSRobert Muir<p>This step was the most time-consuming - and it would probably be 201e8e4245dSRobert Muireven more tedious and difficult if not for the 202e8e4245dSRobert Muirhelp of 203e8e4245dSRobert Muir<a href="http://www.python.org/">Python</a>. The source texts had to be 204e8e4245dSRobert Muirbrought to a common encoding (UTF-8) - some of them used quite ancient 205e8e4245dSRobert Muirencodings like Mazovia or DHN - and then scripts were written to 206e8e4245dSRobert Muircollect all lemmas and 207e8e4245dSRobert Muirinflected forms from the source texts. In cases when the source text 208e8e4245dSRobert Muirwas not 209e8e4245dSRobert Muirtagged, 210e8e4245dSRobert MuirI used the SAM analyzer to produce lemmas. In cases of ambiguous 211e8e4245dSRobert Muirlemmatization I decided to put references to inflected forms from all 212e8e4245dSRobert Muirbase forms.<br> 213e8e4245dSRobert Muir</p> 214e8e4245dSRobert Muir<p>All grammatical categories were allowed to appear in the corpus, 215e8e4245dSRobert Muiri.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting 216e8e4245dSRobert Muircorpus consisted of roughly 87,000+ inflection sets, i.e. each set 217e8e4245dSRobert Muirconsisted of one base form (lemma) and many inflected forms. However, 218e8e4245dSRobert Muirbecause of the nature of the training method I restricted these sets to 219e8e4245dSRobert Muirinclude only those where there were at least 4 inflected forms. Sets 220e8e4245dSRobert Muirwith 3 or less inflected forms were removed, so that the final corpus 221e8e4245dSRobert Muirconsisted of ~69,000 unique sets, which in turn contained ~1.5 mln 222e8e4245dSRobert Muirinflected forms. <br> 223e8e4245dSRobert Muir</p> 224e8e4245dSRobert Muir<h2>Testing</h2> 225e8e4245dSRobert Muir<p>I tested the stemmer tables produced using the implementation 226e8e4245dSRobert Muirdescribed above. The following sections give some details about 227e8e4245dSRobert Muirthe testing setup. 228e8e4245dSRobert Muir</p> 229e8e4245dSRobert Muir<h3>Testing procedure</h3> 230e8e4245dSRobert Muir<p>The testing procedure was as follows: 231e8e4245dSRobert Muir</p> 232e8e4245dSRobert Muir<ul> 233e8e4245dSRobert Muir <li>the whole corpus of ~69,000 unique sets was shuffled, so that the 234e8e4245dSRobert Muirinput sets were in random order.</li> 235e8e4245dSRobert Muir <li>the corpus was split into two parts - one with 30,000 sets (Part 236e8e4245dSRobert Muir1), the other with ~39,000 sets (Part 2).</li> 237e8e4245dSRobert Muir <li>Training samples were drawn in sequential order from the Part 1. 238e8e4245dSRobert MuirSince the sets were already randomized, the training samples were also 239e8e4245dSRobert Muirrandomized, but this procedure ensured that each larger training sample 240e8e4245dSRobert Muircontained all smaller samples.</li> 241e8e4245dSRobert Muir <li>Part 2 was used for testing. Note: this means that the testing 242e8e4245dSRobert Muirrun used <em>only</em> words previously unseen during the training 243e8e4245dSRobert Muirphase. This is the worst scenario, because it means that stemmer must 244e8e4245dSRobert Muirextrapolate the learned rules to unknown cases. This also means that in 245e8e4245dSRobert Muira real-life case (where the input is a mix between known and unknown 246e8e4245dSRobert Muirwords) the F-measure of the stemmer will be even higher than in the 247e8e4245dSRobert Muirtable below.</li> 248e8e4245dSRobert Muir</ul> 249e8e4245dSRobert Muir<h3>Test results</h3> 250e8e4245dSRobert Muir<p>The following table summarizes test results for varying sizes 251e8e4245dSRobert Muirof training samples. The meaning of the table columns is 252e8e4245dSRobert Muirdescribed below: 253e8e4245dSRobert Muir</p> 254e8e4245dSRobert Muir<ul> 255e8e4245dSRobert Muir <li><b>training sets:</b> the number of training sets. One set 256e8e4245dSRobert Muirconsists of one lemma and at least 4 and up to ~80 inflected forms 257e8e4245dSRobert Muir(including pre- and suffixed forms).</li> 258e8e4245dSRobert Muir <li><b>testing forms:</b> the number of testing forms. Only inflected 259e8e4245dSRobert Muirforms were used in testing.</li> 260e8e4245dSRobert Muir <li><b>stem OK:</b> the number of cases when produced output was a 261e8e4245dSRobert Muircorrect (unique) stem. Note: quite often correct stems were also 262e8e4245dSRobert Muircorrect lemmas.</li> 263e8e4245dSRobert Muir <li><b>lemma OK:</b> the number of cases when produced output was a 264e8e4245dSRobert Muircorrect lemma.</li> 265e8e4245dSRobert Muir <li><b>missing:</b> the number of cases when stemmer was unable to 266e8e4245dSRobert Muirprovide any output.</li> 267e8e4245dSRobert Muir <li><b>stem bad:</b> the number of cases when produced output was a 268e8e4245dSRobert Muirstem, but already in use identifying a different set.</li> 269e8e4245dSRobert Muir <li><b>lemma bad:</b> the number of cases when produced output was an 270e8e4245dSRobert Muirincorrect lemma. Note: quite often in such case the output was a 271e8e4245dSRobert Muircorrect stem.</li> 272e8e4245dSRobert Muir <li><b>table size:</b> the size in bytes of the stemmer table.</li> 273e8e4245dSRobert Muir</ul> 2740d339043SRobert Muir<table class="padding2" style="border: 1px solid; border-spacing: 0px; border-collapse: separate"> 275f41eabdcSRobert Muir <caption>test results for different sizes of training samples</caption> 276e8e4245dSRobert Muir <tbody> 2770d339043SRobert Muir <tr style="background-color: #a0b0c0"> 278e8e4245dSRobert Muir <th>Training sets</th> 279e8e4245dSRobert Muir <th>Testing forms</th> 280e8e4245dSRobert Muir <th>Stem OK</th> 281e8e4245dSRobert Muir <th>Lemma OK</th> 282e8e4245dSRobert Muir <th>Missing</th> 283e8e4245dSRobert Muir <th>Stem Bad</th> 284e8e4245dSRobert Muir <th>Lemma Bad</th> 285e8e4245dSRobert Muir <th>Table size [B]</th> 286e8e4245dSRobert Muir </tr> 2870d339043SRobert Muir <tr style="text-align: right"> 288e8e4245dSRobert Muir <td>100</td> 289e8e4245dSRobert Muir <td>1022985</td> 290e8e4245dSRobert Muir <td>842209</td> 291e8e4245dSRobert Muir <td>593632</td> 292e8e4245dSRobert Muir <td>172711</td> 293e8e4245dSRobert Muir <td>22331</td> 294e8e4245dSRobert Muir <td>256642</td> 295e8e4245dSRobert Muir <td>28438</td> 296e8e4245dSRobert Muir </tr> 2970d339043SRobert Muir <tr style="text-align: right"> 298e8e4245dSRobert Muir <td>200</td> 299e8e4245dSRobert Muir <td>1022985</td> 300e8e4245dSRobert Muir <td>862789</td> 301e8e4245dSRobert Muir <td>646488</td> 302e8e4245dSRobert Muir <td>153288</td> 303e8e4245dSRobert Muir <td>16306</td> 304e8e4245dSRobert Muir <td>223209</td> 305e8e4245dSRobert Muir <td>48660</td> 306e8e4245dSRobert Muir </tr> 3070d339043SRobert Muir <tr style="text-align: right"> 308e8e4245dSRobert Muir <td>500</td> 309e8e4245dSRobert Muir <td>1022985</td> 310e8e4245dSRobert Muir <td>885786</td> 311e8e4245dSRobert Muir <td>685009</td> 312e8e4245dSRobert Muir <td>130772</td> 313e8e4245dSRobert Muir <td>14856</td> 314e8e4245dSRobert Muir <td>207204</td> 315e8e4245dSRobert Muir <td>108798</td> 316e8e4245dSRobert Muir </tr> 3170d339043SRobert Muir <tr style="text-align: right"> 318e8e4245dSRobert Muir <td>700</td> 319e8e4245dSRobert Muir <td>1022985</td> 320e8e4245dSRobert Muir <td>909031</td> 321e8e4245dSRobert Muir <td>704609</td> 322e8e4245dSRobert Muir <td>107084</td> 323e8e4245dSRobert Muir <td>15442</td> 324e8e4245dSRobert Muir <td>211292</td> 325e8e4245dSRobert Muir <td>139291</td> 326e8e4245dSRobert Muir </tr> 3270d339043SRobert Muir <tr style="text-align: right"> 328e8e4245dSRobert Muir <td>1000</td> 329e8e4245dSRobert Muir <td>1022985</td> 330e8e4245dSRobert Muir <td>926079</td> 331e8e4245dSRobert Muir <td>725720</td> 332e8e4245dSRobert Muir <td>90117</td> 333e8e4245dSRobert Muir <td>14941</td> 334e8e4245dSRobert Muir <td>207148</td> 335e8e4245dSRobert Muir <td>183677</td> 336e8e4245dSRobert Muir </tr> 3370d339043SRobert Muir <tr style="text-align: right"> 338e8e4245dSRobert Muir <td>2000</td> 339e8e4245dSRobert Muir <td>1022985</td> 340e8e4245dSRobert Muir <td>942886</td> 341e8e4245dSRobert Muir <td>746641</td> 342e8e4245dSRobert Muir <td>73429</td> 343e8e4245dSRobert Muir <td>14903</td> 344e8e4245dSRobert Muir <td>202915</td> 345e8e4245dSRobert Muir <td>313516</td> 346e8e4245dSRobert Muir </tr> 3470d339043SRobert Muir <tr style="text-align: right"> 348e8e4245dSRobert Muir <td>5000</td> 349e8e4245dSRobert Muir <td>1022985</td> 350e8e4245dSRobert Muir <td>954721</td> 351e8e4245dSRobert Muir <td>759930</td> 352e8e4245dSRobert Muir <td>61476</td> 353e8e4245dSRobert Muir <td>14817</td> 354e8e4245dSRobert Muir <td>201579</td> 355e8e4245dSRobert Muir <td>640969</td> 356e8e4245dSRobert Muir </tr> 3570d339043SRobert Muir <tr style="text-align: right"> 358e8e4245dSRobert Muir <td>7000</td> 359e8e4245dSRobert Muir <td>1022985</td> 360e8e4245dSRobert Muir <td>956165</td> 361e8e4245dSRobert Muir <td>764033</td> 362e8e4245dSRobert Muir <td>60364</td> 363e8e4245dSRobert Muir <td>14620</td> 364e8e4245dSRobert Muir <td>198588</td> 365e8e4245dSRobert Muir <td>839347</td> 366e8e4245dSRobert Muir </tr> 3670d339043SRobert Muir <tr style="text-align: right"> 368e8e4245dSRobert Muir <td>10000</td> 369e8e4245dSRobert Muir <td>1022985</td> 370e8e4245dSRobert Muir <td>965427</td> 371e8e4245dSRobert Muir <td>775507</td> 372e8e4245dSRobert Muir <td>50797</td> 373e8e4245dSRobert Muir <td>14662</td> 374e8e4245dSRobert Muir <td>196681</td> 375e8e4245dSRobert Muir <td>1144537</td> 376e8e4245dSRobert Muir </tr> 3770d339043SRobert Muir <tr style="text-align: right"> 378e8e4245dSRobert Muir <td>12000</td> 379e8e4245dSRobert Muir <td>1022985</td> 380e8e4245dSRobert Muir <td>967664</td> 381e8e4245dSRobert Muir <td>782143</td> 382e8e4245dSRobert Muir <td>48722</td> 383e8e4245dSRobert Muir <td>14284</td> 384e8e4245dSRobert Muir <td>192120</td> 385e8e4245dSRobert Muir <td>1313508</td> 386e8e4245dSRobert Muir </tr> 3870d339043SRobert Muir <tr style="text-align: right"> 388e8e4245dSRobert Muir <td>15000</td> 389e8e4245dSRobert Muir <td>1022985</td> 390e8e4245dSRobert Muir <td>973188</td> 391e8e4245dSRobert Muir <td>788867</td> 392e8e4245dSRobert Muir <td>43247</td> 393e8e4245dSRobert Muir <td>14349</td> 394e8e4245dSRobert Muir <td>190871</td> 395e8e4245dSRobert Muir <td>1567902</td> 396e8e4245dSRobert Muir </tr> 3970d339043SRobert Muir <tr style="text-align: right"> 398e8e4245dSRobert Muir <td>17000</td> 399e8e4245dSRobert Muir <td>1022985</td> 400e8e4245dSRobert Muir <td>974203</td> 401e8e4245dSRobert Muir <td>791804</td> 402e8e4245dSRobert Muir <td>42319</td> 403e8e4245dSRobert Muir <td>14333</td> 404e8e4245dSRobert Muir <td>188862</td> 405e8e4245dSRobert Muir <td>1733957</td> 406e8e4245dSRobert Muir </tr> 4070d339043SRobert Muir <tr style="text-align: right"> 408e8e4245dSRobert Muir <td>20000</td> 409e8e4245dSRobert Muir <td>1022985</td> 410e8e4245dSRobert Muir <td>976234</td> 411e8e4245dSRobert Muir <td>791554</td> 412e8e4245dSRobert Muir <td>40058</td> 413e8e4245dSRobert Muir <td>14601</td> 414e8e4245dSRobert Muir <td>191373</td> 415e8e4245dSRobert Muir <td>1977615</td> 416e8e4245dSRobert Muir </tr> 417e8e4245dSRobert Muir </tbody> 418e8e4245dSRobert Muir</table> 419e8e4245dSRobert Muir<p>I also measured the time to produce a stem (which involves 420e8e4245dSRobert Muirtraversing a trie, 421e8e4245dSRobert Muirretrieving a patch command and applying the patch command to the input 422e8e4245dSRobert Muirstring). 423e8e4245dSRobert MuirOn a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03 424e8e4245dSRobert MuirHotSpot), 425e8e4245dSRobert Muirfor tables ranging in size from 1,000 to 20,000 cells, the time to 426e8e4245dSRobert Muirproduce a 427e8e4245dSRobert Muirsingle stem varies between 5-10 microseconds.<br> 428e8e4245dSRobert Muir</p> 429e8e4245dSRobert Muir<p>This means that the stemmer can process up to <span 430e8e4245dSRobert Muir style="font-weight: bold;">200,000 words per second</span>, an 431e8e4245dSRobert Muiroutstanding result when compared to other stemmers (Morfeusz - ~2,000 432e8e4245dSRobert Muirw/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br> 433e8e4245dSRobert Muir</p> 434e8e4245dSRobert Muir<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>, 435e8e4245dSRobert Muirwhich you can use to produce reports 436e8e4245dSRobert Muirlike the one below:<br> 437e8e4245dSRobert Muir</p> 438e8e4245dSRobert Muir<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table: /res/tables/stemmer_2000.out<br>Input file: ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER: 1 2 3<br> Total input words 1378176 1378176 1378176<br> Missed output words 112 112 112<br> Time elapsed [ms] 6989 6940 6640<br> Hit rate percent 99.99% 99.99% 99.99%<br> Miss rate percent 00.01% 00.01% 00.01%<br> Words per second 197192 198584 207557<br> Time per word [us] 5.07 5.04 4.82<br></pre> 439e8e4245dSRobert Muir<h2>Summary</h2> 440e8e4245dSRobert Muir<p>The results of these tests are very encouraging. It seems that using 441e8e4245dSRobert Muirthe 442e8e4245dSRobert Muirtraining corpus and the stemming algorithm described above results in a 443e8e4245dSRobert Muirhigh-quality stemmer useful for most applications. Moreover, it can 444e8e4245dSRobert Muiralso 445e8e4245dSRobert Muirbe used as a better than average lemmatizer.</p> 446e8e4245dSRobert Muir<p>Both the author of the implementation 447e8e4245dSRobert Muir(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author 448e8e4245dSRobert Muirof this 449e8e4245dSRobert Muircompilation (Andrzej Bialecki <ab AT getopt DOT org>) would 450e8e4245dSRobert Muirappreciate any 451e8e4245dSRobert Muirfeedback and suggestions for further improvements.</p> 452e8e4245dSRobert Muir<h2>Bibliography</h2> 453e8e4245dSRobert Muir<ol> 454e8e4245dSRobert Muir <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD 455e8e4245dSRobert MuirThesis, 456e8e4245dSRobert MuirFaculty of Mathematics and Physics, Charles University in Prague, in 457e8e4245dSRobert Muirpress.</li> 458e8e4245dSRobert Muir <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International 459e8e4245dSRobert MuirIntelligent Information Processing and Web Mining Conference, 2004, 460e8e4245dSRobert MuirZakopane, Poland.</li> 461e8e4245dSRobert Muir <li>Galambos, L.: Lemmatizer for Document Information Retrieval 4620d339043SRobert MuirSystems in JAVA.<a 463e8e4245dSRobert Muir class="moz-txt-link-rfc2396E" 464e8e4245dSRobert Muir href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01"><http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01></a> 465e8e4245dSRobert MuirSOFSEM 2001, Piestany, Slovakia. <br> 466e8e4245dSRobert Muir </li> 467e8e4245dSRobert Muir</ol> 468e8e4245dSRobert Muir<br> 469e8e4245dSRobert Muir<br> 470e8e4245dSRobert Muir</body> 471e8e4245dSRobert Muir</html> 472