1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 2<!-- 3 Licensed to the Apache Software Foundation (ASF) under one or more 4 contributor license agreements. See the NOTICE file distributed with 5 this work for additional information regarding copyright ownership. 6 The ASF licenses this file to You under the Apache License, Version 2.0 7 (the "License"); you may not use this file except in compliance with 8 the License. You may obtain a copy of the License at 9 10 http://www.apache.org/licenses/LICENSE-2.0 11 12 Unless required by applicable law or agreed to in writing, software 13 distributed under the License is distributed on an "AS IS" BASIS, 14 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 See the License for the specific language governing permissions and 16 limitations under the License. 17--> 18<html> 19<head> 20 <meta content="text/html; charset=UTF-8" http-equiv="content-type"> 21 <title>Stempel - Algorithmic Stemmer for Polish Language</title> 22 <meta content="Andrzej Bialecki" name="author"> 23 <meta name="keywords" 24 content="stemming, stemmer, algorithmic stemmer, Polish stemmer"> 25 <meta 26 content="This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables." 27 name="description"> 28</head> 29<body style="font-family: Arial,SansSerif;"> 30<h1><i>Stempel</i> - Algorithmic Stemmer for Polish Language</h1> 31<h2>Introduction</h2> 32<p>A method for conflation of different inflected word forms is an 33important component of many Information Retrieval systems. It helps to 34improve the system's recall and can significantly reduce the index 35size. This is especially true for highly-inflectional languages like 36those from the Slavic language family (Czech, Slovak, Polish, Russian, 37Bulgarian, etc).</p> 38<p>This page describes a software package consisting of high-quality 39stemming tables for Polish, and a universal algorithmic stemmer, which 40operates using these tables. The stemmer code is taken virtually 41unchanged from the <a href="http://www.egothor.org">Egothor project</a>.</p> 42<p>The software distribution includes stemmer 43tables prepared using an extensive corpus of Polish language (see 44details below).</p> 45<p>This work is available under Apache-style Open Source license - the 46stemmer code is covered by Egothor License, the tables and other 47additions are covered by Apache License 2.0. Both licenses allow to use 48the code in Open Source as well as commercial (closed source) projects.</p> 49<h3>Terminology</h3> 50<p>A short explanation is in order about the terminology used in this 51text.</p> 52<p>In the following sections I make a distinction between <b>stem</b> 53and <b>lemma</b>.</p> 54<p>Lemma is a base grammatical form (dictionary form, headword) of a 55word. Lemma is an existing, grammatically correct word in some human 56language.</p> 57<p>Stem on the other hand is just a unique token, not necessarily 58making any sense in any human language, but which can serve as a unique 59label instead of lemma for the same set of inflected forms. Quite often 60stem is referred to as a "root" of the word - which is incorrect and 61misleading (stems sometimes have very little to do with the linguistic 62root of a word, i.e. a pattern found in a word which is common to all 63inflected forms or within a family of languages).</p> 64<p>For an IR system stems are usually sufficient, for a morphological 65analysis system obviously lemmas are a must. In practice, various 66stemmers produce a mix of stems and lemmas, as is the case with the 67stemmer described here. Additionally, for some languages, which use 68suffix-based inflection rules many stemmers based on suffix-stripping 69will produce a large percentage of stems equivalent to lemmas. This is 70however not the case for languages with complex, irregular inflection 71rules (such as Slavic languages) - here simplistic suffix-stripping 72stemmers produce very poor results.</p> 73<h3>Background</h3> 74<p>Lemmatization is a process of finding the base, non-inflected form 75of a word. The result of lemmatization is a correct existing word, 76often in nominative case for nouns and infinitive form for verbs. A 77given inflected form may correspond to several lemmas (e.g. "found" 78-> find, found) - the correct choice depends on the context.<br> 79<br> 80Stemming is concerned mostly with finding a unique "root" of a word, 81which not necessarily results in any existing word or lemma. The 82quality of stemming is measured by the rate of collisions (overstemming 83- which causes words with different lemmas to be incorrectly conflated 84into one "root"), and the rate of superfluous word "roots" 85(understemming - which assigns several "roots" to words with the same 86lemma). <br> 87<br> 88Both stemmer and lemmatizer can be implemented in various ways. The two 89most common approaches are:<br> 90</p> 91<ul> 92 <li>dictionary-based: where the stemmer uses an extensive dictionary 93of morphological forms in order to find the corresponding stem or lemma</li> 94 <li>algorithmic: where the stemmer uses an algorithm, based on 95general morphological properties of a given language plus a set of 96heuristic rules<br> 97 </li> 98</ul> 99There are many existing and well-known implementations of stemmers for 100English (Porter, Lovins, Krovetz) and other European languages 101(<a href="https://snowballstem.org/">Snowball</a>). There are also 102good quality commercial lemmatizers for Polish. However, there is only 103one 104freely available Polish stemmer, implemented by 105<a 106 href="http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en">Dawid 107Weiss</a>, based on the "ispell" dictionary and Jan Daciuk's <a 108 href="http://www.eti.pg.gda.pl/%7Ejandac/">FSA package</a>. That 109stemmer is dictionary-based. This means that even 110though it can achieve 111perfect accuracy for previously known word forms found in its 112dictionary, it 113completely fails in case of all other word forms. This deficiency is 114somewhat mitigated by the comprehensive dictionary distributed with 115this stemmer (so there is a high probability that most of the words in 116the input text will be found in the dictionary), however the problem 117still remains (please see the page above for more detailed description).<br> 118<br> 119The implementation described here uses an algorithmic method. This 120method 121and particular algorithm implementation are described in detail in 122[1][2]. 123The main advantage of algorithmic stemmers is their ability to process 124previously 125unseen word forms with high accuracy. This particular algorithm uses a 126set 127of 128transformation rules (patch commands), which describe how a word with a 129given pattern should be transformed to its stem. These rules are first 130learned from a training corpus. They don't 131cover 132all possible cases, so there is always some loss of precision/recall 133(which 134means that even the words from the training corpus are sometimes 135incorrectly stemmed).<br> 136<h2>Algorithm and implementation</h2> 137The algorithm and its Java implementation is described in detail in the 138publications cited below. Here's just a short excerpt from [2]:<br> 139<br> 140<div style="width: 80%; text-align: center">"The aim is separation of the 141stemmer execution code from the data 142structures [...]. In other words, a static algorithm configurable by 143data must be developed. The word transformations that happen in the 144stemmer must be then encoded to the data tables.<br> 145<br> 146The tacit input of our method is a sample set (a so-called dictionary) 147of words (as keys) and their stems. Each record can be equivalently 148stored as a key and the record of key's transformation to its 149respective stem. The transformation record is termed a patch command 150(P-command). It must be ensured that P-commands are universal, and that 151P-commands can transform any word to its stem. Our solution[6,8] is 152based on the Levenstein metric [10], which produces P-command as the 153minimum cost path in a directed graph.<br> 154<br> 155One can imagine the P-command as an algorithm for an operator (editor) 156that rewrites a string to another string. The operator can use these 157instructions (PP-command's): <span style="font-weight: bold;">removal </span>- 158deletes a sequence of characters starting at the current cursor 159position and moves the cursor to the next character. The length of this 160sequence is the parameter; <span style="font-weight: bold;">insertion </span>- 161inserts a character ch, without moving the cursor. The character ch is 162a parameter; <span style="font-weight: bold;">substitution </span> 163- rewrites a character at the current cursor position to the character 164ch and moves the cursor to the next character. The character ch is a 165parameter; <span style="font-weight: bold;">no operation</span> (NOOP) 166- skip a sequence of characters starting at the current cursor 167position. The length of this sequence is the parameter.<br> 168<br> 169The P-commands are applied from the end of a word (right to left). This 170assumption can reduce the set of P-command's, because the last NOOP, 171moving the cursor to the end of a string without any changes, need not 172be stored."</div> 173<br> 174Data structure used to keep the dictionary (words and their P-commands) 175is a trie. Several optimization steps are applied in turn to reduce and 176optimize the initial trie, by eliminating useless information and 177shortening the paths in the trie.<br> 178<br> 179Finally, in order to obtain a stem from the input word, the word is 180passed once through a matching path in the trie (applying at each node 181the P-commands stored there). The result is a word stem.<br> 182<h2>Corpus</h2> 183<p><i>(to be completed...)</i></p> 184<p>The following Polish corpora have been used:</p> 185<ul> 186 <li><a 187 href="http://sourceforge.net/project/showfiles.php?group_id=49316&package_id=65354">Polish 188dictionary 189from ispell distribution</a></li> 190 <li><a href="http://www.mimuw.edu.pl/polszczyzna/">Wzbogacony korpus 191słownika frekwencyjnego</a></li> 192<!--<li><a href="http://www.korpus.pl">Korpus IPI PAN</a></li>--> 193<!--<li>The Bible (so called "Warsaw Bible" or "Brytyjka")</li>--><li>The 194Bible (so called "Tysiąclecia") - unauthorized electronic version</li> 195 <li><a 196 href="http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb">Analizator 197morfologiczny SAM v. 3.4</a> - this was used to recover lemmas 198missing from other texts</li> 199</ul> 200<p>This step was the most time-consuming - and it would probably be 201even more tedious and difficult if not for the 202help of 203<a href="http://www.python.org/">Python</a>. The source texts had to be 204brought to a common encoding (UTF-8) - some of them used quite ancient 205encodings like Mazovia or DHN - and then scripts were written to 206collect all lemmas and 207inflected forms from the source texts. In cases when the source text 208was not 209tagged, 210I used the SAM analyzer to produce lemmas. In cases of ambiguous 211lemmatization I decided to put references to inflected forms from all 212base forms.<br> 213</p> 214<p>All grammatical categories were allowed to appear in the corpus, 215i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting 216corpus consisted of roughly 87,000+ inflection sets, i.e. each set 217consisted of one base form (lemma) and many inflected forms. However, 218because of the nature of the training method I restricted these sets to 219include only those where there were at least 4 inflected forms. Sets 220with 3 or less inflected forms were removed, so that the final corpus 221consisted of ~69,000 unique sets, which in turn contained ~1.5 mln 222inflected forms. <br> 223</p> 224<h2>Testing</h2> 225<p>I tested the stemmer tables produced using the implementation 226described above. The following sections give some details about 227the testing setup. 228</p> 229<h3>Testing procedure</h3> 230<p>The testing procedure was as follows: 231</p> 232<ul> 233 <li>the whole corpus of ~69,000 unique sets was shuffled, so that the 234input sets were in random order.</li> 235 <li>the corpus was split into two parts - one with 30,000 sets (Part 2361), the other with ~39,000 sets (Part 2).</li> 237 <li>Training samples were drawn in sequential order from the Part 1. 238Since the sets were already randomized, the training samples were also 239randomized, but this procedure ensured that each larger training sample 240contained all smaller samples.</li> 241 <li>Part 2 was used for testing. Note: this means that the testing 242run used <em>only</em> words previously unseen during the training 243phase. This is the worst scenario, because it means that stemmer must 244extrapolate the learned rules to unknown cases. This also means that in 245a real-life case (where the input is a mix between known and unknown 246words) the F-measure of the stemmer will be even higher than in the 247table below.</li> 248</ul> 249<h3>Test results</h3> 250<p>The following table summarizes test results for varying sizes 251of training samples. The meaning of the table columns is 252described below: 253</p> 254<ul> 255 <li><b>training sets:</b> the number of training sets. One set 256consists of one lemma and at least 4 and up to ~80 inflected forms 257(including pre- and suffixed forms).</li> 258 <li><b>testing forms:</b> the number of testing forms. Only inflected 259forms were used in testing.</li> 260 <li><b>stem OK:</b> the number of cases when produced output was a 261correct (unique) stem. Note: quite often correct stems were also 262correct lemmas.</li> 263 <li><b>lemma OK:</b> the number of cases when produced output was a 264correct lemma.</li> 265 <li><b>missing:</b> the number of cases when stemmer was unable to 266provide any output.</li> 267 <li><b>stem bad:</b> the number of cases when produced output was a 268stem, but already in use identifying a different set.</li> 269 <li><b>lemma bad:</b> the number of cases when produced output was an 270incorrect lemma. Note: quite often in such case the output was a 271correct stem.</li> 272 <li><b>table size:</b> the size in bytes of the stemmer table.</li> 273</ul> 274<table class="padding2" style="border: 1px solid; border-spacing: 0px; border-collapse: separate"> 275 <caption>test results for different sizes of training samples</caption> 276 <tbody> 277 <tr style="background-color: #a0b0c0"> 278 <th>Training sets</th> 279 <th>Testing forms</th> 280 <th>Stem OK</th> 281 <th>Lemma OK</th> 282 <th>Missing</th> 283 <th>Stem Bad</th> 284 <th>Lemma Bad</th> 285 <th>Table size [B]</th> 286 </tr> 287 <tr style="text-align: right"> 288 <td>100</td> 289 <td>1022985</td> 290 <td>842209</td> 291 <td>593632</td> 292 <td>172711</td> 293 <td>22331</td> 294 <td>256642</td> 295 <td>28438</td> 296 </tr> 297 <tr style="text-align: right"> 298 <td>200</td> 299 <td>1022985</td> 300 <td>862789</td> 301 <td>646488</td> 302 <td>153288</td> 303 <td>16306</td> 304 <td>223209</td> 305 <td>48660</td> 306 </tr> 307 <tr style="text-align: right"> 308 <td>500</td> 309 <td>1022985</td> 310 <td>885786</td> 311 <td>685009</td> 312 <td>130772</td> 313 <td>14856</td> 314 <td>207204</td> 315 <td>108798</td> 316 </tr> 317 <tr style="text-align: right"> 318 <td>700</td> 319 <td>1022985</td> 320 <td>909031</td> 321 <td>704609</td> 322 <td>107084</td> 323 <td>15442</td> 324 <td>211292</td> 325 <td>139291</td> 326 </tr> 327 <tr style="text-align: right"> 328 <td>1000</td> 329 <td>1022985</td> 330 <td>926079</td> 331 <td>725720</td> 332 <td>90117</td> 333 <td>14941</td> 334 <td>207148</td> 335 <td>183677</td> 336 </tr> 337 <tr style="text-align: right"> 338 <td>2000</td> 339 <td>1022985</td> 340 <td>942886</td> 341 <td>746641</td> 342 <td>73429</td> 343 <td>14903</td> 344 <td>202915</td> 345 <td>313516</td> 346 </tr> 347 <tr style="text-align: right"> 348 <td>5000</td> 349 <td>1022985</td> 350 <td>954721</td> 351 <td>759930</td> 352 <td>61476</td> 353 <td>14817</td> 354 <td>201579</td> 355 <td>640969</td> 356 </tr> 357 <tr style="text-align: right"> 358 <td>7000</td> 359 <td>1022985</td> 360 <td>956165</td> 361 <td>764033</td> 362 <td>60364</td> 363 <td>14620</td> 364 <td>198588</td> 365 <td>839347</td> 366 </tr> 367 <tr style="text-align: right"> 368 <td>10000</td> 369 <td>1022985</td> 370 <td>965427</td> 371 <td>775507</td> 372 <td>50797</td> 373 <td>14662</td> 374 <td>196681</td> 375 <td>1144537</td> 376 </tr> 377 <tr style="text-align: right"> 378 <td>12000</td> 379 <td>1022985</td> 380 <td>967664</td> 381 <td>782143</td> 382 <td>48722</td> 383 <td>14284</td> 384 <td>192120</td> 385 <td>1313508</td> 386 </tr> 387 <tr style="text-align: right"> 388 <td>15000</td> 389 <td>1022985</td> 390 <td>973188</td> 391 <td>788867</td> 392 <td>43247</td> 393 <td>14349</td> 394 <td>190871</td> 395 <td>1567902</td> 396 </tr> 397 <tr style="text-align: right"> 398 <td>17000</td> 399 <td>1022985</td> 400 <td>974203</td> 401 <td>791804</td> 402 <td>42319</td> 403 <td>14333</td> 404 <td>188862</td> 405 <td>1733957</td> 406 </tr> 407 <tr style="text-align: right"> 408 <td>20000</td> 409 <td>1022985</td> 410 <td>976234</td> 411 <td>791554</td> 412 <td>40058</td> 413 <td>14601</td> 414 <td>191373</td> 415 <td>1977615</td> 416 </tr> 417 </tbody> 418</table> 419<p>I also measured the time to produce a stem (which involves 420traversing a trie, 421retrieving a patch command and applying the patch command to the input 422string). 423On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03 424HotSpot), 425for tables ranging in size from 1,000 to 20,000 cells, the time to 426produce a 427single stem varies between 5-10 microseconds.<br> 428</p> 429<p>This means that the stemmer can process up to <span 430 style="font-weight: bold;">200,000 words per second</span>, an 431outstanding result when compared to other stemmers (Morfeusz - ~2,000 432w/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br> 433</p> 434<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>, 435which you can use to produce reports 436like the one below:<br> 437</p> 438<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table: /res/tables/stemmer_2000.out<br>Input file: ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER: 1 2 3<br> Total input words 1378176 1378176 1378176<br> Missed output words 112 112 112<br> Time elapsed [ms] 6989 6940 6640<br> Hit rate percent 99.99% 99.99% 99.99%<br> Miss rate percent 00.01% 00.01% 00.01%<br> Words per second 197192 198584 207557<br> Time per word [us] 5.07 5.04 4.82<br></pre> 439<h2>Summary</h2> 440<p>The results of these tests are very encouraging. It seems that using 441the 442training corpus and the stemming algorithm described above results in a 443high-quality stemmer useful for most applications. Moreover, it can 444also 445be used as a better than average lemmatizer.</p> 446<p>Both the author of the implementation 447(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author 448of this 449compilation (Andrzej Bialecki <ab AT getopt DOT org>) would 450appreciate any 451feedback and suggestions for further improvements.</p> 452<h2>Bibliography</h2> 453<ol> 454 <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD 455Thesis, 456Faculty of Mathematics and Physics, Charles University in Prague, in 457press.</li> 458 <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International 459Intelligent Information Processing and Web Mining Conference, 2004, 460Zakopane, Poland.</li> 461 <li>Galambos, L.: Lemmatizer for Document Information Retrieval 462Systems in JAVA.<a 463 class="moz-txt-link-rfc2396E" 464 href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01"><http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01></a> 465SOFSEM 2001, Piestany, Slovakia. <br> 466 </li> 467</ol> 468<br> 469<br> 470</body> 471</html> 472