1<!-- 2 Licensed to the Apache Software Foundation (ASF) under one or more 3 contributor license agreements. See the NOTICE file distributed with 4 this work for additional information regarding copyright ownership. 5 The ASF licenses this file to You under the Apache License, Version 2.0 6 (the "License"); you may not use this file except in compliance with 7 the License. You may obtain a copy of the License at 8 9 http://www.apache.org/licenses/LICENSE-2.0 10 11 Unless required by applicable law or agreed to in writing, software 12 distributed under the License is distributed on an "AS IS" BASIS, 13 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 See the License for the specific language governing permissions and 15 limitations under the License. 16--> 17<html> 18 <head> 19 <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> 20 <title> 21 Apache Lucene ICU integration module 22 </title> 23 </head> 24<body> 25<p> 26This module exposes functionality from 27<a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java 28library that enhances Java's internationalization support by improving 29performance, keeping current with the Unicode Standard, and providing richer 30APIs. 31<p> 32For an introduction to Lucene's analysis API, see the {@link org.apache.lucene.analysis} package documentation. 33<p> 34This module exposes the following functionality: 35</p> 36<ul> 37 <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on 38 properties and rules defined in Unicode.</li> 39 <li><a href="#collation">Collation</a>: Compare strings according to the 40 conventions and standards of a particular language, region or country.</li> 41 <li><a href="#normalization">Normalization</a>: Converts text to a unique, 42 equivalent form.</li> 43 <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with 44 Unicode's Default Caseless Matching algorithm.</li> 45 <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions 46 (such as accent marks) between similar characters for a loose or fuzzy search.</li> 47 <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in 48 a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li> 49</ul> 50<hr> 51<h1><a id="segmentation">Text Segmentation</a></h1> 52<p> 53Text Segmentation (Tokenization) divides document and query text into index terms 54(typically words). Unicode provides special properties and rules so that this can 55be done in a manner that works well with most languages. 56</p> 57<p> 58Text Segmentation implements the word segmentation specified in 59<a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>. 60Additionally the algorithm can be tailored based on writing system, for example 61text in the Thai script is automatically delegated to a dictionary-based segmentation 62algorithm. 63</p> 64<h2>Use Cases</h2> 65<ul> 66 <li> 67 As a more thorough replacement for StandardTokenizer that works well for 68 most languages. 69 </li> 70</ul> 71<h2>Example Usages</h2> 72<h3>Tokenizing multilanguage text</h3> 73<pre class="prettyprint"> 74 /** 75 * This tokenizer will work well in general for most languages. 76 */ 77 Tokenizer tokenizer = new ICUTokenizer(reader); 78</pre> 79<hr> 80<h1><a id="collation">Collation</a></h1> 81<p> 82 <code>ICUCollationKeyAnalyzer</code> 83 converts each token into its binary <code>CollationKey</code> using the 84 provided <code>Collator</code>, allowing it to be 85 stored as an index term. 86</p> 87<p> 88 <code>ICUCollationKeyAnalyzer</code> depends on ICU4J to produce the 89 <code>CollationKey</code>s. 90</p> 91 92<h2>Use Cases</h2> 93 94<ul> 95 <li> 96 Efficient sorting of terms in languages that use non-Unicode character 97 orderings. (Lucene Sort using a Locale can be very slow.) 98 </li> 99 <li> 100 Efficient range queries over fields that contain terms in languages that 101 use non-Unicode character orderings. (Range queries using a Locale can be 102 very slow.) 103 </li> 104 <li> 105 Effective Locale-specific normalization (case differences, diacritics, etc.). 106 ({@link org.apache.lucene.analysis.LowerCaseFilter} and 107 {@link org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter} provide these services 108 in a generic way that doesn't take into account locale-specific needs.) 109 </li> 110</ul> 111 112<h2>Example Usages</h2> 113 114<h3>Farsi Range Queries</h3> 115<pre class="prettyprint"> 116 Collator collator = Collator.getInstance(new ULocale("ar")); 117 ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator); 118 Path indexPath = Files.createTempDirectory("tempIndex"); 119 Directory dir = FSDirectory.open(indexPath); 120 IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer)); 121 Document doc = new Document(); 122 doc.add(new Field("content", "\u0633\u0627\u0628", 123 Field.Store.YES, Field.Index.ANALYZED)); 124 writer.addDocument(doc); 125 writer.close(); 126 IndexSearcher is = new IndexSearcher(dir, true); 127 128 QueryParser aqp = new QueryParser("content", analyzer); 129 aqp.setAnalyzeRangeTerms(true); 130 131 // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi 132 // orders the U+0698 character before the U+0633 character, so the single 133 // indexed Term above should NOT be returned by a ConstantScoreRangeQuery 134 // with a Farsi Collator (or an Arabic one for the case when Farsi is not 135 // supported). 136 ScoreDoc[] result 137 = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs; 138 assertEquals("The index Term should not be included.", 0, result.length); 139</pre> 140 141<h3>Danish Sorting</h3> 142<pre class="prettyprint"> 143 Analyzer analyzer 144 = new ICUCollationKeyAnalyzer(Collator.getInstance(new ULocale("da", "dk"))); 145 Path indexPath = Files.createTempDirectory("tempIndex"); 146 Directory dir = FSDirectory.open(indexPath); 147 IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer)); 148 String[] tracer = new String[] { "A", "B", "C", "D", "E" }; 149 String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" }; 150 String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" }; 151 for (int i = 0 ; i < data.length ; ++i) { 152 Document doc = new Document(); 153 doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO)); 154 doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED)); 155 writer.addDocument(doc); 156 } 157 writer.close(); 158 IndexSearcher searcher = new IndexSearcher(dir, true); 159 Sort sort = new Sort(); 160 sort.setSort(new SortField("contents", SortField.STRING)); 161 Query query = new MatchAllDocsQuery(); 162 ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs; 163 for (int i = 0 ; i < result.length ; ++i) { 164 Document doc = searcher.doc(result[i].doc); 165 assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]); 166 } 167</pre> 168 169<h3>Turkish Case Normalization</h3> 170<pre class="prettyprint"> 171 Collator collator = Collator.getInstance(new ULocale("tr", "TR")); 172 collator.setStrength(Collator.PRIMARY); 173 Analyzer analyzer = new ICUCollationKeyAnalyzer(collator); 174 Path indexPath = Files.createTempDirectory("tempIndex"); 175 Directory dir = FSDirectory.open(indexPath); 176 IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer)); 177 Document doc = new Document(); 178 doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED)); 179 writer.addDocument(doc); 180 writer.close(); 181 IndexSearcher is = new IndexSearcher(dir, true); 182 QueryParser parser = new QueryParser("contents", analyzer); 183 Query query = parser.parse("d\u0131gy"); // U+0131: dotless i 184 ScoreDoc[] result = is.search(query, null, 1000).scoreDocs; 185 assertEquals("The index Term should be included.", 1, result.length); 186</pre> 187 188<h2>Caveats and Comparisons</h2> 189<p> 190 <strong>WARNING:</strong> Make sure you use exactly the same 191 <code>Collator</code> at index and query time -- <code>CollationKey</code>s 192 are only comparable when produced by 193 the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s 194 are not independently versioned, it is unsafe to search against stored 195 <code>CollationKey</code>s unless the following are exactly the same (best 196 practice is to store this information with the index and check that they 197 remain the same at query time): 198</p> 199<ol> 200 <li>JVM vendor</li> 201 <li>JVM version, including patch version</li> 202 <li> 203 The language (and country and variant, if specified) of the Locale 204 used when constructing the collator via 205 {@link java.text.Collator#getInstance(java.util.Locale)}. 206 </li> 207 <li> 208 The collation strength used - see {@link java.text.Collator#setStrength(int)} 209 </li> 210</ol> 211<p> 212 <code>ICUCollationKeyAnalyzer</code> uses ICU4J's <code>Collator</code>, which 213 makes its version available, thus allowing collation to be versioned 214 independently from the JVM. <code>ICUCollationKeyAnalyzer</code> is also 215 significantly faster and generates significantly shorter keys than 216 <code>CollationKeyAnalyzer</code>. See 217 <a href="http://site.icu-project.org/charts/collation-icu4j-sun" 218 >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key 219 generation timing and key length comparisons between ICU4J and 220 <code>java.text.Collator</code> over several languages. 221</p> 222<p> 223 <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are 224 not compatible with those those generated by ICU Collators. Specifically, if 225 you use <code>CollationKeyAnalyzer</code> to generate index terms, do not use 226 <code>ICUCollationKeyAnalyzer</code> on the query side, or vice versa. 227</p> 228<hr> 229<h1><a id="normalization">Normalization</a></h1> 230<p> 231 <code>ICUNormalizer2Filter</code> normalizes term text to a 232 <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so 233 that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a> 234 forms are standardized to a unique form. 235</p> 236<h2>Use Cases</h2> 237<ul> 238 <li> Removing differences in width for Asian-language text. 239 </li> 240 <li> Standardizing complex text with non-spacing marks so that characters are 241 ordered consistently. 242 </li> 243</ul> 244<h2>Example Usages</h2> 245<h3>Normalizing text to NFC</h3> 246<pre class="prettyprint"> 247 /** 248 * Normalizer2 objects are unmodifiable and immutable. 249 */ 250 Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE); 251 /** 252 * This filter will normalize to NFC. 253 */ 254 TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer); 255</pre> 256<hr> 257<h1><a id="casefolding">Case Folding</a></h1> 258<p> 259Default caseless matching, or case-folding is more than just conversion to 260lowercase. For example, it handles cases such as the Greek sigma, so that 261"Μάϊος" and "ΜΆΪΟΣ" will match correctly. 262</p> 263<p> 264Case-folding is still only an approximation of the language-specific rules 265governing case. If the specific language is known, consider using 266ICUCollationKeyFilter and indexing collation keys instead. This implementation 267performs the "full" case-folding specified in the Unicode standard, and this 268may change the length of the term. For example, the German ß is case-folded 269to the string 'ss'. 270</p> 271<p> 272Case folding is related to normalization, and as such is coupled with it in 273this integration. To perform case-folding, you use normalization with the form 274"nfkc_cf" (which is the default). 275</p> 276<h2>Use Cases</h2> 277<ul> 278 <li> 279 As a more thorough replacement for LowerCaseFilter that has good behavior 280 for most languages. 281 </li> 282</ul> 283<h2>Example Usages</h2> 284<h3>Lowercasing text</h3> 285<pre class="prettyprint"> 286 /** 287 * This filter will case-fold and normalize to NFKC. 288 */ 289 TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer); 290</pre> 291<hr> 292<h1><a id="searchfolding">Search Term Folding</a></h1> 293<p> 294Search term folding removes distinctions (such as accent marks) between 295similar characters. It is useful for a fuzzy or loose search. 296</p> 297<p> 298Search term folding implements many of the foldings specified in 299<a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a> 300as a special normalization form. This folding applies NFKC, Case Folding, and 301many character foldings recursively. 302</p> 303<h2>Use Cases</h2> 304<ul> 305 <li> 306 As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter 307 that applies the same ideas to many more languages. 308 </li> 309</ul> 310<h2>Example Usages</h2> 311<h3>Removing accents</h3> 312<pre class="prettyprint"> 313 /** 314 * This filter will case-fold, remove accents and other distinctions, and 315 * normalize to NFKC. 316 */ 317 TokenStream tokenstream = new ICUFoldingFilter(tokenizer); 318</pre> 319<hr> 320<h1><a id="transform">Text Transformation</a></h1> 321<p> 322ICU provides text-transformation functionality via its Transliteration API. This allows 323you to transform text in a variety of ways, taking context into account. 324</p> 325<p> 326For more information, see the 327<a href="http://userguide.icu-project.org/transforms/general">User's Guide</a> 328and 329<a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>. 330</p> 331<h2>Use Cases</h2> 332<ul> 333 <li> 334 Convert Traditional to Simplified 335 </li> 336 <li> 337 Transliterate between different writing systems: e.g. Romanization 338 </li> 339</ul> 340<h2>Example Usages</h2> 341<h3>Convert Traditional to Simplified</h3> 342<pre class="prettyprint"> 343 /** 344 * This filter will map Traditional Chinese to Simplified Chinese 345 */ 346 TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified")); 347</pre> 348<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3> 349<pre class="prettyprint"> 350 /** 351 * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules 352 */ 353 TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN")); 354</pre> 355<hr> 356<h1><a id="backcompat">Backwards Compatibility</a></h1> 357<p> 358This module exists to provide up-to-date Unicode functionality that supports 359the most recent version of Unicode (currently 11.0). However, some users who wish 360for stronger backwards compatibility can restrict 361{@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only 362a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}. 363</p> 364<h2>Example Usages</h2> 365<h3>Restricting normalization to Unicode 5.0</h3> 366<pre class="prettyprint"> 367 /** 368 * This filter will do NFC normalization, but will ignore any characters that 369 * did not exist as of Unicode 5.0. Because of the normalization stability policy 370 * of Unicode, this is an easy way to force normalization to a specific version. 371 */ 372 Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE); 373 UnicodeSet set = new UnicodeSet("[:age=5.0:]"); 374 // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer 375 set.freeze(); 376 FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set); 377 TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50); 378</pre> 379</body> 380</html> 381