1e8e4245dSRobert Muir<!-- 2e8e4245dSRobert Muir Licensed to the Apache Software Foundation (ASF) under one or more 3e8e4245dSRobert Muir contributor license agreements. See the NOTICE file distributed with 4e8e4245dSRobert Muir this work for additional information regarding copyright ownership. 5e8e4245dSRobert Muir The ASF licenses this file to You under the Apache License, Version 2.0 6e8e4245dSRobert Muir (the "License"); you may not use this file except in compliance with 7e8e4245dSRobert Muir the License. You may obtain a copy of the License at 8e8e4245dSRobert Muir 9e8e4245dSRobert Muir http://www.apache.org/licenses/LICENSE-2.0 10e8e4245dSRobert Muir 11e8e4245dSRobert Muir Unless required by applicable law or agreed to in writing, software 12e8e4245dSRobert Muir distributed under the License is distributed on an "AS IS" BASIS, 13e8e4245dSRobert Muir WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14e8e4245dSRobert Muir See the License for the specific language governing permissions and 15e8e4245dSRobert Muir limitations under the License. 16e8e4245dSRobert Muir--> 17e8e4245dSRobert Muir<html> 18e8e4245dSRobert Muir <head> 19e8e4245dSRobert Muir <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> 20e8e4245dSRobert Muir <title> 21e8e4245dSRobert Muir Apache Lucene ICU integration module 22e8e4245dSRobert Muir </title> 23e8e4245dSRobert Muir </head> 24e8e4245dSRobert Muir<body> 25e8e4245dSRobert Muir<p> 26e8e4245dSRobert MuirThis module exposes functionality from 27e8e4245dSRobert Muir<a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java 28e8e4245dSRobert Muirlibrary that enhances Java's internationalization support by improving 29e8e4245dSRobert Muirperformance, keeping current with the Unicode Standard, and providing richer 3077f13708SRobert MuirAPIs. 3177f13708SRobert Muir<p> 3277f13708SRobert MuirFor an introduction to Lucene's analysis API, see the {@link org.apache.lucene.analysis} package documentation. 3377f13708SRobert Muir<p> 3477f13708SRobert MuirThis module exposes the following functionality: 35e8e4245dSRobert Muir</p> 36e8e4245dSRobert Muir<ul> 37e8e4245dSRobert Muir <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on 38e8e4245dSRobert Muir properties and rules defined in Unicode.</li> 39e8e4245dSRobert Muir <li><a href="#collation">Collation</a>: Compare strings according to the 40e8e4245dSRobert Muir conventions and standards of a particular language, region or country.</li> 41e8e4245dSRobert Muir <li><a href="#normalization">Normalization</a>: Converts text to a unique, 42e8e4245dSRobert Muir equivalent form.</li> 43e8e4245dSRobert Muir <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with 44e8e4245dSRobert Muir Unicode's Default Caseless Matching algorithm.</li> 45e8e4245dSRobert Muir <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions 46e8e4245dSRobert Muir (such as accent marks) between similar characters for a loose or fuzzy search.</li> 47e8e4245dSRobert Muir <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in 48e8e4245dSRobert Muir a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li> 49e8e4245dSRobert Muir</ul> 50*0d339043SRobert Muir<hr> 51*0d339043SRobert Muir<h1><a id="segmentation">Text Segmentation</a></h1> 52e8e4245dSRobert Muir<p> 53e8e4245dSRobert MuirText Segmentation (Tokenization) divides document and query text into index terms 54e8e4245dSRobert Muir(typically words). Unicode provides special properties and rules so that this can 55e8e4245dSRobert Muirbe done in a manner that works well with most languages. 56e8e4245dSRobert Muir</p> 57e8e4245dSRobert Muir<p> 58e8e4245dSRobert MuirText Segmentation implements the word segmentation specified in 59e8e4245dSRobert Muir<a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>. 60e8e4245dSRobert MuirAdditionally the algorithm can be tailored based on writing system, for example 61e8e4245dSRobert Muirtext in the Thai script is automatically delegated to a dictionary-based segmentation 62e8e4245dSRobert Muiralgorithm. 63e8e4245dSRobert Muir</p> 64e8e4245dSRobert Muir<h2>Use Cases</h2> 65e8e4245dSRobert Muir<ul> 66e8e4245dSRobert Muir <li> 67e8e4245dSRobert Muir As a more thorough replacement for StandardTokenizer that works well for 68e8e4245dSRobert Muir most languages. 69e8e4245dSRobert Muir </li> 70e8e4245dSRobert Muir</ul> 71e8e4245dSRobert Muir<h2>Example Usages</h2> 72e8e4245dSRobert Muir<h3>Tokenizing multilanguage text</h3> 73e8e4245dSRobert Muir<pre class="prettyprint"> 74e8e4245dSRobert Muir /** 75e8e4245dSRobert Muir * This tokenizer will work well in general for most languages. 76e8e4245dSRobert Muir */ 77e8e4245dSRobert Muir Tokenizer tokenizer = new ICUTokenizer(reader); 78e8e4245dSRobert Muir</pre> 79*0d339043SRobert Muir<hr> 80*0d339043SRobert Muir<h1><a id="collation">Collation</a></h1> 81e8e4245dSRobert Muir<p> 820bf1f362SRobert Muir <code>ICUCollationKeyAnalyzer</code> 83e8e4245dSRobert Muir converts each token into its binary <code>CollationKey</code> using the 840bf1f362SRobert Muir provided <code>Collator</code>, allowing it to be 85e8e4245dSRobert Muir stored as an index term. 86e8e4245dSRobert Muir</p> 87e8e4245dSRobert Muir<p> 880bf1f362SRobert Muir <code>ICUCollationKeyAnalyzer</code> depends on ICU4J to produce the 890bf1f362SRobert Muir <code>CollationKey</code>s. 90e8e4245dSRobert Muir</p> 91e8e4245dSRobert Muir 92e8e4245dSRobert Muir<h2>Use Cases</h2> 93e8e4245dSRobert Muir 94e8e4245dSRobert Muir<ul> 95e8e4245dSRobert Muir <li> 96e8e4245dSRobert Muir Efficient sorting of terms in languages that use non-Unicode character 97e8e4245dSRobert Muir orderings. (Lucene Sort using a Locale can be very slow.) 98e8e4245dSRobert Muir </li> 99e8e4245dSRobert Muir <li> 100e8e4245dSRobert Muir Efficient range queries over fields that contain terms in languages that 101e8e4245dSRobert Muir use non-Unicode character orderings. (Range queries using a Locale can be 102e8e4245dSRobert Muir very slow.) 103e8e4245dSRobert Muir </li> 104e8e4245dSRobert Muir <li> 105e8e4245dSRobert Muir Effective Locale-specific normalization (case differences, diacritics, etc.). 10687016b5fSMike McCandless ({@link org.apache.lucene.analysis.LowerCaseFilter} and 107e8e4245dSRobert Muir {@link org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter} provide these services 108e8e4245dSRobert Muir in a generic way that doesn't take into account locale-specific needs.) 109e8e4245dSRobert Muir </li> 110e8e4245dSRobert Muir</ul> 111e8e4245dSRobert Muir 112e8e4245dSRobert Muir<h2>Example Usages</h2> 113e8e4245dSRobert Muir 114e8e4245dSRobert Muir<h3>Farsi Range Queries</h3> 115e8e4245dSRobert Muir<pre class="prettyprint"> 116e8e4245dSRobert Muir Collator collator = Collator.getInstance(new ULocale("ar")); 117f5663864SRyan Ernst ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator); 118922295a9SDawid Weiss Path indexPath = Files.createTempDirectory("tempIndex"); 119922295a9SDawid Weiss Directory dir = FSDirectory.open(indexPath); 120922295a9SDawid Weiss IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer)); 121e8e4245dSRobert Muir Document doc = new Document(); 122e8e4245dSRobert Muir doc.add(new Field("content", "\u0633\u0627\u0628", 123e8e4245dSRobert Muir Field.Store.YES, Field.Index.ANALYZED)); 124e8e4245dSRobert Muir writer.addDocument(doc); 125e8e4245dSRobert Muir writer.close(); 126922295a9SDawid Weiss IndexSearcher is = new IndexSearcher(dir, true); 127e8e4245dSRobert Muir 128f5663864SRyan Ernst QueryParser aqp = new QueryParser("content", analyzer); 129e8e4245dSRobert Muir aqp.setAnalyzeRangeTerms(true); 130e8e4245dSRobert Muir 131e8e4245dSRobert Muir // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi 132e8e4245dSRobert Muir // orders the U+0698 character before the U+0633 character, so the single 133e8e4245dSRobert Muir // indexed Term above should NOT be returned by a ConstantScoreRangeQuery 134e8e4245dSRobert Muir // with a Farsi Collator (or an Arabic one for the case when Farsi is not 135e8e4245dSRobert Muir // supported). 136e8e4245dSRobert Muir ScoreDoc[] result 137e8e4245dSRobert Muir = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs; 138e8e4245dSRobert Muir assertEquals("The index Term should not be included.", 0, result.length); 139e8e4245dSRobert Muir</pre> 140e8e4245dSRobert Muir 141e8e4245dSRobert Muir<h3>Danish Sorting</h3> 142e8e4245dSRobert Muir<pre class="prettyprint"> 143e8e4245dSRobert Muir Analyzer analyzer 144f5663864SRyan Ernst = new ICUCollationKeyAnalyzer(Collator.getInstance(new ULocale("da", "dk"))); 145922295a9SDawid Weiss Path indexPath = Files.createTempDirectory("tempIndex"); 146922295a9SDawid Weiss Directory dir = FSDirectory.open(indexPath); 147922295a9SDawid Weiss IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer)); 148e8e4245dSRobert Muir String[] tracer = new String[] { "A", "B", "C", "D", "E" }; 149e8e4245dSRobert Muir String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" }; 150e8e4245dSRobert Muir String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" }; 1517dc4df95SDawid Weiss for (int i = 0 ; i < data.length ; ++i) { 152e8e4245dSRobert Muir Document doc = new Document(); 153e8e4245dSRobert Muir doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO)); 154e8e4245dSRobert Muir doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED)); 155e8e4245dSRobert Muir writer.addDocument(doc); 156e8e4245dSRobert Muir } 157e8e4245dSRobert Muir writer.close(); 158922295a9SDawid Weiss IndexSearcher searcher = new IndexSearcher(dir, true); 159e8e4245dSRobert Muir Sort sort = new Sort(); 160e8e4245dSRobert Muir sort.setSort(new SortField("contents", SortField.STRING)); 161e8e4245dSRobert Muir Query query = new MatchAllDocsQuery(); 162e8e4245dSRobert Muir ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs; 1637dc4df95SDawid Weiss for (int i = 0 ; i < result.length ; ++i) { 164e8e4245dSRobert Muir Document doc = searcher.doc(result[i].doc); 165e8e4245dSRobert Muir assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]); 166e8e4245dSRobert Muir } 167e8e4245dSRobert Muir</pre> 168e8e4245dSRobert Muir 169e8e4245dSRobert Muir<h3>Turkish Case Normalization</h3> 170e8e4245dSRobert Muir<pre class="prettyprint"> 171e8e4245dSRobert Muir Collator collator = Collator.getInstance(new ULocale("tr", "TR")); 172e8e4245dSRobert Muir collator.setStrength(Collator.PRIMARY); 173f5663864SRyan Ernst Analyzer analyzer = new ICUCollationKeyAnalyzer(collator); 174922295a9SDawid Weiss Path indexPath = Files.createTempDirectory("tempIndex"); 175922295a9SDawid Weiss Directory dir = FSDirectory.open(indexPath); 176922295a9SDawid Weiss IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer)); 177e8e4245dSRobert Muir Document doc = new Document(); 178e8e4245dSRobert Muir doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED)); 179e8e4245dSRobert Muir writer.addDocument(doc); 180e8e4245dSRobert Muir writer.close(); 181922295a9SDawid Weiss IndexSearcher is = new IndexSearcher(dir, true); 182f5663864SRyan Ernst QueryParser parser = new QueryParser("contents", analyzer); 183e8e4245dSRobert Muir Query query = parser.parse("d\u0131gy"); // U+0131: dotless i 184e8e4245dSRobert Muir ScoreDoc[] result = is.search(query, null, 1000).scoreDocs; 185e8e4245dSRobert Muir assertEquals("The index Term should be included.", 1, result.length); 186e8e4245dSRobert Muir</pre> 187e8e4245dSRobert Muir 188e8e4245dSRobert Muir<h2>Caveats and Comparisons</h2> 189e8e4245dSRobert Muir<p> 190e8e4245dSRobert Muir <strong>WARNING:</strong> Make sure you use exactly the same 191e8e4245dSRobert Muir <code>Collator</code> at index and query time -- <code>CollationKey</code>s 192e8e4245dSRobert Muir are only comparable when produced by 193e8e4245dSRobert Muir the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s 194e8e4245dSRobert Muir are not independently versioned, it is unsafe to search against stored 195e8e4245dSRobert Muir <code>CollationKey</code>s unless the following are exactly the same (best 196e8e4245dSRobert Muir practice is to store this information with the index and check that they 197e8e4245dSRobert Muir remain the same at query time): 198e8e4245dSRobert Muir</p> 199e8e4245dSRobert Muir<ol> 200e8e4245dSRobert Muir <li>JVM vendor</li> 201e8e4245dSRobert Muir <li>JVM version, including patch version</li> 202e8e4245dSRobert Muir <li> 203e8e4245dSRobert Muir The language (and country and variant, if specified) of the Locale 204e8e4245dSRobert Muir used when constructing the collator via 205e8e4245dSRobert Muir {@link java.text.Collator#getInstance(java.util.Locale)}. 206e8e4245dSRobert Muir </li> 207e8e4245dSRobert Muir <li> 208e8e4245dSRobert Muir The collation strength used - see {@link java.text.Collator#setStrength(int)} 209e8e4245dSRobert Muir </li> 210e8e4245dSRobert Muir</ol> 211e8e4245dSRobert Muir<p> 2120bf1f362SRobert Muir <code>ICUCollationKeyAnalyzer</code> uses ICU4J's <code>Collator</code>, which 213e8e4245dSRobert Muir makes its version available, thus allowing collation to be versioned 2140bf1f362SRobert Muir independently from the JVM. <code>ICUCollationKeyAnalyzer</code> is also 215e8e4245dSRobert Muir significantly faster and generates significantly shorter keys than 2160bf1f362SRobert Muir <code>CollationKeyAnalyzer</code>. See 217e8e4245dSRobert Muir <a href="http://site.icu-project.org/charts/collation-icu4j-sun" 218e8e4245dSRobert Muir >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key 219e8e4245dSRobert Muir generation timing and key length comparisons between ICU4J and 220e8e4245dSRobert Muir <code>java.text.Collator</code> over several languages. 221e8e4245dSRobert Muir</p> 222e8e4245dSRobert Muir<p> 223e8e4245dSRobert Muir <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are 224e8e4245dSRobert Muir not compatible with those those generated by ICU Collators. Specifically, if 2250bf1f362SRobert Muir you use <code>CollationKeyAnalyzer</code> to generate index terms, do not use 2260bf1f362SRobert Muir <code>ICUCollationKeyAnalyzer</code> on the query side, or vice versa. 227e8e4245dSRobert Muir</p> 228*0d339043SRobert Muir<hr> 229*0d339043SRobert Muir<h1><a id="normalization">Normalization</a></h1> 230e8e4245dSRobert Muir<p> 231e8e4245dSRobert Muir <code>ICUNormalizer2Filter</code> normalizes term text to a 232e8e4245dSRobert Muir <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so 233e8e4245dSRobert Muir that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a> 234e8e4245dSRobert Muir forms are standardized to a unique form. 235e8e4245dSRobert Muir</p> 236e8e4245dSRobert Muir<h2>Use Cases</h2> 237e8e4245dSRobert Muir<ul> 238e8e4245dSRobert Muir <li> Removing differences in width for Asian-language text. 239e8e4245dSRobert Muir </li> 240e8e4245dSRobert Muir <li> Standardizing complex text with non-spacing marks so that characters are 241e8e4245dSRobert Muir ordered consistently. 242e8e4245dSRobert Muir </li> 243e8e4245dSRobert Muir</ul> 244e8e4245dSRobert Muir<h2>Example Usages</h2> 245e8e4245dSRobert Muir<h3>Normalizing text to NFC</h3> 246e8e4245dSRobert Muir<pre class="prettyprint"> 247e8e4245dSRobert Muir /** 248e8e4245dSRobert Muir * Normalizer2 objects are unmodifiable and immutable. 249e8e4245dSRobert Muir */ 250e8e4245dSRobert Muir Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE); 251e8e4245dSRobert Muir /** 252e8e4245dSRobert Muir * This filter will normalize to NFC. 253e8e4245dSRobert Muir */ 254e8e4245dSRobert Muir TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer); 255e8e4245dSRobert Muir</pre> 256*0d339043SRobert Muir<hr> 257*0d339043SRobert Muir<h1><a id="casefolding">Case Folding</a></h1> 258e8e4245dSRobert Muir<p> 259e8e4245dSRobert MuirDefault caseless matching, or case-folding is more than just conversion to 260e8e4245dSRobert Muirlowercase. For example, it handles cases such as the Greek sigma, so that 261e8e4245dSRobert Muir"Μάϊος" and "ΜΆΪΟΣ" will match correctly. 262e8e4245dSRobert Muir</p> 263e8e4245dSRobert Muir<p> 264e8e4245dSRobert MuirCase-folding is still only an approximation of the language-specific rules 265e8e4245dSRobert Muirgoverning case. If the specific language is known, consider using 266e8e4245dSRobert MuirICUCollationKeyFilter and indexing collation keys instead. This implementation 267e8e4245dSRobert Muirperforms the "full" case-folding specified in the Unicode standard, and this 268e8e4245dSRobert Muirmay change the length of the term. For example, the German ß is case-folded 269e8e4245dSRobert Muirto the string 'ss'. 270e8e4245dSRobert Muir</p> 271e8e4245dSRobert Muir<p> 272e8e4245dSRobert MuirCase folding is related to normalization, and as such is coupled with it in 273e8e4245dSRobert Muirthis integration. To perform case-folding, you use normalization with the form 274e8e4245dSRobert Muir"nfkc_cf" (which is the default). 275e8e4245dSRobert Muir</p> 276e8e4245dSRobert Muir<h2>Use Cases</h2> 277e8e4245dSRobert Muir<ul> 278e8e4245dSRobert Muir <li> 279e8e4245dSRobert Muir As a more thorough replacement for LowerCaseFilter that has good behavior 280e8e4245dSRobert Muir for most languages. 281e8e4245dSRobert Muir </li> 282e8e4245dSRobert Muir</ul> 283e8e4245dSRobert Muir<h2>Example Usages</h2> 284e8e4245dSRobert Muir<h3>Lowercasing text</h3> 285e8e4245dSRobert Muir<pre class="prettyprint"> 286e8e4245dSRobert Muir /** 287e8e4245dSRobert Muir * This filter will case-fold and normalize to NFKC. 288e8e4245dSRobert Muir */ 289e8e4245dSRobert Muir TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer); 290e8e4245dSRobert Muir</pre> 291*0d339043SRobert Muir<hr> 292*0d339043SRobert Muir<h1><a id="searchfolding">Search Term Folding</a></h1> 293e8e4245dSRobert Muir<p> 294e8e4245dSRobert MuirSearch term folding removes distinctions (such as accent marks) between 295e8e4245dSRobert Muirsimilar characters. It is useful for a fuzzy or loose search. 296e8e4245dSRobert Muir</p> 297e8e4245dSRobert Muir<p> 298e8e4245dSRobert MuirSearch term folding implements many of the foldings specified in 299e8e4245dSRobert Muir<a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a> 300e8e4245dSRobert Muiras a special normalization form. This folding applies NFKC, Case Folding, and 301e8e4245dSRobert Muirmany character foldings recursively. 302e8e4245dSRobert Muir</p> 303e8e4245dSRobert Muir<h2>Use Cases</h2> 304e8e4245dSRobert Muir<ul> 305e8e4245dSRobert Muir <li> 306e8e4245dSRobert Muir As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter 307e8e4245dSRobert Muir that applies the same ideas to many more languages. 308e8e4245dSRobert Muir </li> 309e8e4245dSRobert Muir</ul> 310e8e4245dSRobert Muir<h2>Example Usages</h2> 311e8e4245dSRobert Muir<h3>Removing accents</h3> 312e8e4245dSRobert Muir<pre class="prettyprint"> 313e8e4245dSRobert Muir /** 314e8e4245dSRobert Muir * This filter will case-fold, remove accents and other distinctions, and 315e8e4245dSRobert Muir * normalize to NFKC. 316e8e4245dSRobert Muir */ 317e8e4245dSRobert Muir TokenStream tokenstream = new ICUFoldingFilter(tokenizer); 318e8e4245dSRobert Muir</pre> 319*0d339043SRobert Muir<hr> 320*0d339043SRobert Muir<h1><a id="transform">Text Transformation</a></h1> 321e8e4245dSRobert Muir<p> 322e8e4245dSRobert MuirICU provides text-transformation functionality via its Transliteration API. This allows 323e8e4245dSRobert Muiryou to transform text in a variety of ways, taking context into account. 324e8e4245dSRobert Muir</p> 325e8e4245dSRobert Muir<p> 326e8e4245dSRobert MuirFor more information, see the 327e8e4245dSRobert Muir<a href="http://userguide.icu-project.org/transforms/general">User's Guide</a> 328e8e4245dSRobert Muirand 329e8e4245dSRobert Muir<a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>. 330e8e4245dSRobert Muir</p> 331e8e4245dSRobert Muir<h2>Use Cases</h2> 332e8e4245dSRobert Muir<ul> 333e8e4245dSRobert Muir <li> 334e8e4245dSRobert Muir Convert Traditional to Simplified 335e8e4245dSRobert Muir </li> 336e8e4245dSRobert Muir <li> 337e8e4245dSRobert Muir Transliterate between different writing systems: e.g. Romanization 338e8e4245dSRobert Muir </li> 339e8e4245dSRobert Muir</ul> 340e8e4245dSRobert Muir<h2>Example Usages</h2> 341e8e4245dSRobert Muir<h3>Convert Traditional to Simplified</h3> 342e8e4245dSRobert Muir<pre class="prettyprint"> 343e8e4245dSRobert Muir /** 344e8e4245dSRobert Muir * This filter will map Traditional Chinese to Simplified Chinese 345e8e4245dSRobert Muir */ 346e8e4245dSRobert Muir TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified")); 347e8e4245dSRobert Muir</pre> 348e8e4245dSRobert Muir<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3> 349e8e4245dSRobert Muir<pre class="prettyprint"> 350e8e4245dSRobert Muir /** 351e8e4245dSRobert Muir * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules 352e8e4245dSRobert Muir */ 353e8e4245dSRobert Muir TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN")); 354e8e4245dSRobert Muir</pre> 355*0d339043SRobert Muir<hr> 356*0d339043SRobert Muir<h1><a id="backcompat">Backwards Compatibility</a></h1> 357e8e4245dSRobert Muir<p> 358e8e4245dSRobert MuirThis module exists to provide up-to-date Unicode functionality that supports 3592ea416eeSRobert Muirthe most recent version of Unicode (currently 11.0). However, some users who wish 360e8e4245dSRobert Muirfor stronger backwards compatibility can restrict 361e8e4245dSRobert Muir{@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only 362e8e4245dSRobert Muira specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}. 363e8e4245dSRobert Muir</p> 364e8e4245dSRobert Muir<h2>Example Usages</h2> 365e8e4245dSRobert Muir<h3>Restricting normalization to Unicode 5.0</h3> 366e8e4245dSRobert Muir<pre class="prettyprint"> 367e8e4245dSRobert Muir /** 368e8e4245dSRobert Muir * This filter will do NFC normalization, but will ignore any characters that 369e8e4245dSRobert Muir * did not exist as of Unicode 5.0. Because of the normalization stability policy 370e8e4245dSRobert Muir * of Unicode, this is an easy way to force normalization to a specific version. 371e8e4245dSRobert Muir */ 372e8e4245dSRobert Muir Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE); 373e8e4245dSRobert Muir UnicodeSet set = new UnicodeSet("[:age=5.0:]"); 374e8e4245dSRobert Muir // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer 375e8e4245dSRobert Muir set.freeze(); 376e8e4245dSRobert Muir FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set); 377e8e4245dSRobert Muir TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50); 378e8e4245dSRobert Muir</pre> 379e8e4245dSRobert Muir</body> 380e8e4245dSRobert Muir</html> 381