xref: /Lucene/lucene/analysis/icu/src/java/overview.html (revision 0d339043e378d8333c376bae89411b813de25b10)
1<!--
2 Licensed to the Apache Software Foundation (ASF) under one or more
3 contributor license agreements.  See the NOTICE file distributed with
4 this work for additional information regarding copyright ownership.
5 The ASF licenses this file to You under the Apache License, Version 2.0
6 (the "License"); you may not use this file except in compliance with
7 the License.  You may obtain a copy of the License at
8
9     http://www.apache.org/licenses/LICENSE-2.0
10
11 Unless required by applicable law or agreed to in writing, software
12 distributed under the License is distributed on an "AS IS" BASIS,
13 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 See the License for the specific language governing permissions and
15 limitations under the License.
16-->
17<html>
18  <head>
19    <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
20    <title>
21      Apache Lucene ICU integration module
22    </title>
23  </head>
24<body>
25<p>
26This module exposes functionality from
27<a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
28library that enhances Java's internationalization support by improving
29performance, keeping current with the Unicode Standard, and providing richer
30APIs.
31<p>
32For an introduction to Lucene's analysis API, see the {@link org.apache.lucene.analysis} package documentation.
33<p>
34This module exposes the following functionality:
35</p>
36<ul>
37  <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on
38  properties and rules defined in Unicode.</li>
39  <li><a href="#collation">Collation</a>: Compare strings according to the
40  conventions and standards of a particular language, region or country.</li>
41  <li><a href="#normalization">Normalization</a>: Converts text to a unique,
42  equivalent form.</li>
43  <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
44  Unicode's Default Caseless Matching algorithm.</li>
45  <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
46  (such as accent marks) between similar characters for a loose or fuzzy search.</li>
47  <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in
48  a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li>
49</ul>
50<hr>
51<h1><a id="segmentation">Text Segmentation</a></h1>
52<p>
53Text Segmentation (Tokenization) divides document and query text into index terms
54(typically words). Unicode provides special properties and rules so that this can
55be done in a manner that works well with most languages.
56</p>
57<p>
58Text Segmentation implements the word segmentation specified in
59<a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>.
60Additionally the algorithm can be tailored based on writing system, for example
61text in the Thai script is automatically delegated to a dictionary-based segmentation
62algorithm.
63</p>
64<h2>Use Cases</h2>
65<ul>
66  <li>
67    As a more thorough replacement for StandardTokenizer that works well for
68    most languages.
69  </li>
70</ul>
71<h2>Example Usages</h2>
72<h3>Tokenizing multilanguage text</h3>
73<pre class="prettyprint">
74  /**
75   * This tokenizer will work well in general for most languages.
76   */
77  Tokenizer tokenizer = new ICUTokenizer(reader);
78</pre>
79<hr>
80<h1><a id="collation">Collation</a></h1>
81<p>
82  <code>ICUCollationKeyAnalyzer</code>
83  converts each token into its binary <code>CollationKey</code> using the
84  provided <code>Collator</code>, allowing it to be
85  stored as an index term.
86</p>
87<p>
88  <code>ICUCollationKeyAnalyzer</code> depends on ICU4J to produce the
89  <code>CollationKey</code>s.
90</p>
91
92<h2>Use Cases</h2>
93
94<ul>
95  <li>
96    Efficient sorting of terms in languages that use non-Unicode character
97    orderings.  (Lucene Sort using a Locale can be very slow.)
98  </li>
99  <li>
100    Efficient range queries over fields that contain terms in languages that
101    use non-Unicode character orderings.  (Range queries using a Locale can be
102    very slow.)
103  </li>
104  <li>
105    Effective Locale-specific normalization (case differences, diacritics, etc.).
106    ({@link org.apache.lucene.analysis.LowerCaseFilter} and
107    {@link org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter} provide these services
108    in a generic way that doesn't take into account locale-specific needs.)
109  </li>
110</ul>
111
112<h2>Example Usages</h2>
113
114<h3>Farsi Range Queries</h3>
115<pre class="prettyprint">
116  Collator collator = Collator.getInstance(new ULocale("ar"));
117  ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(collator);
118  Path indexPath = Files.createTempDirectory("tempIndex");
119  Directory dir = FSDirectory.open(indexPath);
120  IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer));
121  Document doc = new Document();
122  doc.add(new Field("content", "\u0633\u0627\u0628",
123                    Field.Store.YES, Field.Index.ANALYZED));
124  writer.addDocument(doc);
125  writer.close();
126  IndexSearcher is = new IndexSearcher(dir, true);
127
128  QueryParser aqp = new QueryParser("content", analyzer);
129  aqp.setAnalyzeRangeTerms(true);
130
131  // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
132  // orders the U+0698 character before the U+0633 character, so the single
133  // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
134  // with a Farsi Collator (or an Arabic one for the case when Farsi is not
135  // supported).
136  ScoreDoc[] result
137    = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
138  assertEquals("The index Term should not be included.", 0, result.length);
139</pre>
140
141<h3>Danish Sorting</h3>
142<pre class="prettyprint">
143  Analyzer analyzer
144    = new ICUCollationKeyAnalyzer(Collator.getInstance(new ULocale("da", "dk")));
145  Path indexPath = Files.createTempDirectory("tempIndex");
146  Directory dir = FSDirectory.open(indexPath);
147  IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer));
148  String[] tracer = new String[] { "A", "B", "C", "D", "E" };
149  String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
150  String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
151  for (int i = 0 ; i &lt; data.length ; ++i) {
152    Document doc = new Document();
153    doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
154    doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
155    writer.addDocument(doc);
156  }
157  writer.close();
158  IndexSearcher searcher = new IndexSearcher(dir, true);
159  Sort sort = new Sort();
160  sort.setSort(new SortField("contents", SortField.STRING));
161  Query query = new MatchAllDocsQuery();
162  ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
163  for (int i = 0 ; i &lt; result.length ; ++i) {
164    Document doc = searcher.doc(result[i].doc);
165    assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
166  }
167</pre>
168
169<h3>Turkish Case Normalization</h3>
170<pre class="prettyprint">
171  Collator collator = Collator.getInstance(new ULocale("tr", "TR"));
172  collator.setStrength(Collator.PRIMARY);
173  Analyzer analyzer = new ICUCollationKeyAnalyzer(collator);
174  Path indexPath = Files.createTempDirectory("tempIndex");
175  Directory dir = FSDirectory.open(indexPath);
176  IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(analyzer));
177  Document doc = new Document();
178  doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
179  writer.addDocument(doc);
180  writer.close();
181  IndexSearcher is = new IndexSearcher(dir, true);
182  QueryParser parser = new QueryParser("contents", analyzer);
183  Query query = parser.parse("d\u0131gy");   // U+0131: dotless i
184  ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
185  assertEquals("The index Term should be included.", 1, result.length);
186</pre>
187
188<h2>Caveats and Comparisons</h2>
189<p>
190  <strong>WARNING:</strong> Make sure you use exactly the same
191  <code>Collator</code> at index and query time -- <code>CollationKey</code>s
192  are only comparable when produced by
193  the same <code>Collator</code>.  Since {@link java.text.RuleBasedCollator}s
194  are not independently versioned, it is unsafe to search against stored
195  <code>CollationKey</code>s unless the following are exactly the same (best
196  practice is to store this information with the index and check that they
197  remain the same at query time):
198</p>
199<ol>
200  <li>JVM vendor</li>
201  <li>JVM version, including patch version</li>
202  <li>
203    The language (and country and variant, if specified) of the Locale
204    used when constructing the collator via
205    {@link java.text.Collator#getInstance(java.util.Locale)}.
206  </li>
207  <li>
208    The collation strength used - see {@link java.text.Collator#setStrength(int)}
209  </li>
210</ol>
211<p>
212  <code>ICUCollationKeyAnalyzer</code> uses ICU4J's <code>Collator</code>, which
213  makes its version available, thus allowing collation to be versioned
214  independently from the JVM.  <code>ICUCollationKeyAnalyzer</code> is also
215  significantly faster and generates significantly shorter keys than
216  <code>CollationKeyAnalyzer</code>.  See
217  <a href="http://site.icu-project.org/charts/collation-icu4j-sun"
218    >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
219  generation timing and key length comparisons between ICU4J and
220  <code>java.text.Collator</code> over several languages.
221</p>
222<p>
223  <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
224  not compatible with those those generated by ICU Collators.  Specifically, if
225  you use <code>CollationKeyAnalyzer</code> to generate index terms, do not use
226  <code>ICUCollationKeyAnalyzer</code> on the query side, or vice versa.
227</p>
228<hr>
229<h1><a id="normalization">Normalization</a></h1>
230<p>
231  <code>ICUNormalizer2Filter</code> normalizes term text to a
232  <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
233  that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
234  forms are standardized to a unique form.
235</p>
236<h2>Use Cases</h2>
237<ul>
238  <li> Removing differences in width for Asian-language text.
239  </li>
240  <li> Standardizing complex text with non-spacing marks so that characters are
241  ordered consistently.
242  </li>
243</ul>
244<h2>Example Usages</h2>
245<h3>Normalizing text to NFC</h3>
246<pre class="prettyprint">
247  /**
248   * Normalizer2 objects are unmodifiable and immutable.
249   */
250  Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
251  /**
252   * This filter will normalize to NFC.
253   */
254  TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
255</pre>
256<hr>
257<h1><a id="casefolding">Case Folding</a></h1>
258<p>
259Default caseless matching, or case-folding is more than just conversion to
260lowercase. For example, it handles cases such as the Greek sigma, so that
261"Μάϊος" and "ΜΆΪΟΣ" will match correctly.
262</p>
263<p>
264Case-folding is still only an approximation of the language-specific rules
265governing case. If the specific language is known, consider using
266ICUCollationKeyFilter and indexing collation keys instead. This implementation
267performs the "full" case-folding specified in the Unicode standard, and this
268may change the length of the term. For example, the German ß is case-folded
269to the string 'ss'.
270</p>
271<p>
272Case folding is related to normalization, and as such is coupled with it in
273this integration. To perform case-folding, you use normalization with the form
274"nfkc_cf" (which is the default).
275</p>
276<h2>Use Cases</h2>
277<ul>
278  <li>
279    As a more thorough replacement for LowerCaseFilter that has good behavior
280    for most languages.
281  </li>
282</ul>
283<h2>Example Usages</h2>
284<h3>Lowercasing text</h3>
285<pre class="prettyprint">
286  /**
287   * This filter will case-fold and normalize to NFKC.
288   */
289  TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
290</pre>
291<hr>
292<h1><a id="searchfolding">Search Term Folding</a></h1>
293<p>
294Search term folding removes distinctions (such as accent marks) between
295similar characters. It is useful for a fuzzy or loose search.
296</p>
297<p>
298Search term folding implements many of the foldings specified in
299<a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
300as a special normalization form.  This folding applies NFKC, Case Folding, and
301many character foldings recursively.
302</p>
303<h2>Use Cases</h2>
304<ul>
305  <li>
306    As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
307    that applies the same ideas to many more languages.
308  </li>
309</ul>
310<h2>Example Usages</h2>
311<h3>Removing accents</h3>
312<pre class="prettyprint">
313  /**
314   * This filter will case-fold, remove accents and other distinctions, and
315   * normalize to NFKC.
316   */
317  TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
318</pre>
319<hr>
320<h1><a id="transform">Text Transformation</a></h1>
321<p>
322ICU provides text-transformation functionality via its Transliteration API. This allows
323you to transform text in a variety of ways, taking context into account.
324</p>
325<p>
326For more information, see the
327<a href="http://userguide.icu-project.org/transforms/general">User's Guide</a>
328and
329<a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>.
330</p>
331<h2>Use Cases</h2>
332<ul>
333  <li>
334    Convert Traditional to Simplified
335  </li>
336  <li>
337    Transliterate between different writing systems: e.g. Romanization
338  </li>
339</ul>
340<h2>Example Usages</h2>
341<h3>Convert Traditional to Simplified</h3>
342<pre class="prettyprint">
343  /**
344   * This filter will map Traditional Chinese to Simplified Chinese
345   */
346  TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
347</pre>
348<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
349<pre class="prettyprint">
350  /**
351   * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
352   */
353  TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
354</pre>
355<hr>
356<h1><a id="backcompat">Backwards Compatibility</a></h1>
357<p>
358This module exists to provide up-to-date Unicode functionality that supports
359the most recent version of Unicode (currently 11.0). However, some users who wish
360for stronger backwards compatibility can restrict
361{@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only
362a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}.
363</p>
364<h2>Example Usages</h2>
365<h3>Restricting normalization to Unicode 5.0</h3>
366<pre class="prettyprint">
367  /**
368   * This filter will do NFC normalization, but will ignore any characters that
369   * did not exist as of Unicode 5.0. Because of the normalization stability policy
370   * of Unicode, this is an easy way to force normalization to a specific version.
371   */
372    Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
373    UnicodeSet set = new UnicodeSet("[:age=5.0:]");
374    // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
375    set.freeze();
376    FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
377    TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
378</pre>
379</body>
380</html>
381