13bedc087SDawid Weiss<!-- 23bedc087SDawid Weiss Licensed to the Apache Software Foundation (ASF) under one or more 33bedc087SDawid Weiss contributor license agreements. See the NOTICE file distributed with 43bedc087SDawid Weiss this work for additional information regarding copyright ownership. 53bedc087SDawid Weiss The ASF licenses this file to You under the Apache License, Version 2.0 63bedc087SDawid Weiss the "License"); you may not use this file except in compliance with 73bedc087SDawid Weiss the License. You may obtain a copy of the License at 83bedc087SDawid Weiss 93bedc087SDawid Weiss http://www.apache.org/licenses/LICENSE-2.0 103bedc087SDawid Weiss 113bedc087SDawid Weiss Unless required by applicable law or agreed to in writing, software 123bedc087SDawid Weiss distributed under the License is distributed on an "AS IS" BASIS, 133bedc087SDawid Weiss WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 143bedc087SDawid Weiss See the License for the specific language governing permissions and 153bedc087SDawid Weiss limitations under the License. 163bedc087SDawid Weiss --> 173bedc087SDawid Weiss 18c7697b08STomoko Uchida# Apache Lucene Migration Guide 19c7697b08STomoko Uchida 20b2e866b7SRobert Muir## Migration from Lucene 9.x to Lucene 10.0 21b2e866b7SRobert Muir 22*71a9acb2STomoko Uchida### PersianStemFilter is added to PersianAnalyzer (LUCENE-10312) 23*71a9acb2STomoko Uchida 24*71a9acb2STomoko UchidaPersianAnalyzer now includes PersianStemFilter, that would change analysis results. If you need the exactly same analysis 25*71a9acb2STomoko Uchidabehaviour as 9.x, clone `PersianAnalyzer` in 9.x or create custom analyzer by using `CustomAnalyzer` on your own. 26*71a9acb2STomoko Uchida 2784e4b85bSRobert Muir### AutomatonQuery/CompiledAutomaton/RunAutomaton/RegExp no longer determinize (LUCENE-10010) 28b2e866b7SRobert Muir 29b2e866b7SRobert MuirThese classes no longer take a `determinizeWorkLimit` and no longer determinize 30b2e866b7SRobert Muirbehind the scenes. It is the responsibility of the caller to to call 31b2e866b7SRobert Muir`Operations.determinize()` for DFA execution. 32b2e866b7SRobert Muir 3394fe7e31Szacharymorn### DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery removed in favor of FieldExistsQuery (LUCENE-10436) 3494fe7e31Szacharymorn 3594fe7e31SzacharymornThese classes have been removed and consolidated into `FieldExistsQuery`. To migrate, caller simply replace those classes 3694fe7e31Szacharymornwith the new one during object instantiation. 3794fe7e31Szacharymorn 38694d7975SRushabh Shah### Normalizer and stemmer classes are now package private (LUCENE-10561) 39694d7975SRushabh Shah 40694d7975SRushabh ShahExcept for a few exceptions, almost all normalizer and stemmer classes are now package private. If your code depends on 41694d7975SRushabh Shahconstants defined in them, copy the constant values and re-define them in your code. 42694d7975SRushabh Shah 434dc3e8abSRobert Muir## Migration from Lucene 9.0 to Lucene 9.1 444dc3e8abSRobert Muir 45a94fbb79SDawid Weiss### Test framework package migration and module (LUCENE-10301) 46a94fbb79SDawid Weiss 470b517573SUwe SchindlerThe test framework is now a Java module. All the classes have been moved from 48a94fbb79SDawid Weiss`org.apache.lucene.*` to `org.apache.lucene.tests.*` to avoid package name conflicts 49a94fbb79SDawid Weisswith the core module. If you were using the Lucene test framework, the migration should be 50a94fbb79SDawid Weissfairly automatic (package prefix). 51a94fbb79SDawid Weiss 524dc3e8abSRobert Muir### Minor syntactical changes in StandardQueryParser (LUCENE-10223) 534dc3e8abSRobert Muir 544dc3e8abSRobert MuirAdded interval functions and min-should-match support to `StandardQueryParser`. This 554dc3e8abSRobert Muirmeans that interval function prefixes (`fn:`) and the `@` character after parentheses will 564dc3e8abSRobert Muirparse differently than before. If you need the exact previous behavior, clone the 574dc3e8abSRobert Muir`StandardSyntaxParser` from the previous version of Lucene and create a custom query parser 584dc3e8abSRobert Muirwith that parser. 594dc3e8abSRobert Muir 600b517573SUwe Schindler### Lucene Core now depends on java.logging (JUL) module (LUCENE-10342) 610b517573SUwe Schindler 620b517573SUwe SchindlerLucene Core now logs certain warnings and errors using Java Util Logging (JUL). 630b517573SUwe SchindlerIt is therefore recommended to install wrapper libraries with JUL logging handlers to 640b517573SUwe Schindlerfeed the log events into your app's own logging system. 650b517573SUwe Schindler 660b517573SUwe SchindlerUnder normal circumstances Lucene won't log anything, but in the case of a problem 670b517573SUwe Schindlerusers should find the logged information in the usual log files. 680b517573SUwe Schindler 690b517573SUwe SchindlerLucene also provides a `JavaLoggingInfoStream` implementation that logs `IndexWriter` 700b517573SUwe Schindlerevents using JUL. 710b517573SUwe Schindler 720b517573SUwe SchindlerTo feed Lucene's log events into the well-known Log4J system, we refer to 730b517573SUwe Schindlerthe [Log4j JDK Logging Adapter](https://logging.apache.org/log4j/2.x/log4j-jul/index.html) 740b517573SUwe Schindlerin combination with the corresponding system property: 750b517573SUwe Schindler`java.util.logging.manager=org.apache.logging.log4j.jul.LogManager`. 760b517573SUwe Schindler 778aa4a564SUwe Schindler### Kuromoji and Nori analysis component constructors for custom dictionaries 788aa4a564SUwe Schindler 798aa4a564SUwe SchindlerThe Kuromoji and Nori analysis modules had some way to customize the backing dictionaries 808aa4a564SUwe Schindlerby passing a path to file or classpath resources using some inconsistently implemented 818aa4a564SUwe SchindlerAPIs. This was buggy from the beginning, but some users made use of it. Due to move to Java 828aa4a564SUwe Schindlermodule system, especially the resource lookup on classpath stopped to work correctly. 838aa4a564SUwe SchindlerThe Lucene team therefore implemented new APIs to create dictionary implementations 848aa4a564SUwe Schindlerwith custom data files. Unfortunately there were some shortcomings in the 9.1 version, 858aa4a564SUwe Schindleralso when using the now deprecated ctors, so users are advised to upgrade to 868aa4a564SUwe SchindlerLucene 9.2 or stay with 9.0. 878aa4a564SUwe Schindler 888aa4a564SUwe SchindlerSee LUCENE-10558 for more details and workarounds. 898aa4a564SUwe Schindler 904dc3e8abSRobert Muir## Migration from Lucene 8.x to Lucene 9.0 914dc3e8abSRobert Muir 924dc3e8abSRobert Muir### Rename of binary artifacts from '**-analyzers-**' to '**-analysis-**' (LUCENE-9562) 935aa9da9eSRobert Muir 945aa9da9eSRobert MuirAll binary analysis packages (and corresponding Maven artifacts) have been renamed and are 954dc3e8abSRobert Muirnow consistent with repository module `analysis`. You will need to adjust build dependencies 965aa9da9eSRobert Muirto the new coordinates: 975aa9da9eSRobert Muir 985aa9da9eSRobert Muir| Old Artifact Coordinates | New Artifact Coordinates | 995aa9da9eSRobert Muir|---------------------------------------------|--------------------------------------------| 1005aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-common |org.apache.lucene:lucene-analysis-common | 1015aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-icu |org.apache.lucene:lucene-analysis-icu | 1025aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-kuromoji |org.apache.lucene:lucene-analysis-kuromoji | 1035aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-morfologik|org.apache.lucene:lucene-analysis-morfologik| 1045aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-nori |org.apache.lucene:lucene-analysis-nori | 1055aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-opennlp |org.apache.lucene:lucene-analysis-opennlp | 1065aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-phonetic |org.apache.lucene:lucene-analysis-phonetic | 1075aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-smartcn |org.apache.lucene:lucene-analysis-smartcn | 1085aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-stempel |org.apache.lucene:lucene-analysis-stempel | 1095aa9da9eSRobert Muir 110f725b27eSDawid Weiss 1114dc3e8abSRobert Muir### LucenePackage class removed (LUCENE-10260) 112f725b27eSDawid Weiss 1134dc3e8abSRobert Muir`LucenePackage` class has been removed. The implementation string can be 1144dc3e8abSRobert Muirretrieved from `Version.getPackageImplementationVersion()`. 115651755aaSDawid Weiss 1164dc3e8abSRobert Muir### Directory API is now little-endian (LUCENE-9047) 117651755aaSDawid Weiss 1184dc3e8abSRobert Muir`DataOutput`'s `writeShort()`, `writeInt()`, and `writeLong()` methods now encode with 1194dc3e8abSRobert Muirlittle-endian byte order. If you have custom subclasses of `DataInput`/`DataOutput`, you 1204dc3e8abSRobert Muirwill need to adjust them from big-endian byte order to little-endian byte order. 121321d274bSRobert Muir 1224dc3e8abSRobert Muir### NativeUnixDirectory removed and replaced by DirectIODirectory (LUCENE-8982) 1234b508aefSUwe Schindler 1244b508aefSUwe SchindlerJava 11 supports to use Direct IO without native wrappers from Java code. 1254dc3e8abSRobert Muir`NativeUnixDirectory` in the misc module was therefore removed and replaced 1264dc3e8abSRobert Muirby `DirectIODirectory`. To use it, you need a JVM and operating system that 1274b508aefSUwe Schindlersupports Direct IO. 1284b508aefSUwe Schindler 1294dc3e8abSRobert Muir### BM25Similarity.setDiscountOverlaps and LegacyBM25Similarity.setDiscountOverlaps methods removed (LUCENE-9646) 130227256d9SPatrick Marty 1314dc3e8abSRobert MuirThe `discountOverlaps()` parameter for both `BM25Similarity` and `LegacyBM25Similarity` 132227256d9SPatrick Martyis now set by the constructor of those classes. 133227256d9SPatrick Marty 1344dc3e8abSRobert Muir### Packages in misc module are renamed (LUCENE-9600) 135d1110394STomoko Uchida 1364dc3e8abSRobert MuirThese packages in the `lucene-misc` module are renamed: 137d1110394STomoko Uchida 1384dc3e8abSRobert Muir| Old Package Name | New Package Name | 1394dc3e8abSRobert Muir|--------------------------|-------------------------------| 1404dc3e8abSRobert Muir|org.apache.lucene.document|org.apache.lucene.misc.document| 1414dc3e8abSRobert Muir|org.apache.lucene.index |org.apache.lucene.misc.index | 1424dc3e8abSRobert Muir|org.apache.lucene.search |org.apache.lucene.misc.search | 1434dc3e8abSRobert Muir|org.apache.lucene.store |org.apache.lucene.misc.store | 1444dc3e8abSRobert Muir|org.apache.lucene.util |org.apache.lucene.misc.util | 145d1110394STomoko Uchida 1464dc3e8abSRobert MuirThe following classes were moved to the `lucene-core` module: 147d1110394STomoko Uchida 1484dc3e8abSRobert Muir- org.apache.lucene.document.InetAddressPoint 1494dc3e8abSRobert Muir- org.apache.lucene.document.InetAddressRange 1506a7131eeSTomoko Uchida 1514dc3e8abSRobert Muir### Packages in sandbox module are renamed (LUCENE-9319) 1526a7131eeSTomoko Uchida 1534dc3e8abSRobert MuirThese packages in the `lucene-sandbox` module are renamed: 1546a7131eeSTomoko Uchida 1554dc3e8abSRobert Muir| Old Package Name | New Package Name | 1564dc3e8abSRobert Muir|--------------------------|----------------------------------| 1574dc3e8abSRobert Muir|org.apache.lucene.codecs |org.apache.lucene.sandbox.codecs | 1584dc3e8abSRobert Muir|org.apache.lucene.document|org.apache.lucene.sandbox.document| 1594dc3e8abSRobert Muir|org.apache.lucene.search |org.apache.lucene.sandbox.search | 16044c1bd42STomoko Uchida 1614dc3e8abSRobert Muir### Backward codecs are renamed (LUCENE-9318) 16244c1bd42STomoko Uchida 1634dc3e8abSRobert MuirThese packages in the `lucene-backwards-codecs` module are renamed: 1644e0aa0d2Smsfroh 1654dc3e8abSRobert Muir| Old Package Name | New Package Name | 1664dc3e8abSRobert Muir|------------------------|---------------------------------| 1674dc3e8abSRobert Muir|org.apache.lucene.codecs|org.apache.lucene.backward_codecs| 1684dc3e8abSRobert Muir 1694dc3e8abSRobert Muir### JapanesePartOfSpeechStopFilterFactory loads default stop tags if "tags" argument not specified (LUCENE-9567) 1704dc3e8abSRobert Muir 1714dc3e8abSRobert MuirPreviously, `JapanesePartOfSpeechStopFilterFactory` added no filter if `args` didn't include "tags". Now, it will load 1724e0aa0d2Smsfrohthe default stop tags returned by `JapaneseAnalyzer.getDefaultStopTags()` (i.e. the tags from`stoptags.txt` in the 1734e0aa0d2Smsfroh`lucene-analyzers-kuromoji` jar.) 1744e0aa0d2Smsfroh 1754dc3e8abSRobert Muir### ICUCollationKeyAnalyzer is renamed (LUCENE-9558) 176b70eaeeeSTomoko Uchida 1774dc3e8abSRobert MuirThese packages in the `lucene-analysis-icu` module are renamed: 178b70eaeeeSTomoko Uchida 1794dc3e8abSRobert Muir| Old Package Name | New Package Name | 1804dc3e8abSRobert Muir|---------------------------|------------------------------| 1814dc3e8abSRobert Muir|org.apache.lucene.collation|org.apache.lucene.analysis.icu| 1825e617cccSTomoko Uchida 1834dc3e8abSRobert Muir### Base and concrete analysis factories are moved / package renamed (LUCENE-9317) 1845e617cccSTomoko Uchida 1854dc3e8abSRobert MuirBase analysis factories are moved to `lucene-core`, also their package names are renamed. 1864dc3e8abSRobert Muir 1874dc3e8abSRobert Muir| Old Class Name | New Class Name | 1884dc3e8abSRobert Muir|--------------------------------------------------|--------------------------------------------| 1894dc3e8abSRobert Muir|org.apache.lucene.analysis.util.TokenizerFactory |org.apache.lucene.analysis.TokenizerFactory | 1904dc3e8abSRobert Muir|org.apache.lucene.analysis.util.CharFilterFactory |org.apache.lucene.analysis.CharFilterFactory| 1914dc3e8abSRobert Muir|org.apache.lucene.analysis.util.TokenFilterFactory|org.apache.lucene.analysis.TokenizerFactory | 1925e617cccSTomoko Uchida 1935e617cccSTomoko UchidaThe service provider files placed in `META-INF/services` for custom analysis factories should be renamed as follows: 1945e617cccSTomoko Uchida 1955e617cccSTomoko Uchida- META-INF/services/org.apache.lucene.analysis.TokenizerFactory 1965e617cccSTomoko Uchida- META-INF/services/org.apache.lucene.analysis.CharFilterFactory 1975e617cccSTomoko Uchida- META-INF/services/org.apache.lucene.analysis.TokenFilterFactory 1985e617cccSTomoko Uchida 1994dc3e8abSRobert Muir`StandardTokenizerFactory` is moved to `lucene-core` module. 2005e617cccSTomoko Uchida 2014dc3e8abSRobert MuirThe `org.apache.lucene.analysis.standard` package in `lucene-analysis-common` module 2024dc3e8abSRobert Muiris split into `org.apache.lucene.analysis.classic` and `org.apache.lucene.analysis.email`. 2035e617cccSTomoko Uchida 2044dc3e8abSRobert Muir### RegExpQuery now rejects invalid backslashes (LUCENE-9370) 205819e668cSmarkharwood 206819e668cSmarkharwoodWe now follow the [Java rules](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs) for accepting backslashes. 207819e668cSmarkharwoodAlphabetic characters other than s, S, w, W, d or D that are preceded by a backslash are considered illegal syntax and will throw an exception. 208819e668cSmarkharwood 2094dc3e8abSRobert Muir### RegExp certain regular expressions now match differently (LUCENE-9336) 21018bd2971Smarkharwood 21118bd2971SmarkharwoodThe commonly used regular expressions \w \W \d \D \s and \S now work the same way [Java Pattern](https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html#CHART) matching works. Previously these expressions were (mis)interpreted as searches for the literal characters w, d, s etc. 21218bd2971Smarkharwood 2134dc3e8abSRobert Muir### NGramFilterFactory "keepShortTerm" option was fixed to "preserveOriginal" (LUCENE-9259) 214c7697b08STomoko Uchida 215c7697b08STomoko UchidaThe factory option name to output the original term was corrected in accordance with its Javadoc. 216c7697b08STomoko Uchida 2174dc3e8abSRobert Muir### IndexMergeTool defaults changes (LUCENE-9206) 218c7697b08STomoko Uchida 219c7697b08STomoko UchidaThis command-line tool no longer forceMerges to a single segment. Instead, by 220c7697b08STomoko Uchidadefault it just follows (configurable) merge policy. If you really want to merge 2214dc3e8abSRobert Muirto a single segment, you can pass `-max-segments 1`. 222c7697b08STomoko Uchida 2234dc3e8abSRobert Muir### FST Builder is renamed FSTCompiler with fluent-style Builder (LUCENE-9089) 224c7697b08STomoko Uchida 2254dc3e8abSRobert MuirSimply use `FSTCompiler` instead of the previous `Builder`. Use either the simple constructor with default settings, or 2264dc3e8abSRobert Muirthe `FSTCompiler.Builder` to tune and tweak any parameter. 227c7697b08STomoko Uchida 2284dc3e8abSRobert Muir### Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933) 229c7697b08STomoko Uchida 230c7697b08STomoko UchidaUser dictionary now strictly validates if the (concatenated) segment is the same as the surface form. This change avoids 231c7697b08STomoko Uchidaunexpected runtime exceptions or behaviours. 232c7697b08STomoko UchidaFor example, these entries are not allowed at all and an exception is thrown when loading the dictionary file. 233c7697b08STomoko Uchida 234c7697b08STomoko Uchida``` 235c7697b08STomoko Uchida# concatenated "日本経済新聞" does not match the surface form "日経新聞" 236c7697b08STomoko Uchida日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞 237c7697b08STomoko Uchida 238c7697b08STomoko Uchida# concatenated "日経新聞" does not match the surface form "日本経済新聞" 239c7697b08STomoko Uchida日本経済新聞,日経 新聞,ニッケイ シンブン,カスタム名詞 240c7697b08STomoko Uchida``` 241c7697b08STomoko Uchida 2424dc3e8abSRobert Muir### JapaneseTokenizer no longer emits original (compound) tokens by default when the mode is not NORMAL (LUCENE-9123) 243c7697b08STomoko Uchida 2444dc3e8abSRobert Muir`JapaneseTokenizer` and `JapaneseAnalyzer` no longer emits original tokens when `discardCompoundToken` option is not specified. 2454dc3e8abSRobert MuirThe constructor option has been introduced since Lucene 8.5.0, and the default value is changed to `true`. 246c7697b08STomoko Uchida 247c7697b08STomoko UchidaWhen given the text "株式会社", JapaneseTokenizer (mode != NORMAL) emits decompounded tokens "株式" and "会社" only and no 2484dc3e8abSRobert Muirlonger outputs the original token "株式会社" by default. To output original tokens, `discardCompoundToken` option should be 2494dc3e8abSRobert Muirexplicitly set to `false`. Be aware that if this option is set to `false`, `SynonymFilter` or `SynonymGraphFilter` does not work 250c7697b08STomoko Uchidacorrectly (see LUCENE-9173). 251c7697b08STomoko Uchida 2524dc3e8abSRobert Muir### Analysis factories now have customizable symbolic names (LUCENE-8778) and need additional no-arg constructor (LUCENE-9281) 253c7697b08STomoko Uchida 2544dc3e8abSRobert MuirThe SPI names for concrete subclasses of `TokenizerFactory`, `TokenFilterFactory`, and `CharfilterFactory` are no longer 255c7697b08STomoko Uchidaderived from their class name. Instead, each factory must have a static "NAME" field like this: 256c7697b08STomoko Uchida 2574dc3e8abSRobert Muir```java 258c7697b08STomoko Uchida /** o.a.l.a.standard.StandardTokenizerFactory's SPI name */ 259c7697b08STomoko Uchida public static final String NAME = "standard"; 260c7697b08STomoko Uchida``` 261c7697b08STomoko Uchida 2624dc3e8abSRobert MuirA factory can be resolved/instantiated with its `NAME` by using methods such as `TokenizerFactory.lookupClass(String)` 2634dc3e8abSRobert Muiror `TokenizerFactory.forName(String, Map<String,String>)`. 264c7697b08STomoko Uchida 2654dc3e8abSRobert MuirIf there are any user-defined factory classes that don't have proper `NAME` field, an exception will be thrown 2664dc3e8abSRobert Muirwhen (re)loading factories. e.g., when calling `TokenizerFactory.reloadTokenizers(ClassLoader)`. 267c7697b08STomoko Uchida 268c7697b08STomoko UchidaIn addition starting all factories need to implement a public no-arg constructor, too. The reason for this 269c7697b08STomoko Uchidachange comes from the fact that Lucene now uses `java.util.ServiceLoader` instead its own implementation to 270c7697b08STomoko Uchidaload the factory classes to be compatible with Java Module System changes (e.g., load factories from modules). 271c7697b08STomoko UchidaIn the future, extensions to Lucene developed on the Java Module System may expose the factories from their 272c7697b08STomoko Uchida`module-info.java` file instead of `META-INF/services`. 273c7697b08STomoko Uchida 2744dc3e8abSRobert MuirThis constructor is never called by Lucene, so by default it throws an `UnsupportedOperationException`. User-defined 275c7697b08STomoko Uchidafactory classes should implement it in the following way: 276c7697b08STomoko Uchida 2774dc3e8abSRobert Muir```java 278c7697b08STomoko Uchida /** Default ctor for compatibility with SPI */ 279c7697b08STomoko Uchida public StandardTokenizerFactory() { 280c7697b08STomoko Uchida throw defaultCtorException(); 281c7697b08STomoko Uchida } 282c7697b08STomoko Uchida``` 283c7697b08STomoko Uchida 284c7697b08STomoko Uchida(`defaultCtorException()` is a protected static helper method) 285c7697b08STomoko Uchida 2864dc3e8abSRobert Muir### TermsEnum is now fully abstract (LUCENE-8292, LUCENE-8662) 287c7697b08STomoko Uchida 2884dc3e8abSRobert Muir`TermsEnum` has been changed to be fully abstract, so non-abstract subclasses must implement all its methods. 2894dc3e8abSRobert MuirNon-Performance critical `TermsEnum`s can use `BaseTermsEnum` as a base class instead. The change was motivated 2904dc3e8abSRobert Muirby several performance issues with `FilterTermsEnum` that caused significant slowdowns and massive memory consumption due 2914dc3e8abSRobert Muirto not delegating all method from `TermsEnum`. 292c7697b08STomoko Uchida 2934dc3e8abSRobert Muir### RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed (LUCENE-8474) 294c7697b08STomoko Uchida 2954dc3e8abSRobert MuirRAM-based directory implementation have been removed. 2964dc3e8abSRobert Muir`ByteBuffersDirectory` can be used as a RAM-resident replacement, although it 2974dc3e8abSRobert Muiris discouraged in favor of the default `MMapDirectory`. 298c7697b08STomoko Uchida 2994dc3e8abSRobert Muir### Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014) 300c7697b08STomoko Uchida 3014dc3e8abSRobert Muir`SpanQuery` and `PhraseQuery` now always calculate their slops as 3024dc3e8abSRobert Muir`(1.0 / (1.0 + distance))`. Payload factor calculation is performed by 3034dc3e8abSRobert Muir`PayloadDecoder` in the `lucene-queries` module. 304c7697b08STomoko Uchida 3054dc3e8abSRobert Muir### Scorer must produce positive scores (LUCENE-7996) 306c7697b08STomoko Uchida 3074dc3e8abSRobert Muir`Scorer`s are no longer allowed to produce negative scores. If you have custom 308c7697b08STomoko Uchidaquery implementations, you should make sure their score formula may never produce 309c7697b08STomoko Uchidanegative scores. 310c7697b08STomoko Uchida 311c7697b08STomoko UchidaAs a side-effect of this change, negative boosts are now rejected and 3124dc3e8abSRobert Muir`FunctionScoreQuery` maps negative values to 0. 313c7697b08STomoko Uchida 3144dc3e8abSRobert Muir### CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099) 315c7697b08STomoko Uchida 3164dc3e8abSRobert MuirInstead use `FunctionScoreQuery` and a `DoubleValuesSource` implementation. `BoostedQuery` 3174dc3e8abSRobert Muirand `BoostingQuery` may be replaced by calls to `FunctionScoreQuery.boostByValue()` and 3184dc3e8abSRobert Muir`FunctionScoreQuery.boostByQuery()`. To replace more complex calculations in 3194dc3e8abSRobert Muir`CustomScoreQuery`, use the `lucene-expressions` module: 320c7697b08STomoko Uchida 3214dc3e8abSRobert Muir```java 322c7697b08STomoko UchidaSimpleBindings bindings = new SimpleBindings(); 323c7697b08STomoko Uchidabindings.add("score", DoubleValuesSource.SCORES); 324c7697b08STomoko Uchidabindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield")); 325c7697b08STomoko Uchidabindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield")); 326c7697b08STomoko UchidaExpression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))"); 327c7697b08STomoko UchidaFunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings)); 328c7697b08STomoko Uchida``` 329c7697b08STomoko Uchida 3304dc3e8abSRobert Muir### IndexOptions can no longer be changed dynamically (LUCENE-8134) 331c7697b08STomoko Uchida 3324dc3e8abSRobert MuirChanging `IndexOptions` for a field on the fly will now result into an 3334dc3e8abSRobert Muir`IllegalArgumentException`. If a field is indexed 3344dc3e8abSRobert Muir(`FieldType.indexOptions() != IndexOptions.NONE`) then all documents must have 335c7697b08STomoko Uchidathe same index options for that field. 336c7697b08STomoko Uchida 337c7697b08STomoko Uchida 3384dc3e8abSRobert Muir### IndexSearcher.createNormalizedWeight() removed (LUCENE-8242) 339c7697b08STomoko Uchida 3404dc3e8abSRobert MuirInstead use `IndexSearcher.createWeight()`, rewriting the query first, and using 3414dc3e8abSRobert Muira boost of `1f`. 342c7697b08STomoko Uchida 3434dc3e8abSRobert Muir### Memory codecs removed (LUCENE-8267) 344c7697b08STomoko Uchida 3454dc3e8abSRobert MuirMemory codecs (`MemoryPostingsFormat`, `MemoryDocValuesFormat`) have been removed from the codebase. 346c7697b08STomoko Uchida 3474dc3e8abSRobert Muir### Direct doc-value format removed (LUCENE-8917) 348c7697b08STomoko Uchida 3494dc3e8abSRobert MuirThe `Direct` doc-value format has been removed from the codebase. 350c7697b08STomoko Uchida 3514dc3e8abSRobert Muir### QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144) 352c7697b08STomoko Uchida 353c7697b08STomoko UchidaCaching everything is discouraged as it disables the ability to skip non-interesting documents. 3544dc3e8abSRobert Muir`ALWAYS_CACHE` can be replaced by a `UsageTrackingQueryCachingPolicy` with an appropriate config. 355c7697b08STomoko Uchida 3564dc3e8abSRobert Muir### English stopwords are no longer removed by default in StandardAnalyzer (LUCENE-7444) 357c7697b08STomoko Uchida 3584dc3e8abSRobert MuirTo retain the old behaviour, pass `EnglishAnalyzer.ENGLISH_STOP_WORDS_SET` as an argument 359c7697b08STomoko Uchidato the constructor 360c7697b08STomoko Uchida 3614dc3e8abSRobert Muir### StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved 362c7697b08STomoko Uchida 3634dc3e8abSRobert MuirEnglish stop words are now defined in `EnglishAnalyzer.ENGLISH_STOP_WORDS_SET` in the 3644dc3e8abSRobert Muir`analysis-common` module. 365c7697b08STomoko Uchida 3664dc3e8abSRobert Muir### TopDocs.maxScore removed 367c7697b08STomoko Uchida 3684dc3e8abSRobert Muir`TopDocs.maxScore` is removed. `IndexSearcher` and `TopFieldCollector` no longer have 369c7697b08STomoko Uchidaan option to compute the maximum score when sorting by field. If you need to 370c7697b08STomoko Uchidaknow the maximum score for a query, the recommended approach is to run a 371c7697b08STomoko Uchidaseparate query: 372c7697b08STomoko Uchida 3734dc3e8abSRobert Muir```java 374c7697b08STomoko Uchida TopDocs topHits = searcher.search(query, 1); 375c7697b08STomoko Uchida float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score; 376c7697b08STomoko Uchida``` 377c7697b08STomoko Uchida 378c7697b08STomoko UchidaThanks to other optimizations that were added to Lucene 8, this query will be 379c7697b08STomoko Uchidaable to efficiently select the top-scoring document without having to visit 380c7697b08STomoko Uchidaall matches. 381c7697b08STomoko Uchida 3824dc3e8abSRobert Muir### TopFieldCollector always assumes fillFields=true 383c7697b08STomoko Uchida 3844dc3e8abSRobert MuirBecause filling sort values doesn't have a significant overhead, the `fillFields` 3854dc3e8abSRobert Muiroption has been removed from `TopFieldCollector` factory methods. Everything 3864dc3e8abSRobert Muirbehaves as if it was previously set to `true`. 387c7697b08STomoko Uchida 3884dc3e8abSRobert Muir### TopFieldCollector no longer takes a trackDocScores option 389c7697b08STomoko Uchida 390c7697b08STomoko UchidaComputing scores at collection time is less efficient than running a second 391c7697b08STomoko Uchidarequest in order to only compute scores for documents that made it to the top 3924dc3e8abSRobert Muirhits. As a consequence, the `trackDocScores` option has been removed and can be 3934dc3e8abSRobert Muirreplaced with the new `TopFieldCollector.populateScores()` helper method. 394c7697b08STomoko Uchida 3954dc3e8abSRobert Muir### IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long 396c7697b08STomoko Uchida 397c7697b08STomoko UchidaLucene 8 received optimizations for collection of top-k matches by not visiting 398c7697b08STomoko Uchidaall matches. However these optimizations won't help if all matches still need 399c7697b08STomoko Uchidato be visited in order to compute the total number of hits. As a consequence, 4004dc3e8abSRobert Muir`IndexSearcher`'s `search()` and `searchAfter()` methods were changed to only count hits 4014dc3e8abSRobert Muiraccurately up to 1,000, and `Topdocs.totalHits` was changed from a `long` to an 402c7697b08STomoko Uchidaobject that says whether the hit count is accurate or a lower bound of the 403c7697b08STomoko Uchidaactual hit count. 404c7697b08STomoko Uchida 4054dc3e8abSRobert Muir### RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated (LUCENE-8467, LUCENE-8438) 406c7697b08STomoko Uchida 407c7697b08STomoko UchidaThis RAM-based directory implementation is an old piece of code that uses inefficient 408c7697b08STomoko Uchidathread synchronization primitives and can be confused as "faster" than the NIO-based 4094dc3e8abSRobert Muir`MMapDirectory`. It is deprecated and scheduled for removal in future versions of 4104dc3e8abSRobert MuirLucene. 411c7697b08STomoko Uchida 4124dc3e8abSRobert Muir### LeafCollector.setScorer() now takes a Scorable rather than a Scorer (LUCENE-6228) 413c7697b08STomoko Uchida 4144dc3e8abSRobert Muir`Scorer` has a number of methods that should never be called from `Collector`s, for example 4154dc3e8abSRobert Muirthose that advance the underlying iterators. To hide these, `LeafCollector.setScorer()` 4164dc3e8abSRobert Muirnow takes a `Scorable`, an abstract class that scorers can extend, with methods 4174dc3e8abSRobert Muir`docId()` and `score()`. 418c7697b08STomoko Uchida 4194dc3e8abSRobert Muir### Scorers must have non-null Weights 420c7697b08STomoko Uchida 4214dc3e8abSRobert MuirIf a custom `Scorer` implementation does not have an associated `Weight`, it can probably 4224dc3e8abSRobert Muirbe replaced with a `Scorable` instead. 423c7697b08STomoko Uchida 4244dc3e8abSRobert Muir### Suggesters now return Long instead of long for weight() during indexing, and double instead of long at suggest time 425c7697b08STomoko Uchida 426c7697b08STomoko UchidaMost code should just require recompilation, though possibly requiring some added casts. 427c7697b08STomoko Uchida 4284dc3e8abSRobert Muir### TokenStreamComponents is now final 429c7697b08STomoko Uchida 4304dc3e8abSRobert MuirInstead of overriding `TokenStreamComponents.setReader()` to customise analyzer 4314dc3e8abSRobert Muirinitialisation, you should now pass a `Consumer<Reader>` instance to the 4324dc3e8abSRobert Muir`TokenStreamComponents` constructor. 433c7697b08STomoko Uchida 4344dc3e8abSRobert Muir### LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed 435c7697b08STomoko Uchida 4364dc3e8abSRobert Muir`LowerCaseTokenizer` combined tokenization and filtering in a way that broke token 4374dc3e8abSRobert Muirnormalization, so they have been removed. Instead, use a `LetterTokenizer` followed by 4384dc3e8abSRobert Muira `LowerCaseFilter`. 439c7697b08STomoko Uchida 4404dc3e8abSRobert Muir### CharTokenizer no longer takes a normalizer function 441c7697b08STomoko Uchida 4424dc3e8abSRobert Muir`CharTokenizer` now only performs tokenization. To perform any type of filtering 4434dc3e8abSRobert Muiruse a `TokenFilter` chain as you would with any other `Tokenizer`. 444c7697b08STomoko Uchida 4454dc3e8abSRobert Muir### Highlighter and FastVectorHighlighter no longer support ToParent/ToChildBlockJoinQuery 446c7697b08STomoko Uchida 4474dc3e8abSRobert MuirBoth `Highlighter` and `FastVectorHighlighter` need a custom `WeightedSpanTermExtractor` or `FieldQuery`, respectively, 4484dc3e8abSRobert Muirin order to support `ToParentBlockJoinQuery`/`ToChildBlockJoinQuery`. 449c7697b08STomoko Uchida 4504dc3e8abSRobert Muir### MultiTermAwareComponent replaced by CharFilterFactory.normalize() and TokenFilterFactory.normalize() 451c7697b08STomoko Uchida 4524dc3e8abSRobert MuirNormalization is now type-safe, with `CharFilterFactory.normalize()` returning a `Reader` and 4534dc3e8abSRobert Muir`TokenFilterFactory.normalize()` returning a `TokenFilter`. 454c7697b08STomoko Uchida 4554dc3e8abSRobert Muir### k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563) 456c7697b08STomoko Uchida 4574dc3e8abSRobert MuirScores computed by the `BM25Similarity` are lower than previously as the `k1+1` 458c7697b08STomoko Uchidaconstant factor was removed from the numerator of the scoring formula. 459c7697b08STomoko UchidaOrdering of results is preserved unless scores are computed from multiple 460c7697b08STomoko Uchidafields using different similarities. The previous behaviour is now exposed 4614dc3e8abSRobert Muirby the `LegacyBM25Similarity` class which can be found in the lucene-misc jar. 462c7697b08STomoko Uchida 4634dc3e8abSRobert Muir### IndexWriter.maxDoc()/numDocs() removed in favor of IndexWriter.getDocStats() 464c7697b08STomoko Uchida 4654dc3e8abSRobert Muir`IndexWriter.getDocStats()` should be used instead of `maxDoc()` / `numDocs()` which offers a consistent 4664dc3e8abSRobert Muirview on document stats. Previously calling two methods in order to get point in time stats was subject 467c7697b08STomoko Uchidato concurrent changes. 468c7697b08STomoko Uchida 4694dc3e8abSRobert Muir### maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811) 470c7697b08STomoko Uchida 4714dc3e8abSRobert Muir`IndexSearcher` now performs max clause count checks on all types of queries (including BooleanQueries). 4724dc3e8abSRobert MuirThis led to a logical move of the clauses count from `BooleanQuery` to `IndexSearcher`. 473c7697b08STomoko Uchida 4744dc3e8abSRobert Muir### TopDocs.merge shall no longer allow setting of shard indices 475c7697b08STomoko Uchida 4764dc3e8abSRobert Muir`TopDocs.merge()`'s API has been changed to stop allowing passing in a parameter to indicate if it should 477c7697b08STomoko Uchidaset shard indices for hits as they are seen during the merge process. This is done to simplify the API 478c7697b08STomoko Uchidato be more dynamic in terms of passing in custom tie breakers. 4794dc3e8abSRobert MuirIf shard indices are to be used for tie breaking docs with equal scores during `TopDocs.merge()`, then it is 4804dc3e8abSRobert Muirmandatory that the input `ScoreDocs` have their shard indices set to valid values prior to calling `merge()` 481c7697b08STomoko Uchida 4824dc3e8abSRobert Muir### TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments 483c7697b08STomoko Uchida 4844dc3e8abSRobert Muir`TopDocsCollector` shall no longer return an empty `TopDocs` for malformed arguments. 4854dc3e8abSRobert MuirRather, an `IllegalArgumentException` shall be thrown. This is introduced for better 486c7697b08STomoko Uchidadefence and to ensure that there is no bubbling up of errors when Lucene is 487c7697b08STomoko Uchidaused in multi level applications 488b0333ab5SMayya Sharipova 4894dc3e8abSRobert Muir### Assumption of data consistency between different data-structures sharing the same field name 490b0333ab5SMayya Sharipova 491b0333ab5SMayya SharipovaSorting on a numeric field that is indexed with both doc values and points may use an 492b0333ab5SMayya Sharipovaoptimization to skip non-competitive documents. This optimization relies on the assumption 493b0333ab5SMayya Sharipovathat the same data is stored in these points and doc values. 494f3a284adSRobert Muir 495d03662c4SMayya Sharipova### Require consistency between data-structures on a per-field basis 496d03662c4SMayya Sharipova 497d03662c4SMayya SharipovaThe per field data-structures are implicitly defined by the first document 498d03662c4SMayya Sharipovaindexed that contains a certain field. Once defined, the per field 499d03662c4SMayya Sharipovadata-structures are not changeable for the whole index. For example, if you 500d03662c4SMayya Sharipovafirst index a document where a certain field is indexed with doc values and 501d03662c4SMayya Sharipovapoints, all subsequent documents containing this field must also have this 502d03662c4SMayya Sharipovafield indexed with only doc values and points. 503d03662c4SMayya Sharipova 504d03662c4SMayya SharipovaThis also means that an index created in the previous version that doesn't 505d03662c4SMayya Sharipovasatisfy this requirement can not be updated. 506d03662c4SMayya Sharipova 507d03662c4SMayya Sharipova### Doc values updates are allowed only for doc values only fields 508d03662c4SMayya Sharipova 509d03662c4SMayya SharipovaPreviously IndexWriter could update doc values for a binary or numeric docValue 510d03662c4SMayya Sharipovafield that was also indexed with other data structures (e.g. postings, vectors 511d03662c4SMayya Sharipovaetc). This is not allowed anymore. A field must be indexed with only doc values 5124dc3e8abSRobert Muirto be allowed for doc values updates in `IndexWriter`. 513d03662c4SMayya Sharipova 5144dc3e8abSRobert Muir### SortedDocValues no longer extends BinaryDocValues (LUCENE-9796) 515f3a284adSRobert Muir 5164dc3e8abSRobert Muir`SortedDocValues` no longer extends `BinaryDocValues`: `SortedDocValues` do not have a per-document 517f3a284adSRobert Muirbinary value, they have a per-document numeric `ordValue()`. The ordinal can then be dereferenced 518f3a284adSRobert Muirto its binary form with `lookupOrd()`, but it was a performance trap to implement a `binaryValue()` 519f3a284adSRobert Muiron the SortedDocValues api that does this behind-the-scenes on every document. 520f3a284adSRobert Muir 521f3a284adSRobert MuirYou can replace calls of `binaryValue()` with `lookupOrd(ordValue())` as a "quick fix", but it is 522f3a284adSRobert Muirbetter to use the ordinal alone (integer-based datastructures) for per-document access, and only 5234dc3e8abSRobert Muircall `lookupOrd()` a few times at the end (e.g. for the hits you want to display). Otherwise, if you 5244dc3e8abSRobert Muirreally don't want per-document ordinals, but instead a per-document `byte[]`, use a `BinaryDocValues` 525f3a284adSRobert Muirfield. 52679f14b17SAdrien Grand 5274dc3e8abSRobert Muir### Removed CodecReader.ramBytesUsed() (LUCENE-9387) 52879f14b17SAdrien Grand 52979f14b17SAdrien GrandLucene index readers are now using so little memory with the default codec that 53079f14b17SAdrien Grandit was decided to remove the ability to estimate their RAM usage. 531650cad19SGreg Miller 5324dc3e8abSRobert Muir### LongValueFacetCounts no longer accepts multiValued param in constructors (LUCENE-9948) 533650cad19SGreg Miller 5344dc3e8abSRobert Muir`LongValueFacetCounts` will now automatically detect whether-or-not an indexed field is single- or 535650cad19SGreg Millermulti-valued. The user no longer needs to provide this information to the ctors. Migrating should 536650cad19SGreg Millerbe as simple as no longer providing this boolean. 5374464cd87SAlan Woodward 5384dc3e8abSRobert Muir### SpanQuery and subclasses have moved from core/ to the queries module 5394464cd87SAlan Woodward 5404dc3e8abSRobert MuirThey can now be found in the `org.apache.lucene.queries.spans` package. 541dbb4c265SAlan Woodward 5424dc3e8abSRobert Muir### SpanBoostQuery has been removed (LUCENE-8143) 543dbb4c265SAlan Woodward 5444dc3e8abSRobert Muir`SpanBoostQuery` was a no-op unless used at the top level of a `SpanQuery` nested 5454dc3e8abSRobert Muirstructure. Use a standard `BoostQuery` here instead. 5469e9c3bd2SAlan Woodward 5474dc3e8abSRobert Muir### Sort is immutable (LUCENE-9325) 5489e9c3bd2SAlan Woodward 5499e9c3bd2SAlan WoodwardRather than using `setSort()` to change sort values, you should instead create 5504dc3e8abSRobert Muira new `Sort` instance with the new values. 5516ee69e06SGreg Miller 5524dc3e8abSRobert Muir### Taxonomy-based faceting uses more modern encodings (LUCENE-9450, LUCENE-10062, LUCENE-10122) 5536ee69e06SGreg Miller 5546ee69e06SGreg MillerThe side-car taxonomy index now uses doc values for ord-to-path lookup (LUCENE-9450) and parent 5556ee69e06SGreg Millerlookup (LUCENE-10122) instead of stored fields and positions (respectively). Document ordinals 5566ee69e06SGreg Millerare now encoded with `SortedNumericDocValues` instead of using a custom (v-int) binary format. 5576ee69e06SGreg MillerPerformance gains have been observed with these encoding changes. These changes were introduced 5586ee69e06SGreg Millerin 9.0, and 9.x releases remain backwards-compatible with 8.x indexes, but starting with 10.0, 5596ee69e06SGreg Milleronly the newer formats are supported. Users will need to create a new index with all their 5606ee69e06SGreg Millerdocuments using 9.0 or later to pick up the new format and remain compatible with 10.x releases. 5616ee69e06SGreg MillerJust re-adding documents to an existing index is not enough to pick up the changes as the 5626ee69e06SGreg Millerformat will "stick" to whatever version was used to initially create the index. 5636ee69e06SGreg Miller 5646ee69e06SGreg MillerAdditionally, `OrdinalsReader` (and sub-classes) are fully removed starting with 10.0. These 5656ee69e06SGreg Millerclasses were `@Deprecated` starting with 9.0. Users are encouraged to rely on the default 5666ee69e06SGreg Millertaxonomy facet encodings where possible. If custom formats are needed, users will need 5676ee69e06SGreg Millerto manage the indexed data on their own and create new `Facet` implementations to use it. 568