Lucene/lucene/MIGRATE.md

3bedc087SDawid Weiss<!--
3bedc087SDawid Weiss    Licensed to the Apache Software Foundation (ASF) under one or more
3bedc087SDawid Weiss    contributor license agreements.  See the NOTICE file distributed with
3bedc087SDawid Weiss    this work for additional information regarding copyright ownership.
3bedc087SDawid Weiss    The ASF licenses this file to You under the Apache License, Version 2.0
3bedc087SDawid Weiss    the "License"); you may not use this file except in compliance with
3bedc087SDawid Weiss    the License.  You may obtain a copy of the License at
3bedc087SDawid Weiss
3bedc087SDawid Weiss        http://www.apache.org/licenses/LICENSE-2.0
3bedc087SDawid Weiss
3bedc087SDawid Weiss    Unless required by applicable law or agreed to in writing, software
3bedc087SDawid Weiss    distributed under the License is distributed on an "AS IS" BASIS,
3bedc087SDawid Weiss    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
3bedc087SDawid Weiss    See the License for the specific language governing permissions and
3bedc087SDawid Weiss    limitations under the License.
3bedc087SDawid Weiss -->
3bedc087SDawid Weiss
c7697b08STomoko Uchida# Apache Lucene Migration Guide
c7697b08STomoko Uchida
b2e866b7SRobert Muir## Migration from Lucene 9.x to Lucene 10.0
b2e866b7SRobert Muir
*71a9acb2STomoko Uchida### PersianStemFilter is added to PersianAnalyzer (LUCENE-10312)
*71a9acb2STomoko Uchida
*71a9acb2STomoko UchidaPersianAnalyzer now includes PersianStemFilter, that would change analysis results. If you need the exactly same analysis
*71a9acb2STomoko Uchidabehaviour as 9.x, clone `PersianAnalyzer` in 9.x or create custom analyzer by using `CustomAnalyzer` on your own.
*71a9acb2STomoko Uchida
84e4b85bSRobert Muir### AutomatonQuery/CompiledAutomaton/RunAutomaton/RegExp no longer determinize (LUCENE-10010)
b2e866b7SRobert Muir
b2e866b7SRobert MuirThese classes no longer take a `determinizeWorkLimit` and no longer determinize
b2e866b7SRobert Muirbehind the scenes. It is the responsibility of the caller to to call
b2e866b7SRobert Muir`Operations.determinize()` for DFA execution.
b2e866b7SRobert Muir
94fe7e31Szacharymorn### DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery removed in favor of FieldExistsQuery (LUCENE-10436)
94fe7e31Szacharymorn
94fe7e31SzacharymornThese classes have been removed and consolidated into `FieldExistsQuery`. To migrate, caller simply replace those classes
94fe7e31Szacharymornwith the new one during object instantiation.
94fe7e31Szacharymorn
694d7975SRushabh Shah### Normalizer and stemmer classes are now package private (LUCENE-10561)
694d7975SRushabh Shah
694d7975SRushabh ShahExcept for a few exceptions, almost all normalizer and stemmer classes are now package private. If your code depends on
694d7975SRushabh Shahconstants defined in them, copy the constant values and re-define them in your code.
694d7975SRushabh Shah
4dc3e8abSRobert Muir## Migration from Lucene 9.0 to Lucene 9.1
4dc3e8abSRobert Muir
a94fbb79SDawid Weiss### Test framework package migration and module (LUCENE-10301)
a94fbb79SDawid Weiss
0b517573SUwe SchindlerThe test framework is now a Java module. All the classes have been moved from
a94fbb79SDawid Weiss`org.apache.lucene.*` to `org.apache.lucene.tests.*` to avoid package name conflicts
a94fbb79SDawid Weisswith the core module. If you were using the Lucene test framework, the migration should be
a94fbb79SDawid Weissfairly automatic (package prefix).
a94fbb79SDawid Weiss
4dc3e8abSRobert Muir### Minor syntactical changes in StandardQueryParser (LUCENE-10223)
4dc3e8abSRobert Muir
4dc3e8abSRobert MuirAdded interval functions and min-should-match support to `StandardQueryParser`. This
4dc3e8abSRobert Muirmeans that interval function prefixes (`fn:`) and the `@` character after parentheses will
4dc3e8abSRobert Muirparse differently than before. If you need the exact previous behavior, clone the
4dc3e8abSRobert Muir`StandardSyntaxParser` from the previous version of Lucene and create a custom query parser
4dc3e8abSRobert Muirwith that parser.
4dc3e8abSRobert Muir
0b517573SUwe Schindler### Lucene Core now depends on java.logging (JUL) module (LUCENE-10342)
0b517573SUwe Schindler
0b517573SUwe SchindlerLucene Core now logs certain warnings and errors using Java Util Logging (JUL).
0b517573SUwe SchindlerIt is therefore recommended to install wrapper libraries with JUL logging handlers to
0b517573SUwe Schindlerfeed the log events into your app's own logging system.
0b517573SUwe Schindler
0b517573SUwe SchindlerUnder normal circumstances Lucene won't log anything, but in the case of a problem
0b517573SUwe Schindlerusers should find the logged information in the usual log files.
0b517573SUwe Schindler
0b517573SUwe SchindlerLucene also provides a `JavaLoggingInfoStream` implementation that logs `IndexWriter`
0b517573SUwe Schindlerevents using JUL.
0b517573SUwe Schindler
0b517573SUwe SchindlerTo feed Lucene's log events into the well-known Log4J system, we refer to
0b517573SUwe Schindlerthe [Log4j JDK Logging Adapter](https://logging.apache.org/log4j/2.x/log4j-jul/index.html)
0b517573SUwe Schindlerin combination with the corresponding system property:
0b517573SUwe Schindler`java.util.logging.manager=org.apache.logging.log4j.jul.LogManager`.
0b517573SUwe Schindler
8aa4a564SUwe Schindler### Kuromoji and Nori analysis component constructors for custom dictionaries
8aa4a564SUwe Schindler
8aa4a564SUwe SchindlerThe Kuromoji and Nori analysis modules had some way to customize the backing dictionaries
8aa4a564SUwe Schindlerby passing a path to file or classpath resources using some inconsistently implemented
8aa4a564SUwe SchindlerAPIs. This was buggy from the beginning, but some users made use of it. Due to move to Java
8aa4a564SUwe Schindlermodule system, especially the resource lookup on classpath stopped to work correctly.
8aa4a564SUwe SchindlerThe Lucene team therefore implemented new APIs to create dictionary implementations
8aa4a564SUwe Schindlerwith custom data files. Unfortunately there were some shortcomings in the 9.1 version,
8aa4a564SUwe Schindleralso when using the now deprecated ctors, so users are advised to upgrade to
8aa4a564SUwe SchindlerLucene 9.2 or stay with 9.0.
8aa4a564SUwe Schindler
8aa4a564SUwe SchindlerSee LUCENE-10558 for more details and workarounds.
8aa4a564SUwe Schindler
4dc3e8abSRobert Muir## Migration from Lucene 8.x to Lucene 9.0
4dc3e8abSRobert Muir
4dc3e8abSRobert Muir### Rename of binary artifacts from '**-analyzers-**' to '**-analysis-**' (LUCENE-9562)
5aa9da9eSRobert Muir
5aa9da9eSRobert MuirAll binary analysis packages (and corresponding Maven artifacts) have been renamed and are
4dc3e8abSRobert Muirnow consistent with repository module `analysis`. You will need to adjust build dependencies
5aa9da9eSRobert Muirto the new coordinates:
5aa9da9eSRobert Muir
5aa9da9eSRobert Muir|         Old Artifact Coordinates            |        New Artifact Coordinates            |
5aa9da9eSRobert Muir|---------------------------------------------|--------------------------------------------|
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-common    |org.apache.lucene:lucene-analysis-common    |
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-icu       |org.apache.lucene:lucene-analysis-icu       |
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-kuromoji  |org.apache.lucene:lucene-analysis-kuromoji  |
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-morfologik|org.apache.lucene:lucene-analysis-morfologik|
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-nori      |org.apache.lucene:lucene-analysis-nori      |
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-opennlp   |org.apache.lucene:lucene-analysis-opennlp   |
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-phonetic  |org.apache.lucene:lucene-analysis-phonetic  |
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-smartcn   |org.apache.lucene:lucene-analysis-smartcn   |
5aa9da9eSRobert Muir|org.apache.lucene:lucene-analyzers-stempel   |org.apache.lucene:lucene-analysis-stempel   |
5aa9da9eSRobert Muir
f725b27eSDawid Weiss
4dc3e8abSRobert Muir### LucenePackage class removed (LUCENE-10260)
f725b27eSDawid Weiss
4dc3e8abSRobert Muir`LucenePackage` class has been removed. The implementation string can be
4dc3e8abSRobert Muirretrieved from `Version.getPackageImplementationVersion()`.
651755aaSDawid Weiss
4dc3e8abSRobert Muir### Directory API is now little-endian (LUCENE-9047)
651755aaSDawid Weiss
4dc3e8abSRobert Muir`DataOutput`'s `writeShort()`, `writeInt()`, and `writeLong()` methods now encode with
4dc3e8abSRobert Muirlittle-endian byte order. If you have custom subclasses of `DataInput`/`DataOutput`, you
4dc3e8abSRobert Muirwill need to adjust them from big-endian byte order to little-endian byte order.
321d274bSRobert Muir
4dc3e8abSRobert Muir### NativeUnixDirectory removed and replaced by DirectIODirectory (LUCENE-8982)
4b508aefSUwe Schindler
4b508aefSUwe SchindlerJava 11 supports to use Direct IO without native wrappers from Java code.
4dc3e8abSRobert Muir`NativeUnixDirectory` in the misc module was therefore removed and replaced
4dc3e8abSRobert Muirby `DirectIODirectory`. To use it, you need a JVM and operating system that
4b508aefSUwe Schindlersupports Direct IO.
4b508aefSUwe Schindler
4dc3e8abSRobert Muir### BM25Similarity.setDiscountOverlaps and LegacyBM25Similarity.setDiscountOverlaps methods removed (LUCENE-9646)
227256d9SPatrick Marty
4dc3e8abSRobert MuirThe `discountOverlaps()` parameter for both `BM25Similarity` and `LegacyBM25Similarity`
227256d9SPatrick Martyis now set by the constructor of those classes.
227256d9SPatrick Marty
4dc3e8abSRobert Muir### Packages in misc module are renamed (LUCENE-9600)
d1110394STomoko Uchida
4dc3e8abSRobert MuirThese packages in the `lucene-misc` module are renamed:
d1110394STomoko Uchida
4dc3e8abSRobert Muir|    Old Package Name      |       New Package Name        |
4dc3e8abSRobert Muir|--------------------------|-------------------------------|
4dc3e8abSRobert Muir|org.apache.lucene.document|org.apache.lucene.misc.document|
4dc3e8abSRobert Muir|org.apache.lucene.index   |org.apache.lucene.misc.index   |
4dc3e8abSRobert Muir|org.apache.lucene.search  |org.apache.lucene.misc.search  |
4dc3e8abSRobert Muir|org.apache.lucene.store   |org.apache.lucene.misc.store   |
4dc3e8abSRobert Muir|org.apache.lucene.util    |org.apache.lucene.misc.util    |
d1110394STomoko Uchida
4dc3e8abSRobert MuirThe following classes were moved to the `lucene-core` module:
d1110394STomoko Uchida
4dc3e8abSRobert Muir- org.apache.lucene.document.InetAddressPoint
4dc3e8abSRobert Muir- org.apache.lucene.document.InetAddressRange
6a7131eeSTomoko Uchida
4dc3e8abSRobert Muir### Packages in sandbox module are renamed (LUCENE-9319)
6a7131eeSTomoko Uchida
4dc3e8abSRobert MuirThese packages in the `lucene-sandbox` module are renamed:
6a7131eeSTomoko Uchida
4dc3e8abSRobert Muir|    Old Package Name      |       New Package Name           |
4dc3e8abSRobert Muir|--------------------------|----------------------------------|
4dc3e8abSRobert Muir|org.apache.lucene.codecs  |org.apache.lucene.sandbox.codecs  |
4dc3e8abSRobert Muir|org.apache.lucene.document|org.apache.lucene.sandbox.document|
4dc3e8abSRobert Muir|org.apache.lucene.search  |org.apache.lucene.sandbox.search  |
44c1bd42STomoko Uchida
4dc3e8abSRobert Muir### Backward codecs are renamed (LUCENE-9318)
44c1bd42STomoko Uchida
4dc3e8abSRobert MuirThese packages in the `lucene-backwards-codecs` module are renamed:
4e0aa0d2Smsfroh
4dc3e8abSRobert Muir|    Old Package Name    |       New Package Name          |
4dc3e8abSRobert Muir|------------------------|---------------------------------|
4dc3e8abSRobert Muir|org.apache.lucene.codecs|org.apache.lucene.backward_codecs|
4dc3e8abSRobert Muir
4dc3e8abSRobert Muir### JapanesePartOfSpeechStopFilterFactory loads default stop tags if "tags" argument not specified (LUCENE-9567)
4dc3e8abSRobert Muir
4dc3e8abSRobert MuirPreviously, `JapanesePartOfSpeechStopFilterFactory` added no filter if `args` didn't include "tags". Now, it will load
4e0aa0d2Smsfrohthe default stop tags returned by `JapaneseAnalyzer.getDefaultStopTags()` (i.e. the tags from`stoptags.txt` in the
4e0aa0d2Smsfroh`lucene-analyzers-kuromoji` jar.)
4e0aa0d2Smsfroh
4dc3e8abSRobert Muir### ICUCollationKeyAnalyzer is renamed (LUCENE-9558)
b70eaeeeSTomoko Uchida
4dc3e8abSRobert MuirThese packages in the `lucene-analysis-icu` module are renamed:
b70eaeeeSTomoko Uchida
4dc3e8abSRobert Muir|    Old Package Name       |       New Package Name       |
4dc3e8abSRobert Muir|---------------------------|------------------------------|
4dc3e8abSRobert Muir|org.apache.lucene.collation|org.apache.lucene.analysis.icu|
5e617cccSTomoko Uchida
4dc3e8abSRobert Muir### Base and concrete analysis factories are moved / package renamed (LUCENE-9317)
5e617cccSTomoko Uchida
4dc3e8abSRobert MuirBase analysis factories are moved to `lucene-core`, also their package names are renamed.
4dc3e8abSRobert Muir
4dc3e8abSRobert Muir|                Old Class Name                    |               New Class Name               |
4dc3e8abSRobert Muir|--------------------------------------------------|--------------------------------------------|
4dc3e8abSRobert Muir|org.apache.lucene.analysis.util.TokenizerFactory  |org.apache.lucene.analysis.TokenizerFactory |
4dc3e8abSRobert Muir|org.apache.lucene.analysis.util.CharFilterFactory |org.apache.lucene.analysis.CharFilterFactory|
4dc3e8abSRobert Muir|org.apache.lucene.analysis.util.TokenFilterFactory|org.apache.lucene.analysis.TokenizerFactory |
5e617cccSTomoko Uchida
5e617cccSTomoko UchidaThe service provider files placed in `META-INF/services` for custom analysis factories should be renamed as follows:
5e617cccSTomoko Uchida
5e617cccSTomoko Uchida- META-INF/services/org.apache.lucene.analysis.TokenizerFactory
5e617cccSTomoko Uchida- META-INF/services/org.apache.lucene.analysis.CharFilterFactory
5e617cccSTomoko Uchida- META-INF/services/org.apache.lucene.analysis.TokenFilterFactory
5e617cccSTomoko Uchida
4dc3e8abSRobert Muir`StandardTokenizerFactory` is moved to `lucene-core` module.
5e617cccSTomoko Uchida
4dc3e8abSRobert MuirThe `org.apache.lucene.analysis.standard` package in `lucene-analysis-common` module
4dc3e8abSRobert Muiris split into `org.apache.lucene.analysis.classic` and `org.apache.lucene.analysis.email`.
5e617cccSTomoko Uchida
4dc3e8abSRobert Muir### RegExpQuery now rejects invalid backslashes (LUCENE-9370)
819e668cSmarkharwood
819e668cSmarkharwoodWe now follow the [Java rules](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs) for accepting backslashes.
819e668cSmarkharwoodAlphabetic characters other than s, S, w, W, d or D that are preceded by a backslash are considered illegal syntax and will throw an exception.
819e668cSmarkharwood
4dc3e8abSRobert Muir### RegExp certain regular expressions now match differently (LUCENE-9336)
18bd2971Smarkharwood
18bd2971SmarkharwoodThe commonly used regular expressions \w \W \d \D \s and \S now work the same way [Java Pattern](https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html#CHART) matching works. Previously these expressions were (mis)interpreted as searches for the literal characters w, d, s etc.
18bd2971Smarkharwood
4dc3e8abSRobert Muir### NGramFilterFactory "keepShortTerm" option was fixed to "preserveOriginal" (LUCENE-9259)
c7697b08STomoko Uchida
c7697b08STomoko UchidaThe factory option name to output the original term was corrected in accordance with its Javadoc.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### IndexMergeTool defaults changes (LUCENE-9206)
c7697b08STomoko Uchida
c7697b08STomoko UchidaThis command-line tool no longer forceMerges to a single segment. Instead, by
c7697b08STomoko Uchidadefault it just follows (configurable) merge policy. If you really want to merge
4dc3e8abSRobert Muirto a single segment, you can pass `-max-segments 1`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### FST Builder is renamed FSTCompiler with fluent-style Builder (LUCENE-9089)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirSimply use `FSTCompiler` instead of the previous `Builder`. Use either the simple constructor with default settings, or
4dc3e8abSRobert Muirthe `FSTCompiler.Builder` to tune and tweak any parameter.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933)
c7697b08STomoko Uchida
c7697b08STomoko UchidaUser dictionary now strictly validates if the (concatenated) segment is the same as the surface form. This change avoids
c7697b08STomoko Uchidaunexpected runtime exceptions or behaviours.
c7697b08STomoko UchidaFor example, these entries are not allowed at all and an exception is thrown when loading the dictionary file.
c7697b08STomoko Uchida
c7697b08STomoko Uchida```
c7697b08STomoko Uchida# concatenated "日本経済新聞" does not match the surface form "日経新聞"
c7697b08STomoko Uchida日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
c7697b08STomoko Uchida
c7697b08STomoko Uchida# concatenated "日経新聞" does not match the surface form "日本経済新聞"
c7697b08STomoko Uchida日本経済新聞,日経 新聞,ニッケイ シンブン,カスタム名詞
c7697b08STomoko Uchida```
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### JapaneseTokenizer no longer emits original (compound) tokens by default when the mode is not NORMAL (LUCENE-9123)
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`JapaneseTokenizer` and `JapaneseAnalyzer` no longer emits original tokens when `discardCompoundToken` option is not specified.
4dc3e8abSRobert MuirThe constructor option has been introduced since Lucene 8.5.0, and the default value is changed to `true`.
c7697b08STomoko Uchida
c7697b08STomoko UchidaWhen given the text "株式会社", JapaneseTokenizer (mode != NORMAL) emits decompounded tokens "株式" and "会社" only and no
4dc3e8abSRobert Muirlonger outputs the original token "株式会社" by default. To output original tokens, `discardCompoundToken` option should be
4dc3e8abSRobert Muirexplicitly set to `false`. Be aware that if this option is set to `false`, `SynonymFilter` or `SynonymGraphFilter` does not work
c7697b08STomoko Uchidacorrectly (see LUCENE-9173).
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Analysis factories now have customizable symbolic names (LUCENE-8778) and need additional no-arg constructor (LUCENE-9281)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirThe SPI names for concrete subclasses of `TokenizerFactory`, `TokenFilterFactory`, and `CharfilterFactory` are no longer
c7697b08STomoko Uchidaderived from their class name. Instead, each factory must have a static "NAME" field like this:
c7697b08STomoko Uchida
4dc3e8abSRobert Muir```java
c7697b08STomoko Uchida    /** o.a.l.a.standard.StandardTokenizerFactory's SPI name */
c7697b08STomoko Uchida    public static final String NAME = "standard";
c7697b08STomoko Uchida```
c7697b08STomoko Uchida
4dc3e8abSRobert MuirA factory can be resolved/instantiated with its `NAME` by using methods such as `TokenizerFactory.lookupClass(String)`
4dc3e8abSRobert Muiror `TokenizerFactory.forName(String, Map<String,String>)`.
c7697b08STomoko Uchida
4dc3e8abSRobert MuirIf there are any user-defined factory classes that don't have proper `NAME` field, an exception will be thrown
4dc3e8abSRobert Muirwhen (re)loading factories. e.g., when calling `TokenizerFactory.reloadTokenizers(ClassLoader)`.
c7697b08STomoko Uchida
c7697b08STomoko UchidaIn addition starting all factories need to implement a public no-arg constructor, too. The reason for this
c7697b08STomoko Uchidachange comes from the fact that Lucene now uses `java.util.ServiceLoader` instead its own implementation to
c7697b08STomoko Uchidaload the factory classes to be compatible with Java Module System changes (e.g., load factories from modules).
c7697b08STomoko UchidaIn the future, extensions to Lucene developed on the Java Module System may expose the factories from their
c7697b08STomoko Uchida`module-info.java` file instead of `META-INF/services`.
c7697b08STomoko Uchida
4dc3e8abSRobert MuirThis constructor is never called by Lucene, so by default it throws an `UnsupportedOperationException`. User-defined
c7697b08STomoko Uchidafactory classes should implement it in the following way:
c7697b08STomoko Uchida
4dc3e8abSRobert Muir```java
c7697b08STomoko Uchida    /** Default ctor for compatibility with SPI */
c7697b08STomoko Uchida    public StandardTokenizerFactory() {
c7697b08STomoko Uchida      throw defaultCtorException();
c7697b08STomoko Uchida    }
c7697b08STomoko Uchida```
c7697b08STomoko Uchida
c7697b08STomoko Uchida(`defaultCtorException()` is a protected static helper method)
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### TermsEnum is now fully abstract (LUCENE-8292, LUCENE-8662)
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`TermsEnum` has been changed to be fully abstract, so non-abstract subclasses must implement all its methods.
4dc3e8abSRobert MuirNon-Performance critical `TermsEnum`s can use `BaseTermsEnum` as a base class instead. The change was motivated
4dc3e8abSRobert Muirby several performance issues with `FilterTermsEnum` that caused significant slowdowns and massive memory consumption due
4dc3e8abSRobert Muirto not delegating all method from `TermsEnum`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed (LUCENE-8474)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirRAM-based directory implementation have been removed.
4dc3e8abSRobert Muir`ByteBuffersDirectory` can be used as a RAM-resident replacement, although it
4dc3e8abSRobert Muiris discouraged in favor of the default `MMapDirectory`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014)
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`SpanQuery` and `PhraseQuery` now always calculate their slops as
4dc3e8abSRobert Muir`(1.0 / (1.0 + distance))`.  Payload factor calculation is performed by
4dc3e8abSRobert Muir`PayloadDecoder` in the `lucene-queries` module.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Scorer must produce positive scores (LUCENE-7996)
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`Scorer`s are no longer allowed to produce negative scores. If you have custom
c7697b08STomoko Uchidaquery implementations, you should make sure their score formula may never produce
c7697b08STomoko Uchidanegative scores.
c7697b08STomoko Uchida
c7697b08STomoko UchidaAs a side-effect of this change, negative boosts are now rejected and
4dc3e8abSRobert Muir`FunctionScoreQuery` maps negative values to 0.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirInstead use `FunctionScoreQuery` and a `DoubleValuesSource` implementation.  `BoostedQuery`
4dc3e8abSRobert Muirand `BoostingQuery` may be replaced by calls to `FunctionScoreQuery.boostByValue()` and
4dc3e8abSRobert Muir`FunctionScoreQuery.boostByQuery()`.  To replace more complex calculations in
4dc3e8abSRobert Muir`CustomScoreQuery`, use the `lucene-expressions` module:
c7697b08STomoko Uchida
4dc3e8abSRobert Muir```java
c7697b08STomoko UchidaSimpleBindings bindings = new SimpleBindings();
c7697b08STomoko Uchidabindings.add("score", DoubleValuesSource.SCORES);
c7697b08STomoko Uchidabindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
c7697b08STomoko Uchidabindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
c7697b08STomoko UchidaExpression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
c7697b08STomoko UchidaFunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));
c7697b08STomoko Uchida```
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### IndexOptions can no longer be changed dynamically (LUCENE-8134)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirChanging `IndexOptions` for a field on the fly will now result into an
4dc3e8abSRobert Muir`IllegalArgumentException`. If a field is indexed
4dc3e8abSRobert Muir(`FieldType.indexOptions() != IndexOptions.NONE`) then all documents must have
c7697b08STomoko Uchidathe same index options for that field.
c7697b08STomoko Uchida
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### IndexSearcher.createNormalizedWeight() removed (LUCENE-8242)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirInstead use `IndexSearcher.createWeight()`, rewriting the query first, and using
4dc3e8abSRobert Muira boost of `1f`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Memory codecs removed (LUCENE-8267)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirMemory codecs (`MemoryPostingsFormat`, `MemoryDocValuesFormat`) have been removed from the codebase.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Direct doc-value format removed (LUCENE-8917)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirThe `Direct` doc-value format has been removed from the codebase.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144)
c7697b08STomoko Uchida
c7697b08STomoko UchidaCaching everything is discouraged as it disables the ability to skip non-interesting documents.
4dc3e8abSRobert Muir`ALWAYS_CACHE` can be replaced by a `UsageTrackingQueryCachingPolicy` with an appropriate config.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### English stopwords are no longer removed by default in StandardAnalyzer (LUCENE-7444)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirTo retain the old behaviour, pass `EnglishAnalyzer.ENGLISH_STOP_WORDS_SET` as an argument
c7697b08STomoko Uchidato the constructor
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved
c7697b08STomoko Uchida
4dc3e8abSRobert MuirEnglish stop words are now defined in `EnglishAnalyzer.ENGLISH_STOP_WORDS_SET` in the
4dc3e8abSRobert Muir`analysis-common` module.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### TopDocs.maxScore removed
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`TopDocs.maxScore` is removed. `IndexSearcher` and `TopFieldCollector` no longer have
c7697b08STomoko Uchidaan option to compute the maximum score when sorting by field. If you need to
c7697b08STomoko Uchidaknow the maximum score for a query, the recommended approach is to run a
c7697b08STomoko Uchidaseparate query:
c7697b08STomoko Uchida
4dc3e8abSRobert Muir```java
c7697b08STomoko Uchida  TopDocs topHits = searcher.search(query, 1);
c7697b08STomoko Uchida  float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score;
c7697b08STomoko Uchida```
c7697b08STomoko Uchida
c7697b08STomoko UchidaThanks to other optimizations that were added to Lucene 8, this query will be
c7697b08STomoko Uchidaable to efficiently select the top-scoring document without having to visit
c7697b08STomoko Uchidaall matches.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### TopFieldCollector always assumes fillFields=true
c7697b08STomoko Uchida
4dc3e8abSRobert MuirBecause filling sort values doesn't have a significant overhead, the `fillFields`
4dc3e8abSRobert Muiroption has been removed from `TopFieldCollector` factory methods. Everything
4dc3e8abSRobert Muirbehaves as if it was previously set to `true`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### TopFieldCollector no longer takes a trackDocScores option
c7697b08STomoko Uchida
c7697b08STomoko UchidaComputing scores at collection time is less efficient than running a second
c7697b08STomoko Uchidarequest in order to only compute scores for documents that made it to the top
4dc3e8abSRobert Muirhits. As a consequence, the `trackDocScores` option has been removed and can be
4dc3e8abSRobert Muirreplaced with the new `TopFieldCollector.populateScores()` helper method.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long
c7697b08STomoko Uchida
c7697b08STomoko UchidaLucene 8 received optimizations for collection of top-k matches by not visiting
c7697b08STomoko Uchidaall matches. However these optimizations won't help if all matches still need
c7697b08STomoko Uchidato be visited in order to compute the total number of hits. As a consequence,
4dc3e8abSRobert Muir`IndexSearcher`'s `search()` and `searchAfter()` methods were changed to only count hits
4dc3e8abSRobert Muiraccurately up to 1,000, and `Topdocs.totalHits` was changed from a `long` to an
c7697b08STomoko Uchidaobject that says whether the hit count is accurate or a lower bound of the
c7697b08STomoko Uchidaactual hit count.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated (LUCENE-8467, LUCENE-8438)
c7697b08STomoko Uchida
c7697b08STomoko UchidaThis RAM-based directory implementation is an old piece of code that uses inefficient
c7697b08STomoko Uchidathread synchronization primitives and can be confused as "faster" than the NIO-based
4dc3e8abSRobert Muir`MMapDirectory`. It is deprecated and scheduled for removal in future versions of
4dc3e8abSRobert MuirLucene.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### LeafCollector.setScorer() now takes a Scorable rather than a Scorer (LUCENE-6228)
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`Scorer` has a number of methods that should never be called from `Collector`s, for example
4dc3e8abSRobert Muirthose that advance the underlying iterators.  To hide these, `LeafCollector.setScorer()`
4dc3e8abSRobert Muirnow takes a `Scorable`, an abstract class that scorers can extend, with methods
4dc3e8abSRobert Muir`docId()` and `score()`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Scorers must have non-null Weights
c7697b08STomoko Uchida
4dc3e8abSRobert MuirIf a custom `Scorer` implementation does not have an associated `Weight`, it can probably
4dc3e8abSRobert Muirbe replaced with a `Scorable` instead.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Suggesters now return Long instead of long for weight() during indexing, and double instead of long at suggest time
c7697b08STomoko Uchida
c7697b08STomoko UchidaMost code should just require recompilation, though possibly requiring some added casts.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### TokenStreamComponents is now final
c7697b08STomoko Uchida
4dc3e8abSRobert MuirInstead of overriding `TokenStreamComponents.setReader()` to customise analyzer
4dc3e8abSRobert Muirinitialisation, you should now pass a `Consumer<Reader>` instance to the
4dc3e8abSRobert Muir`TokenStreamComponents` constructor.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`LowerCaseTokenizer` combined tokenization and filtering in a way that broke token
4dc3e8abSRobert Muirnormalization, so they have been removed. Instead, use a `LetterTokenizer` followed by
4dc3e8abSRobert Muira `LowerCaseFilter`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### CharTokenizer no longer takes a normalizer function
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`CharTokenizer` now only performs tokenization. To perform any type of filtering
4dc3e8abSRobert Muiruse a `TokenFilter` chain as you would with any other `Tokenizer`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### Highlighter and FastVectorHighlighter no longer support ToParent/ToChildBlockJoinQuery
c7697b08STomoko Uchida
4dc3e8abSRobert MuirBoth `Highlighter` and `FastVectorHighlighter` need a custom `WeightedSpanTermExtractor` or `FieldQuery`, respectively,
4dc3e8abSRobert Muirin order to support `ToParentBlockJoinQuery`/`ToChildBlockJoinQuery`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### MultiTermAwareComponent replaced by CharFilterFactory.normalize() and TokenFilterFactory.normalize()
c7697b08STomoko Uchida
4dc3e8abSRobert MuirNormalization is now type-safe, with `CharFilterFactory.normalize()` returning a `Reader` and
4dc3e8abSRobert Muir`TokenFilterFactory.normalize()` returning a `TokenFilter`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563)
c7697b08STomoko Uchida
4dc3e8abSRobert MuirScores computed by the `BM25Similarity` are lower than previously as the `k1+1`
c7697b08STomoko Uchidaconstant factor was removed from the numerator of the scoring formula.
c7697b08STomoko UchidaOrdering of results is preserved unless scores are computed from multiple
c7697b08STomoko Uchidafields using different similarities. The previous behaviour is now exposed
4dc3e8abSRobert Muirby the `LegacyBM25Similarity` class which can be found in the lucene-misc jar.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### IndexWriter.maxDoc()/numDocs() removed in favor of IndexWriter.getDocStats()
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`IndexWriter.getDocStats()` should be used instead of `maxDoc()` / `numDocs()` which offers a consistent
4dc3e8abSRobert Muirview on document stats. Previously calling two methods in order to get point in time stats was subject
c7697b08STomoko Uchidato concurrent changes.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811)
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`IndexSearcher` now performs max clause count checks on all types of queries (including BooleanQueries).
4dc3e8abSRobert MuirThis led to a logical move of the clauses count from `BooleanQuery` to `IndexSearcher`.
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### TopDocs.merge shall no longer allow setting of shard indices
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`TopDocs.merge()`'s API has been changed to stop allowing passing in a parameter to indicate if it should
c7697b08STomoko Uchidaset shard indices for hits as they are seen during the merge process. This is done to simplify the API
c7697b08STomoko Uchidato be more dynamic in terms of passing in custom tie breakers.
4dc3e8abSRobert MuirIf shard indices are to be used for tie breaking docs with equal scores during `TopDocs.merge()`, then it is
4dc3e8abSRobert Muirmandatory that the input `ScoreDocs` have their shard indices set to valid values prior to calling `merge()`
c7697b08STomoko Uchida
4dc3e8abSRobert Muir### TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments
c7697b08STomoko Uchida
4dc3e8abSRobert Muir`TopDocsCollector` shall no longer return an empty `TopDocs` for malformed arguments.
4dc3e8abSRobert MuirRather, an `IllegalArgumentException` shall be thrown. This is introduced for better
c7697b08STomoko Uchidadefence and to ensure that there is no bubbling up of errors when Lucene is
c7697b08STomoko Uchidaused in multi level applications
b0333ab5SMayya Sharipova
4dc3e8abSRobert Muir### Assumption of data consistency between different data-structures sharing the same field name
b0333ab5SMayya Sharipova
b0333ab5SMayya SharipovaSorting on a numeric field that is indexed with both doc values and points may use an
b0333ab5SMayya Sharipovaoptimization to skip non-competitive documents. This optimization relies on the assumption
b0333ab5SMayya Sharipovathat the same data is stored in these points and doc values.
f3a284adSRobert Muir
d03662c4SMayya Sharipova### Require consistency between data-structures on a per-field basis
d03662c4SMayya Sharipova
d03662c4SMayya SharipovaThe per field data-structures are implicitly defined by the first document
d03662c4SMayya Sharipovaindexed that contains a certain field. Once defined, the per field
d03662c4SMayya Sharipovadata-structures are not changeable for the whole index. For example, if you
d03662c4SMayya Sharipovafirst index a document where a certain field is indexed with doc values and
d03662c4SMayya Sharipovapoints, all subsequent documents containing this field must also have this
d03662c4SMayya Sharipovafield indexed with only doc values and points.
d03662c4SMayya Sharipova
d03662c4SMayya SharipovaThis also means that an index created in the previous version that doesn't
d03662c4SMayya Sharipovasatisfy this requirement can not be updated.
d03662c4SMayya Sharipova
d03662c4SMayya Sharipova### Doc values updates are allowed only for doc values only fields
d03662c4SMayya Sharipova
d03662c4SMayya SharipovaPreviously IndexWriter could update doc values for a binary or numeric docValue
d03662c4SMayya Sharipovafield that was also indexed with other data structures (e.g. postings, vectors
d03662c4SMayya Sharipovaetc). This is not allowed anymore. A field must be indexed with only doc values
4dc3e8abSRobert Muirto be allowed for doc values updates in `IndexWriter`.
d03662c4SMayya Sharipova
4dc3e8abSRobert Muir### SortedDocValues no longer extends BinaryDocValues (LUCENE-9796)
f3a284adSRobert Muir
4dc3e8abSRobert Muir`SortedDocValues` no longer extends `BinaryDocValues`: `SortedDocValues` do not have a per-document
f3a284adSRobert Muirbinary value, they have a per-document numeric `ordValue()`. The ordinal can then be dereferenced
f3a284adSRobert Muirto its binary form with `lookupOrd()`, but it was a performance trap to implement a `binaryValue()`
f3a284adSRobert Muiron the SortedDocValues api that does this behind-the-scenes on every document.
f3a284adSRobert Muir
f3a284adSRobert MuirYou can replace calls of `binaryValue()` with `lookupOrd(ordValue())` as a "quick fix", but it is
f3a284adSRobert Muirbetter to use the ordinal alone (integer-based datastructures) for per-document access, and only
4dc3e8abSRobert Muircall `lookupOrd()` a few times at the end (e.g. for the hits you want to display). Otherwise, if you
4dc3e8abSRobert Muirreally don't want per-document ordinals, but instead a per-document `byte[]`, use a `BinaryDocValues`
f3a284adSRobert Muirfield.
79f14b17SAdrien Grand
4dc3e8abSRobert Muir### Removed CodecReader.ramBytesUsed() (LUCENE-9387)
79f14b17SAdrien Grand
79f14b17SAdrien GrandLucene index readers are now using so little memory with the default codec that
79f14b17SAdrien Grandit was decided to remove the ability to estimate their RAM usage.
650cad19SGreg Miller
4dc3e8abSRobert Muir### LongValueFacetCounts no longer accepts multiValued param in constructors (LUCENE-9948)
650cad19SGreg Miller
4dc3e8abSRobert Muir`LongValueFacetCounts` will now automatically detect whether-or-not an indexed field is single- or
650cad19SGreg Millermulti-valued. The user no longer needs to provide this information to the ctors. Migrating should
650cad19SGreg Millerbe as simple as no longer providing this boolean.
4464cd87SAlan Woodward
4dc3e8abSRobert Muir### SpanQuery and subclasses have moved from core/ to the queries module
4464cd87SAlan Woodward
4dc3e8abSRobert MuirThey can now be found in the `org.apache.lucene.queries.spans` package.
dbb4c265SAlan Woodward
4dc3e8abSRobert Muir### SpanBoostQuery has been removed (LUCENE-8143)
dbb4c265SAlan Woodward
4dc3e8abSRobert Muir`SpanBoostQuery` was a no-op unless used at the top level of a `SpanQuery` nested
4dc3e8abSRobert Muirstructure. Use a standard `BoostQuery` here instead.
9e9c3bd2SAlan Woodward
4dc3e8abSRobert Muir### Sort is immutable (LUCENE-9325)
9e9c3bd2SAlan Woodward
9e9c3bd2SAlan WoodwardRather than using `setSort()` to change sort values, you should instead create
4dc3e8abSRobert Muira new `Sort` instance with the new values.
6ee69e06SGreg Miller
4dc3e8abSRobert Muir### Taxonomy-based faceting uses more modern encodings (LUCENE-9450, LUCENE-10062, LUCENE-10122)
6ee69e06SGreg Miller
6ee69e06SGreg MillerThe side-car taxonomy index now uses doc values for ord-to-path lookup (LUCENE-9450) and parent
6ee69e06SGreg Millerlookup (LUCENE-10122) instead of stored fields and positions (respectively). Document ordinals
6ee69e06SGreg Millerare now encoded with `SortedNumericDocValues` instead of using a custom (v-int) binary format.
6ee69e06SGreg MillerPerformance gains have been observed with these encoding changes. These changes were introduced
6ee69e06SGreg Millerin 9.0, and 9.x releases remain backwards-compatible with 8.x indexes, but starting with 10.0,
6ee69e06SGreg Milleronly the newer formats are supported. Users will need to create a new index with all their
6ee69e06SGreg Millerdocuments using 9.0 or later to pick up the new format and remain compatible with 10.x releases.
6ee69e06SGreg MillerJust re-adding documents to an existing index is not enough to pick up the changes as the
6ee69e06SGreg Millerformat will "stick" to whatever version was used to initially create the index.
6ee69e06SGreg Miller
6ee69e06SGreg MillerAdditionally, `OrdinalsReader` (and sub-classes) are fully removed starting with 10.0. These
6ee69e06SGreg Millerclasses were `@Deprecated` starting with 9.0. Users are encouraged to rely on the default
6ee69e06SGreg Millertaxonomy facet encodings where possible. If custom formats are needed, users will need
6ee69e06SGreg Millerto manage the indexed data on their own and create new `Facet` implementations to use it.