1*308ddd75SAdrien Grand<!-- 2*308ddd75SAdrien GrandLicensed to the Apache Software Foundation (ASF) under one 3*308ddd75SAdrien Grandor more contributor license agreements. See the NOTICE file 4*308ddd75SAdrien Granddistributed with this work for additional information 5*308ddd75SAdrien Grandregarding copyright ownership. The ASF licenses this file 6*308ddd75SAdrien Grandto you under the Apache License, Version 2.0 (the 7*308ddd75SAdrien Grand"License"); you may not use this file except in compliance 8*308ddd75SAdrien Grandwith the License. You may obtain a copy of the License at 9*308ddd75SAdrien Grand 10*308ddd75SAdrien Grand http://www.apache.org/licenses/LICENSE-2.0 11*308ddd75SAdrien Grand 12*308ddd75SAdrien GrandUnless required by applicable law or agreed to in writing, 13*308ddd75SAdrien Grandsoftware distributed under the License is distributed on an 14*308ddd75SAdrien Grand"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15*308ddd75SAdrien GrandKIND, either express or implied. See the License for the 16*308ddd75SAdrien Grandspecific language governing permissions and limitations 17*308ddd75SAdrien Grandunder the License. 18*308ddd75SAdrien Grand--> 19*308ddd75SAdrien Grand 20*308ddd75SAdrien Grand# Designing file formats 21*308ddd75SAdrien Grand 22*308ddd75SAdrien Grand## Use little JVM heap 23*308ddd75SAdrien Grand 24*308ddd75SAdrien GrandLucene generally prefers to avoid loading gigabytes of data into the JVM heap. 25*308ddd75SAdrien GrandCould this data be stored in a file and accessed using a 26*308ddd75SAdrien Grand`org.apache.lucene.store.RandomAccessInput` instead? 27*308ddd75SAdrien Grand 28*308ddd75SAdrien Grand## Avoid options 29*308ddd75SAdrien Grand 30*308ddd75SAdrien GrandOne of the hardest problems with file formats is maintaining backward 31*308ddd75SAdrien Grandcompatibility. Avoid giving options to the user, and instead let the file 32*308ddd75SAdrien Grandformat make decisions based on the information it has. If an expert user wants 33*308ddd75SAdrien Grandto optimize for a specific case, they can write a custom codec and maintain it 34*308ddd75SAdrien Grandon their own. 35*308ddd75SAdrien Grand 36*308ddd75SAdrien Grand## How to split the data into files? 37*308ddd75SAdrien Grand 38*308ddd75SAdrien GrandMost file formats split the data into 3 files: 39*308ddd75SAdrien Grand - metadata, 40*308ddd75SAdrien Grand - index data, 41*308ddd75SAdrien Grand - raw data. 42*308ddd75SAdrien Grand 43*308ddd75SAdrien GrandThe metadata file contains all the data that is read once at open time. This 44*308ddd75SAdrien Grandhelps on several fronts: 45*308ddd75SAdrien Grand - One can validate the checksums of this data at open time without significant 46*308ddd75SAdrien Grand overhead since all data needs to be read anyway, this helps detect 47*308ddd75SAdrien Grand corruptions early. 48*308ddd75SAdrien Grand - No need to perform expensive seeks into the index/raw data files at open 49*308ddd75SAdrien Grand time, one can create slices into these files from offsets that have been 50*308ddd75SAdrien Grand written into the metadata file. 51*308ddd75SAdrien Grand 52*308ddd75SAdrien GrandThe index file contains data-structures that help search the raw data. For KD 53*308ddd75SAdrien Grandtrees, this would be the inner nodes, for doc values this would be jump tables, 54*308ddd75SAdrien Grandfor KNN vectors this would be the HNSW graph structure, for terms this would be 55*308ddd75SAdrien Grandthe FST that stores term prefixes, etc. Having it in a separate file from the 56*308ddd75SAdrien Granddata file enables users to do things like `MMapDirectory#setPreload(boolean)` 57*308ddd75SAdrien Grandon these files which are generally rather small and accessed randomly. It is 58*308ddd75SAdrien Grandalso convenient at times so that index and raw data can be written on the fly 59*308ddd75SAdrien Grandwithout buffering all index data into memory. 60*308ddd75SAdrien Grand 61*308ddd75SAdrien GrandThe raw file contains the data that needs to be retrieved. 62*308ddd75SAdrien Grand 63*308ddd75SAdrien GrandSome file formats are simpler, e.g. the compound file format's index is so 64*308ddd75SAdrien Grandsmall that it can be loaded fully into memory at open time. So it becomes 65*308ddd75SAdrien Grandread-once and can be stored in the same file as metadata. 66*308ddd75SAdrien Grand 67*308ddd75SAdrien GrandSome file formats are more complex, e.g. postings have multiple types of data 68*308ddd75SAdrien Grand(docs, freqs, positions, offsets, payloads) that are optionally retrieved, so 69*308ddd75SAdrien Grandthey use multiple data files in order not to have to read lots of useless data. 70*308ddd75SAdrien Grand 71*308ddd75SAdrien Grand## Don't use too many files 72*308ddd75SAdrien Grand 73*308ddd75SAdrien GrandThe maximum number of file descriptors is usually not infinite. It's ok to use 74*308ddd75SAdrien Grandmultiple files per segment as described above, but this number should always be 75*308ddd75SAdrien Grandsmall. For instance, it would be a bad practice to use a different file per 76*308ddd75SAdrien Grandfield. 77*308ddd75SAdrien Grand 78*308ddd75SAdrien Grand## Add codec headers and footers to all files 79*308ddd75SAdrien Grand 80*308ddd75SAdrien GrandUse `CodecUtil` to add headers and footers to all files of the index. This 81*308ddd75SAdrien Grandhelps make sure that we are opening the right file and differenciate Lucene 82*308ddd75SAdrien Grandbugs from file corruptions. 83*308ddd75SAdrien Grand 84*308ddd75SAdrien Grand## Validate checksums of the metadata file when opening a segment 85*308ddd75SAdrien Grand 86*308ddd75SAdrien GrandIf data has been organized in such a way that the metadata file only contains 87*308ddd75SAdrien Grandread-once data then verifying checksums is very cheap to do and can help detect 88*308ddd75SAdrien Grandcorruptions early and in a way that we can give users a meaningful error 89*308ddd75SAdrien Grandmessage that tells users that their index is corrupt, rather than a confusing 90*308ddd75SAdrien Grandexception that tells them that Lucene tried to read data beyond the end of the 91*308ddd75SAdrien Grandfile or anything like that. 92*308ddd75SAdrien Grand 93*308ddd75SAdrien Grand## Validate structures of other files when opening a segment 94*308ddd75SAdrien Grand 95*308ddd75SAdrien GrandOne of the most frequent case of index corruption that we have observed over 96*308ddd75SAdrien Grandthe years is file truncation. Verifying that index files have the expected 97*308ddd75SAdrien Grandcodec header and a correct structure for the codec footer when opening a 98*308ddd75SAdrien Grandsegment helps detect a significant spectrum of cases of corruption. 99*308ddd75SAdrien Grand 100*308ddd75SAdrien Grand## Do as many consistency checks as reasonable 101*308ddd75SAdrien Grand 102*308ddd75SAdrien GrandIt is common for some data to be redundant, e.g. data from the metadata file 103*308ddd75SAdrien Grandmight be redundant with information from `FieldInfos`, or all files from the 104*308ddd75SAdrien Grandsame file format should have the same version in their codec header. Checking 105*308ddd75SAdrien Grandthat these redundant pieces of information are consistent is always a good 106*308ddd75SAdrien Grandidea, as it would make cases of corruption much easier to debug. 107*308ddd75SAdrien Grand 108*308ddd75SAdrien Grand## Make sure to not leak files 109*308ddd75SAdrien Grand 110*308ddd75SAdrien GrandBe paranoid regarding where exceptions might be thrown and make sure that files 111*308ddd75SAdrien Grandwould be closed on all paths. E.g. imagine that opening the data file fails 112*308ddd75SAdrien Grandwhile the index file is already open, make sure that the index file would also 113*308ddd75SAdrien Grandget closed in that case. Lucene has tests that randomly throw exceptions when 114*308ddd75SAdrien Grandinteracting with the `Directory` in order to detect some bugs, but it might 115*308ddd75SAdrien Grandtake many runs before randomization triggers the exact case that triggers a 116*308ddd75SAdrien Grandbug. 117*308ddd75SAdrien Grand 118*308ddd75SAdrien Grand## Verify checksums upon merges 119*308ddd75SAdrien Grand 120*308ddd75SAdrien GrandMerges need to read most if not all input data anyway, so make sure to verify 121*308ddd75SAdrien Grandchecksums before starting a merge by calling `checkIntegrity()` on the file 122*308ddd75SAdrien Grandformat reader in order to make sure that file corruptions don't get propagated 123*308ddd75SAdrien Grandby merges. All default implementations do this. 124*308ddd75SAdrien Grand 125*308ddd75SAdrien Grand## How to make backward-compatible changes to file formats? 126*308ddd75SAdrien Grand 127*308ddd75SAdrien GrandSee [here](../lucene/backward-codecs/README.md). 128