1<!-- 2Licensed to the Apache Software Foundation (ASF) under one 3or more contributor license agreements. See the NOTICE file 4distributed with this work for additional information 5regarding copyright ownership. The ASF licenses this file 6to you under the Apache License, Version 2.0 (the 7"License"); you may not use this file except in compliance 8with the License. You may obtain a copy of the License at 9 10 http://www.apache.org/licenses/LICENSE-2.0 11 12Unless required by applicable law or agreed to in writing, 13software distributed under the License is distributed on an 14"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15KIND, either express or implied. See the License for the 16specific language governing permissions and limitations 17under the License. 18--> 19 20# Designing file formats 21 22## Use little JVM heap 23 24Lucene generally prefers to avoid loading gigabytes of data into the JVM heap. 25Could this data be stored in a file and accessed using a 26`org.apache.lucene.store.RandomAccessInput` instead? 27 28## Avoid options 29 30One of the hardest problems with file formats is maintaining backward 31compatibility. Avoid giving options to the user, and instead let the file 32format make decisions based on the information it has. If an expert user wants 33to optimize for a specific case, they can write a custom codec and maintain it 34on their own. 35 36## How to split the data into files? 37 38Most file formats split the data into 3 files: 39 - metadata, 40 - index data, 41 - raw data. 42 43The metadata file contains all the data that is read once at open time. This 44helps on several fronts: 45 - One can validate the checksums of this data at open time without significant 46 overhead since all data needs to be read anyway, this helps detect 47 corruptions early. 48 - No need to perform expensive seeks into the index/raw data files at open 49 time, one can create slices into these files from offsets that have been 50 written into the metadata file. 51 52The index file contains data-structures that help search the raw data. For KD 53trees, this would be the inner nodes, for doc values this would be jump tables, 54for KNN vectors this would be the HNSW graph structure, for terms this would be 55the FST that stores term prefixes, etc. Having it in a separate file from the 56data file enables users to do things like `MMapDirectory#setPreload(boolean)` 57on these files which are generally rather small and accessed randomly. It is 58also convenient at times so that index and raw data can be written on the fly 59without buffering all index data into memory. 60 61The raw file contains the data that needs to be retrieved. 62 63Some file formats are simpler, e.g. the compound file format's index is so 64small that it can be loaded fully into memory at open time. So it becomes 65read-once and can be stored in the same file as metadata. 66 67Some file formats are more complex, e.g. postings have multiple types of data 68(docs, freqs, positions, offsets, payloads) that are optionally retrieved, so 69they use multiple data files in order not to have to read lots of useless data. 70 71## Don't use too many files 72 73The maximum number of file descriptors is usually not infinite. It's ok to use 74multiple files per segment as described above, but this number should always be 75small. For instance, it would be a bad practice to use a different file per 76field. 77 78## Add codec headers and footers to all files 79 80Use `CodecUtil` to add headers and footers to all files of the index. This 81helps make sure that we are opening the right file and differenciate Lucene 82bugs from file corruptions. 83 84## Validate checksums of the metadata file when opening a segment 85 86If data has been organized in such a way that the metadata file only contains 87read-once data then verifying checksums is very cheap to do and can help detect 88corruptions early and in a way that we can give users a meaningful error 89message that tells users that their index is corrupt, rather than a confusing 90exception that tells them that Lucene tried to read data beyond the end of the 91file or anything like that. 92 93## Validate structures of other files when opening a segment 94 95One of the most frequent case of index corruption that we have observed over 96the years is file truncation. Verifying that index files have the expected 97codec header and a correct structure for the codec footer when opening a 98segment helps detect a significant spectrum of cases of corruption. 99 100## Do as many consistency checks as reasonable 101 102It is common for some data to be redundant, e.g. data from the metadata file 103might be redundant with information from `FieldInfos`, or all files from the 104same file format should have the same version in their codec header. Checking 105that these redundant pieces of information are consistent is always a good 106idea, as it would make cases of corruption much easier to debug. 107 108## Make sure to not leak files 109 110Be paranoid regarding where exceptions might be thrown and make sure that files 111would be closed on all paths. E.g. imagine that opening the data file fails 112while the index file is already open, make sure that the index file would also 113get closed in that case. Lucene has tests that randomly throw exceptions when 114interacting with the `Directory` in order to detect some bugs, but it might 115take many runs before randomization triggers the exact case that triggers a 116bug. 117 118## Verify checksums upon merges 119 120Merges need to read most if not all input data anyway, so make sure to verify 121checksums before starting a merge by calling `checkIntegrity()` on the file 122format reader in order to make sure that file corruptions don't get propagated 123by merges. All default implementations do this. 124 125## How to make backward-compatible changes to file formats? 126 127See [here](../lucene/backward-codecs/README.md). 128