xref: /Lucene/dev-docs/file-formats.md (revision 308ddd7502c5585d93528a67ce2276163a58224f)
1*308ddd75SAdrien Grand<!--
2*308ddd75SAdrien GrandLicensed to the Apache Software Foundation (ASF) under one
3*308ddd75SAdrien Grandor more contributor license agreements.  See the NOTICE file
4*308ddd75SAdrien Granddistributed with this work for additional information
5*308ddd75SAdrien Grandregarding copyright ownership.  The ASF licenses this file
6*308ddd75SAdrien Grandto you under the Apache License, Version 2.0 (the
7*308ddd75SAdrien Grand"License"); you may not use this file except in compliance
8*308ddd75SAdrien Grandwith the License.  You may obtain a copy of the License at
9*308ddd75SAdrien Grand
10*308ddd75SAdrien Grand  http://www.apache.org/licenses/LICENSE-2.0
11*308ddd75SAdrien Grand
12*308ddd75SAdrien GrandUnless required by applicable law or agreed to in writing,
13*308ddd75SAdrien Grandsoftware distributed under the License is distributed on an
14*308ddd75SAdrien Grand"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15*308ddd75SAdrien GrandKIND, either express or implied.  See the License for the
16*308ddd75SAdrien Grandspecific language governing permissions and limitations
17*308ddd75SAdrien Grandunder the License.
18*308ddd75SAdrien Grand-->
19*308ddd75SAdrien Grand
20*308ddd75SAdrien Grand# Designing file formats
21*308ddd75SAdrien Grand
22*308ddd75SAdrien Grand## Use little JVM heap
23*308ddd75SAdrien Grand
24*308ddd75SAdrien GrandLucene generally prefers to avoid loading gigabytes of data into the JVM heap.
25*308ddd75SAdrien GrandCould this data be stored in a file and accessed using a
26*308ddd75SAdrien Grand`org.apache.lucene.store.RandomAccessInput` instead?
27*308ddd75SAdrien Grand
28*308ddd75SAdrien Grand## Avoid options
29*308ddd75SAdrien Grand
30*308ddd75SAdrien GrandOne of the hardest problems with file formats is maintaining backward
31*308ddd75SAdrien Grandcompatibility. Avoid giving options to the user, and instead let the file
32*308ddd75SAdrien Grandformat make decisions based on the information it has. If an expert user wants
33*308ddd75SAdrien Grandto optimize for a specific case, they can write a custom codec and maintain it
34*308ddd75SAdrien Grandon their own.
35*308ddd75SAdrien Grand
36*308ddd75SAdrien Grand## How to split the data into files?
37*308ddd75SAdrien Grand
38*308ddd75SAdrien GrandMost file formats split the data into 3 files:
39*308ddd75SAdrien Grand - metadata,
40*308ddd75SAdrien Grand - index data,
41*308ddd75SAdrien Grand - raw data.
42*308ddd75SAdrien Grand
43*308ddd75SAdrien GrandThe metadata file contains all the data that is read once at open time. This
44*308ddd75SAdrien Grandhelps on several fronts:
45*308ddd75SAdrien Grand - One can validate the checksums of this data at open time without significant
46*308ddd75SAdrien Grand   overhead since all data needs to be read anyway, this helps detect
47*308ddd75SAdrien Grand   corruptions early.
48*308ddd75SAdrien Grand - No need to perform expensive seeks into the index/raw data files at open
49*308ddd75SAdrien Grand   time, one can create slices into these files from offsets that have been
50*308ddd75SAdrien Grand   written into the metadata file.
51*308ddd75SAdrien Grand
52*308ddd75SAdrien GrandThe index file contains data-structures that help search the raw data. For KD
53*308ddd75SAdrien Grandtrees, this would be the inner nodes, for doc values this would be jump tables,
54*308ddd75SAdrien Grandfor KNN vectors this would be the HNSW graph structure, for terms this would be
55*308ddd75SAdrien Grandthe FST that stores term prefixes, etc. Having it in a separate file from the
56*308ddd75SAdrien Granddata file enables users to do things like `MMapDirectory#setPreload(boolean)`
57*308ddd75SAdrien Grandon these files which are generally rather small and accessed randomly. It is
58*308ddd75SAdrien Grandalso convenient at times so that index and raw data can be written on the fly
59*308ddd75SAdrien Grandwithout buffering all index data into memory.
60*308ddd75SAdrien Grand
61*308ddd75SAdrien GrandThe raw file contains the data that needs to be retrieved.
62*308ddd75SAdrien Grand
63*308ddd75SAdrien GrandSome file formats are simpler, e.g. the compound file format's index is so
64*308ddd75SAdrien Grandsmall that it can be loaded fully into memory at open time. So it becomes
65*308ddd75SAdrien Grandread-once and can be stored in the same file as metadata.
66*308ddd75SAdrien Grand
67*308ddd75SAdrien GrandSome file formats are more complex, e.g. postings have multiple types of data
68*308ddd75SAdrien Grand(docs, freqs, positions, offsets, payloads) that are optionally retrieved, so
69*308ddd75SAdrien Grandthey use multiple data files in order not to have to read lots of useless data.
70*308ddd75SAdrien Grand
71*308ddd75SAdrien Grand## Don't use too many files
72*308ddd75SAdrien Grand
73*308ddd75SAdrien GrandThe maximum number of file descriptors is usually not infinite. It's ok to use
74*308ddd75SAdrien Grandmultiple files per segment as described above, but this number should always be
75*308ddd75SAdrien Grandsmall. For instance, it would be a bad practice to use a different file per
76*308ddd75SAdrien Grandfield.
77*308ddd75SAdrien Grand
78*308ddd75SAdrien Grand## Add codec headers and footers to all files
79*308ddd75SAdrien Grand
80*308ddd75SAdrien GrandUse `CodecUtil` to add headers and footers to all files of the index. This
81*308ddd75SAdrien Grandhelps make sure that we are opening the right file and differenciate Lucene
82*308ddd75SAdrien Grandbugs from file corruptions.
83*308ddd75SAdrien Grand
84*308ddd75SAdrien Grand## Validate checksums of the metadata file when opening a segment
85*308ddd75SAdrien Grand
86*308ddd75SAdrien GrandIf data has been organized in such a way that the metadata file only contains
87*308ddd75SAdrien Grandread-once data then verifying checksums is very cheap to do and can help detect
88*308ddd75SAdrien Grandcorruptions early and in a way that we can give users a meaningful error
89*308ddd75SAdrien Grandmessage that tells users that their index is corrupt, rather than a confusing
90*308ddd75SAdrien Grandexception that tells them that Lucene tried to read data beyond the end of the
91*308ddd75SAdrien Grandfile or anything like that.
92*308ddd75SAdrien Grand
93*308ddd75SAdrien Grand## Validate structures of other files when opening a segment
94*308ddd75SAdrien Grand
95*308ddd75SAdrien GrandOne of the most frequent case of index corruption that we have observed over
96*308ddd75SAdrien Grandthe years is file truncation. Verifying that index files have the expected
97*308ddd75SAdrien Grandcodec header and a correct structure for the codec footer when opening a
98*308ddd75SAdrien Grandsegment helps detect a significant spectrum of cases of corruption.
99*308ddd75SAdrien Grand
100*308ddd75SAdrien Grand## Do as many consistency checks as reasonable
101*308ddd75SAdrien Grand
102*308ddd75SAdrien GrandIt is common for some data to be redundant, e.g. data from the metadata file
103*308ddd75SAdrien Grandmight be redundant with information from `FieldInfos`, or all files from the
104*308ddd75SAdrien Grandsame file format should have the same version in their codec header. Checking
105*308ddd75SAdrien Grandthat these redundant pieces of information are consistent is always a good
106*308ddd75SAdrien Grandidea, as it would make cases of corruption much easier to debug.
107*308ddd75SAdrien Grand
108*308ddd75SAdrien Grand## Make sure to not leak files
109*308ddd75SAdrien Grand
110*308ddd75SAdrien GrandBe paranoid regarding where exceptions might be thrown and make sure that files
111*308ddd75SAdrien Grandwould be closed on all paths. E.g. imagine that opening the data file fails
112*308ddd75SAdrien Grandwhile the index file is already open, make sure that the index file would also
113*308ddd75SAdrien Grandget closed in that case. Lucene has tests that randomly throw exceptions when
114*308ddd75SAdrien Grandinteracting with the `Directory` in order to detect some bugs, but it might
115*308ddd75SAdrien Grandtake many runs before randomization triggers the exact case that triggers a
116*308ddd75SAdrien Grandbug.
117*308ddd75SAdrien Grand
118*308ddd75SAdrien Grand## Verify checksums upon merges
119*308ddd75SAdrien Grand
120*308ddd75SAdrien GrandMerges need to read most if not all input data anyway, so make sure to verify
121*308ddd75SAdrien Grandchecksums before starting a merge by calling `checkIntegrity()` on the file
122*308ddd75SAdrien Grandformat reader in order to make sure that file corruptions don't get propagated
123*308ddd75SAdrien Grandby merges. All default implementations do this.
124*308ddd75SAdrien Grand
125*308ddd75SAdrien Grand## How to make backward-compatible changes to file formats?
126*308ddd75SAdrien Grand
127*308ddd75SAdrien GrandSee [here](../lucene/backward-codecs/README.md).
128