xref: /JGit/Documentation/technical/reftable.md (revision c217d33ff89bc39b084e4269cf0255a5d4a9ae93)
1b9e818b5SShawn Pearce# reftable
2b9e818b5SShawn Pearce
3b9e818b5SShawn Pearce[TOC]
4b9e818b5SShawn Pearce
5b9e818b5SShawn Pearce## Overview
6b9e818b5SShawn Pearce
7b9e818b5SShawn Pearce### Problem statement
8b9e818b5SShawn Pearce
9b9e818b5SShawn PearceSome repositories contain a lot of references (e.g.  android at 866k,
10b9e818b5SShawn Pearcerails at 31k).  The existing packed-refs format takes up a lot of
11b9e818b5SShawn Pearcespace (e.g.  62M), and does not scale with additional references.
12b9e818b5SShawn PearceLookup of a single reference requires linearly scanning the file.
13b9e818b5SShawn Pearce
14b9e818b5SShawn PearceAtomic pushes modifying multiple references require copying the
15b9e818b5SShawn Pearceentire packed-refs file, which can be a considerable amount of data
16b9e818b5SShawn Pearcemoved (e.g. 62M in, 62M out) for even small transactions (2 refs
17b9e818b5SShawn Pearcemodified).
18b9e818b5SShawn Pearce
19b9e818b5SShawn PearceRepositories with many loose references occupy a large number of disk
20b9e818b5SShawn Pearceblocks from the local file system, as each reference is its own file
21b9e818b5SShawn Pearcestoring 41 bytes (and another file for the corresponding reflog).
22b9e818b5SShawn PearceThis negatively affects the number of inodes available when a large
23b9e818b5SShawn Pearcenumber of repositories are stored on the same filesystem.  Readers can
24b9e818b5SShawn Pearcebe penalized due to the larger number of syscalls required to traverse
25b9e818b5SShawn Pearceand read the `$GIT_DIR/refs` directory.
26b9e818b5SShawn Pearce
27b9e818b5SShawn Pearce### Objectives
28b9e818b5SShawn Pearce
29b9e818b5SShawn Pearce- Near constant time lookup for any single reference, even when the
30b9e818b5SShawn Pearce  repository is cold and not in process or kernel cache.
31b9e818b5SShawn Pearce- Near constant time verification if a SHA-1 is referred to by at
32b9e818b5SShawn Pearce  least one reference (for allow-tip-sha1-in-want).
33b9e818b5SShawn Pearce- Efficient lookup of an entire namespace, such as `refs/tags/`.
34b9e818b5SShawn Pearce- Support atomic push with `O(size_of_update)` operations.
35b9e818b5SShawn Pearce- Combine reflog storage with ref storage for small transactions.
36b9e818b5SShawn Pearce- Separate reflog storage for base refs and historical logs.
37b9e818b5SShawn Pearce
38b9e818b5SShawn Pearce### Description
39b9e818b5SShawn Pearce
40b9e818b5SShawn PearceA reftable file is a portable binary file format customized for
41b9e818b5SShawn Pearcereference storage. References are sorted, enabling linear scans,
42b9e818b5SShawn Pearcebinary search lookup, and range scans.
43b9e818b5SShawn Pearce
44b9e818b5SShawn PearceStorage in the file is organized into variable sized blocks.  Prefix
45b9e818b5SShawn Pearcecompression is used within a single block to reduce disk space.  Block
46b9e818b5SShawn Pearcesize and alignment is tunable by the writer.
47b9e818b5SShawn Pearce
48b9e818b5SShawn Pearce### Performance
49b9e818b5SShawn Pearce
50b9e818b5SShawn PearceSpace used, packed-refs vs. reftable:
51b9e818b5SShawn Pearce
52b9e818b5SShawn Pearcerepository | packed-refs | reftable | % original | avg ref  | avg obj
53b9e818b5SShawn Pearce-----------|------------:|---------:|-----------:|---------:|--------:
54b9e818b5SShawn Pearceandroid    |      62.2 M |   36.1 M |     58.0%  | 33 bytes | 5 bytes
55b9e818b5SShawn Pearcerails      |       1.8 M |    1.1 M |     57.7%  | 29 bytes | 4 bytes
56b9e818b5SShawn Pearcegit        |      78.7 K |   48.1 K |     61.0%  | 50 bytes | 4 bytes
57b9e818b5SShawn Pearcegit (heads)|       332 b |    269 b |     81.0%  | 33 bytes | 0 bytes
58b9e818b5SShawn Pearce
59b9e818b5SShawn PearceScan (read 866k refs), by reference name lookup (single ref from 866k
60b9e818b5SShawn Pearcerefs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):
61b9e818b5SShawn Pearce
62b9e818b5SShawn Pearceformat      | cache | scan    | by name        | by SHA-1
63b9e818b5SShawn Pearce------------|------:|--------:|---------------:|---------------:
64b9e818b5SShawn Pearcepacked-refs | cold  |  402 ms | 409,660.1 usec | 412,535.8 usec
65b9e818b5SShawn Pearcepacked-refs | hot   |         |   6,844.6 usec |  20,110.1 usec
66b9e818b5SShawn Pearcereftable    | cold  |  112 ms |      33.9 usec |     323.2 usec
67b9e818b5SShawn Pearcereftable    | hot   |         |      20.2 usec |     320.8 usec
68b9e818b5SShawn Pearce
69b9e818b5SShawn PearceSpace used for 149,932 log entries for 43,061 refs,
70b9e818b5SShawn Pearcereflog vs. reftable:
71b9e818b5SShawn Pearce
72b9e818b5SShawn Pearceformat        | size  | avg entry
73b9e818b5SShawn Pearce--------------|------:|-----------:
74b9e818b5SShawn Pearce$GIT_DIR/logs | 173 M | 1209 bytes
75b9e818b5SShawn Pearcereftable      |   5 M |   37 bytes
76b9e818b5SShawn Pearce
77b9e818b5SShawn Pearce## Details
78b9e818b5SShawn Pearce
79b9e818b5SShawn Pearce### Peeling
80b9e818b5SShawn Pearce
81b9e818b5SShawn PearceReferences stored in a reftable are peeled, a record for an annotated
82b9e818b5SShawn Pearce(or signed) tag records both the tag object, and the object it refers
83b9e818b5SShawn Pearceto.
84b9e818b5SShawn Pearce
85b9e818b5SShawn Pearce### Reference name encoding
86b9e818b5SShawn Pearce
87b9e818b5SShawn PearceReference names are an uninterpreted sequence of bytes that must pass
88b9e818b5SShawn Pearce[git-check-ref-format][ref-fmt] as a valid reference name.
89b9e818b5SShawn Pearce
90b9e818b5SShawn Pearce[ref-fmt]: https://git-scm.com/docs/git-check-ref-format
91b9e818b5SShawn Pearce
927c75a68bSHan-Wen Nienhuys### Key unicity
937c75a68bSHan-Wen Nienhuys
947c75a68bSHan-Wen NienhuysEach entry must have a unique key; repeated keys are disallowed.
957c75a68bSHan-Wen Nienhuys
96b9e818b5SShawn Pearce### Network byte order
97b9e818b5SShawn Pearce
98b9e818b5SShawn PearceAll multi-byte, fixed width fields are in network byte order.
99b9e818b5SShawn Pearce
100b9e818b5SShawn Pearce### Ordering
101b9e818b5SShawn Pearce
102b9e818b5SShawn PearceBlocks are lexicographically ordered by their first reference.
103b9e818b5SShawn Pearce
104b9e818b5SShawn Pearce### Directory/file conflicts
105b9e818b5SShawn Pearce
106b9e818b5SShawn PearceThe reftable format accepts both `refs/heads/foo` and
107b9e818b5SShawn Pearce`refs/heads/foo/bar` as distinct references.
108b9e818b5SShawn Pearce
109b9e818b5SShawn PearceThis property is useful for retaining log records in reftable, but may
110b9e818b5SShawn Pearceconfuse versions of Git using `$GIT_DIR/refs` directory tree to
111b9e818b5SShawn Pearcemaintain references.  Users of reftable may choose to continue to
112b9e818b5SShawn Pearcereject `foo` and `foo/bar` type conflicts to prevent problems for
113b9e818b5SShawn Pearcepeers.
114b9e818b5SShawn Pearce
115b9e818b5SShawn Pearce## File format
116b9e818b5SShawn Pearce
117b9e818b5SShawn Pearce### Structure
118b9e818b5SShawn Pearce
119b9e818b5SShawn PearceA reftable file has the following high-level structure:
120b9e818b5SShawn Pearce
121b9e818b5SShawn Pearce    first_block {
122b9e818b5SShawn Pearce      header
123b9e818b5SShawn Pearce      first_ref_block
124b9e818b5SShawn Pearce    }
125b9e818b5SShawn Pearce    ref_block*
126b9e818b5SShawn Pearce    ref_index*
127b9e818b5SShawn Pearce    obj_block*
128b9e818b5SShawn Pearce    obj_index*
129b9e818b5SShawn Pearce    log_block*
130b9e818b5SShawn Pearce    log_index*
131b9e818b5SShawn Pearce    footer
132b9e818b5SShawn Pearce
133b9e818b5SShawn PearceA log-only file omits the `ref_block`, `ref_index`, `obj_block` and
134b9e818b5SShawn Pearce`obj_index` sections, containing only the file header and log block:
135b9e818b5SShawn Pearce
136b9e818b5SShawn Pearce    first_block {
137b9e818b5SShawn Pearce      header
138b9e818b5SShawn Pearce    }
139b9e818b5SShawn Pearce    log_block*
140b9e818b5SShawn Pearce    log_index*
141b9e818b5SShawn Pearce    footer
142b9e818b5SShawn Pearce
143b9e818b5SShawn Pearcein a log-only file the first log block immediately follows the file
144b9e818b5SShawn Pearceheader, without padding to block alignment.
145b9e818b5SShawn Pearce
146b9e818b5SShawn Pearce### Block size
147b9e818b5SShawn Pearce
148b9e818b5SShawn PearceThe file's block size is arbitrarily determined by the writer, and
149b9e818b5SShawn Pearcedoes not have to be a power of 2.  The block size must be larger than
150b9e818b5SShawn Pearcethe longest reference name or log entry used in the repository, as
151b9e818b5SShawn Pearcereferences cannot span blocks.
152b9e818b5SShawn Pearce
153b9e818b5SShawn PearcePowers of two that are friendly to the virtual memory system or
154b9e818b5SShawn Pearcefilesystem (such as 4k or 8k) are recommended.  Larger sizes (64k) can
155b9e818b5SShawn Pearceyield better compression, with a possible increased cost incurred by
156b9e818b5SShawn Pearcereaders during access.
157b9e818b5SShawn Pearce
158b9e818b5SShawn PearceThe largest block size is `16777215` bytes (15.99 MiB).
159b9e818b5SShawn Pearce
160b9e818b5SShawn Pearce### Block alignment
161b9e818b5SShawn Pearce
162b9e818b5SShawn PearceWriters may choose to align blocks at multiples of the block size by
163b9e818b5SShawn Pearceincluding `padding` filled with NUL bytes at the end of a block to
164b9e818b5SShawn Pearceround out to the chosen alignment.  When alignment is used, writers
165b9e818b5SShawn Pearcemust specify the alignment with the file header's `block_size` field.
166b9e818b5SShawn Pearce
167b9e818b5SShawn PearceBlock alignment is not required by the file format.  Unaligned files
168b9e818b5SShawn Pearcemust set `block_size = 0` in the file header, and omit `padding`.
169b9e818b5SShawn PearceUnaligned files with more than one ref block must include the
170b9e818b5SShawn Pearce[ref index](#Ref-index) to support fast lookup.  Readers must be
171b9e818b5SShawn Pearceable to read both aligned and non-aligned files.
172b9e818b5SShawn Pearce
173b9e818b5SShawn PearceVery small files (e.g. 1 only ref block) may omit `padding` and the
174b9e818b5SShawn Pearceref index to reduce total file size.
175b9e818b5SShawn Pearce
176b9e818b5SShawn Pearce### Header
177b9e818b5SShawn Pearce
178b9e818b5SShawn PearceA 24-byte header appears at the beginning of the file:
179b9e818b5SShawn Pearce
180b9e818b5SShawn Pearce    'REFT'
181b9e818b5SShawn Pearce    uint8( version_number = 1 )
182b9e818b5SShawn Pearce    uint24( block_size )
183b9e818b5SShawn Pearce    uint64( min_update_index )
184b9e818b5SShawn Pearce    uint64( max_update_index )
185b9e818b5SShawn Pearce
186b9e818b5SShawn PearceAligned files must specify `block_size` to configure readers with the
187b9e818b5SShawn Pearceexpected block alignment.  Unaligned files must set `block_size = 0`.
188b9e818b5SShawn Pearce
189b9e818b5SShawn PearceThe `min_update_index` and `max_update_index` describe bounds for the
190b9e818b5SShawn Pearce`update_index` field of all log records in this file.  When reftables
191b9e818b5SShawn Pearceare used in a stack for [transactions](#Update-transactions), these
192b9e818b5SShawn Pearcefields can order the files such that the prior file's
193b9e818b5SShawn Pearce`max_update_index + 1` is the next file's `min_update_index`.
194b9e818b5SShawn Pearce
195b9e818b5SShawn Pearce### First ref block
196b9e818b5SShawn Pearce
197b9e818b5SShawn PearceThe first ref block shares the same block as the file header, and is
198b9e818b5SShawn Pearce24 bytes smaller than all other blocks in the file.  The first block
199b9e818b5SShawn Pearceimmediately begins after the file header, at position 24.
200b9e818b5SShawn Pearce
201b9e818b5SShawn PearceIf the first block is a log block (a log-only file), its block header
202b9e818b5SShawn Pearcebegins immediately at position 24.
203b9e818b5SShawn Pearce
204b9e818b5SShawn Pearce### Ref block format
205b9e818b5SShawn Pearce
206b9e818b5SShawn PearceA ref block is written as:
207b9e818b5SShawn Pearce
208b9e818b5SShawn Pearce    'r'
209b9e818b5SShawn Pearce    uint24( block_len )
210b9e818b5SShawn Pearce    ref_record+
211b9e818b5SShawn Pearce    uint24( restart_offset )+
212b9e818b5SShawn Pearce    uint16( restart_count )
213b9e818b5SShawn Pearce
214b9e818b5SShawn Pearce    padding?
215b9e818b5SShawn Pearce
216b9e818b5SShawn PearceBlocks begin with `block_type = 'r'` and a 3-byte `block_len` which
217b9e818b5SShawn Pearceencodes the number of bytes in the block up to, but not including the
218b9e818b5SShawn Pearceoptional `padding`.  This is always less than or equal to the file's
219b9e818b5SShawn Pearceblock size.  In the first ref block, `block_len` includes 24 bytes
220b9e818b5SShawn Pearcefor the file header.
221b9e818b5SShawn Pearce
222b9e818b5SShawn PearceThe 2-byte `restart_count` stores the number of entries in the
223b9e818b5SShawn Pearce`restart_offset` list, which must not be empty.  Readers can use
224b9e818b5SShawn Pearce`restart_count` to binary search between restarts before starting a
225b9e818b5SShawn Pearcelinear scan.
226b9e818b5SShawn Pearce
227b9e818b5SShawn PearceExactly `restart_count` 3-byte `restart_offset` values precedes the
228b9e818b5SShawn Pearce`restart_count`.  Offsets are relative to the start of the block and
229b9e818b5SShawn Pearcerefer to the first byte of any `ref_record` whose name has not been
230b9e818b5SShawn Pearceprefix compressed.  Entries in the `restart_offset` list must be
231b9e818b5SShawn Pearcesorted, ascending.  Readers can start linear scans from any of these
232b9e818b5SShawn Pearcerecords.
233b9e818b5SShawn Pearce
234b9e818b5SShawn PearceA variable number of `ref_record` fill the middle of the block,
235b9e818b5SShawn Pearcedescribing reference names and values.  The format is described below.
236b9e818b5SShawn Pearce
237b9e818b5SShawn PearceAs the first ref block shares the first file block with the file
238b9e818b5SShawn Pearceheader, all `restart_offset` in the first block are relative to the
239b9e818b5SShawn Pearcestart of the file (position 0), and include the file header.  This
240b9e818b5SShawn Pearceforces the first `restart_offset` to be `28`.
241b9e818b5SShawn Pearce
242b9e818b5SShawn Pearce#### ref record
243b9e818b5SShawn Pearce
244b9e818b5SShawn PearceA `ref_record` describes a single reference, storing both the name and
245b9e818b5SShawn Pearceits value(s). Records are formatted as:
246b9e818b5SShawn Pearce
247b9e818b5SShawn Pearce    varint( prefix_length )
248b9e818b5SShawn Pearce    varint( (suffix_length << 3) | value_type )
249b9e818b5SShawn Pearce    suffix
25044a75d9eSShawn Pearce    varint( update_index_delta )
251b9e818b5SShawn Pearce    value?
252b9e818b5SShawn Pearce
253b9e818b5SShawn PearceThe `prefix_length` field specifies how many leading bytes of the
254b9e818b5SShawn Pearceprior reference record's name should be copied to obtain this
255b9e818b5SShawn Pearcereference's name.  This must be 0 for the first reference in any
256b9e818b5SShawn Pearceblock, and also must be 0 for any `ref_record` whose offset is listed
257b9e818b5SShawn Pearcein the `restart_offset` table at the end of the block.
258b9e818b5SShawn Pearce
259b9e818b5SShawn PearceRecovering a reference name from any `ref_record` is a simple concat:
260b9e818b5SShawn Pearce
261b9e818b5SShawn Pearce    this_name = prior_name[0..prefix_length] + suffix
262b9e818b5SShawn Pearce
263b9e818b5SShawn PearceThe `suffix_length` value provides the number of bytes available in
264b9e818b5SShawn Pearce`suffix` to copy from `suffix` to complete the reference name.
265b9e818b5SShawn Pearce
26644a75d9eSShawn PearceThe `update_index` that last modified the reference can be obtained by
26744a75d9eSShawn Pearceadding `update_index_delta` to the `min_update_index` from the file
26844a75d9eSShawn Pearceheader: `min_update_index + update_index_delta`.
26944a75d9eSShawn Pearce
270b9e818b5SShawn PearceThe `value` follows.  Its format is determined by `value_type`, one of
271b9e818b5SShawn Pearcethe following:
272b9e818b5SShawn Pearce
273b9e818b5SShawn Pearce- `0x0`: deletion; no value data (see transactions, below)
274b9e818b5SShawn Pearce- `0x1`: one 20-byte object id; value of the ref
275b9e818b5SShawn Pearce- `0x2`: two 20-byte object ids; value of the ref, peeled target
276b9e818b5SShawn Pearce- `0x3`: symbolic reference: `varint( target_len ) target`
277b9e818b5SShawn Pearce
278b9e818b5SShawn PearceSymbolic references use `0x3`, followed by the complete name of the
279b9e818b5SShawn Pearcereference target.  No compression is applied to the target name.
280b9e818b5SShawn Pearce
281b9e818b5SShawn PearceTypes `0x4..0x7` are reserved for future use.
282b9e818b5SShawn Pearce
283b9e818b5SShawn Pearce### Ref index
284b9e818b5SShawn Pearce
285b9e818b5SShawn PearceThe ref index stores the name of the last reference from every ref
286b9e818b5SShawn Pearceblock in the file, enabling reduced disk seeks for lookups.  Any
287b9e818b5SShawn Pearcereference can be found by searching the index, identifying the
288b9e818b5SShawn Pearcecontaining block, and searching within that block.
289b9e818b5SShawn Pearce
290b9e818b5SShawn PearceThe index may be organized into a multi-level index, where the 1st
291b9e818b5SShawn Pearcelevel index block points to additional ref index blocks (2nd level),
292b9e818b5SShawn Pearcewhich may in turn point to either additional index blocks (e.g. 3rd
293b9e818b5SShawn Pearcelevel) or ref blocks (leaf level).  Disk reads required to access a
294b9e818b5SShawn Pearceref go up with higher index levels.  Multi-level indexes may be
295b9e818b5SShawn Pearcerequired to ensure no single index block exceeds the file format's max
296b9e818b5SShawn Pearceblock size of `16777215` bytes (15.99 MiB).  To acheive constant O(1)
297b9e818b5SShawn Pearcedisk seeks for lookups the index must be a single level, which is
298b9e818b5SShawn Pearcepermitted to exceed the file's configured block size, but not the
299b9e818b5SShawn Pearceformat's max block size of 15.99 MiB.
300b9e818b5SShawn Pearce
301b9e818b5SShawn PearceIf present, the ref index block(s) appears after the last ref block.
302b9e818b5SShawn Pearce
303b9e818b5SShawn PearceIf there are at least 4 ref blocks, a ref index block should be
304b9e818b5SShawn Pearcewritten to improve lookup times.  Cold reads using the index require
305b9e818b5SShawn Pearce2 disk reads (read index, read block), and binary searching < 4 blocks
306b9e818b5SShawn Pearcealso requires <= 2 reads.  Omitting the index block from smaller files
307b9e818b5SShawn Pearcesaves space.
308b9e818b5SShawn Pearce
309b9e818b5SShawn PearceIf the file is unaligned and contains more than one ref block, the ref
310b9e818b5SShawn Pearceindex must be written.
311b9e818b5SShawn Pearce
312b9e818b5SShawn PearceIndex block format:
313b9e818b5SShawn Pearce
314b9e818b5SShawn Pearce    'i'
315b9e818b5SShawn Pearce    uint24( block_len )
316b9e818b5SShawn Pearce    index_record+
317b9e818b5SShawn Pearce    uint24( restart_offset )+
318b9e818b5SShawn Pearce    uint16( restart_count )
319b9e818b5SShawn Pearce
320b9e818b5SShawn Pearce    padding?
321b9e818b5SShawn Pearce
322b9e818b5SShawn PearceThe index blocks begin with `block_type = 'i'` and a 3-byte
323b9e818b5SShawn Pearce`block_len` which encodes the number of bytes in the block,
324b9e818b5SShawn Pearceup to but not including the optional `padding`.
325b9e818b5SShawn Pearce
326b9e818b5SShawn PearceThe `restart_offset` and `restart_count` fields are identical in
327b9e818b5SShawn Pearceformat, meaning and usage as in ref blocks.
328b9e818b5SShawn Pearce
329b9e818b5SShawn PearceTo reduce the number of reads required for random access in very large
330b9e818b5SShawn Pearcefiles the index block may be larger than other blocks.  However,
331b9e818b5SShawn Pearcereaders must hold the entire index in memory to benefit from this, so
332b9e818b5SShawn Pearceit's a time-space tradeoff in both file size and reader memory.
333b9e818b5SShawn Pearce
334b9e818b5SShawn PearceIncreasing the file's block size decreases the index size.
335b9e818b5SShawn PearceAlternatively a multi-level index may be used, keeping index blocks
336b9e818b5SShawn Pearcewithin the file's block size, but increasing the number of blocks
337b9e818b5SShawn Pearcethat need to be accessed.
338b9e818b5SShawn Pearce
339b9e818b5SShawn Pearce#### index record
340b9e818b5SShawn Pearce
341b9e818b5SShawn PearceAn index record describes the last entry in another block.
342b9e818b5SShawn PearceIndex records are written as:
343b9e818b5SShawn Pearce
344b9e818b5SShawn Pearce    varint( prefix_length )
345b9e818b5SShawn Pearce    varint( (suffix_length << 3) | 0 )
346b9e818b5SShawn Pearce    suffix
347b9e818b5SShawn Pearce    varint( block_position )
348b9e818b5SShawn Pearce
349b9e818b5SShawn PearceIndex records use prefix compression exactly like `ref_record`.
350b9e818b5SShawn Pearce
351b9e818b5SShawn PearceIndex records store `block_position` after the suffix, specifying the
352b9e818b5SShawn Pearceabsolute position in bytes (from the start of the file) of the block
353b9e818b5SShawn Pearcethat ends with this reference. Readers can seek to `block_position` to
354b9e818b5SShawn Pearcebegin reading the block header.
355b9e818b5SShawn Pearce
356b9e818b5SShawn PearceReaders must examine the block header at `block_position` to determine
357b9e818b5SShawn Pearceif the next block is another level index block, or the leaf-level ref
358b9e818b5SShawn Pearceblock.
359b9e818b5SShawn Pearce
360b9e818b5SShawn Pearce#### Reading the index
361b9e818b5SShawn Pearce
362b9e818b5SShawn PearceReaders loading the ref index must first read the footer (below) to
363b9e818b5SShawn Pearceobtain `ref_index_position`. If not present, the position will be 0.
364b9e818b5SShawn PearceThe `ref_index_position` is for the 1st level root of the ref index.
365b9e818b5SShawn Pearce
366b9e818b5SShawn Pearce### Obj block format
367b9e818b5SShawn Pearce
368b9e818b5SShawn PearceObject blocks are optional.  Writers may choose to omit object blocks,
369b9e818b5SShawn Pearceespecially if readers will not use the SHA-1 to ref mapping.
370b9e818b5SShawn Pearce
371b9e818b5SShawn PearceObject blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping
372b9e818b5SShawn Pearceto ref blocks containing references pointing to that object directly,
373b9e818b5SShawn Pearceor as the peeled value of an annotated tag.  Like ref blocks, object
374b9e818b5SShawn Pearceblocks use the file's standard block size. The abbrevation length is
375b9e818b5SShawn Pearceavailable in the footer as `obj_id_len`.
376b9e818b5SShawn Pearce
377b9e818b5SShawn PearceTo save space in small files, object blocks may be omitted if the ref
378b9e818b5SShawn Pearceindex is not present, as brute force search will only need to read a
379b9e818b5SShawn Pearcefew ref blocks.  When missing, readers should brute force a linear
380b9e818b5SShawn Pearcesearch of all references to lookup by SHA-1.
381b9e818b5SShawn Pearce
382b9e818b5SShawn PearceAn object block is written as:
383b9e818b5SShawn Pearce
384b9e818b5SShawn Pearce    'o'
385b9e818b5SShawn Pearce    uint24( block_len )
386b9e818b5SShawn Pearce    obj_record+
387b9e818b5SShawn Pearce    uint24( restart_offset )+
388b9e818b5SShawn Pearce    uint16( restart_count )
389b9e818b5SShawn Pearce
390b9e818b5SShawn Pearce    padding?
391b9e818b5SShawn Pearce
392b9e818b5SShawn PearceFields are identical to ref block.  Binary search using the restart
393b9e818b5SShawn Pearcetable works the same as in reference blocks.
394b9e818b5SShawn Pearce
395b9e818b5SShawn PearceBecause object identifiers are abbreviated by writers to the shortest
396b9e818b5SShawn Pearceunique abbreviation within the reftable, obj key lengths are variable
397b9e818b5SShawn Pearcebetween 2 and 20 bytes.  Readers must compare only for common prefix
398b9e818b5SShawn Pearcematch within an obj block or obj index.
399b9e818b5SShawn Pearce
400b9e818b5SShawn Pearce#### obj record
401b9e818b5SShawn Pearce
402b9e818b5SShawn PearceAn `obj_record` describes a single object abbreviation, and the blocks
403b9e818b5SShawn Pearcecontaining references using that unique abbreviation:
404b9e818b5SShawn Pearce
405b9e818b5SShawn Pearce    varint( prefix_length )
406b9e818b5SShawn Pearce    varint( (suffix_length << 3) | cnt_3 )
407b9e818b5SShawn Pearce    suffix
408b9e818b5SShawn Pearce    varint( cnt_large )?
409b9e818b5SShawn Pearce    varint( position_delta )*
410b9e818b5SShawn Pearce
411b9e818b5SShawn PearceLike in reference blocks, abbreviations are prefix compressed within
412b9e818b5SShawn Pearcean obj block.  On large reftables with many unique objects, higher
413b9e818b5SShawn Pearceblock sizes (64k), and higher restart interval (128), a
414b9e818b5SShawn Pearce`prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in
415b9e818b5SShawn Pearceobj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits).
416b9e818b5SShawn Pearce
417b9e818b5SShawn PearceEach record contains `position_count` number of positions for matching
418b9e818b5SShawn Pearceref blocks.  For 1-7 positions the count is stored in `cnt_3`.  When
419b9e818b5SShawn Pearce`cnt_3 = 0` the actual count follows in a varint, `cnt_large`.
420b9e818b5SShawn Pearce
421b9e818b5SShawn PearceThe use of `cnt_3` bets most objects are pointed to by only a single
422b9e818b5SShawn Pearcereference, some may be pointed to by a couple of references, and very
423b9e818b5SShawn Pearcefew (if any) are pointed to by more than 7 references.
424b9e818b5SShawn Pearce
425b9e818b5SShawn PearceA special case exists when `cnt_3 = 0` and `cnt_large = 0`: there
426b9e818b5SShawn Pearceare no `position_delta`, but at least one reference starts with this
427b9e818b5SShawn Pearceabbreviation.  A reader that needs exact reference names must scan all
428b9e818b5SShawn Pearcereferences to find which specific references have the desired object.
429b9e818b5SShawn PearceWriters should use this format when the `position_delta` list would have
430b9e818b5SShawn Pearceoverflowed the file's block size due to a high number of references
431b9e818b5SShawn Pearcepointing to the same object.
432b9e818b5SShawn Pearce
433b9e818b5SShawn PearceThe first `position_delta` is the position from the start of the file.
434b9e818b5SShawn PearceAdditional `position_delta` entries are sorted ascending and relative
435b9e818b5SShawn Pearceto the prior entry, e.g.  a reader would perform:
436b9e818b5SShawn Pearce
437b9e818b5SShawn Pearce    pos = position_delta[0]
438b9e818b5SShawn Pearce    prior = pos
439b9e818b5SShawn Pearce    for (j = 1; j < position_count; j++) {
440b9e818b5SShawn Pearce      pos = prior + position_delta[j]
441b9e818b5SShawn Pearce      prior = pos
442b9e818b5SShawn Pearce    }
443b9e818b5SShawn Pearce
444b9e818b5SShawn PearceWith a position in hand, a reader must linearly scan the ref block,
445b9e818b5SShawn Pearcestarting from the first `ref_record`, testing each reference's SHA-1s
446b9e818b5SShawn Pearce(for `value_type = 0x1` or `0x2`) for full equality.  Faster searching
447b9e818b5SShawn Pearceby SHA-1 within a single ref block is not supported by the reftable
448b9e818b5SShawn Pearceformat.  Smaller block sizes reduce the number of candidates this step
449b9e818b5SShawn Pearcemust consider.
450b9e818b5SShawn Pearce
451b9e818b5SShawn Pearce### Obj index
452b9e818b5SShawn Pearce
453b9e818b5SShawn PearceThe obj index stores the abbreviation from the last entry for every
454b9e818b5SShawn Pearceobj block in the file, enabling reduced disk seeks for all lookups.
455b9e818b5SShawn PearceIt is formatted exactly the same as the ref index, but refers to obj
456b9e818b5SShawn Pearceblocks.
457b9e818b5SShawn Pearce
458b9e818b5SShawn PearceThe obj index should be present if obj blocks are present, as
459b9e818b5SShawn Pearceobj blocks should only be written in larger files.
460b9e818b5SShawn Pearce
461b9e818b5SShawn PearceReaders loading the obj index must first read the footer (below) to
462b9e818b5SShawn Pearceobtain `obj_index_position`.  If not present, the position will be 0.
463b9e818b5SShawn Pearce
464b9e818b5SShawn Pearce### Log block format
465b9e818b5SShawn Pearce
466b9e818b5SShawn PearceUnlike ref and obj blocks, log blocks are always unaligned.
467b9e818b5SShawn Pearce
468b9e818b5SShawn PearceLog blocks are variable in size, and do not match the `block_size`
469b9e818b5SShawn Pearcespecified in the file header or footer.  Writers should choose an
470b9e818b5SShawn Pearceappropriate buffer size to prepare a log block for deflation, such as
471b9e818b5SShawn Pearce`2 * block_size`.
472b9e818b5SShawn Pearce
473b9e818b5SShawn PearceA log block is written as:
474b9e818b5SShawn Pearce
475b9e818b5SShawn Pearce    'g'
476b9e818b5SShawn Pearce    uint24( block_len )
477b9e818b5SShawn Pearce    zlib_deflate {
478b9e818b5SShawn Pearce      log_record+
479b9e818b5SShawn Pearce      uint24( restart_offset )+
480b9e818b5SShawn Pearce      uint16( restart_count )
481b9e818b5SShawn Pearce    }
482b9e818b5SShawn Pearce
483b9e818b5SShawn PearceLog blocks look similar to ref blocks, except `block_type = 'g'`.
484b9e818b5SShawn Pearce
485b9e818b5SShawn PearceThe 4-byte block header is followed by the deflated block contents
486b9e818b5SShawn Pearceusing zlib deflate.  The `block_len` in the header is the inflated
487b9e818b5SShawn Pearcesize (including 4-byte block header), and should be used by readers to
488b9e818b5SShawn Pearcepreallocate the inflation output buffer.  A log block's `block_len`
489b9e818b5SShawn Pearcemay exceed the file's block size.
490b9e818b5SShawn Pearce
491b9e818b5SShawn PearceOffsets within the log block (e.g.  `restart_offset`) still include
492b9e818b5SShawn Pearcethe 4-byte header.  Readers may prefer prefixing the inflation output
493b9e818b5SShawn Pearcebuffer with the 4-byte header.
494b9e818b5SShawn Pearce
495b9e818b5SShawn PearceWithin the deflate container, a variable number of `log_record`
496b9e818b5SShawn Pearcedescribe reference changes.  The log record format is described
497b9e818b5SShawn Pearcebelow.  See ref block format (above) for a description of
498b9e818b5SShawn Pearce`restart_offset` and `restart_count`.
499b9e818b5SShawn Pearce
500b9e818b5SShawn PearceBecause log blocks have no alignment or padding between blocks,
501b9e818b5SShawn Pearcereaders must keep track of the bytes consumed by the inflater to
502b9e818b5SShawn Pearceknow where the next log block begins.
503b9e818b5SShawn Pearce
504b9e818b5SShawn Pearce#### log record
505b9e818b5SShawn Pearce
506b9e818b5SShawn PearceLog record keys are structured as:
507b9e818b5SShawn Pearce
508b9e818b5SShawn Pearce    ref_name '\0' reverse_int64( update_index )
509b9e818b5SShawn Pearce
510b9e818b5SShawn Pearcewhere `update_index` is the unique transaction identifier.  The
511b9e818b5SShawn Pearce`update_index` field must be unique within the scope of a `ref_name`.
512b9e818b5SShawn PearceSee the update transactions section below for further details.
513b9e818b5SShawn Pearce
514b9e818b5SShawn PearceThe `reverse_int64` function inverses the value so lexographical
515b9e818b5SShawn Pearceordering the network byte order encoding sorts the more recent records
516b9e818b5SShawn Pearcewith higher `update_index` values first:
517b9e818b5SShawn Pearce
518b9e818b5SShawn Pearce    reverse_int64(int64 t) {
519b9e818b5SShawn Pearce      return 0xffffffffffffffff - t;
520b9e818b5SShawn Pearce    }
521b9e818b5SShawn Pearce
522b9e818b5SShawn PearceLog records have a similar starting structure to ref and index
523b9e818b5SShawn Pearcerecords, utilizing the same prefix compression scheme applied to the
524b9e818b5SShawn Pearcelog record key described above.
525b9e818b5SShawn Pearce
526b9e818b5SShawn Pearce```
527b9e818b5SShawn Pearce    varint( prefix_length )
528b9e818b5SShawn Pearce    varint( (suffix_length << 3) | log_type )
529b9e818b5SShawn Pearce    suffix
530b9e818b5SShawn Pearce    log_data {
531b9e818b5SShawn Pearce      old_id
532b9e818b5SShawn Pearce      new_id
533b9e818b5SShawn Pearce      varint( name_length    )  name
534b9e818b5SShawn Pearce      varint( email_length   )  email
535b9e818b5SShawn Pearce      varint( time_seconds )
536b9e818b5SShawn Pearce      sint16( tz_offset )
537b9e818b5SShawn Pearce      varint( message_length )  message
538b9e818b5SShawn Pearce    }?
539b9e818b5SShawn Pearce```
540b9e818b5SShawn Pearce
541b9e818b5SShawn PearceLog record entries use `log_type` to indicate what follows:
542b9e818b5SShawn Pearce
543b9e818b5SShawn Pearce- `0x0`: deletion; no log data.
544b9e818b5SShawn Pearce- `0x1`: standard git reflog data using `log_data` above.
545b9e818b5SShawn Pearce
546b9e818b5SShawn PearceThe `log_type = 0x0` is mostly useful for `git stash drop`, removing
547b9e818b5SShawn Pearcean entry from the reflog of `refs/stash` in a transaction file
548b9e818b5SShawn Pearce(below), without needing to rewrite larger files.  Readers reading a
549b9e818b5SShawn Pearcestack of reflogs must treat this as a deletion.
550b9e818b5SShawn Pearce
551b9e818b5SShawn PearceFor `log_type = 0x1`, the `log_data` section follows
552b9e818b5SShawn Pearce[git update-ref][update-ref] logging, and includes:
553b9e818b5SShawn Pearce
554b9e818b5SShawn Pearce- two 20-byte SHA-1s (old id, new id)
555b9e818b5SShawn Pearce- varint string of committer's name
556b9e818b5SShawn Pearce- varint string of committer's email
557b9e818b5SShawn Pearce- varint time in seconds since epoch (Jan 1, 1970)
558b9e818b5SShawn Pearce- 2-byte timezone offset in minutes (signed)
559b9e818b5SShawn Pearce- varint string of message
560b9e818b5SShawn Pearce
561b9e818b5SShawn Pearce`tz_offset` is the absolute number of minutes from GMT the committer
562b9e818b5SShawn Pearcewas at the time of the update.  For example `GMT-0800` is encoded in
563b9e818b5SShawn Pearcereftable as `sint16(-480)` and `GMT+0230` is `sint16(150)`.
564b9e818b5SShawn Pearce
565b9e818b5SShawn PearceThe committer email does not contain `<` or `>`, it's the value
566b9e818b5SShawn Pearcenormally found between the `<>` in a git commit object header.
567b9e818b5SShawn Pearce
568b9e818b5SShawn PearceThe `message_length` may be 0, in which case there was no message
569b9e818b5SShawn Pearcesupplied for the update.
570b9e818b5SShawn Pearce
571b9e818b5SShawn Pearce[update-ref]: https://git-scm.com/docs/git-update-ref#_logging_updates
572b9e818b5SShawn Pearce
573c517725bSHan-Wen NienhuysContrary to traditional reflog (which is a file), renames are encoded as a
574c517725bSHan-Wen Nienhuyscombination of ref deletion and ref creation.
575c517725bSHan-Wen Nienhuys
576c517725bSHan-Wen Nienhuys
577b9e818b5SShawn Pearce#### Reading the log
578b9e818b5SShawn Pearce
579b9e818b5SShawn PearceReaders accessing the log must first read the footer (below) to
580b9e818b5SShawn Pearcedetermine the `log_position`.  The first block of the log begins at
581b9e818b5SShawn Pearce`log_position` bytes since the start of the file.  The `log_position`
582b9e818b5SShawn Pearceis not block aligned.
583b9e818b5SShawn Pearce
584b9e818b5SShawn Pearce#### Importing logs
585b9e818b5SShawn Pearce
586b9e818b5SShawn PearceWhen importing from `$GIT_DIR/logs` writers should globally order all
587b9e818b5SShawn Pearcelog records roughly by timestamp while preserving file order, and
588b9e818b5SShawn Pearceassign unique, increasing `update_index` values for each log line.
589b9e818b5SShawn PearceNewer log records get higher `update_index` values.
590b9e818b5SShawn Pearce
591b9e818b5SShawn PearceAlthough an import may write only a single reftable file, the reftable
592b9e818b5SShawn Pearcefile must span many unique `update_index`, as each log line requires
593b9e818b5SShawn Pearceits own `update_index` to preserve semantics.
594b9e818b5SShawn Pearce
595b9e818b5SShawn Pearce### Log index
596b9e818b5SShawn Pearce
597b9e818b5SShawn PearceThe log index stores the log key (`refname \0 reverse_int64(update_index)`)
598b9e818b5SShawn Pearcefor the last log record of every log block in the file, supporting
599b9e818b5SShawn Pearcebounded-time lookup.
600b9e818b5SShawn Pearce
601b9e818b5SShawn PearceA log index block must be written if 2 or more log blocks are written
602b9e818b5SShawn Pearceto the file.  If present, the log index appears after the last log
603b9e818b5SShawn Pearceblock.  There is no padding used to align the log index to block
604b9e818b5SShawn Pearcealignment.
605b9e818b5SShawn Pearce
606b9e818b5SShawn PearceLog index format is identical to ref index, except the keys are 9
607b9e818b5SShawn Pearcebytes longer to include `'\0'` and the 8-byte
608b9e818b5SShawn Pearce`reverse_int64(update_index)`.  Records use `block_position` to
609b9e818b5SShawn Pearcerefer to the start of a log block.
610b9e818b5SShawn Pearce
611b9e818b5SShawn Pearce#### Reading the index
612b9e818b5SShawn Pearce
613b9e818b5SShawn PearceReaders loading the log index must first read the footer (below) to
614b9e818b5SShawn Pearceobtain `log_index_position`. If not present, the position will be 0.
615b9e818b5SShawn Pearce
616b9e818b5SShawn Pearce### Footer
617b9e818b5SShawn Pearce
618b9e818b5SShawn PearceAfter the last block of the file, a file footer is written.  It begins
619b9e818b5SShawn Pearcelike the file header, but is extended with additional data.
620b9e818b5SShawn Pearce
621b9e818b5SShawn PearceA 68-byte footer appears at the end:
622b9e818b5SShawn Pearce
623b9e818b5SShawn Pearce```
624b9e818b5SShawn Pearce    'REFT'
625b9e818b5SShawn Pearce    uint8( version_number = 1 )
626b9e818b5SShawn Pearce    uint24( block_size )
627b9e818b5SShawn Pearce    uint64( min_update_index )
628b9e818b5SShawn Pearce    uint64( max_update_index )
629b9e818b5SShawn Pearce
630b9e818b5SShawn Pearce    uint64( ref_index_position )
631b9e818b5SShawn Pearce    uint64( (obj_position << 5) | obj_id_len )
632b9e818b5SShawn Pearce    uint64( obj_index_position )
633b9e818b5SShawn Pearce
634b9e818b5SShawn Pearce    uint64( log_position )
635b9e818b5SShawn Pearce    uint64( log_index_position )
636b9e818b5SShawn Pearce
637b9e818b5SShawn Pearce    uint32( CRC-32 of above )
638b9e818b5SShawn Pearce```
639b9e818b5SShawn Pearce
640b9e818b5SShawn PearceIf a section is missing (e.g. ref index) the corresponding position
641b9e818b5SShawn Pearcefield (e.g. `ref_index_position`) will be 0.
642b9e818b5SShawn Pearce
643b9e818b5SShawn Pearce- `obj_position`: byte position for the first obj block.
644b9e818b5SShawn Pearce- `obj_id_len`: number of bytes used to abbreviate object identifiers
645b9e818b5SShawn Pearce  in obj blocks.
646b9e818b5SShawn Pearce- `log_position`: byte position for the first log block.
647b9e818b5SShawn Pearce- `ref_index_position`: byte position for the start of the ref index.
648b9e818b5SShawn Pearce- `obj_index_position`: byte position for the start of the obj index.
649b9e818b5SShawn Pearce- `log_index_position`: byte position for the start of the log index.
650b9e818b5SShawn Pearce
651b9e818b5SShawn Pearce#### Reading the footer
652b9e818b5SShawn Pearce
653b9e818b5SShawn PearceReaders must seek to `file_length - 68` to access the footer.  A
654b9e818b5SShawn Pearcetrusted external source (such as `stat(2)`) is necessary to obtain
655b9e818b5SShawn Pearce`file_length`.  When reading the footer, readers must verify:
656b9e818b5SShawn Pearce
657b9e818b5SShawn Pearce- 4-byte magic is correct
658b9e818b5SShawn Pearce- 1-byte version number is recognized
659b9e818b5SShawn Pearce- 4-byte CRC-32 matches the other 64 bytes (including magic, and version)
660b9e818b5SShawn Pearce
661b9e818b5SShawn PearceOnce verified, the other fields of the footer can be accessed.
662b9e818b5SShawn Pearce
663b9e818b5SShawn Pearce### Varint encoding
664b9e818b5SShawn Pearce
665b9e818b5SShawn PearceVarint encoding is identical to the ofs-delta encoding method used
666b9e818b5SShawn Pearcewithin pack files.
667b9e818b5SShawn Pearce
668b9e818b5SShawn PearceDecoder works such as:
669b9e818b5SShawn Pearce
670b9e818b5SShawn Pearce    val = buf[ptr] & 0x7f
671b9e818b5SShawn Pearce    while (buf[ptr] & 0x80) {
672b9e818b5SShawn Pearce      ptr++
673b9e818b5SShawn Pearce      val = ((val + 1) << 7) | (buf[ptr] & 0x7f)
674b9e818b5SShawn Pearce    }
675b9e818b5SShawn Pearce
676b9e818b5SShawn Pearce### Binary search
677b9e818b5SShawn Pearce
678b9e818b5SShawn PearceBinary search within a block is supported by the `restart_offset`
679b9e818b5SShawn Pearcefields at the end of the block.  Readers can binary search through the
680b9e818b5SShawn Pearcerestart table to locate between which two restart points the sought
681b9e818b5SShawn Pearcereference or key should appear.
682b9e818b5SShawn Pearce
683b9e818b5SShawn PearceEach record identified by a `restart_offset` stores the complete key
684b9e818b5SShawn Pearcein the `suffix` field of the record, making the compare operation
685b9e818b5SShawn Pearceduring binary search straightforward.
686b9e818b5SShawn Pearce
687b9e818b5SShawn PearceOnce a restart point lexicographically before the sought reference has
688b9e818b5SShawn Pearcebeen identified, readers can linearly scan through the following
689b9e818b5SShawn Pearcerecord entries to locate the sought record, terminating if the current
690b9e818b5SShawn Pearcerecord sorts after (and therefore the sought key is not present).
691b9e818b5SShawn Pearce
692b9e818b5SShawn Pearce#### Restart point selection
693b9e818b5SShawn Pearce
694b9e818b5SShawn PearceWriters determine the restart points at file creation.  The process is
695b9e818b5SShawn Pearcearbitrary, but every 16 or 64 records is recommended.  Every 16 may
696b9e818b5SShawn Pearcebe more suitable for smaller block sizes (4k or 8k), every 64 for
697b9e818b5SShawn Pearcelarger block sizes (64k).
698b9e818b5SShawn Pearce
699b9e818b5SShawn PearceMore frequent restart points reduces prefix compression and increases
700b9e818b5SShawn Pearcespace consumed by the restart table, both of which increase file size.
701b9e818b5SShawn Pearce
702b9e818b5SShawn PearceLess frequent restart points makes prefix compression more effective,
703b9e818b5SShawn Pearcedecreasing overall file size, with increased penalities for readers
704b9e818b5SShawn Pearcewalking through more records after the binary search step.
705b9e818b5SShawn Pearce
706b9e818b5SShawn PearceA maximum of `65535` restart points per block is supported.
707b9e818b5SShawn Pearce
708b9e818b5SShawn Pearce## Considerations
709b9e818b5SShawn Pearce
710b9e818b5SShawn Pearce### Lightweight refs dominate
711b9e818b5SShawn Pearce
712b9e818b5SShawn PearceThe reftable format assumes the vast majority of references are single
713b9e818b5SShawn PearceSHA-1 valued with common prefixes, such as Gerrit Code Review's
714b9e818b5SShawn Pearce`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
715b9e818b5SShawn Pearcelightweight tags in the `refs/tags/` namespace.
716b9e818b5SShawn Pearce
717b9e818b5SShawn PearceAnnotated tags storing the peeled object cost an additional 20 bytes
718b9e818b5SShawn Pearceper reference.
719b9e818b5SShawn Pearce
720b9e818b5SShawn Pearce### Low overhead
721b9e818b5SShawn Pearce
722b9e818b5SShawn PearceA reftable with very few references (e.g. git.git with 5 heads)
723b9e818b5SShawn Pearceis 269 bytes for reftable, vs. 332 bytes for packed-refs.  This
724b9e818b5SShawn Pearcesupports reftable scaling down for transaction logs (below).
725b9e818b5SShawn Pearce
726b9e818b5SShawn Pearce### Block size
727b9e818b5SShawn Pearce
728b9e818b5SShawn PearceFor a Gerrit Code Review type repository with many change refs, larger
729b9e818b5SShawn Pearceblock sizes (64 KiB) and less frequent restart points (every 64) yield
730b9e818b5SShawn Pearcebetter compression due to more references within the block compressing
731b9e818b5SShawn Pearceagainst the prior reference.
732b9e818b5SShawn Pearce
733b9e818b5SShawn PearceLarger block sizes reduce the index size, as the reftable will
734b9e818b5SShawn Pearcerequire fewer blocks to store the same number of references.
735b9e818b5SShawn Pearce
736b9e818b5SShawn Pearce### Minimal disk seeks
737b9e818b5SShawn Pearce
738b9e818b5SShawn PearceAssuming the index block has been loaded into memory, binary searching
739b9e818b5SShawn Pearcefor any single reference requires exactly 1 disk seek to load the
740b9e818b5SShawn Pearcecontaining block.
741b9e818b5SShawn Pearce
742b9e818b5SShawn Pearce### Scans and lookups dominate
743b9e818b5SShawn Pearce
744b9e818b5SShawn PearceScanning all references and lookup by name (or namespace such as
745b9e818b5SShawn Pearce`refs/heads/`) are the most common activities performed on repositories.
746b9e818b5SShawn PearceSHA-1s are stored directly with references to optimize this use case.
747b9e818b5SShawn Pearce
748b9e818b5SShawn Pearce### Logs are infrequently read
749b9e818b5SShawn Pearce
750b9e818b5SShawn PearceLogs are infrequently accessed, but can be large.  Deflating log
751b9e818b5SShawn Pearceblocks saves disk space, with some increased penalty at read time.
752b9e818b5SShawn Pearce
753b9e818b5SShawn PearceLogs are stored in an isolated section from refs, reducing the burden
754b9e818b5SShawn Pearceon reference readers that want to ignore logs.  Further, historical
755b9e818b5SShawn Pearcelogs can be isolated into log-only files.
756b9e818b5SShawn Pearce
757b9e818b5SShawn Pearce### Logs are read backwards
758b9e818b5SShawn Pearce
759b9e818b5SShawn PearceLogs are frequently accessed backwards (most recent N records for
760b9e818b5SShawn Pearcemaster to answer `master@{4}`), so log records are grouped by
761b9e818b5SShawn Pearcereference, and sorted descending by update index.
762b9e818b5SShawn Pearce
763b9e818b5SShawn Pearce## Repository format
764b9e818b5SShawn Pearce
765b9e818b5SShawn Pearce### Version 1
766b9e818b5SShawn Pearce
767b9e818b5SShawn PearceA repository must set its `$GIT_DIR/config` to configure reftable:
768b9e818b5SShawn Pearce
769b9e818b5SShawn Pearce    [core]
770b9e818b5SShawn Pearce        repositoryformatversion = 1
771b9e818b5SShawn Pearce    [extensions]
772b9e818b5SShawn Pearce        refStorage = reftable
773b9e818b5SShawn Pearce
774b9e818b5SShawn Pearce### Layout
775b9e818b5SShawn Pearce
776b9e818b5SShawn PearceA collection of reftable files are stored in the `$GIT_DIR/reftable/`
777b9e818b5SShawn Pearcedirectory:
778b9e818b5SShawn Pearce
779cf11a03bSHan-Wen Nienhuys    00000001-00000001.log
780cf11a03bSHan-Wen Nienhuys    00000002-00000002.ref
781cf11a03bSHan-Wen Nienhuys    00000003-00000003.ref
782b9e818b5SShawn Pearce
783b9e818b5SShawn Pearcewhere reftable files are named by a unique name such as produced by
784cf11a03bSHan-Wen Nienhuysthe function `${min_update_index}-${max_update_index}.ref`.
785b9e818b5SShawn Pearce
786b9e818b5SShawn PearceLog-only files use the `.log` extension, while ref-only and mixed ref
787b9e818b5SShawn Pearceand log files use `.ref`.  extension.
788b9e818b5SShawn Pearce
789b9e818b5SShawn Pearce
790*c217d33fSHan-Wen NienhuysThe stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the current
791*c217d33fSHan-Wen Nienhuysfiles, one per line, in order, from oldest (base) to newest (most recent):
792*c217d33fSHan-Wen Nienhuys
793*c217d33fSHan-Wen Nienhuys    $ cat .git/reftable/tables.list
794cf11a03bSHan-Wen Nienhuys    00000001-00000001.log
795cf11a03bSHan-Wen Nienhuys    00000002-00000002.ref
796cf11a03bSHan-Wen Nienhuys    00000003-00000003.ref
797b9e818b5SShawn Pearce
798*c217d33fSHan-Wen NienhuysReaders must read `$GIT_DIR/reftable/tables.list` to determine which files are
799*c217d33fSHan-Wen Nienhuysrelevant right now, and search through the stack in reverse order (last reftable
800*c217d33fSHan-Wen Nienhuysis examined first).
801b9e818b5SShawn Pearce
802*c217d33fSHan-Wen NienhuysReftable files not listed in `tables.list` may be new (and about to be added
803b9e818b5SShawn Pearceto the stack by the active writer), or ancient and ready to be pruned.
804b9e818b5SShawn Pearce
805*c217d33fSHan-Wen Nienhuys### Backward compatibility
806*c217d33fSHan-Wen Nienhuys
807*c217d33fSHan-Wen NienhuysOlder clients should continue to recognize the directory as a git repository so
808*c217d33fSHan-Wen Nienhuysthey don't look for an enclosing repository in parent directories. To this end,
809*c217d33fSHan-Wen Nienhuysa reftable-enabled repository must contain the following dummy files
810*c217d33fSHan-Wen Nienhuys
811*c217d33fSHan-Wen Nienhuys*   `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`.
812*c217d33fSHan-Wen Nienhuys*   `.git/refs/`, a directory
813*c217d33fSHan-Wen Nienhuys*   `.git/refs/heads`, a regular file
814*c217d33fSHan-Wen Nienhuys
815b9e818b5SShawn Pearce### Readers
816b9e818b5SShawn Pearce
817b9e818b5SShawn PearceReaders can obtain a consistent snapshot of the reference space by
818b9e818b5SShawn Pearcefollowing:
819b9e818b5SShawn Pearce
820*c217d33fSHan-Wen Nienhuys1.  Open and read the `tables.list` file.
821b9e818b5SShawn Pearce2.  Open each of the reftable files that it mentions.
822b9e818b5SShawn Pearce3.  If any of the files is missing, goto 1.
823b9e818b5SShawn Pearce4.  Read from the now-open files as long as necessary.
824b9e818b5SShawn Pearce
825b9e818b5SShawn Pearce### Update transactions
826b9e818b5SShawn Pearce
827b9e818b5SShawn PearceAlthough reftables are immutable, mutations are supported by writing a
828b9e818b5SShawn Pearcenew reftable and atomically appending it to the stack:
829b9e818b5SShawn Pearce
830*c217d33fSHan-Wen Nienhuys1. Acquire `tables.list.lock`.
831*c217d33fSHan-Wen Nienhuys2. Read `tables.list` to determine current reftables.
832b9e818b5SShawn Pearce3. Select `update_index` to be most recent file's `max_update_index + 1`.
833cf11a03bSHan-Wen Nienhuys4. Prepare temp reftable `tmp_XXXXXX`, including log entries.
834cf11a03bSHan-Wen Nienhuys5. Rename `tmp_XXXXXX` to `${update_index}-${update_index}.ref`.
835*c217d33fSHan-Wen Nienhuys6. Copy `tables.list` to `tables.list.lock`, appending file from (5).
836*c217d33fSHan-Wen Nienhuys7. Rename `tables.list.lock` to `tables.list`.
837b9e818b5SShawn Pearce
838b9e818b5SShawn PearceDuring step 4 the new file's `min_update_index` and `max_update_index`
839b9e818b5SShawn Pearceare both set to the `update_index` selected by step 3.  All log
840b9e818b5SShawn Pearcerecords for the transaction use the same `update_index` in their keys.
841b9e818b5SShawn PearceThis enables later correlation of which references were updated by the
842b9e818b5SShawn Pearcesame transaction.
843b9e818b5SShawn Pearce
844*c217d33fSHan-Wen NienhuysBecause a single `tables.list.lock` file is used to manage locking, the
845b9e818b5SShawn Pearcerepository is single-threaded for writers.  Writers may have to
846*c217d33fSHan-Wen Nienhuysbusy-spin (with backoff) around creating `tables.list.lock`, for up to an
847b9e818b5SShawn Pearceacceptable wait period, aborting if the repository is too busy to
848b9e818b5SShawn Pearcemutate.  Application servers wrapped around repositories (e.g.  Gerrit
849b9e818b5SShawn PearceCode Review) can layer their own lock/wait queue to improve fairness
850b9e818b5SShawn Pearceto writers.
851b9e818b5SShawn Pearce
852b9e818b5SShawn Pearce### Reference deletions
853b9e818b5SShawn Pearce
854b9e818b5SShawn PearceDeletion of any reference can be explicitly stored by setting the
855b9e818b5SShawn Pearce`type` to `0x0` and omitting the `value` field of the `ref_record`.
856b9e818b5SShawn PearceThis serves as a tombstone, overriding any assertions about the
857b9e818b5SShawn Pearceexistence of the reference from earlier files in the stack.
858b9e818b5SShawn Pearce
859b9e818b5SShawn Pearce### Compaction
860b9e818b5SShawn Pearce
861b9e818b5SShawn PearceA partial stack of reftables can be compacted by merging references
862b9e818b5SShawn Pearceusing a straightforward merge join across reftables, selecting the
863b9e818b5SShawn Pearcemost recent value for output, and omitting deleted references that do
864b9e818b5SShawn Pearcenot appear in remaining, lower reftables.
865b9e818b5SShawn Pearce
866b9e818b5SShawn PearceA compacted reftable should set its `min_update_index` to the smallest of
867b9e818b5SShawn Pearcethe input files' `min_update_index`, and its `max_update_index`
868b9e818b5SShawn Pearcelikewise to the largest input `max_update_index`.
869b9e818b5SShawn Pearce
870b9e818b5SShawn PearceFor sake of illustration, assume the stack currently consists of
871b9e818b5SShawn Pearcereftable files (from oldest to newest): A, B, C, and D. The compactor
872b9e818b5SShawn Pearceis going to compact B and C, leaving A and D alone.
873b9e818b5SShawn Pearce
874*c217d33fSHan-Wen Nienhuys1.  Obtain lock `tables.list.lock` and read the `tables.list` file.
875b9e818b5SShawn Pearce2.  Obtain locks `B.lock` and `C.lock`.
876b9e818b5SShawn Pearce    Ownership of these locks prevents other processes from trying
877b9e818b5SShawn Pearce    to compact these files.
878*c217d33fSHan-Wen Nienhuys3.  Release `tables.list.lock`.
879cf11a03bSHan-Wen Nienhuys4.  Compact `B` and `C` into a temp file `${min_update_index}-${max_update_index}_XXXXXX`.
880*c217d33fSHan-Wen Nienhuys5.  Reacquire lock `tables.list.lock`.
881b9e818b5SShawn Pearce6.  Verify that `B` and `C` are still in the stack, in that order. This
882b9e818b5SShawn Pearce    should always be the case, assuming that other processes are adhering
883b9e818b5SShawn Pearce    to the locking protocol.
884cf11a03bSHan-Wen Nienhuys7.  Rename `${min_update_index}-${max_update_index}_XXXXXX` to
885cf11a03bSHan-Wen Nienhuys    `${min_update_index}-${max_update_index}.ref`.
886*c217d33fSHan-Wen Nienhuys8.  Write the new stack to `tables.list.lock`, replacing `B` and `C` with the
887b9e818b5SShawn Pearce    file from (4).
888*c217d33fSHan-Wen Nienhuys9.  Rename `tables.list.lock` to `tables.list`.
889b9e818b5SShawn Pearce10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
890b9e818b5SShawn Pearce    readers to backtrack.
891b9e818b5SShawn Pearce
892b9e818b5SShawn PearceThis strategy permits compactions to proceed independently of updates.
893b9e818b5SShawn Pearce
894cf11a03bSHan-Wen NienhuysEach reftable (compacted or not) is uniquely identified by its name, so open
895cf11a03bSHan-Wen Nienhuysreftables can be cached by their name.
896cf11a03bSHan-Wen Nienhuys
897b9e818b5SShawn Pearce## Alternatives considered
898b9e818b5SShawn Pearce
899b9e818b5SShawn Pearce### bzip packed-refs
900b9e818b5SShawn Pearce
901b9e818b5SShawn Pearce`bzip2` can significantly shrink a large packed-refs file (e.g. 62
902b9e818b5SShawn PearceMiB compresses to 23 MiB, 37%).  However the bzip format does not support
903b9e818b5SShawn Pearcerandom access to a single reference. Readers must inflate and discard
904b9e818b5SShawn Pearcewhile performing a linear scan.
905b9e818b5SShawn Pearce
906b9e818b5SShawn PearceBreaking packed-refs into chunks (individually compressing each chunk)
907b9e818b5SShawn Pearcewould reduce the amount of data a reader must inflate, but still
908b9e818b5SShawn Pearceleaves the problem of indexing chunks to support readers efficiently
909b9e818b5SShawn Pearcelocating the correct chunk.
910b9e818b5SShawn Pearce
911b9e818b5SShawn PearceGiven the compression achieved by reftable's encoding, it does not
912b9e818b5SShawn Pearceseem necessary to add the complexity of bzip/gzip/zlib.
913b9e818b5SShawn Pearce
914b9e818b5SShawn Pearce### Michael Haggerty's alternate format
915b9e818b5SShawn Pearce
916b9e818b5SShawn PearceMichael Haggerty proposed [an alternate][mh-alt] format to reftable on
917b9e818b5SShawn Pearcethe Git mailing list.  This format uses smaller chunks, without the
918b9e818b5SShawn Pearcerestart table, and avoids block alignment with padding.  Reflog entries
919b9e818b5SShawn Pearceimmediately follow each ref, and are thus interleaved between refs.
920b9e818b5SShawn Pearce
921b9e818b5SShawn PearcePerformance testing indicates reftable is faster for lookups (51%
922b9e818b5SShawn Pearcefaster, 11.2 usec vs.  5.4 usec), although reftable produces a
923b9e818b5SShawn Pearceslightly larger file (+ ~3.2%, 28.3M vs 29.2M):
924b9e818b5SShawn Pearce
925b9e818b5SShawn Pearceformat    |  size  | seek cold | seek hot  |
926b9e818b5SShawn Pearce---------:|-------:|----------:|----------:|
927b9e818b5SShawn Pearcemh-alt    | 28.3 M | 23.4 usec | 11.2 usec |
928b9e818b5SShawn Pearcereftable  | 29.2 M | 19.9 usec |  5.4 usec |
929b9e818b5SShawn Pearce
930b9e818b5SShawn Pearce[mh-alt]: https://public-inbox.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe+wGvEJ7A@mail.gmail.com/
931b9e818b5SShawn Pearce
932b9e818b5SShawn Pearce### JGit Ketch RefTree
933b9e818b5SShawn Pearce
934b9e818b5SShawn Pearce[JGit Ketch][ketch] proposed [RefTree][reftree], an encoding of
935b9e818b5SShawn Pearcereferences inside Git tree objects stored as part of the repository's
936b9e818b5SShawn Pearceobject database.
937b9e818b5SShawn Pearce
938b9e818b5SShawn PearceThe RefTree format adds additional load on the object database storage
939b9e818b5SShawn Pearcelayer (more loose objects, more objects in packs), and relies heavily
940b9e818b5SShawn Pearceon the packer's delta compression to save space.  Namespaces which are
941b9e818b5SShawn Pearceflat (e.g.  thousands of tags in refs/tags) initially create very
942b9e818b5SShawn Pearcelarge loose objects, and so RefTree does not address the problem of
943b9e818b5SShawn Pearcecopying many references to modify a handful.
944b9e818b5SShawn Pearce
945b9e818b5SShawn PearceFlat namespaces are not efficiently searchable in RefTree, as tree
946b9e818b5SShawn Pearceobjects in canonical formatting cannot be binary searched. This fails
947b9e818b5SShawn Pearcethe need to handle a large number of references in a single namespace,
948b9e818b5SShawn Pearcesuch as GitHub's `refs/pulls`, or a project with many tags.
949b9e818b5SShawn Pearce
950b9e818b5SShawn Pearce[ketch]: https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html
951b9e818b5SShawn Pearce[reftree]: https://public-inbox.org/git/CAJo=hJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg@mail.gmail.com/
952b9e818b5SShawn Pearce
953b9e818b5SShawn Pearce### LMDB
954b9e818b5SShawn Pearce
955b9e818b5SShawn PearceDavid Turner proposed [using LMDB][dt-lmdb], as LMDB is lightweight
956b9e818b5SShawn Pearce(64k of runtime code) and GPL-compatible license.
957b9e818b5SShawn Pearce
958b9e818b5SShawn PearceA downside of LMDB is its reliance on a single C implementation.  This
959b9e818b5SShawn Pearcemakes embedding inside JGit (a popular reimplemenation of Git)
960b9e818b5SShawn Pearcedifficult, and hoisting onto virtual storage (for JGit DFS) virtually
961b9e818b5SShawn Pearceimpossible.
962b9e818b5SShawn Pearce
963b9e818b5SShawn PearceA common format that can be supported by all major Git implementations
964b9e818b5SShawn Pearce(git-core, JGit, libgit2) is strongly preferred.
965b9e818b5SShawn Pearce
966b9e818b5SShawn Pearce[dt-lmdb]: https://public-inbox.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/
967b9e818b5SShawn Pearce
968b9e818b5SShawn Pearce## Future
969b9e818b5SShawn Pearce
970b9e818b5SShawn Pearce### Longer hashes
971b9e818b5SShawn Pearce
972b9e818b5SShawn PearceVersion will bump (e.g.  2) to indicate `value` uses a different
973b9e818b5SShawn Pearceobject id length other than 20.  The length could be stored in an
974b9e818b5SShawn Pearceexpanded file header, or hardcoded as part of the version.
975