xref: /Lucene/help/regeneration.txt (revision bd8f182b13c055220ff579da609452377bca1b6d)
1Regeneration
2============
3
4Lucene has a number of machine-generated resources - some of these are
5resource (binary) files, others are Java source files that are stored
6(and compiled) with the rest of Lucene source code.
7
8If you're reading this, chances are that:
9
101) you've hit a precommit check error that said you've modified a generated
11   resource and some checksums are out of sync.
12
132) you need to regenerate one (or more) of these resources.
14
15In many cases hitting (1) means you'll have to do (2) so let's discuss
16these in order.
17
18
19Checksum validation errors
20--------------------------
21
22LUCENE-9868 introduced a system of storing (and validating) checksums of
23generated files so that they are not accidentally modified. This checkums
24system will fail the build with a message similar to this one:
25
26Execution failed for task ':lucene:core:generateStandardTokenizerChecksumCheck'.
27> Checksums mismatch for derived resources; you might have modified a generated resource (regenerate task: :lucene:core:generateStandardTokenizerIfChanged):
28  Actual:
29    lucene/core/[...]/StandardTokenizerImpl.java=3298326986432483248962398462938649869326
30
31  Expected:
32    lucene/core/[...]/StandardTokenizerImpl.java=8e33c2698446c1c7a9479796a41316d1932ceda8
33
34The message shows you which resources have mismatches on checksums (in this case
35StandardTokenizerImpl.java) but also the *module* where the generated
36resource exists and the *task name* that should be used to regenerate this resource:
37
38:lucene:core:generateStandardTokenizerIfChanged
39
40To resolve the problem, try to:
41
421) "git diff" the changes that caused the build failure (to see why the checksums
43changed) and then decide whether to update the generated resource's template (or whatever
44it is using to emit the generated resource);
45
462) regenerate the derived resources, possibly saving new checksums. If you decide to
47regenerate, just run the task hinted at in the error message, for example:
48
49gradlew :lucene:core:generateStandardTokenizerIfChanged
50
51This regenerates all resources the task "generateStandardTokenizer" produces
52and updates the corresponding checksums.
53
54
55Resource regeneration
56---------------------
57
58The "convention" task for regenerating all derived resources in a given
59module is called "regenerate" and you can apply it to all Lucene modules
60by running:
61
62gradlew regenerate
63
64It is typically much wiser to limit the scope of regeneration to only
65the module you're working with though:
66
67gradlew -p lucene/analysis/common regenerate
68
69If you're interested in what specific generation tasks are available, see
70the task list for the generation group:
71
72gradlew tasks --group generation
73
74or limit the output to a particular module:
75
76gradlew -p lucene/analysis/common tasks --group generation
77
78which displays (at the moment of writing):
79
80generateClassicTokenizer - Regenerate ClassicTokenizerImpl.java (if sources changed)
81generateHTMLStripCharFilter - Regenerate HTMLStripCharFilter.java (if sources changed)
82generateTlds - Regenerate top-level domain jflex macros and tests (if sources changed)
83generateUAX29URLEmailTokenizer - Regenerate UAX29URLEmailTokenizerImpl.java (if sources changed)
84generateWikipediaTokenizer - Regenerate WikipediaTokenizerImpl.java (if sources changed)
85regenerate - Rerun any code or static data generation tasks.
86snowball - Regenerates snowball stemmers.
87
88You may wonder why none of these tasks actually exist in gradle source files (identically
89named tasks with a suffix "Internal" exist).
90
91
92Resource checksums, incremental generation and advanced topics
93--------------------------------------------------------------
94
95Many resource generation tasks require specific tools (perl, python, bash shell)
96and resources that may not be available on all platforms. In LUCENE-9868 we tried
97to make resource generation tasks "incremental" so that they only run if their
98sources (or outputs) have changed. So if you run the generic "regenerate" task, many of the
99actual regeneration sub-tasks will be "skipped" - you can see this if you run gradle with
100plain console, for example:
101
102gradlew -p lucene/analysis/common regenerate --console=plain
103
104...
105> Task :lucene:analysis:common:generateUnicodeProps
106Checksums consistent with sources, skipping task: :lucene:analysis:common:generateUnicodePropsInternal
107...
108
109This shouldn't worry you at all - the internal tasks are skipped by wrappers
110if the inputs and outputs of the internal task have not changed. If they have changed,
111the task is re-run and followed up by other tasks, such as code-formatting (tidy).
112
113Of course, sometimes you may want to *force* the regeneration task to run, even if the
114checksums indicate nothing has changed. This may happen because of several reasons:
115
116- the generation task has outputs but no inputs or the inputs are volatile. In this case
117only the outputs have checksums and the task will be skipped if the outputs haven't changed.
118
119- you may want to run the regeneration task just to see that it actually runs and produces
120the same checksums (git diff should be clean). This would be a wise periodic sanity check
121to ensure everything works as expected.
122
123If you want to force-run the regeneration, use gradle's "--rerun-tasks" option:
124
125gradlew regenerate --rerun-tasks
126
127Scoping the call to a particular module will also work:
128
129gradlew -p lucene/analysis/common regenerate --rerun-tasks
130
131Scoping the call to a particular task will also work:
132
133gradlew -p lucene/analysis/common generateUnicodeProps --rerun-tasks
134
135You *should not* call the underlying generation task directly; this is possible
136but discouraged:
137
138gradlew -p lucene/analysis/common generateUnicodePropsInternal --rerun-tasks
139
140The reason is that some of these generation tasks require follow-up (for example
141source code tidying) and, more importantly, the checksums for these
142regenerated resources won't be saved (so the next time you run 'check' it'll fail
143with checksum mismatches).
144
145Finally, if you do feel like force-regenerating everything, remember to exclude this
146monster...
147
148gradlew regenerate -x generateUAX29URLEmailTokenizerInternal --rerun-tasks
149
150and on Windows, exclude snowball regeneration (requires bash):
151
152gradlew regenerate -x generateUAX29URLEmailTokenizerInternal -x snowball --rerun-tasks
153