xref: /Lucene/lucene/demo/src/java/overview.html (revision 2a618586de6c2ebbd48761addb73942ea59fc094)
1<!--
2 Licensed to the Apache Software Foundation (ASF) under one or more
3 contributor license agreements.  See the NOTICE file distributed with
4 this work for additional information regarding copyright ownership.
5 The ASF licenses this file to You under the Apache License, Version 2.0
6 (the "License"); you may not use this file except in compliance with
7 the License.  You may obtain a copy of the License at
8
9     http://www.apache.org/licenses/LICENSE-2.0
10
11 Unless required by applicable law or agreed to in writing, software
12 distributed under the License is distributed on an "AS IS" BASIS,
13 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 See the License for the specific language governing permissions and
15 limitations under the License.
16-->
17<html>
18<head>
19<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
20<title>Apache Lucene - Building and Installing the Basic Demo</title>
21</head>
22<body>
23<p>The demo module offers simple example code to show the features of Lucene.</p>
24<h1>Apache Lucene - Building and Installing the Basic Demo</h1>
25<div id="minitoc-area">
26<ul class="minitoc">
27<li><a href="#About_this_Document">About this Document</a></li>
28<li><a href="#About_the_Demo">About the Demo</a></li>
29<li><a href="#Setting_your_CLASSPATH">Setting your CLASSPATH</a></li>
30<li><a href="#Indexing_Files">Indexing Files</a></li>
31<li><a href="#About_the_code">About the code</a></li>
32<li><a href="#Location_of_the_source">Location of the source</a></li>
33<li><a href="#IndexFiles">IndexFiles</a></li>
34<li><a href="#Searching_Files">Searching Files</a></li>
35<li><a href="#Embeddings">Working with vector embeddings</a></li>
36</ul>
37</div>
38<a id="About_this_Document"></a>
39<h2 class="boxed">About this Document</h2>
40<div class="section">
41<p>This document is intended as a "getting started" guide to using and running
42the Lucene demos. It walks you through some basic installation and
43configuration.</p>
44</div>
45<a id="About_the_Demo"></a>
46<h2 class="boxed">About the Demo</h2>
47<div class="section">
48<p>The Lucene command-line demo code consists of an application that
49demonstrates various functionalities of Lucene and how you can add Lucene to
50your applications.</p>
51</div>
52<a id="Setting_your_CLASSPATH"></a>
53<h2 class="boxed">Setting your CLASSPATH</h2>
54<div class="section">
55<p>First, you should <a href=
56"http://www.apache.org/dyn/closer.cgi/lucene/java/">download</a> the latest
57Lucene distribution and then extract it to a working directory.</p>
58<p>You need four JARs: the Lucene JAR, the queryparser JAR, the common analysis JAR, and the Lucene
59demo JAR. You should see the Lucene JAR file in the modules/ directory you created
60when you extracted the archive -- it should be named something like
61<span class="codefrag">lucene-core-{version}.jar</span>. You should also see
62files called <span class="codefrag">lucene-queryparser-{version}.jar</span>,
63<span class=
64"codefrag">lucene-analysis-common-{version}.jar</span> and <span class=
65"codefrag">lucene-demo-{version}.jar</span> under queryparser, analysis/common/ and demo/,
66respectively.</p>
67<p>Put all four of these files in your Java CLASSPATH.</p>
68</div>
69<a id="Indexing_Files"></a>
70<h2 class="boxed">Indexing Files</h2>
71<div class="section">
72<p>Once you've gotten this far you're probably itching to go. Let's <b>build an
73index!</b> Assuming you've set your CLASSPATH correctly, just type:</p>
74<pre>
75    java org.apache.lucene.demo.IndexFiles -docs {path-to-lucene}
76</pre>
77This will produce a subdirectory called <span class="codefrag">index</span>
78which will contain an index of all of the Lucene source code.
79<p>To <b>search the index</b> type:</p>
80<pre>
81    java org.apache.lucene.demo.SearchFiles
82</pre>
83You'll be prompted for a query. Type in a gibberish or made up word (for example:
84"superca<!-- need to break up word in a way that is not visibile so it doesn't cause this ile to match a search on this word -->lifragilisticexpialidocious").
85You'll see that there are no maching results in the lucene source code.
86Now try entering the word "string". That should return a whole bunch
87of documents. The results will page at every tenth result and ask you whether
88you want more results.</div>
89<a id="About_the_code"></a>
90<h2 class="boxed">About the code</h2>
91<div class="section">
92<p>In this section we walk through the sources behind the command-line Lucene
93demo: where to find them, their parts and their function. This section is
94intended for Java developers wishing to understand how to use Lucene in their
95applications.</p>
96</div>
97<a id="Location_of_the_source"></a>
98<h2 class="boxed">Location of the source</h2>
99<div class="section">
100<p>The files discussed here are linked into this documentation directly:
101  <ul>
102     <li><a href="src-html/org/apache/lucene/demo/IndexFiles.html">IndexFiles.java</a>: code to create a Lucene index.
103     <li><a href="src-html/org/apache/lucene/demo/SearchFiles.html">SearchFiles.java</a>: code to search a Lucene index.
104  </ul>
105</div>
106<a id="IndexFiles"></a>
107<h2 class="boxed">IndexFiles</h2>
108<div class="section">
109<p>As we discussed in the previous walk-through, the <a href=
110"src-html/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class creates
111a Lucene Index. Let's take a look at how it does this.</p>
112<p>The <span class="codefrag">main()</span> method parses the command-line
113parameters, then in preparation for instantiating
114{@link org.apache.lucene.index.IndexWriter IndexWriter}, opens a
115{@link org.apache.lucene.store.Directory Directory}, and
116instantiates {@link org.apache.lucene.analysis.standard.StandardAnalyzer StandardAnalyzer}
117and {@link org.apache.lucene.index.IndexWriterConfig IndexWriterConfig}.</p>
118<p>The value of the <span class="codefrag">-index</span> command-line parameter
119is the name of the filesystem directory where all index information should be
120stored. If <span class="codefrag">IndexFiles</span> is invoked with a relative
121path given in the <span class="codefrag">-index</span> command-line parameter,
122or if the <span class="codefrag">-index</span> command-line parameter is not
123given, causing the default relative index path "<span class=
124"codefrag">index</span>" to be used, the index path will be created as a
125subdirectory of the current working directory (if it does not already exist).
126On some platforms, the index path may be created in a different directory (such
127as the user's home directory).</p>
128<p>The <span class="codefrag">-docs</span> command-line parameter value is the
129location of the directory containing files to be indexed.</p>
130<p>The <span class="codefrag">-update</span> command-line parameter tells
131<span class="codefrag">IndexFiles</span> not to delete the index if it already
132exists. When <span class="codefrag">-update</span> is not given, <span class=
133"codefrag">IndexFiles</span> will first wipe the slate clean before indexing
134any documents.</p>
135<p>Lucene {@link org.apache.lucene.store.Directory Directory}s are used by
136the <span class="codefrag">IndexWriter</span> to store information in the
137index. In addition to the {@link org.apache.lucene.store.FSDirectory FSDirectory}
138implementation we are using, there are several other <span class=
139"codefrag">Directory</span> subclasses that can write to RAM, to databases,
140etc.</p>
141<p>Lucene {@link org.apache.lucene.analysis.Analyzer Analyzer}s are
142processing pipelines that break up text into indexed tokens, a.k.a. terms, and
143optionally perform other operations on these tokens, e.g. downcasing, synonym
144insertion, filtering out unwanted tokens, etc. The <span class=
145"codefrag">Analyzer</span> we are using is <span class=
146"codefrag">StandardAnalyzer</span>, which creates tokens using the Word Break
147rules from the Unicode Text Segmentation algorithm specified in <a href=
148"http://unicode.org/reports/tr29/">Unicode Standard Annex #29</a>; converts
149tokens to lowercase; and then filters out stopwords. Stopwords are common
150language words such as articles (a, an, the, etc.) and other tokens that may
151have less value for searching. It should be noted that there are different
152rules for every language, and you should use the proper analyzer for each.
153Lucene currently provides Analyzers for a number of different languages (see
154the javadocs under <a href=
155"../analysis/common/overview-summary.html">lucene/analysis/common/src/java/org/apache/lucene/analysis</a>).</p>
156<p>The <span class="codefrag">IndexWriterConfig</span> instance holds all
157configuration for <span class="codefrag">IndexWriter</span>. For example, we
158set the <span class="codefrag">OpenMode</span> to use here based on the value
159of the <span class="codefrag">-update</span> command-line parameter.</p>
160<p>Looking further down in the file, after <span class=
161"codefrag">IndexWriter</span> is instantiated, you should see the <span class=
162"codefrag">indexDocs()</span> code. This recursive function crawls the
163directories and creates {@link org.apache.lucene.document.Document Document} objects. The
164<span class="codefrag">Document</span> is simply a data object to represent the
165text content from the file as well as its creation time and location. These
166instances are added to the <span class="codefrag">IndexWriter</span>. If the
167<span class="codefrag">-update</span> command-line parameter is given, the
168<span class="codefrag">IndexWriterConfig</span> <span class=
169"codefrag">OpenMode</span> will be set to {@link org.apache.lucene.index.IndexWriterConfig.OpenMode#CREATE_OR_APPEND
170OpenMode.CREATE_OR_APPEND}, and rather than adding documents
171to the index, the <span class="codefrag">IndexWriter</span> will
172<strong>update</strong> them in the index by attempting to find an
173already-indexed document with the same identifier (in our case, the file path
174serves as the identifier); deleting it from the index if it exists; and then
175adding the new document to the index.</p>
176</div>
177<a id="Searching_Files"></a>
178<h2 class="boxed">Searching Files</h2>
179<div class="section">
180<p>The <a href=
181"src-html/org/apache/lucene/demo/SearchFiles.html">SearchFiles</a> class is
182quite simple. It primarily collaborates with an
183{@link org.apache.lucene.search.IndexSearcher IndexSearcher},
184{@link org.apache.lucene.analysis.standard.StandardAnalyzer StandardAnalyzer},
185 (which is used in the <a href=
186"src-html/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class as well)
187and a {@link org.apache.lucene.queryparser.classic.QueryParser QueryParser}. The
188query parser is constructed with an analyzer used to interpret your query text
189in the same way the documents are interpreted: finding word boundaries,
190downcasing, and removing useless words like 'a', 'an' and 'the'. The
191{@link org.apache.lucene.search.Query} object contains the
192results from the
193{@link org.apache.lucene.queryparser.classic.QueryParser QueryParser} which
194is passed to the searcher. Note that it's also possible to programmatically
195construct a rich {@link org.apache.lucene.search.Query}  object without using
196the query parser. The query parser just enables decoding the <a href=
197"../queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description">
198Lucene query syntax</a> into the corresponding
199{@link org.apache.lucene.search.Query Query} object.</p>
200<p><span class="codefrag">SearchFiles</span> uses the
201{@link org.apache.lucene.search.IndexSearcher#search(org.apache.lucene.search.Query,int)
202IndexSearcher.search(query,n)} method that returns
203{@link org.apache.lucene.search.TopDocs TopDocs} with max
204<span class="codefrag">n</span> hits. The results are printed in pages, sorted
205by score (i.e. relevance).</p>
206</div>
207<h2 id="Embeddings" class="boxed">Working with vector embeddings</h2>
208<div class="section">
209    <p>In addition to indexing and searching text, IndexFiles and SearchFiles can also index and search
210        numeric vectors derived from that text, known as "embeddings." This demo code uses pre-computed embeddings
211        provided by the <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> project, which are in the public
212        domain. The dictionary here is a tiny subset of the full GloVe dataset. It includes only the words that occur
213        in the toy data set, and is definitely <i>not ready for production use</i>! If you use this code to create
214        a vector index for a larger document set, the indexer will throw an exception because
215        a more complete set of embeddings is needed to get reasonable results.
216    </p>
217</div>
218</body>
219</html>
220
221