0% found this document useful (0 votes)
1K views

Lucene Domain Index

Lucene Domain Index was originally sponsored by an online social lending network. The project is under Apache V2 License: Copyright 2004. You may obtain a copy of the License at.

Uploaded by

dsa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Lucene Domain Index

Lucene Domain Index was originally sponsored by an online social lending network. The project is under Apache V2 License: Copyright 2004. You may obtain a copy of the License at.

Uploaded by

dsa
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Lucene Domain Index

This project was originally sponsored by Lending Club, an online social lending network where people
can borrow and lend money among themselves based upon their affinities and/or social connections.
The project is under Apache V2 License:

Copyright 2004 The Apache Software Foundation


Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software


distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Lucene Domain Index ...................................................................................... 1


1. Introduction................................................................................................. 3
1.1 What is Lucene ........................................................................................... 3
1.2 What is Lucene Domain Index.......................................................................3
1.3 Why do I use Lucene Domain Index? .............................................................4
2. Install .......................................................................................................... 5
2.1 Requirements ............................................................................................. 5
2.2 Install binary distributions ............................................................................ 5
2.2.1 11g Binary Distribution........................................................................... 5
2.2.2 10g Binary Distribution........................................................................... 6
2.3 Install Instructions to compile from sources ....................................................7
2.3.1 Generating Maven's artifacts ...................................................................8
2.4 Optimizations ............................................................................................. 9
2.4.1 Using NCOMP on 10g ............................................................................. 9
2.4.2 Using JIT on 11g ................................................................................... 9
2.5 Testing Lucene Domain Index .......................................................................9
Required grants for regular Oracle users ...........................................................9
3. Examples................................................................................................... 12
IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user
rather than LUCENE: .................................................................................... 12
3.1 Create a Lucene Domain Index ................................................................ 12
3.1.1 Single column index............................................................................. 12
IMPORTANT: Lucene Domain Index name can not be larger than 21 characters.... 13
3.1.2 Multiple columns ................................................................................. 13
3.1.3 Multiple tables..................................................................................... 13
3.1.4 Padding and formatting ........................................................................ 15
3.1.5 Functional columns .............................................................................. 16
3.1.6 Create OnLine index ............................................................................ 17
3.1.7 Populate Index .................................................................................... 17
3.1.8 Parallel Index Operations...................................................................... 17
3.2 Alter ..................................................................................................... 18
3.2 Rebuild ................................................................................................. 18
3.2.1 Manual............................................................................................... 19
3.2.2 On Line .............................................................................................. 19
3.3 Drop..................................................................................................... 19
3.4 Querying............................................................................................... 19
3.4.1 Simple columns................................................................................... 20
3.4.2 Multiple columns ................................................................................. 20
3.4.3 Pagination .......................................................................................... 21
3.4.4 Sort................................................................................................... 21
3.4.5 Count Hits Function ............................................................................. 24
3.4.6 First Rows Optimizer Hint ..................................................................... 26
3.4.6 Highlighting ........................................................................................ 26
3.4.7 Highlighting using pipeline table functions............................................... 28
3.4.8 More like this functionality .................................................................... 29
3.4.9 Facets................................................................................................ 30
3.4.10 Terms pipeline table functions ............................................................. 33
3.4.11 Did You Mean functionality .................................................................. 34
3.5 Synchronize ............................................................................................. 36
3.6 Optimize .................................................................................................. 36
3.6 XMLDB Export .......................................................................................... 37
3.7 Exporting/Importing functional index with exp/imp Oracle tools....................... 41
4. Locking and Performance ............................................................................. 44
4.1 Lock used by Lucene Domain Index .......................................................... 44
4.2 Performance tips .................................................................................... 44
4.2.1 Index Writer parameters ...................................................................... 44
4.2.2 Auto Tune Memory functionality ............................................................ 44
4.2.3 Keep Index on RAM ............................................................................. 45
4.2.4 Compare your execution plan ................................................................ 45
5 Know caveats.............................................................................................. 46
Appendixes .................................................................................................... 48
A. Parameter reference and syntax ................................................................... 48
A.1 Lucene Index Writer parameters .............................................................. 49
A.1.1 MergeFactor ....................................................................................... 49
A.1.2 MaxBufferedDocs ................................................................................ 49
A.1.3 MaxMergeDocs................................................................................... 49
A.1.4 MaxBufferedDeleteTerms...................................................................... 49
A.1.5 UseCompoundFile ................................................................................ 49
A.2 Analyzer parameters .............................................................................. 49
A.2.1 Analyzer............................................................................................. 50
A.2.2 Stemmer............................................................................................ 50
A.2.3 PerFieldAnalyzer.................................................................................. 50
A.3 User Data Store parameters .................................................................... 51
A.3.1 ExtraCols ........................................................................................... 51
A.3.2 ExtraTabs........................................................................................... 51
A.3.3 WhereCondition .................................................................................. 52
A.3.4 UserDataStore .................................................................................... 52
A.3.2 FormatCols ......................................................................................... 52
A.4 General parameters................................................................................ 52
A.4.1 SyncMode .......................................................................................... 53
A.4.2 AutoTuneMemory ................................................................................ 53
A.4.3 LobStorageParameters ......................................................................... 53
A.4.4 LogLevel ............................................................................................ 53
A.4.5 CachedRowIdSize ................................................................................ 54
A.5 Query parameters .................................................................................. 54
A.5.1 DefaultColumn .................................................................................... 54
A.5.2 DefaultOperator .................................................................................. 54
A.5.3 NormalizeScore ................................................................................... 54
A.5.4 PreserveDocIdOrder ............................................................................ 55
A.6 Highlight parameters .............................................................................. 55
A.6.1 Formatter........................................................................................... 55
A.6.2 MaxNumFragmentsRequired ................................................................. 55
A.6.3 FragmentSize ..................................................................................... 55
A.6.4 FragmentSeparator.............................................................................. 55
B Lucene Domain Index Storage ....................................................................... 55
C JUnit test suites explained ............................................................................ 56
C.1 DBTestCase base class............................................................................ 56
C.2 TestDBIndex ......................................................................................... 57
C.3 TestDBIndexAddDoc ............................................................................... 57
C.4 TestDBIndexDelDoc................................................................................ 58
C.5 TestDBIndexParallel ............................................................................... 58
C.6 TestDBIndexSearchDoc........................................................................... 60
C.7 TestQueryHits........................................................................................ 61
D Functions, operators and utilities ................................................................... 63
D.1 lcontains ancillary operator ..................................................................... 63
D.2 lscore ancillary operator.......................................................................... 67
D.3 lhighlight ancillary operator ..................................................................... 68
D.4 phighlight pipeline table function.............................................................. 68
D.5 rhighlight pipeline table function .............................................................. 69
D.6 MoreLike.this function ............................................................................ 70
D.7 lfacets aggregate function ....................................................................... 71
D.8 index_terms pipeline table function .......................................................... 71
D.9 high_freq_terms pipeline table function .................................................... 72
D.10 DidYouMean package ............................................................................ 72
E Project Change Log ...................................................................................... 74
2.9.2.1.0 Production Release based on Lucene 2.9 (2.9.2) core base .................. 74
2.9.1.1.0 Production Release based on Lucene 2.9 (2.9.1) core base .................. 74
2.9.0.1.0 Production release based on Lucene 2.9.0 core base, 29/Sep/09 .......... 74
2.4.1.1.0 (maintenance release based on Lucene 2.4.1, 27/Mar/09) ................... 75
2.4.1.0.0 (first release based on Lucene 2.4.1, 9/Mar/09) ................................. 75
2.4.0.1.0 (maintenance release based on Lucene 2.4.0, 10/Jan/09) ................... 75
2.4.0.0.0 (production release based on Lucene 2.4.0, 10/10/08)........................ 75
2.3.2.0.0 (binary release based on Lucene 2.3.2, 1/Jun/08) .............................. 76
2.2.0.2.2 (fixpack for 2.2.0.2.0 release, 5/Apr/08) ........................................... 76
2.2.0.2.1 (fixpack for 2.2.0.2.0 release, 12/Dec/07) ......................................... 76
2.2.0.2.0 (third major release synchronized with Lucene 2.2.0, 12/Dec/07) ......... 76
2.2.0.1.1 (second release, 27/Sep/07 05:39 AM) ............................................. 76
2.2.0.1.0 (first release synchronized with lucene 2.2.0, 14/Sep/07 06:44 AM) ..... 77
2.0.0.1.3 (third release, 09/Jan/07 11:40 AM) ................................................. 77
2.0.0.1.2 (second release, 20/Dec/06 02:03 PM) ............................................. 78
2.0.0.1.1 (first release, 28/Nov/06 01:04 PM) ................................................. 78
2.0.0.1.0 (initial implementation, 22/Nov/06 03:45 PM).................................... 78

1. Introduction
1.1 What is Lucene
Apache Lucene is a high-performance, full-featured text search engine library written
entirely in Java. It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.
If Lucene is a pure Java framework why not use it inside Oracle Database JVM
environment?

1.2 What is Lucene Domain Index


Lucene Domain Index is full integration of Lucene project running inside the Oracle
database using Oracle JVM.
Oracle provides a full featured JVM inside your Oracle Database compliant with JDK 1.4 in
10g release and 1.5 in 11g.
OJVMDirectory is a replacement for Lucene's file system storage by a BLOB based storage,
the name is related to the class which overrides (Directory.java), here a simple list of
points take into account to choose this storage:
• Using traditional File System for storing the inverted index is not a good option for
some users, you don't have commit or rollback behavior, backup, etc.
• Using BLOB for storing the inverted index running Lucene outside the Oracle
database has a bad performance because there are a lot of network round trips
and data marshaling.
• Indexing relational data stores such as tables with VARCHAR2, CLOB or XMLType
with Lucene running outside the database has the same problem as the previous
point.
• The JVM included inside the Oracle database can scale up to 10.000+ concurrent
sessions without memory leaks or deadlock and all the operations on tables are in
the same memory space!!
More on this, Oracle provides a Data Cartridge API (ODCI), also called Extensible Indexing
mechanism because you can write your own Domain Index and integrate it with the Oracle
Engine and optimizer.
There are some important points integrating Lucene by using ODCI:
• Changes on rows are automatically notified to Lucene, now these changes are en-
queued using Oracle AQ. User can control if these changes are applied OnLine or
Deferred.
• Oracle optimizer can choose a proper execution plan if there is a Domain Index
created.
• You can mix lcontains(),lhighlight() and lscore() operators in your queries.

1.3 Why do I use Lucene Domain Index?


Oracle include a full feature Enterprise Engine named Oracle Text made in C and fully
integrated to the Oracle Engine, but:
on Oracle Text you can not:
• control which functionality will be included into next release.
• easily customize it for your needs.
• index Index Organized Tables (IOT)
• index joined tables
• index unlimited extra columns
• easily highlight text
• index NCLOB and NVARCHAR data types
on 10g you can not:
• index multiple columns in a same index
• sort and filter by using indexed columns at index level
on 11g you can not:
• filter by / sort by on columns of timestamp with TZ, commonly
used in XDB because is the official data type for xsd:date type
using Lucene inside Oracle:
• usually indexes are small because Lucene Domain Index do not store any column, except
the rowid, inside Lucene inverted index structure. Using a rowid Oracle can lookup any
column value faster than retrieve it from Lucene inverted index.
• Support padding for Text columns
• Support formatting (rounding/padding) for Number and Date/Time columns
• You can create index on-line even in a standard edition databases (feature available en EE
for Text)
• Extending DefaultUserDataStore class an application can implement any data type
mapping, specially BLOB which in common cases have non standard encoding
• An experimental native REST WS can be used to query the index
• Lucene inverted index is transactional, if a SQL operation is rolled back, the index will be
consistent too, avoiding phantom reads or negative hits (rows which should be included as
hit but was not included in Lucene index)
• is a ready to use uptodate solution for any programming language, for example Ruby, .Net,
Phyton or PHP.
• an elegant solution for highlighting text use pipeline table functions.
• a high level abstraction layer for Lucene IR library, developers only deal with SQL
• transparent compression and encryption of Lucene storage if you enable Oracle
Transparent Data Encryption and Secure File compression

2. Install
2.1 Requirements
• JDeveloper 11g only if you want to edit Java code.
• Ant 1.7.0
• Sun JDK 1.5.0_05/1.4.2 ($ORACLE_HOME/jdk directory works fine as Java Home for compiling
on 10g and 11g)
• Linux/Windows Database Oracle 10g 10.2/11g production

2.2 Install binary distributions


Binary distributions are available at SourceForge.net and provides a very straightforward
installation.

2.2.1 11g Binary Distribution

Edit your ~/build.properties file with your Database values (Windows users can find
build.properties file at C:\Documents and Settings\username folder):

db.str=test
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install
javac.debug=true
javac.source=1.4
javac.target=1.4

db.str is your SQLNet connect string for your target database, check first with tnsping
This is an example environment setting before installing on 11g database

MAVEN_HOME=/usr/local/maven
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=$ORACLE_BASE/product/11.1.0.6.0/db_1
ORACLE_SID=test
JAVA_HOME=$ORACLE_HOME/jdk
PATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/
bin:$PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/lib
CVS_RSH=ssh
umask 022
export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOME
CVS_RSH NLS_LANG

Upload, install and test your code into the database

# ant install-ojvm
# ant test-domain-index

For Oracle 11g you can perform a post-installation step:

# ant jit-lucene-classes

This target force to translate all Lucene, Snowball and OJVMDirectory classes to
assembler.
Instead of waiting that the database compile it by detecting most used classes or method.

2.2.2 10g Binary Distribution

First edit your ~/build.properties with something like this:

db.str=orcl
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install
javac.debug=true
javac.source=1.4
javac.target=1.4

db.str property is a SQLNet connect string for the target database.


ORACLE_HOME environment setting is required and properly configured to an Oracle
10g database layout, finally execute ant without arguments.
Here an example of environment setting on 10g database

MAVEN_HOME=/usr/local/maven
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=$ORACLE_BASE/product/10.2.0/db_1
ORACLE_SID=orcl
JAVA_HOME=$ORACLE_HOME/jdk
PATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/
bin:$PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/lib
CVS_RSH=ssh
umask 022
export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOME
CVS_RSH NLS_LANG
If you are re-installing Oracle Lucene OJVM integration first drop any Lucene Domain
Index not installed at Lucene's schema.
Default target will drop first Lucene schema if exists, additionaly (Recommended for
production system) you can run "ant ncomp-ojvm" which translates all Lucene classes to
C using JAccelerator, for example:

# ant ncomp-ojvm
# ant test-domain-index

2.3 Install Instructions to compile from sources


- Unpack or checkout Lucene sources.
- Checkout OJVM sources, by now only Anonymous CVS access is provided you can
download from Source Forge servers with:

cd /tmp
cvs -d:pserver:[email protected]:/cvsroot/dbprism login
cvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P
ojvm

- Copy to $LUCENE_ROOT/contrib

# cd $LUCENE_ROOT/contrib
# cp -rp /tmp/ojvm .

- Edit $LUCENE_ROOT/common-build.xml adding a target for creating a jar file with test
sources.

<target name="jar-test" depends="compile-test">


<jar destfile="${build.dir}/${final.name}-test.jar" basedir="${build.dir}/classes/test"
excludes="**/*.java"/>
</target>

- Also edit above file at the target name test adding db.usr, db.pwd and db.str properties
as System properties to be available for Lucene Domain Index JUnit suites.

<target name="test" depends="compile-test" description="Runs unit tests">


<fail unless="junit.present">

################################################################
JUnit not found.
Please make sure junit.jar is in ANT_HOME/lib, or made available
to Ant using other mechanisms like -lib or CLASSPATH.

################################################################
</fail>
............
<!-- contrib/ojvm uses these system properties to connect to the target database --
>
<sysproperty key="db.str" value="${db.str}"/>
<sysproperty key="db.usr" value="${db.usr}"/>
<sysproperty key="db.pwd" value="${db.pwd}"/>
............
<delete file="${build.dir}/test/junitfailed.flag" />

</target>

- (OPTIONAL) Update Lucene's BufferedIndexInput.BUFFER_SIZE according to your


db_block_size init.ora parameter.
Before compile and upload Lucene core library you can change
org.apache.lucene.store.BufferedIndexInput.BUFFER_SIZE constant to the value of your
db_block_size init parameter, this change will improve reading performance by using same
block size as the physical block size that your database use.
- Compile OJVM Directory sources and tests, these targets automatically copies all Lucene
Domain Index required libraries from your $ORACLE_HOME and Internet. Starting with
OJVM 2.4.0.1.x build.xml file automatically compiles all Lucene contrib modules
dependency.

# cd $LUCENE_ROOT/contrib/ojvm
# ant jar-core
# ant jar-test

- Edit your ~/build.properties file with your Database values:

db.str=orcl
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install
javac.debug=true
javac.source=1.4
javac.target=1.4

db.str is your SQLNet connect string for your target database, check first with tnsping
utility, also note that for 11g database user and password are case sensitive, so leave
LUCENE in uppercase.
- Upload your code to the database

# ant install-ojvm

2.3.1 Generating Maven's artifacts

You can generate Lucene and OJVM Directory Maven's artifacts following previous one steps,
then execute:

# ant generate-maven-artifacts
2.4 Optimizations

2.4.1 Using NCOMP on 10g

Is strongly recommended before going in production that install Oracle Lucene


Domain Index NCOMPed in 10g databases. NCOMP automatically translate Lucene
and OJVMDirectory Java code to assembler and finally install it as dynamic link
library (.so/.dll) in your Oracle home. To do this simply execute this Ant task
instead of install-ojvm target:

# ant ncomp-ojvm

2.4.2 Using JIT on 11g

First verify that your database parameter java_jit_enabled is TRUE. Oracle 11g
includes a JIT technology which automatically translates most used Java methods
to assembler. If you want to pre-compile all Lucene Java code to assembler and not
wait for Oracle database detects common used code you can execute this target:

ant jit-lucene-classes
ant jit-oracle-classes

2.5 Testing Lucene Domain Index

Required grants for regular Oracle users

IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user rather
than LUCENE:

-- connected as sysdba
begin

dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission',
'control', '' );
commit;
end;
/

Lucene Domain Index have two kinds of test suites to check that everything is OK after
installation.
First test suite which can be launched using Ant is pure SQL and use SQLPlus to work, to launch
it simply execute:

[mochoa@mochoa ojvm]$ ant test-domain-index


Buildfile: build.xml
[echo] Building ojvm...

test-domain-index:
[exec]
[exec] SQL*Plus: Release 11.1.0.6.0 - Production on Wed Dec 5 17:43:24 2007
[exec]
[exec] Copyright (c) 1982, 2007, Oracle. All rights reserved.
[exec]
[exec]
[exec] Connected to:
[exec] Oracle Database 11g Release 11.1.0.6.0 - Production
[exec]
[exec]
[exec] Table dropped.
[exec]
[exec]
[exec] Table created.
[exec]
[exec] SQL> Disconnected from Oracle Database 11g Release 11.1.0.6.0 - Production
[echo] See output at ../../build/testLuceneDomainIndex.txt

Except for the test which uses test_source_small table which outputs his log at the .trc files
other will outputs his log information at ../../build/testLuceneDomainIndex.txt file.

Second test suite is a set of JUnit tests to simulate middle tier environments, it also use a
connection pool. To start these suites run:

[mochoa@mochoa ojvm]$ ant -Ddb.usr=scott -Ddb.pwd=tiger -Ddb.str=test "-


Djunit.includes=**/AllTests.java" ojvm-test
Buildfile: build.xml
[echo] Building ojvm...

ojvm-test:
[echoproperties] #Ant properties
[echoproperties] #Wed Dec 05 17:56:30 ART 2007
.........
common.test:
[junit] Testsuite: org.apache.lucene.index.TestDBIndex
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.883 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
&nbsp;&nbsp; [junit] Inserted rows: 40 total bytes inserted: 421 avg text length: 10
[junit] Index synced: IT1 elapsed time: 249 ms.
[junit] Avg Sync time: 6
[junit] Index optimized: IT1 elapsed time: 46 ms.
[junit] Avg Optimize time: 1
[junit] Row deleted 41, from: 10 to: 50 elapsed time: 2005 ms. Avg time: 48 ms.
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] ------------- ---------------- ---------------
.............
[junit] Testsuite: org.apache.lucene.indexer.TestQueryHits
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.158 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] iteration from: 13775 to: 13785
[junit] Step time: 1291 ms.
[junit] iteration from: 13785 to: 13795
[junit] Step time: 157 ms.
[junit] iteration from: 13795 to: 13805
[junit] Step time: 144 ms.
[junit] iteration from: 13805 to: 13815
[junit] Step time: 147 ms.
[junit] iteration from: 13815 to: 13825
[junit] Step time: 145 ms.
[junit] iteration from: 13825 to: 13835
[junit] Step time: 147 ms.
[junit] iteration from: 13835 to: 13845
[junit] Step time: 145 ms.
[junit] iteration from: 13845 to: 13855
[junit] Step time: 150 ms.
[junit] iteration from: 13855 to: 13865
[junit] Step time: 278 ms.
[junit] iteration from: 13865 to: 13875
[junit] Step time: 146 ms.
[junit] Elapsed time: 3159
[junit] Hits: 18387
[junit] Elapsed time: 653
[junit] ------------- ---------------- ---------------
[junit] Testsuite: org.apache.lucene.indexer.TestTableIndexer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.685 sec
[junit]
[delete] Deleting: /u01/src/lucene-2.2.0/build/contrib/ojvm/test/junitfailed.flag

BUILD SUCCESSFUL
Total time: 8 minutes 22 seconds

org.apache.lucene.indexer.TestQueryHits use a table which is very big to create and


destroy it at setup() and tearDown() methods. Before run this test create the table with:

create table test_source_big as (select * from all_source);

and his index on 10g with:

create index source_big_lidx on test_source_big(text)


indextype is lucene.LuceneIndex
parameters('AutoTuneMemory:true;Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:5
"line"');

Or in 11g with:

create index source_big_lidx on test_source_big(text)


indextype is lucene.LuceneIndex
parameters('AutoTuneMemory:true;FormatCols:line(0000);ExtraCols:line
"line";Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500;LobStorageParameters:PC
0 DISABLE STORAGE IN ROW CHUNK 32768 CACHE READS
FILESYSTEM_LIKE_LOGGING');
If you want to execute only one or some specific test override Ant's property junit.includes,
for example:

ant -Ddb.usr=scott -Ddb.pwd=tiger -Ddb.str=test "-Djunit.includes=**/


TestDBIndex.java" ojvm-test

Note that this argument is enclosed by "" to prevent Unix shell replacement.

3. Examples

IMPORTANT: Before start using Lucene Domain Index grant this to any
Oracle user rather than LUCENE:

-- connected as sysdba
begin

dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission',
'control', '' );
commit;
end;
/

3.1 Create a Lucene Domain Index

3.1.1 Single column index

Table example:

create table t1 (
f1 number,
f2 varchar2(200),
f3 varchar2(200),
f4 number unique);

create index it1 on t1(f2) indextype is lucene.LuceneIndex


parameters('Analyzer:org.apache.lucene.analysis.SimpleAnalyzer');

Create a domain index on table t1 column f2 using SimpleAnalyzer as Lucene Analyzer.


After this DDL command is executed two new tables, one AQ queue and one index are at
user's schema, named IT1$.T, IT1$QT, IT1$Q and IT1$DI respectively.
Other example but instead using some Lucene Analyzer, using Snoball Stemmer.

create index it1 on t1(f2) indextype is lucene.LuceneIndex


parameters('Stemmer:English');

Create a domain index on table t1 column f2 using English Stemmer.


IMPORTANT: Lucene Domain Index name can not be larger than 21
characters.

This limit is due a limitation in Oracle DBMS AQ table name.


Every Lucene Domain Index have associated a AQ queue named
idx_name$Q and his queue table idx_name$QT.

3.1.2 Multiple columns

For previous one table example you can also index extra columns passing the information
as parameter to the index due Oracle 10g do not support Domain Index with compound
columns, here example:

create index it1 on t1(f2) indextype is lucene.LuceneIndex


parameters('Stemmer:English;ExtraCols:F1 "f1"');

Creating an index with ExtraCols parameter cause that Lucene index both columns,
master column f2 indexed as F2 and F1 indexed as "f1", as you can see below, at query
section examples, lcontains() operator provides Lucene's Query Parser Syntax which have
functionality for selecting multiples field using f1:text for example. Using ExtraCols
parameter imply that create index operator performs a full scan on table t1 with a syntax
like: SELECT ROWID,F2,F1 "f1" FROM T1.
Because ODCI Api will not detect changes on other columns than the master, you need to
create a trigger that fire an update on the master column when a change on ExtraCols list
is detected. Here an example:

CREATE OR REPLACE TRIGGER L$IT1


BEFORE UPDATE OF f1 ON t1
FOR EACH ROW
BEGIN
:new.f2 := :new.f2;
END;
/

Any changes on f1 also will force to change f2, then ODCI will notify Lucene that an
specific rowid was updated, Lucene Domain Index based on his parameter definition will
update the inverted index to reflect changes in both columns.

3.1.3 Multiple tables

Lucene Domain Index supports indexing in multiples column and multiples tables which
can be joined in a natural form, it means defining a list of tables with ExtraTabs
parameter, and a where condition with WhereCondition parameter. Here an example:

create table t2 (
f4 number primary key,
f5 VARCHAR2(200));
create table t1 (
f1 number,
f2 VARCHAR2(4000),
f3 number,
CONSTRAINT t1_t2_fk FOREIGN KEY (f3)
REFERENCES t2(f4) ON DELETE cascade);

You can index both tables using t1 as master index definition with:

create index it1 on t1(f3) indextype is lucene.LuceneIndex


parameters('ExtraCols:L$MT.f2 "f2",t2.f5
"f5";ExtraTabs:t2;WhereCondition:L$MT.f3=t2.f4');

Note that tables t1 and t2 are joined directly by a foreign key, so t2 could be consider as
a satellite table of t1. With this set of parameters when ODCI Api detects a change on it1
master column (f3) a select like this is executed:

select L$MT.f3,L$MT.f2 "f2",t2.f5 "f5" from t1 L$MT,t2 where L$MT.rowid=? and


L$MT.f3=t2.f4;

Bold parts of the query are injected by Lucene Domain Index implementation and italics
parts are extracted from ExtraCols and ExtraTabs parameters. The table alias L$MT is
automatically added by Lucene Domain Index to the master table, this alias is important
to create complex joins with Object Tables which uses existsNode or extracValue
operator, that functionality was added starting with 2.9.0.1.0 release.
With the above scenario a trigger for getting Lucene Index synced with changes in any
columns defined at ExtraCols parameter is a bit complex, it requires a combination of two
triggers:

CREATE OR REPLACE TRIGGER L$IT1


BEFORE UPDATE OF f2 ON t1
FOR EACH ROW
BEGIN
:new.f3 := :new.f3;
END;
/
CREATE OR REPLACE TRIGGER LT$IT1
AFTER UPDATE OF f5 ON t2
FOR EACH ROW
DECLARE
ridlist sys.ODCIRidList;
BEGIN
SELECT ROWID
BULK COLLECT INTO ridlist
FROM T1 WHERE F3=:NEW.f4;
LuceneDomainIndex.enqueueChange(USER||'.IT1',ridlist,'update');
END;
/

First trigger is similar to the previous example, second trigger at the satellite table looks
for all rowid at the master table who have references to satellite row, then it uses
LuceneDomainIndex.enqueueChange procedure to notify Lucene Domain Index changes.
sys.ODCIRidList is an special ODCI structure to hold a group of rowid.
3.1.4 Padding and formatting

Lucene Domain Index can be customized with a parameter named UserDataStore, it


defines which class is responsible for creating Lucene documents. Lucene documents are
a list of Field one for each column indexed plus and extra field named rowid stored
compressed and untokenized. By default UserDataStore is defined to
org.apache.lucene.indexer.DefaultUserDataStore.
Default User Data Store supports left padding for NUMBER or FLOAT columns, and
left character padding for VARCHAR2 or CHAR columns. To define padding, FormatCols
parameter at create or alter index DDL command can be used. Here an example:

create table t1 (f1 number primary key, f2 varchar2(200), f3 number(4,2))


ORGANIZATION INDEX;
insert into t1 values (1, 'ravi', 3.46);
insert into t1 values (3, 'murthy', 15.87);
commit;

create index it1 on t1(f2) indextype is lucene.LuceneIndex

parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');

Above example shows that for F2 column all values will be automatically padded to 15
character (z) and F3 column using 00.00, then these rows will be indexed as Lucene
documents:

Document<stored/compressed,indexed<rowid:*BAEAPBQCwQL+>
indexed,tokenized<F2:zzzzzzzzzzzravi> indexed<F3:03.46>>
Document<stored/compressed,indexed<rowid:*BAEAPBQCwQT+>
indexed,tokenized<F2:zzzzzzzzzmurthy> indexed<F3:15.87>>

For columns based on Oracle XMLType, FormatCols parameter can be used to define an
XPath expression which controls a subset of XML nodes to be indexed.

create table t1 (f1 VARCHAR2(10), f2 XMLType);


insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));
insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));
commit;

create index it1 on t1(f1) indextype is lucene.LuceneIndex


parameters('ExtraCols:F2;FormatCols:F1(000),F2(/emp/name)');

Above rows will be indexed as:

Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA>
indexed,tokenized<F1:001> indexed,tokenized<F2:ravi >>
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB>
indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >>

For columns of type VARCHAR/CHAR/CLOB and special string NOT_ANALYZED,


NOT_ANALYZED_STORED, ANALYZED_WITH_OFFSETS,
ANALYZED_WITH_POSITIONS and ANALYZED_WITH_POSITIONS_OFFSETS can
be used as format, this constant tells to User Data Store class that a field will be stored or
not as indexed untokenized, untokenized fields can be used then as sort field. Here an
example:

create table emails (


emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB)
/

create index emailbodyText on emails(bodyText) indextype is lucene.LuceneIndex


parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;ExtraCols:emailDate
"emailDate",subject "subject",emailFrom "emailFrom",emailTo "emailTo"');

-- required to Sort by subject


alter index emailbodyText
parameters('LogLevel:INFO;FormatCols:subject(NOT_ANALYZED),emailFrom(NOT_ANALYZED),ema

The translation rules to Lucene Fields is:


• NOT_ANALYZED - Field(name, value, Field.Store.NO, Field.Index.NOT_ANALYZED)
• NOT_ANALYZED_STORED - Field(name, value, Field.Store.YES,
Field.Index.NOT_ANALYZED)
• ANALYZED - Field(name, value, Field.Store.YES, Field.Index.ANALYZED,
Field.TermVector.NO)
• ANALYZED_WITH_VECTORS - Field(name, value, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.YES)
• ANALYZED_WITH_OFFSETS - Field(name, value, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.WITH_OFFSETS)
• ANALYZED_WITH_POSITIONS - Field(name, value, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS)
• ANALYZED_WITH_POSITIONS_OFFSETS - Field(name, value, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)
If no Format value is associated to a column of type VARCHAR/CHAR/CLOB it is converted
to Lucene Filed as Field(name, val, Field.Store.NO, Field.Index.ANALYZED)

3.1.5 Functional columns

ExtraCols parameter also have a possibility to define functional column for Lucene Index
which means, any SQL function valid in a select section is allowed. For example, using
above table definition:

create index it1 on t1(f1) indextype is lucene.LuceneIndex


parameters('ExtraCols:F2,extractValue(F2,''/emp/@id'') "id";FormatCols:F1(000),F2(/
emp/name),id(00)');

Create this set of Lucene documents indexed:

Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA>
indexed,tokenized<F1:001> indexed,tokenized<F2:ravi > indexed,tokenized<id:01>>
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB>
indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >
indexed,tokenized<id:03>>

Note that a virtual column was defined and indexed as "id", this column then is available
at lcontains operator.

3.1.6 Create OnLine index

If you put SyncMode:OnLine during create index DDL operation it will cause that Lucene
Domain Index enqueues all rowids of the master table for indexing in batchs of
BatchCount rows (default is 115). Immediately that the command returns the index is
ready and a PLSQL AQ Callback will populate Lucene Index structure in background. For
example:

create index pages_lidx_all on pages p (value(p))


indextype is Lucene.LuceneIndex
parameters('SyncMode:OnLine;LogLevel:WARNING;Stemmer:Spanish;ExtraCols:extractValue(object_
page/title'') "title",extractValue(object_value,''/page/revision/comment'')
"comment",extract(object_value,''/page/revision/text/text()'')
"text",extractValue(object_value,''/page/revision/timestamp'')
"revisionDate";IncludeMasterColumn:false;LobStorageParameters:PCTVERSION 0
ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

3.1.7 Populate Index

During create index DDL statement using PopulateIndex:false causes that Lucene Index
structure is created empty and the Domain Index is ready. Then you can call to alter
index rebuild DDL statement to populate it. Here an example:

create index it1 on t1(f2) indextype is lucene.LuceneIndex

parameters('PopulateIndex:false;LogLevel:ALL;IncludeMasterColumn:false;ExtraCols:F1,extractValue
emp/name/text()'') "name",extractValue(F2,''/emp/@id'')
"id";FormatCols:F1(000),id(00)');

-- A this point it1 index is ready but not populated


select lscore(1),f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;
-- Populate Index
alter index it1 rebuild
parameters('Analyzer:org.apache.lucene.analysis.WhitespaceAnalyzer');
-- query, returns one row
select lscore(1),f2 from t1 where lcontains(f2, 'name:ravi',1) > 0;

3.1.8 Parallel Index Operations

Starting with Lucene Domain Index 2.9.1.1.0, you can enable ParallelDegree
operations, ParallelDegree parameter which can be 0 or 2 to 9, is implemented using
multiples Data Storage to process insert operations in parallel, this is useful when you
have multi core chips or RAC environments. By now only insert are parallelized an the
index must be configured in OnLine mode. Following an example of index creation with
parallel inserts enables
create index source_big_lidx on test_source_big(text)
indextype is lucene.luceneindex
parameters('BatchCount:1000;ParallelDegree:4;SyncMode:OnLine;LogLevel:INFO;AutoTuneMemory:tru
"line"');

After this index DDL statement is executed five new tables will be visibly on user's
schema, SOURCE_BIG_LIDX$T (master index storage), and
SOURCE_BIG_LIDX$[0..3]$T storage for slaves process, also a sequence
SOURCE_BIG_LIDX$S is created generating number from 0 to 3.
The parallel implementation will enqueue batch of 1000 rows (BatchCount
parameter) on the master queue related to the index, the AQ callback which is
enable for this queue will dequeue each batch of rows and enqueue in the slaves
queues, the result of these operations is that Oracle AQ process will execute
multiple AQ server process, you can see multiples ora_j00x_sid process running.
With Oracle 11g we saw that the AQ implementation do not start new slaves
process if one callback is getting a lot of CPU usage, my experience show that for a
BatchCount parameter set up to 250 leaves a level of degree on AQ queues which
guarantee that multiples slaves process will be executed resulting in a real parallel
insert operations.

3.2 Alter

SQL DDL alter index command can be used with Lucene Domain Index to change any
parameter after index creation time. Lucene Domain Index parameters are a simple list of
name:value pairs stored into Lucene OJVMDirectory storage. If you want to remove any
parameter from the storage pre-pending ~ in a parameter name is used.
Here some examples of alter index:

alter index it1


parameters('MaxBufferedDocs:500;AutoTuneMemory:false');

Change Lucene Index Writer parameter MaxBufferedDocs to 500 and disable Auto Tune Memory
functionality.

alter index it1


parameters('MaxBufferedDocs:500;AutoTuneMemory:false;SyncMode:OnLine');

Similar to the previous one example but enabling SyncMode to OnLine.

alter index it1 parameters('~SyncMode:OnLine');

Disable SyncMode from the above example, you can get similar functionality setting
SyncMode:Deferred which is the default value for SyncMode.

3.2 Rebuild

SQL DDL alter index allow you to rebuild an index from scratch, this is useful when Lucene
Domain Index is damaged, corrupted or you need to change some parameter which is
necessary to be applied to existing rows already indexed, for example Lucene Analyzer
parameter.

3.2.1 Manual

Manual index rebuild is a typical way, here an example:

alter index it1 rebuild

parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;MaxBufferedDocs:500;AutoTuneMemory

Above example shows how to change Lucene Index Analyzer, if you change your index
Analyzer its necessary to rebuild the complete index because you should not query an
index with an analyzer different from the index time.

3.2.2 On Line

Alter index rebuild will not return up to the complete operation is finished. Rebuild On
Line is a functionality for Oracle Index available in enterprise edition databases, but with
a little trick you can rebuild Lucene Domain Index On Line too.
If you are working with SyncMode:Deferred you need to change to SyncMode:OnLine,
then you can rebuild the index by using:

alter index it1 rebuild


parameters('SyncMode:OnLine;MergeFactor:100;BatchCount:1000');
commit; -- notify change to AQ Callback

Rebuild command enqueues batchs of 1000 rowids of the master table (it1) for addition
to Lucene Index structure then Lucene Domain Index AQ Callback will process these
messages using background database process and automatically commit changes when it
finish.

3.3 Drop

Dropping Lucene Index do not differs from dropping any other index. Just call:

drop index it1;

This operation implies drop Lucene Domain Index table, for above example IT1$T, and an AQ
queue IT1$Q with his storage table IT1$QT. If the index is configured with SyncMode:OnLine,
first the AQ Callback is disabled.
If something is wrong during index drop command you can add "force" at the end of the
command to be sure that System's views will not have any reference to the index.

3.4 Querying

Lucene Domain Index define a new SQL operator named lcontains() with his ancillary operators
lscore() and lhighlight(), his functionality is similar to Oracle Text contains and score operators.
Next example shows operator functionality and parameters.
3.4.1 Simple columns

For the table and index defined into 3.1.4/3.1.5 section a simple usage of lcontains and
lscore is:

SQL> select lscore(1),f2 from t1 where lcontains(f1, 'F2:ravi',1) > 0;

LSCORE(1)
----------
F2
----------------------------------------------------------------------------------------------------
----------------------------------------
1
<emp id="1">
<name>ravi</name>
</emp>
SQL>

First parameter of lcontains operator is the column which have attached Lucene Domain
Index, this is the master column of the index and is a default field for Query Parser
syntax..
Second parameter is Lucene Query Parser syntax string, above table example have
defined Lucene Domain Index at f1 columns, so its not default field for the query, with
this definition to query for an string inside F2 column its necessary to explicit defined
"F2:".
If you want to use lscore its necessary to specify as third argument in lcontains, a
correlation id, it this example "1", this correlation id then match with lscore(1) to
associate the ancillary operator to a proper lcontains.
If you are querying for the master column of the index you can simply omits column
qualifier, for example:

SQL> select lscore(1) from t1 where lcontains(f1,'001',1)>0;

LSCORE(1)
----------
1
SQL>

lcontains() operator must always be compared with >0.

3.4.2 Multiple columns

Query Parser Syntax supports many logical operator and term modifier, you can combine
any of them with each column indexed. Here a practical example using table and index
from section 3.1.4/3.1.5

SQL> select f1,lscore(1) sc,extractValue(f2,'/emp/@id') id from t1


2 where lcontains(f1, '003 OR (F2:ravi AND id:01)',1)>0;

F1 SC ID
---------- ---------------- -----------------------------
1 .577350259 1
3 .288675129 3

Note that first row match against column F2:ravi and functional column id:01, second row
match with F1 equal to 003 (remember F1 qualifier its not necessary because is the master
column of the index defined in 3.1.5)

3.4.3 Pagination

lcontains operator have an extension to Query Parser syntax to include in-line pagination
information at Lucene Domain Index Hits result.
You can select an specific window (pagination) of your query injecting a Query Parser like
range inside lcontains() operator. For example:

select /*+ DOMAIN_INDEX_SORT */ ... where lcontains(col,'rownum:[20 TO 40] AND


word1',1)>0 order by lscore(1) DESC;

Lucene Domain Index implementation extracts automatically pagination information


rownum:[n TO m] AND from the beginning of the query syntax and return to the Oracle
optimizer a subset of 20 rowids. This extension provides a lot of performance improvement
by eliminating the needs of using Oracle's Top-N syntax which need to collect all rowids
and then filter to calculate the window.
Due inline pagination is an home made extension to Query Parser syntax have two
limitations or know caveats:
• rownum:[n TO m] AND must be at the beginning of Query Parser string and as is,
we simple use string position of rownum and AND reserved keywords to extract
start and stop index of the window.
• pagination is concatenated using AND boolean operator, but its not strictly and
AND operator because is a simple substring splitting operation, it means a priority
for grouping AND OR are no applied to the first part, rownum:[n TO m] AND xx
OR bb should be evaluated as (rownum:[n TO m] AND xx) OR bb rather than this
Lucene will search xx OR bb and ODCI will extract n TO m window.
Note: DOMAIN_INDEX_SORT optimizer hint is required for Sorting, see next section.

3.4.4 Sort

Lucene provides sort over the result of a particular query, Lucene Domain Index provides
sorting by using an extra argument at lcontains() operator. Here examples of sorting using
emails table created in section 3.1.4:

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails


2 where lcontains(bodytext,'security','subject:ASC',1)>0;

SUBJECT
----------------------------------------------------------------------------------------------------
----------------------------------------
Re: lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
lucene injection
Elapsed: 00:00:00.04
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails
2 where lcontains(bodytext,'security','subject:DESC',1)>0;

SUBJECT
----------------------------------------------------------------------------------------------------
----------------------------------------
lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection

Elapsed: 00:00:00.17

Sort parameter syntax is a coma separated string of field[:ORDER[:TYPE]] values, fields


included in sorting spec should be NOT_ANALYZED or NOT_ANALYZED_STORED, see
FormatCols argument above. ORDER can be ASC or DESC, default value is ASC. TYPE can
be string, float or int, starting with Lucene 2.9.0 default value is string.
Using above table a little complicated sort spec is:

SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails


2 where
lcontains(bodytext,'security','subject:ASC:string,emailFrom:DESC:string',1)>0;

SUBJECT EMAILFROM
----------------------------------------- ----------------------------------------------------------
---------------
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
lucene injection [email protected]

Elapsed: 00:00:00.06
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails
2 where lcontains(bodytext,'security','subject:ASC:string,emailFrom:ASC:string',1)>0;

SUBJECT EMAILFROM
------------------------------------------ ---------------------------------------------------------
---------------
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
lucene injection [email protected]

Elapsed: 00:00:00.05
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails
2 where lcontains(bodytext,'security',1)>0;

SUBJECT EMAILFROM
------------------------------------------- --------------------------------------------------------
--------------
lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]

Elapsed: 00:00:00.09

Latest query doesn't include sort so it sorted by score. An abbreviated syntax for sort string
is ASC or DESC which means sort by score ascending or descending, this short format is
equivalent to use order by syntax with lscore operator, for example:

SQL> SELECT lscore(1),subject FROM emails


2 where lcontains(bodytext,'security',1)>0;

LSCORE(1) SUBJECT
-------------- -----------------------------------------------------------------------------------
-------------------
.241440386 lucene injection
.22763218 Re: lucene injection
.199178159 Re: lucene injection
.140840232 Re: lucene injection
.140840232 Re: lucene injection

Elapsed: 00:00:00.10
SQL> SELECT lscore(1),subject FROM emails
2 where lcontains(bodytext,'security',1)>0 order by lscore(1) asc;

LSCORE(1) SUBJECT
------------- ------------------------------------------------------------------------------------
-----------------
.140840232 Re: lucene injection
.140840232 Re: lucene injection
.199178159 Re: lucene injection
.22763218 Re: lucene injection
.241440386 lucene injection

Elapsed: 00:00:00.11
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
2 where lcontains(bodytext,'security','subject:DESC',1)>0;

LSCORE(1) SUBJECT
------------- ------------------------------------------------------------------------------------
----------------
.241440386 lucene injection
.22763218 Re: lucene injection
.199178159 Re: lucene injection
.140840232 Re: lucene injection
.140840232 Re: lucene injection

Elapsed: 00:00:00.07
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
2 where lcontains(bodytext,'security','subject:ASC',1)>0;

LSCORE(1) SUBJECT
------------- -----------------------------------------------------------------------------------
--------------
.140840232 Re: lucene injection
.140840232 Re: lucene injection
.199178159 Re: lucene injection
.22763218 Re: lucene injection
.241440386 lucene injection

Elapsed: 00:00:00.07

First example uses default sort by score descend, second example uses order by syntax
overriding default sort and change it to score ascend, the other ones are equivalent but
using lcontains sort argument string.
Note that if you are using lcontains sort string, you has to add DOMAIN_INDEX_SORT
optimizer hint, this hint tells Oracle optimizer that the order of the rows will be dictated by
Lucene Domain Index.
The usage of lscore(anc_id) in conjunction with lcontains(column,query,sort_str,anc_id)
make not sense and produce an extra overhead on the score computation which can be
avoided, it means if you are querying Lucene Domain Index and want to get the result
ordered by other columns rather than the relevance order why to compute it, AVOID
lscore() function in the select list and you will get a query faster. For example:

SQL> SELECT /*+ DOMAIN_INDEX_SORT BAD */ lscore(1),subject FROM


emails
2 where lcontains(bodytext,'security','subject:ASC',1)>0;
LSCORE(1) SUBJECT
------------- ----------------------------------------------------------------------
---------------------------
.140840232 Re: lucene injection
.140840232 Re: lucene injection
.199178159 Re: lucene injection
.22763218 Re: lucene injection
.241440386 lucene injection
Elapsed: 00:00:00.07

SQL> SELECT /*+ DOMAIN_INDEX_SORT GOOD */ subject FROM emails


2 where lcontains(bodytext,'security','subject:ASC')>0;
SUBJECT
--------------------------------------------------------------------------------------
-----------
Re: lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
lucene injection
Elapsed: 00:00:00.02

3.4.5 Count Hits Function

Count hits function is a Lucene Domain Index optimization to replace SQL count(*)
functionality. This is extremely fast because there is no rowid information passed from
Lucene Data Cartridge to Oracle Engine to count matching rows. Here an example:
SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits from dual;

HITS
----------
5

Elapsed: 00:00:00.02

First argument of count hits function is an string with Lucene Domain Index syntax
(IDX_NAME), second argument is Query Parser syntax string equals to second argument of
lcontains function, optionally you can use a three argument version of countHits function to
use index in another schemas, first argument is the schema, second argument is the index
name and last one is the Query Parser syntax string. After a count hits function call you
can use a select with lcontains function, if count hits query matchs with lcontains query,
lcontains will have a cached information for returning matching rowids. Following some
examples of count hits an his correlated query using caching results:

SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits from dual;

HITS
----------
5

Elapsed: 00:00:00.02
SQL> select emailFrom FROM emails
2 where lcontains(bodytext,'security',1)>0;

EMAILFROM
----------------------------------------------------------------------------------------------------
----------------------------------------
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Elapsed: 00:00:00.08
SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') from dual;

LUCENEDOMAININDEX.COUNTHITS('EMAILBODYTEXT','SECURITY')
------------------------------------------------------------------------------
5

Elapsed: 00:00:00.02
SQL> select emailFrom FROM emails
2 where lcontains(bodytext,'security','emailFrom:ASC',1)>0;

EMAILFROM
----------------------------------------------------------------------------------------------------
----------------------------------------
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Elapsed: 00:00:00.04

In both queries lcontains found a cached hits structure evaluated by count hits function.
Lucene Domain Index stores cached hits information, to localize it uses a key compounded
by sort_string(QueryParser.toString()) so both arguments of count hits and lcontains
should match to re-use a cached hits structure. For last query example the string
emailFrom:(security) is used as key.

3.4.6 First Rows Optimizer Hint

Starting with 2.4.0.1.0 release we have replaced deprecated Lucene Hits class by
TopDocs class. If you use FIRST_ROWS optimizer hint in conjuction with lcontains inline
pagination Lucene Domain Index will call to TopDocs to get only the first M hits. For
example:

SQL> select /*+ FIRST_ROWS DOMAIN_INDEX_SORT */


lhighlight(1),extractValue(object_value,'/page/title')
from pages where lcontains(object_value,'rownum:[1 TO 10] AND (musica tango
rock)',1)>0;

FIRST_ROWS and rownum:[1 TO 10] tells to Lucene Domain Index that performs a
Lucene Query for the first 10 hits only. Next query with rownum:[10 TO 20] will have
most of the Lucene structures cached in memory such as the Searcher and the ROWID<-
>Lucene DocID association, but it will re-query Lucene Index to get first 20 Hits (1..20),
this cache miss behavior of Hits could be interpreted as bad solution but is extremely useful
if 90% of query only visits the first page of the hits, typical behavior on Internet Search.
In the other hand if you omits FIRST_ROWS optimizer hint, Oracle by default switch to
ALL_ROWS mode which means, if you are using pagination (rownum:[n TO m]) with
m greater than 2000, Lucene Domain Index will fetch m first hits, but if m is lower than
2000, Lucene Domain Index will try to fetch by default 2000 hits. The magic number 2000
is due Oracle ODCI API calls to ODCIFetch routine in batch of 2000 rowids.
If FIRST_ROWS and in-line pagination are not included in query, Lucene Domain Index
works in batch of 2000 hits causing several cache miss in a full scan mode. For example
given a query:

SQL> select count(*) from pages where lcontains(object_value,'musica tango rock')>0;

causes that Lucene Domain Index fetch the first 2000 hits, finally with the information
that the hits length is 2736 it re-fetch (cache miss) the 2736 hits. Obviously you can use
LuceneDomainIndex.countHits() function to count hits faster than the previous query.

3.4.6 Highlighting

lhighlight ancillary operator works as lscore but returning a VARCHAR2 text with the
words highlighted during the evaluation of lcontains function, the tag used to remark
matching words is not customizable yet and is <B>, also the fragment separator and the
maximum number of fragments are constant (... and 4, respectively). Starting with
2.4.1.1.0 release it have parameters customizable through alter index ... parameters()
DDL command to change. Highlighting example:
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subject
2 FROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;

TXT SC
SUBJECT
On Dec 21, 2006, at 4:56 AM, Deepan wrote:> I am bothered about <B>security</B>
.27477634 Re: lucene injection
problems with lucene. Is it vulnerable to> any kind of injection like <B>mysql</
B> injection? many times the query from> user is passed to lucene for search wit
hout validating.Rest easy. There are no known <B>security</B> issues with Lucen
e, and ithas even undergone a recent static code analysis by Fortify (see theluc
ene-dev e-mail list

Highlighting only works with columns of type VARCHAR2, CLOB and XMLType. You can
perform highlighting operation even if your master columns is not indexed/stored, for
example for an index created with:

create index pages_lidx_all on pages p (value(p))


indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;DefaultColumn:text;SyncMode:Deferred;LogLevel:INFO;
Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;
ExtraCols:extractValue(object_value,''/page/title'') "title",
extractValue(object_value,''/page/revision/comment'') "comment",
extract(object_value,''/page/revision/text/text()'') "text",
extractValue(object_value,''/page/revision/timestamp'') "revisionDate";
FormatCols:revisionDate(day);IncludeMasterColumn:false;
LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE
READS FILESYSTEM_LIKE_LOGGING');

SQL> select /*+ DOMAIN_INDEX_SORT */ lhighlight(1),extractValue(object_value,'/


page/title') from pages where lcontains(object_value,'rownum:[1 TO 10] AND (musica
tango rock)',1)>0;

<page xmlns="http://www.mediawiki.org/xml/export-0.3/">
<title><B>Música</B> de Argentina... [[Latinoamérica|latinoamericanos]] con más desarrollo en su
[[<B>música</B>]].

Se encuentra una gran... argentinos, un instrumento tradicional andino]]


Aún se mantiene la <B>música</B> de los [[Indígenas_en_Argentina... de grandes corrientes de
[[inmigración|inmigrantes]] europeos, la <B>música</B> argentina se enriqueció
Música de Argentina
musical emparentado con la [[habanera]] y el [[<B>tango</B> (<B>música</B>)|<B>tango</B>]].

==Diferencias con el <B>tango</B>==

Aunque tanto la milonga como el <B>tango</B> están en [[compás]] de 2/4, las 8 [[semicorchea]]s de la
milonga están distribuidas en 3 + 3 + 2 en cambio el <B>tango</B> posee un ritmo más «cuadrado». Las
letras...]] criticó en algún momento el <B>tango</B> y prefirió la milonga, que no trasmite la melancolía
Milonga (género musical)

Index creation DDL have IncludeMasterColumn:false, which means the whole XMLType
representation of the Spanish Wikipedia page dump is not indexed only the virtual columns
title, comment, text and revisionDate are processed by Lucene, but TextHighlight Java
function attached to lhighlight operator receives the XMLType from RDBMS engine, so it
call to Lucene Highlighter class with the whole XMLType object (note that page titles are in
bold only to separate rows at the output).
Parameters supported by highlighting functions are:
• Formatter, a valid class name which implements Lucene Interface Formatter
and with a constructor with no arguments, default value
org.apache.lucene.search.highlight.SimpleHTMLFormatter.
• MaxNumFragmentsRequired, number of text fragments returned by Highlight
function, default value 4.
• FragmentSize, size of each fragment returned, default value 100.
• FragmentSeparator, String used as fragment separator, default is "...". Note that
you can not use ";" or ":" as fragment separator because are used as parameter
and value delimiters into alter index ... parameters(..) DDL statement.
There is no customization allowed by passing constructor arguments to Formatter class,
but you can easily creates your own Formatter which call to SimpleHTMLFormatter with
arguments, your Formatter will look like:

create or replace and compile java source named


"org.apache.lucene.search.highlight.MyHTMLFormatter" as
package org.apache.lucene.search.highlight;

public class MyHTMLFormatter extends SimpleHTMLFormatter {


public MyHTMLFormatter() {
super("<span class=\"myhighlightclass\">","</span>");
}
}
/
show errors

alter index emailbodyText


parameters('Formatter:org.apache.lucene.search.highlight.MyHTMLFormatter;MaxNumFragmentsRe

3.4.7 Highlighting using pipeline table functions

phighlight and rhighlight provides a more general usage patterns of Lucene highlighting
functionality. phighlight receives an SQL query as string and performs highlighting in set
of user defined columns on the query result. rhighlight receives a SYS_REFCURSOR
argument and performs highlighting in a set of user defined query columns, unlike
phighlight, rhighlight requires that the user defined a return type of the query, usually
a TABLE OF collection, because with a SYS_REFCURSOR argument there is no option to
know at compilation time the return type of the query.
Both phighlight and rhighlight support highlighting parameters defined during create
index or alter index DDL statements, see 3.4.6 section for more information.
Here two examples of highlighting features by using pipeline table functions, table emails
is the example table/index of the section 3.1.4:

SELECT * FROM
TABLE(phighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,''security OR mysql'',''subject:ASC'',1)>0'
));

SELECT * FROM
TABLE(rhighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'EMAILRSET',
CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0)
));

First three arguments of both pipeline function are equals, the Lucene Domain Index used,
the Lucene Query Syntax argument (should match with lcontains argument) and finally
the columns of the query which will be highlighted.
Last arguments are, for phighlight is a VARCHAR2 type with SQL query to be executed
by DBMS_SQL package, note the double single quote used as escape character sequence
to encode SQL single quotes char.
For rhighlight two arguments are required the type returned by the cursor, in this example
is EMAILRSET defined as:

CREATE TYPE EMAILR AS OBJECT


(
sc NUMBER,
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB
);

CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;

Note that EMAILR is record which holds all columns of table EMAILS plus the score
returned by lscore() function, then EMAILRSET is simple collection type TABLE OF
EMAILR which is required type for CURSOR value.
And finally last argument is of CURSOR type which means any SQL query.

3.4.8 More like this functionality

More like this Lucene Functionality is wrapped using function:


MoreLike.this(index_name IN VARCHAR2,
x IN ROWID,
f IN NUMBER DEFAULT 1,
t IN NUMBER DEFAULT 10,
minTermFreq IN NUMBER DEFAULT 2,
minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist
where index_name could be owner,index_name pair as in other Lucene Domain Index
procedures. A typically uses case is:
select rowid,lscore(1),text from test_source_big where lcontains(text,'"procedure
java"~10',1)>0 order by lscore(1) desc;
AAAOaPAAEAAAAnnABV 1.00000003 procedure (C, Java or PL/SQL), optionally
qualified
AAAOaPAAEAAAA0aAAV .84852819 STATIC PROCEDURE refreshParameterCache as
LANGUAGE JAVA NAME
.........
declare
ridlist sys.odciridlist;
begin
ridlist :=
MoreLike.this(index_name=>'SOURCE_BIG_LIDX',x=>'AAAOaPAAEAAAAnnABV',minTermFreq=>1);
FOR i IN (select rowid,text from test_source_big where rowid in (select * from
table(ridlist_table(ridlist)))) LOOP
dbms_output.put_line('rowid: '||i.rowid||' text: '||i.text);
END LOOP;
end;
/
rowid: AAAOaPAAEAAAAhLAAc text: -- after issuing insert, update, delete or anonymous
PL/SQL calls
rowid: AAAOaPAAEAAAAjrAAo text: -- QUALIFIED_SQL_NAME
rowid: AAAOaPAAEAAAAk5AAe text: -- ORA-06502: PL/SQL: numeric or value error:
character string buffer
.....
rowid: AAAOaPAAEAAAAtXAAb text: -- The name of the Java class, PL/SQL package or
object type implementing

Note that the anonymous PL/SQL block gets the first ROWID returned from the first
query as pivot, then expands the result set with other rows which also includes terms like
"procedure (C, Java or PL/SQL), optionally qualified", "C" is not take into account due was
eliminate as stop word.
Refers to the Appendix D.6 for a full explanation of each parameter.

3.4.9 Facets

Starting with Lucene Domain Index 2.4.1.1.0, Lucene Facets functionality is available
through an SQL aggregate function lfacets():
lfacets(index_name_and_categories IN VARCHAR2
) RETURN LUCENE.agg_tbl
where index_name_and_categories is encoded string with the Lucene Index name and
categories, aggregated function only accepts one scalar value as argument so we need to
encode the index and categories in a coma separated list, for example:

SQL> select lfacets('SOURCE_BIG_LIDX,TEXT:procedure,TEXT:java') from dual;

Using the index created on the example of section 2.5 Testing Lucene Domain Index index
name can be SCHEMA.IDX_NAME sintax, categories can be one or two and are expressed
in Lucene Query Syntax, in the above example TEXT is the index column procedure is the
main category and java the sub category.
Creating a table with categories and linking the rows with parent is an option to
automatically generate facets, for example:
create table source_categories (
cat_code number(4),
cat_name varchar2(256),
cat_parent number(4),
CONSTRAINT PK_SOURCE_CATEGORIES PRIMARY KEY (cat_code),
CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)
REFERENCES source_categories (cat_code)
);
insert into source_categories values (1,'TEXT:procedure',null);
insert into source_categories values (2,'TEXT:function',null);
...
insert into source_categories values (6,'TEXT:java',1);
insert into source_categories values (7,'TEXT:(pl sql)',1);
insert into source_categories values (8,'TEXT:wrapped',1);
...
insert into source_categories values (21,'line:[1 TO 1000]',1);
insert into source_categories values (22,'line:[1001 TO 2000]',1);
insert into source_categories values (23,'line:[2001 TO 3000]',1);

Now we can query above table calling to lfacets with the category and sub category:

SQL> select ljoin(lfacets('SOURCE_BIG_LIDX,'||


case level when 1 then cat_name
ELSE PRIOR cat_name||','|| cat_name
END
)), cat_code,level
FROM source_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_code,level;

LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))
CAT_CODE LEVEL
TEXT:procedure(5116)
1 1
TEXT:function(5574)
2 1
TEXT:trigger(96)
3 1
TEXT:package(860)
4 1
TEXT:(object type)(5140)
5 1
TEXT:procedure,TEXT:java(9)
6 2
.....
TEXT:procedure,line:[1 TO 1000](3)
21 2
TEXT:procedure,line:[1001 TO 2000](615)
22 2
...
SQL> select ljoin(lfacets('SOURCE_BIG_LIDX,'||
case level when 1 then cat_name
ELSE PRIOR cat_name||','|| cat_name
END
)), cat_parent
FROM source_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;

LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))
CAT_PARENT
-----------------------------------------------------------------------------------------------------------------------------
------------ ---------------
TEXT:procedure,TEXT:java(11),TEXT:procedure,TEXT:(pl sql)(70),TEXT:procedure,line:[1 TO
1000](3),TEXT:procedure,TEXT:wrapped(21),TEXT:proced
ure,line:[1001 TO 2000](675),TEXT:procedure,line:[3001 TO 4000](105),TEXT:procedure,line:[4001 TO
5000](10),TEXT:procedure,line:[2001 TO 300
0](199) 1
TEXT:function,TEXT:java(22),TEXT:function,TEXT:wrapped(85),TEXT:function,line:[1 TO
1000](0),TEXT:function,TEXT:(pl sql)(87),TEXT:function,l
ine:[1001 TO 2000](835),TEXT:function,line:[3001 TO 4000](21),TEXT:function,line:[4001 TO
5000](0),TEXT:function,line:[2001 TO 3000](338) 2
TEXT:trigger,TEXT:java(1),TEXT:trigger,line:[1 TO
1000](0),TEXT:trigger,TEXT:wrapped(0),TEXT:trigger,TEXT:(pl sql)(1),TEXT:trigger,line:[100
1 TO 2000](33),TEXT:trigger,line:[3001 TO 4000](0),TEXT:trigger,line:[4001 TO
5000](0),TEXT:trigger,line:[2001 TO 3000](0) 3
TEXT:package,TEXT:java(7),TEXT:package,line:[1 TO 1000](0),TEXT:package,TEXT:(pl
sql)(25),TEXT:package,TEXT:wrapped(137),TEXT:package,line:[
1001 TO 2000](54),TEXT:package,line:[3001 TO 4000](5),TEXT:package,line:[4001 TO
5000](0),TEXT:package,line:[2001 TO 3000](5) 4
TEXT:(object type),TEXT:java(56),TEXT:(object type),TEXT:(pl sql)(106),TEXT:(object type),line:[1 TO
1000](1),TEXT:(object type),TEXT:wrappe
d(76),TEXT:(object type),line:[1001 TO 2000](441),TEXT:(object type),line:[4001 TO 5000](0),TEXT:(object
type),line:[3001 TO 4000](28),TEXT:
(object type),line:[2001 TO 3000](119) 5
TEXT:procedure(5574),TEXT:(object type)(5584),TEXT:package(868),TEXT:trigger(114),TEXT:function(6167)

6 rows selected.

Note that we are using ljoin() function which convert agg_tbl type to a coma separated
string plus his cardinality. First row do not have a sub category because parent column is
null, so 5116 is a number of rows which includes the text procedure, last row showed
included a category and sub category, TEXT:procedure,line:[1001 TO 2000] implies
the bit AND intersection between the set of rows which includes procedure against a set
of rows which match with line[1001 TO 2000], the group by cat_code causes that the
oracle ODCI API call first to calculate the bit set for procedure and iterate over all his sub
categories, java, pl sql, wrapped, doing the bit AND intersections, this is fast and once
the facets is computed is stored as Filter in Lucene Domain Index memory structures.
When a number of rows or the amount of categories is big we can use a materialized view
to work as cache of the facets computation. For example:

CREATE MATERIALIZED VIEW source_facets


AS
select ljoin(lfacets('SOURCE_BIG_LIDX,'||
case level when 1 then cat_name
ELSE PRIOR cat_name||','|| cat_name
END
)), cat_code,level
FROM source_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_code,level;

Now source_facets materialized view can be queried as any other table and his access will
be too fast. The materialized view then can be refreshed by the application at an specific
point in time.

3.4.10 Terms pipeline table functions

Starting with Lucene Domain Index 2.9.1.1.0, two pipeline table functions has been
included to iterate over terms of Lucene Index structure, high_freq_terms():
FUNCTION high_freq_terms(index_name VARCHAR2,
term_name VARCHAR2,
num_terms NUMBER) RETURN term_info_set
is available for getting the Top-N (num_terms) most used terms on the whole index or in
a particular field. term_info_set is defined as:
TYPE term_info AS OBJECT (
term VARCHAR2(4000),
docFreq NUMBER(10)
);
TYPE term_info_set AS TABLE OF term_info;
You can query your index by using:

select * from table(high_freq_terms('SOURCE_BIG_LIDX','TEXT',10));


select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));
select * from table(high_freq_terms('SOURCE_BIG_LIDX','line',100));

and, index_terms():
FUNCTION index_terms(index_name VARCHAR2,
term_name VARCHAR2) RETURN term_info_set

select * from table(index_terms('SOURCE_BIG_LIDX','TEXT')) order by


docFreq desc;
select * from table(index_terms('SOURCE_BIG_LIDX','TEXT'));
select * from table(index_terms('SOURCE_BIG_LIDX',null)) where
rownum<10;
select * from (select * from table(index_terms('SOURCE_BIG_LIDX','line'))
order by docFreq desc) where rownum<=10;

on both functions if argument term is NULL, these functions will iterate over all index
terms. The natural order for high_freq_terms() is descendent by docFreq,
but index_terms() is ordered by term_name:term_value ascending. Note that if you
pass a non NULL value to term to starts with the first value for the specific
term index_terms() do not stop when all the values of this term are completed, this
functionality is similar to Lucene Java method reader.terms(new Term(term)). Here
example if you want only iterate on an specific term name:

BEGIN
FOR term_rec IN (SELECT * FROM table(index_terms('SOURCE_BIG_LIDX','line')))
LOOP
/* Fetch from cursor variable. */
EXIT WHEN substr(term_rec.term,1,length('line'))<>'line'; -- exit when last row
is fetched
-- process data record
dbms_output.put_line('Name = ' || term_rec.term || ' ' || term_rec.docFreq);
END LOOP;
END;

You can use index_terms() to get the Top-N terms order by docFreq, for example:

SQL> select * from (select * from table(index_terms('SOURCE_BIG_LIDX',null))


order by docFreq desc) where rownum<=10;
TEXT:in 24952
TEXT:varchar 16996
...
TEXT:return 6241
Elapsed: 00:00:06.09

SQL> select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));


TEXT:in 24952
TEXT:varchar 16996
...
TEXT:return 6241
Elapsed: 00:00:00.02

Two queries are equivalent semantically but high_freq_terms() is more efficient because
it uses TermInfoQueue structure for sorting, caches his computation one is executed and
do not creates a lot of term_info objects which then are sorted by the RDBMS engine.

3.4.11 Did You Mean functionality

Starting with Lucene Domain Index 2.9.2.1.0, Did You Mean Lucene functionality was
added as an extended LDI property using the Lucene SpellChecker library to create the
dictionary index from the main index. Finaly, the dictionary index will be merged to the
main index.

PROCEDURE indexDictionary(
index_name IN VARCHAR2,
spellColumns IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein')

is available to create the dictionary index to be merged with main index.

You can create the dictionary by using:

SQL> call didyoumean.indexdictionary('SOURCE_BIG_LIDX');


Call completed.
Elapsed: 00:01:11.61
SQL> exec
didyoumean.indexdictionary('EMAILBODYTEXT','BODYTEXT,subject,emailFrom,emailTo','NGram')
PL/SQL procedure successfully completed.
Elapsed: 00:00:01.62

Only index_name is mandatory. If spellColumns parameter is NULL, the master


column of the main index will be used. By default Levenstein Distance Algorithm (a.k.a.
edit distance) is applied (other options are Jaro - Jaro Winkler metric - and Ngram
distance).

Note: The dictionary structure create the "word", "gramN", "startN" and "endN"
Lucene fields, so be carefull if you have this fieds in the main index. The structure of
this index is (for a 3-4 gram) this:

Index Structure Example


word kings
gram3 kin, ing, ngs
gram4 king, ings
start3 kin
start4 king
end3 ngs
end4 ings

and,

FUNCTION suggest
(
index_name IN VARCHAR2,
cmpval IN VARCHAR2,
highlight IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein'
) RETURN VARCHAR2

is available to query the dictionary index. You can query the dictionary by using:

SQL> select didyoumean.suggest('SOURCE_BIG_LIDX','sorce') suggestion from dual;

SUGGESTION
--------------------------------------------------------------------------------
source
Elapsed: 00:00:00.31

SQL> select didyoumean.suggest('SOURCE_BIG_LIDX','sorce','b') suggestion from


dual;
SUGGESTION
--------------------------------------------------------------------------------
<b>source</b>
Elapsed: 00:00:00.09

SQL> select didyoumean.suggest('SOURCE_BIG_LIDX','sorce','b','Jaro') suggestion


from dual;
SUGGESTION
--------------------------------------------------------------------------------
<b>source</b>
Elapsed: 00:00:00.07

SQL> select didyoumean.suggest('EMAILBODYTEXT','lucene searhc','i') suggestion from


dual;
SUGGESTION
--------------------------------------------------------------------------------
lucene <i>search</i>
Elapsed: 00:00:00.06

SQL> select didyoumean.suggest('EMAILBODYTEXT','lucine injetion','b','Levenstein')


suggestion from dual;
SUGGESTION
--------------------------------------------------------------------------------
<b>lucene</b> <b>injection</b>
Elapsed: 00:00:00.06

The index_name parameter and the word to respell (cmpval) parameter are
mandatory. You can define, optionaly, the highlight to be used (e.g. b for bold, i for
italic, etc.) and define the distance algorithm to apply.

3.5 Synchronize
Working with SyncMode:Deferred you has to manually synchronize your index, it means update
Lucene Domain Index structure applying pending changes such as insert and update. Deletes
operations are always applied due ODCI Api do not accept rowid of deleted rows.
Here an example:

begin
LuceneDomainIndex.sync('IT1');
commit; -- release locks
end;

LuceneDomainIndex.sync procedure requires an argument of type VARCHAR2 with the index


object name, index object name are usually capitalized and have the syntax
SCHEMA_OWNER.IDX_NAME.
Synchronize operation could raise an exception if some rows being indexed are locked for update,
in that case you have release first locked rows and re-sync the index.
An exclusive lock at Lucene Index storage is obtained during index synchronization, so you has
to commit or rollback the connection immediately after this operation to release exclusive lock.
Since Lucene Domain Index 2.4.0.1.0 you can use LuceneDomainIndex.sync('IT1') or
LuceneDomainIndex.sync(USER,'IT1'), both procedure are equivalent.
Note: Due a limitation on SYS.ODCIRidList() array you can enqueue more than 32767
additions or deletions, an update is counted as one deletion plus one addition by Lucene
implementation code. This limitation will be removed in future releases of Lucene Domain Index.

3.6 Optimize
Optionally you can optimize Lucene Index storage, for doing that execute:
begin
LuceneDomainIndex.optimize('IT1');
commit; -- release locks
end;

Like sync operation this procedure get an exclusive lock at Lucene Index storage table and
perform an optimization of Lucene Index merging multiples segment in new one for example. You
can still performing select operation (read-only) using Lucene Domain Index during optimization
time, Oracle concurrency system (redo logs) provides you this functionality, once you perform a
commit operation any other concurrent session will automatically see index changes.

3.6 XMLDB Export


You can perform an XMLDB Export operation for your Lucene Domain Index, this operation
provides an easy way to get Lucene Domain Index information available as Lucene File option.
Once the operation is done you can get all the files from you database using WebDAV explorer
or FTP. For example:

begin
LuceneDomainIndex.xdbExport('IT1');
commit; -- makes change visible to Ftp or WebDAV
end;

Your index will be visible at /public/lucene/SCOTT.IT1 directory for example.


Once you copy this files into the file system you can open it with any Lucene compatible
application like Luke. Here some screen shots of Luke using Lucene Domain Index export
information.
3.7 Exporting/Importing functional index with exp/imp
Oracle tools
You can perform an Oracle exp operation for your Lucene Domain Index. Oracle exp tool
performs by default functional index for every table being exported during the backup process.
As I mention early Lucene Domain Index creates a table named IDX_NAME$T which have
Lucene file storage replaced by BLOB, also a DBMS AQ is created during the index creation
time, this queue is associated to a table IDX_NAME$QT, both tables have a flag marked as
SECONDARY, which means that you can not export these tables alone, but they are automatically
includes when Lucene Domain Index is included into the export.
During import operation Oracle re-create the index using a create index ... parameters('your
lucene parameters') DML statement, all Lucene Domain Index parameters are included except
for the parameter PopulateIndex which always is stored as false into Oracle System's views.
This parameter is altered intentional by Lucene Domain Index because if its set to true, during
import operation Lucene Domain Index will try to re-create the Lucene Index structure instead
of using the information restored into IDX_NAME$T table.
Alternative to XMLDB Export or Oracle exp tool you can also exports your Lucene Domain Index
storage using a create table as ... DML statement. For example:

SQL> create table SOURCE_BIG_LIDX$T$BK as (select * from SOURCE_BIG_LIDX$T);


... you can export now using exp tool SOURCE_BIG_LIDX$T$BK because is regular table ...
-bash-3.2$ exp
Export: Release 10.2.0.3.0 - Production on Fri Mar 27 02:46:18 2009

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Username: scott/tiger

Connected to: Oracle Database 10g Release 10.2.0.3.0 - Production


Enter array fetch buffer size: 4096 >

Export file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp

(2)U(sers), or (3)T(ables): (2)U > 3

Export table data (yes/no): yes > yes

Compress extents (yes/no): yes >

Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)

About to export specified tables via Conventional Path ...


Table(T) or Partition(T:P) to be exported: (RETURN to quit) > SOURCE_BIG_LIDX$T$BK

. . exporting table SOURCE_BIG_LIDX$T$BK 19 rows exported


Table(T) or Partition(T:P) to be exported: (RETURN to quit) >

Export terminated successfully without warnings.


.... Now you can drop your index and re-create again without populating it ....
SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

COUNT(*)
----------
6167

SQL> drop index SOURCE_BIG_LIDX;

Index dropped.

SQL> create index source_big_lidx on test_source_big(text)


2 indextype is lucene.LuceneIndex
3
parameters('PopulateIndex:false;AutoTuneMemory:true;Analyzer:org.apache.lucene.analysis.SimpleAnalyzer;MergeFactor:500
"line"');

Index created.

SQL> drop table SOURCE_BIG_LIDX$T$BK;

Table dropped.
.... Restore your .dmp now and check again if your index returns a correct result ....
-bash-3.2$ imp scott/tiger

Import: Release 10.2.0.3.0 - Production on Fri Mar 27 02:49:40 2009

Copyright (c) 1982, 2005, Oracle. All rights reserved.


Connected to: Oracle Database 10g Release 10.2.0.3.0 - Production

Import file: expdat.dmp > SOURCE_BIG_LIDX_BK.dmp

Enter insert buffer size (minimum is 8192) 30720>

Export file created by EXPORT:V10.02.01 via conventional path


import done in US7ASCII character set and AL16UTF16 NCHAR character set
import server uses AL32UTF8 character set (possible charset conversion)
List contents of import file only (yes/no): no >

Ignore create error due to object existence (yes/no): no >

Import grants (yes/no): yes >

Import table data (yes/no): yes >

Import entire export file (yes/no): no > yes

. importing SCOTT's objects into SCOTT


. importing SCOTT's objects into SCOTT
. . importing table "SOURCE_BIG_LIDX$T$BK" 19 rows imported
Import terminated successfully without warnings.
.... Check first that your index do not have information and populate them with Lucene Index
information ....
SQL> conn scott/tiger
Connected.
SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

COUNT(*)
----------
0

SQL> truncate table SOURCE_BIG_LIDX$T;

Table truncated.

SQL> insert into SOURCE_BIG_LIDX$T (select * from SOURCE_BIG_LIDX$T$BK);

19 rows created.
SQL> exit
..... and connect again to refresh Lucene Domain Index in memory structures ....
SQL> conn scott/tiger
Connected.
SQL> select count(*) from test_source_big where lcontains(text,'function')>0;

COUNT(*)
----------
6167

As you can see the Lucene Domain Index structure can be export alone without exporting the
master table, this is useful when you are upgrading Lucene Domain Index that requires that all
index need to be dropped first and you don't want to re-create a very big index.
4. Locking and Performance

4.1 Lock used by Lucene Domain Index

Index Table Queue Table


Operation Base Table (row/table)
(SCHEMA.IDX$T) (SCHEMA.IDX$QT)
(1)
Insert X/RX NONE NONE
Update X/RX NONE NONE
X/RS(updateCount)|X/RX
Delete X/RX (3) NONE
(2) (4
Manually Sync X/RS X/RS(updateCount)|X/RX DBMS_AQ.BLOCKED
Automatically Sync X/RS X/RS(updateCount)|X/RX DBMS_AQ.BLOCKED
Optimize NONE X/RS(updateCount)|X/RX NONE
1. X = Row exclusive lock at the row being inserted, RX = Table row exclusive lock.
2. X = Row exclusive lock at the row being indexed, RS = Table row share lock. A select
... for update no wait is performed at all rows being added to Lucene Index.
3. X/RS is performed at the row where name='updateCount', this is writer lock
semaphore of Lucene Index and provide serialize write operations. X/RX is performed at
many rows of this table because Lucene is created and deleting many files.
4. To perform massive dequeue operations at DBMS AQ queue Sync scan this queue with
DBMS_AQ.BLOCKED option.

4.2 Performance tips

4.2.1 Index Writer parameters

Lucene Index Writer class uses several parameters to control index structure. Lucene
Domain Index pass to Index Writer several parameters such as MergeFactor,
MaxBufferedDocs among others.
As best practice if you want to index thousands of rows you can override default Lucene
parameters for other which speed up indexing time. With create index or alter index
rebuild you can set MergeFactor to 100 and MaxBufferedDocs to 4000.
This parameters increase index performance but then DML operations at the base table
will batch small set of rows, so after DDL commands change MergeFactor to 2 and
MaxBufferedDocs to 100. A good place to start knowing these parameters behavior is the
Wiki page Improving Indexing Speed.

4.2.2 Auto Tune Memory functionality

Lucene Domain Index have a parameter called AutoTuneMemory a true value means that
for Index Writer operations it will try to use up to 90% of the Java Pool Size configured at
the Oracle SGA to adjust how many documents are buffered (MaxBufferedDocs) before
call IndexWritter.flush().
With AutoTuneMemory:true MaxBufferedDocs its not required, its calculated using free
ram at the SGA, but you has to set MergeFactor.
Due Java Pool Size is global parameter the rule is not valid if you want to create many
index with parallel connexions, two connections will try to use 90% of the SGA, so one of
them will ran out of memory.
4.2.3 Keep Index on RAM

OJVMDirectory replaces Lucene file system storage by a table storage with BLOBs. For
every Lucene Domain Index created there is a new table which stores every Lucene file as
a row with a BLOB column, see section 6 for more detail, using similar strategy as Oracle
Text you can keep this table in RAM. Unlike Oracle Text which uses multiples tables for
storing the inverted index, Lucene Domain Index use one table, execute this DDL
command to keep Lucene Index on RAM:

create index source_small_lidx on test_source_small(text)


indextype is lucene.LuceneIndex
parameters('FormatCols:line(0000);ExtraCols:line
"line";Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:100');

alter index source_small_lidx parameters('MergeFactor:2');

alter table source_small_lidx$t storage (buffer_pool keep) modify lob (data) (storage
(buffer_pool keep));

During Index creation use AutoTuneMemory:true (default value) and a MergeFactor high
because many rows will be indexed at this time. Then change MergeFactor to 2 to work
better after each DML/sync operation. Finally change OJVMDirectory storage table and
LOB to keep them in RAM.
Be sure that your SGA has a enough RAM to keep it. To know how big your index you can
query the table:

SQL> select sum(file_size) from source_small_lidx$t where deleted='N';

SUM(FILE_SIZE)
--------------
147444

Finally as Tom Kyte say, tkprof, tkprof, .... ;)


You can see Lucene Domain Index IO operations with an "alter session set events
'10046 trace name context forever, level 12'; then you can find operations at Lucene
Domain Index table SCHEMA.IDX_NAME$T. Using TKPROF information you can alter table
and lob storage parameters manually.

4.2.4 Compare your execution plan

To be sure that your Lucene Domain Index is properly used compare your executions
plans and try to avoid non necessary filter by or sort order by predicates by using in-line
sort or multiples field Query Parser conditions.
Here examples of sorting using emails table created in section 3.1.4:

SQL> explain plan for


2 SELECT subject FROM emails where lcontains(bodytext,'security',1)>0
3 order by subject ASC;

Explained.

Elapsed: 00:00:00.58
SQL> set echo off
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------
Plan hash value: 1542204867
Id Operation Name Rows Bytes Cost (%CPU) Time
0 SELECT STATEMENT 1 4016 3 (34) 00:00:01

1 SORT ORDER BY 1 4016 3 (34) 00:00:01

2 TABLE ACCESS BY INDEX ROWID EMAILS 1 4016 2 (0) 00:00:01

*3 DOMAIN INDEX EMAILBODYTEXT


Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security',1)>0)

Above execution plan tells that you are using Lucene Domain Index but you can get a better
optimizer plan by using lcontains sort:

SQL> explain plan for


2 SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails
3 where lcontains(bodytext,'security','subject:ASC',1)>0;

Explained.

Elapsed: 00:00:00.01
SQL> set echo off

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------
Plan hash value: 1450245214
Id Operation Name Rows Bytes Cost (%CPU) Time
0 SELECT STATEMENT 1 4016 2 (0) 00:00:01

1 TABLE ACCESS BY INDEX ROWID EMAILS 1 4016 2 (0) 00:00:01

*2 DOMAIN INDEX EMAILBODYTEXT


Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security','subject:ASC',1)>0)

Here we have a better optimizer plan and lower cost.

5 Know caveats
1. Lucene Domain Index uses Java Util Logging API it means that a grant is required to create
and operate any index:

dbms_java.grant_permission( 'USER_NAME',
'SYS:java.util.logging.LoggingPermission', 'control', '' )
2. SyncMode:OnLine should be reserved only for index which a number of update/insert/delete
operation are too small compared to select operations, because each message process
requires almost open an IndexWriter/IndexReader on the associated Lucene Index by a
background process, except for bulk collect operation or "insert into ... select ... from" which
are processed in batch off 150 rows. Tables with many insert/update operations by seconds
should use LuceneDomainIndex.sync(idx) procedure called by DBMS_JOB periodically or by
the application.
3. Syntax for Inline pagination is only supported at the beginning of the Query, it means that if
you want to perform pagination lcontains() query syntax must start with "rownum:[n TO m]
AND" note that this syntax is case sensitive. Also this extraction is performed by splitting the
query by position and does not take into account grouping operator, so this query
"rownum:[1 TO 10] AND word1 OR word2" will be passed to Lucene's Query Parser as "word1
OR word2" which is not semantically the original one if you look precedence operator. We can
try to modify Query Parser class in a future to solve this semantic issues.
4. Since October 25 column name are case sensitive in ExtraCols and FormatCols parameters
using traditional SQL behavior, it means that for this DDL index creation:

create index it1 on t1(f2) indextype is lucene.LuceneIndex


parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');

You can use ExtraCols with f3 or F3 but FormatCols should be F3 because f3 is returned by
the SQL select operation as F3 during the table full scan, also Lucene Index will have a
document with a Field F3 instead of f3. If you want to use f3 as is you can re-write DDL index
creation with:

create index it1 on t1(f2) indextype is lucene.LuceneIndex


parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),f3(00.00);ExtraCols:F3
"f3"');

With this sentence Lucene will create documents with two field F2 and f3, F2 is uppercase
because is the master column of the index and his passed as "F2" by ODCI API but, due is the
default Field of the query, you can omit his name at lcontains syntax, F3 now is lowercase
and will be indexed as a Field "f3".
5. Since November Index parameters are pre-cached in memory for faster response. Due
isolation behaviour of Oracle JVM sessions, if you call to alter index or re-create a new one in
another session you need to close all SQL session that are already pre-load an index
parameter storage.
Calling to LuceneDomainIndex.getParameter('owner.index_name','parameter_name') you can
see the values of any parameter passed to the ODCI API either by calling create index or alter
index.
Otherwise you can call to LuceneDomainIndex.refreshParameterCache stored procedure.
6. If you re-install Lucene Domain Index without previous deleting existing indexes you can
manually drop resources associated to and old index. For example:

SQL> drop index source_big_lidx force;


Index dropped.
SQL> select table_name from tabs;

TABLE_NAME
------------------------------
DEPT
EMP
BONUS
SALGRADE
SOURCE_BIG_LIDX$QT
DR$SOURCE_BIG_IDX$I
DR$SOURCE_BIG_IDX$R
SOURCE_BIG_LIDX$T
TEST_SOURCE_BIG
DR$SOURCE_BIG_IDX$N
DR$SOURCE_BIG_IDX$K

11 rows selected.
SQL> drop table SOURCE_BIG_LIDX$T;

Table dropped.
SQL> conn / as sysdba
connected.
SQL>exec DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q')
BEGIN DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q'); END;

*
ERROR at line 1:
ORA-01403: no data found
ORA-06512: at "SYS.DBMS_AQADM_SYS", line 3359
ORA-06512: at "SYS.DBMS_AQADM", line 167
ORA-06512: at line 1

SQL> exec DBMS_AQADM.DROP_QUEUE_TABLE(queue_table =>


'SCOTT.SOURCE_BIG_LIDX$QT', force=>true);

PL/SQL procedure successfully completed.

SQL> exit

Note that "drop index ... force" will de-register Lucene Domain Index from Oracle's system
views, then Lucene Domain Index storage's table is manually dropped, finally connected as
SYS Lucene Domain Index AQ's table is dropped.
7. Oracle 11g have a know bug "6445561 - ORA-00600 [26599] [62] DUE TO INCORRECT
PERSISTENCE OF BY INVOKER PIN" please apply patch number
p6445561_111060_LINUX.zip available at Metalink, this bug affects select count(*) with a
large results.
8. Up to Lucene Domain Index 2.9.0 there is known problem with the WhereCondition
parameter using OR SQL operator, see section A.3.3 to see the workaround.

Appendixes
A. Parameter reference and syntax
Lucene Domain Index accept several parameters which can be passed using create index or alter
index DDL commands. This parameters are divided into four categories, Index Writer, Analyzer, User
Data Store and General parameters.
A.1 Lucene Index Writer parameters

This section covers Lucene Index Writer parameters for more information about this parameter
see Lucene docs and Wiki.

A.1.1 MergeFactor

Determines how often segment indices are merged by addDocument(). If you are creating
a new index over a table with thousands of rows a value of 100 to 500 is good value.

A.1.2 MaxBufferedDocs

Determines the minimal number of documents required before the buffered in-memory
documents are merged and a new Segment is created. This value can cause an out
of memory exception you provide a value larger than user space available. A typical
SGA configuration can accept values of 4000 or 5000 depending how big are your rows
being indexed. If you are not sure of how megabytes can consume your rows you
can use AutoTuneMemory:true parameter which is a default value, so you choose true
MaxBufferedDocs will be ignored and Lucene Domain Index will try to uso 90% of Oracle
Java Pool Size value.

A.1.3 MaxMergeDocs

Determines the largest number of documents ever merged by addDocument().

A.1.4 MaxBufferedDeleteTerms

Determines the minimal number of delete terms required before the buffered in-memory
delete terms are applied and flushed.

A.1.5 UseCompoundFile

Setting to turn on usage of a compound file. When on, multiple files for each segment are
merged into a single file once the segment creation is finished. This is done regardless of
what directory is in use. By default Lucene Domain Index do not use compound file format
because its not affected by max open file descriptors.

A.2 Analyzer parameters

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting
index terms from text.
Typical implementations first build a Tokenizer, which breaks the stream of characters from the
Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the
Tokenizer.
Analyzer, PerFieldAnalyzer or Stemmer parameter affects indexing and query expressions,
so if you want to change this parameter on a exists index you to must rebuild it, the priority
of these three parameters is first check for the Stemmer if its not present check for
PerFieldAnalyzer if its not present checks for Analyzer parameter, finally if none of them are
defined will use SimpleAnalyzer.
A.2.1 Analyzer

This parameter is fully qualified Java class name which extends


org.apache.lucene.analysis.Analyzer. For example:
• BrazilianAnalyzer
• ChineseAnalyzer
• CJKAnalyzer
• CzechAnalyzer
• DutchAnalyzer
• FrenchAnalyzer
• GermanAnalyzer
• GreekAnalyzer
• KeywordAnalyzer
• PatternAnalyzer
• RussianAnalyzer
• SimpleAnalyzer
• StandardAnalyzer
• StopAnalyzer
• ThaiAnalyzer
• WhitespaceAnalyzer
See Lucene Java Docs for more details. A default analyzer is SimpleAnalyzer.

A.2.2 Stemmer

Stemmer is another kind of analyzer which divides words, stop words and another term
related object based on an specific language. Stemmer parameter use Snowball Analyzer,
possible values for Stemmer parameter using Lucene 2.2.0 distribution are:
• Danish
• Dutch
• English
• Finnish
• French
• German
• German2
• Italian
• Kp
• Lovins
• Norwegian
• Porter
• Portuguese
• Russian
• Spanish
• Swedish
Stemmer parameter override Analyzer parameter.

A.2.3 PerFieldAnalyzer

PerFieldAnalyzer is a wrapper of other analyzers which provides an independent analyzer


for each column being indexed, see PerFieldAnalyzerWrapper class in Lucene
documentation. Each column could have his own analyzer which extends
org.apache.lucene.analysis.Analyzer. If a column is not in the list StandardAnalyzer
will be used as default. For example:

create table t1 (f1 VARCHAR2(10), f2 XMLType);


insert into t1 values ('1', XMLType('<emp id="1"><name>ravi</name></emp>'));
insert into t1 values ('3', XMLType('<emp id="3"><name>murthy</name></emp>'));

create index it1 on t1(f2) indextype is lucene.LuceneIndex


parameters('IncludeMasterColumn:false;
ExtraCols:F1,extractValue(F2,''/emp/name/text()'') "name",extractValue(F2,''/emp/
@id'') "id";
FormatCols:F1(000),id(00)');

alter index it1 rebuild


parameters('PerFieldAnalyzer:F1(org.apache.lucene.analysis.KeywordAnalyzer),id(org.apache.lucene.

In the above example four columns are being indexed by Lucene Domain Index rowid
(added by default) using KeywordAnalyzer, F1 and id (added by ExtraCols parameter)
using KeywordAnalyzer too, and finally name which is not included into
PerFieldParameter and then using StandardAnalyzer.

A.3 User Data Store parameters

Lucene Domain Index implements a User Data Store functionality, this functionality
provides many parameters to control which column will be included into a Lucene
Document which is inserted into the index.
and First three parameters are used to choose which columns will added to the index
in addition to the master column. Oracle Domain Index are bound to a single column,
this is a limitation with Oracle 10g version. To avoid this problem passing ExtraCols,
ExtraTabsWhereCondition you can easily build a set of new column from the master table
and others. Basically a select DML statement is built using these parameters. To clarify this
Lucene Domain Index will performs a query like:

Full table scan (create index statement):


SELECT rowid,MasterTable.MasterColumn,ExtraCols FROM MasterTable,ExtraTabs where
WhereCondition;

Find a particular rowid (insert,update operations):


SELECT MasterTable.MasterColumn,ExtraCols FROM MasterTable,ExtraTabs where
MasterTable.rowid=:rowid AND WhereCondition;

Text in italic are injected by Lucene Domain Index and text in bold are user defined.

A.3.1 ExtraCols

A coma separated list of columns of the Master table of table being indexed or the tables
defined into ExtraTabs parameter. Note that if you don't define columns alias column
name are capitalized by default on Oracle databases. For example 'ExtraCols:F2 "f2",T2.F3
"f3"' note that you can omit master table name if there is no collisions

A.3.2 ExtraTabs

A coma separated list of table name and alias for this tables. For example 'ExtraTabs:T2
aliasT2,T3 aliasT3'. Note that ODCI API only will detect changes at index master column,
to notify changes based on ExtraCols list you need to attach triggers, see section examples
above for more detail.
A.3.3 WhereCondition

An SQL where condition used to join index's master table with ExtraTabs tables. For
example: 'WhereCondition:T1.f1=T2.f2(+) AND T1.F1=aliasT3.f3'. Be careful to produce a
correct join condition to guaranty single row result; multiple or zero row result based on
the master table values are not allowed.

Note: Up to Lucene Domain Index 2.9.0, if you use a WhereCondition which


have an OR operator put this where condition enclosed with () because the
precedence of the OR over the AND operator makes that some queries returns
more rows that the correct behavior, for example instead of:
WhereCondition:T1.F1='AA' OR T1.F1='BB'
put:
WhereCondition:(T1.F1='AA' OR T1.F1='BB')
this workaround fix some problems when working in OnLine mode. Starting
with 2.9.1 version this extra () are not required.

A.3.4 UserDataStore

This is a fully Java Class name which implements


org.apache.lucene.indexer.UserDataStore interface, you can create your own Data
Store class implementing this interface. By default Lucene Domain Index provides an
implementation which covers most of the typical scenarios, this class is
org.apache.lucene.indexer.DefaultUserDataStore and use FormatCols parameter to
create Lucene Fields.

A.3.2 FormatCols

A coma separated list of column(format) strings interpreted by User Data Store class to
control how an specific database column will be transformed in a Lucene Field. For example
you can choose padding, un-tokenized values and so on.
Supported formats by Default Data Store class are:
• Number padding for numeric columns using java.text.DecimalFormat class
syntax, default is 0000000000.
• Date rounding for timestamp and date columns using
org.apache.lucene.document.DateTools, default is day.
• Character left padding for VARCHAR2 or CHAR columns using
org.apache.lucene.util.StringUtils class (leftPad method), default is no left char
padding. Any char can be used for left padding.
• XPath expression for XMLType columns, this XPath string will be passed to
XMLType.extract("format","") method, the result of the XPath extraction will be
a new XMLType object over getStringVal() will executed. If you want to perform
more user defined XMLType to Field extraction extend DefaultUserDataStore class
or use virtual column indexing.
• For columns of type VARCHAR2 or CHAR you can use an special string
NOT_ANALYZED or NOT_ANALYZED_STORED as format which tell to Default
User Data Store class that this column will be indexed but un-tokenized, this is
useful with columns which will be used for sorting.

A.4 General parameters

This set of parameters are Lucene Domain Index specific parameters.


A.4.1 SyncMode

SyncMode tells to Lucene Domain Index which strategy is used to update the index.
SyncMode:Deferred (default) left to the application when the index is synced either
by calling LuceneDomainIndex.sync procedure after a set of changes pending or by
DBMS_SCHEDULER process at an specific time. With SyncMode:Deferred update and
insert operations are queued using DBMS_AQ package. Delete operations are never
enqueued because require an update on Lucene Index to not return rowid of deleted rows.
SyncMode:OnLine is implemented by using DBMS_AQ PLSQL callback, so immediately
after a commit operation which involves insert or update rows a parallel process dbms_j*
is automatically started by DBMS_AQ package to applied pending changes.
SyncMode:OnLine should be reserved for index which update, insert or delete operations
are much lower than select, AQ callbacks can not handle very well exceptions during sync
time, for example when a row being index is locked by another session, so some changes
can be lost with this scenario.

A.4.2 AutoTuneMemory

AutoTuneMemory:true (default) overrides MaxBufferedDocs parameter, it defines


dynamically MaxBufferedDocs based on how much memory is reported by
OracleRuntime.getJavaPoolSize() method.
After each document is added to the index it calls to writer.ramSizeInBytes() and test that
is not over a 90% of the ram free.
This parameter works in most of the common cases, but you can get a Java out of memory
error in multiuser environments because Java Pool Size is common parameter for all the
sessions. If you get an exception during index creation time set AutoTuneMemory:false
and adjust MaxBufferedDocs to a value which not raise an out of memory exception.

A.4.3 LobStorageParameters

Lucene Domain Index uses a BLOB column named "data" for storing Lucene Inverted
index files. You can control any LOB storage parameter with this parameter during index
creation time, his default value is 'LobStorageParameters:PCTVERSION 0 ENABLE
STORAGE IN ROW CACHE READS NOLOGGING' for 11g databases you can use a
better optimize storage by using newest Secure LOB parameter, for example:
'LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768
CACHE READS FILESYSTEM_LIKE_LOGGING'

A.4.4 LogLevel

Lucene Domain Index uses JDK Java Util Logging package, LogLevel parameter is any of
the string defined by Level.parse() method, for example: LogLevel:ALL. By default logging
level is defined to WARNING.
Lucene Domain Index uses:
• SEVERE for non recoverable error conditions
• FINER for debugging purpose such as ODCI API arguments
• INFO for checking index operations such as value being indexed
• WARNING for error messages which are reported as ERROR through ODCI API
• CONFIG to see user parameters changed by users
Logging information is sent by default to Oracle .trc files, but you can redirect this output
using dbms_java.set_output procedure for example.
If you are not sure which field and how these fields are added to the index change LogLevel
to INFO and check for lines starting with: "INFO: Document<"
exiting and throwing methods does not print messages also with log level defined to ALL.
This is because logging level used by these methods are controlled by ConsoleHandler
level.
To get these methods work copy logging.properties file from your JAVA_HOME/jre/lib to
ORACLE_HOME/javavm/lib directory and edit the line which includes level property:

# Limit the message that are printed on the console to INFO and above.
java.util.logging.ConsoleHandler.level = ALL
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

Then shutdown and startup your Oracle database.

A.4.5 CachedRowIdSize

CachedRowIdSize is used by an LRU cached used to maintain the association between


Lucene Doc ID and a particular Oracle ROWID. For very big table using an array to
store this association can consume a lot of SGA RAM, starting with Lucene Domain Index
2.9.0.1.0 only 10.000 ROWID are stored in this cache, tables with high frequency of
updates can use this LRU small due every caused that LRU is completed flushed, but tables
with low frequency of updates/deletes can get a lot of performance improvement by using
larger LRU cached size.

A.5 Query parameters

This set of parameters which affects QueryParser and search functionality.

A.5.1 DefaultColumn

DefaultColumn defines which columns is used as default column in QueryParser syntax, if


this parameter is not set master column of the index is used, this name is a Lucene Field
name. Here an example:

create index pages_lidx_all on pages p (value(p))


indextype is Lucene.LuceneIndex
parameters('PopulateIndex:false;DefaultColumn:text;SyncMode:Deferred;LogLevel:WARNING;Analyzer:org.apa
page/title'') "title",extractValue(object_value,''/page/revision/comment'')
"comment",extract(object_value,''/page/revision/text/text()'')
"text",extractValue(object_value,''/page/revision/timestamp'')
"revisionDate";FormatCols:revisionDate(day);IncludeMasterColumn:false;LobStorageParameters:PCTVERSION
0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

Note the correlation between DefaultColumn and ExtraCols. ExtraCols defines a Lucene
Field named "text" with a value calculated by the SQL expression
extract(object_value,''/page/revision/text/text()''), then you can use a Lucene
Field text as default Field in QueryParser syntax.

A.5.2 DefaultOperator

DefaultOperator defines which Boolean operator is used in QueryParser syntax, if this


parameter is not set OR operator is his default value.

A.5.3 NormalizeScore

NormalizeScore is used during Lucene Index scan to know if they need to track the
maximum score, the maximum score then used to normalize the result of lscore() operator
to return only values between 0 to 1. If you don't need a normalized range of the score you
can avoid this computation and your query will be fast. Note that a not normalized score
not implied that the document are not in order of relevance.

A.5.4 PreserveDocIdOrder

PreserveDocIdOrder is an internal parameter which is used by Lucene in some kind of


operator, if you don't need that result preserve Lucene Doc ID in order rather than the
relevance, you can put this value to false (default) and some operator will be fast.

A.6 Highlight parameters

This set of parameters which affects lhighlight, phighlight and rhighlight functionality.

A.6.1 Formatter

Formatter defines a valid class name which implements Lucene Interface Formatter
and with a constructor with no arguments, default value
org.apache.lucene.search.highlight.SimpleHTMLFormatter.

A.6.2 MaxNumFragmentsRequired

MaxNumFragmentsRequired defines a number of text fragments returned by Highlight


function, default value is 4.

A.6.3 FragmentSize

FragmentSize defines the size of each fragment returned in characters of each fragment,
default value is 100.

A.6.4 FragmentSeparator

FragmentSeparator defines a String used as fragment separator, default value is "...". Note
that you can not use ";" or ":" as fragment separator because are used as parameter and
value delimiters into alter index ... parameters(..) DDL statement.

B Lucene Domain Index Storage


OJVMDirectory class creates a set of Oracle objects to represent Lucene Inverted Index and Domain
Index functionality. First it creates a table named IDX_NAME$T (IDX_NAME is your Lucene Domain
Index used at create index DDL statement) with this structure:
Name Null? Type
NAME NOT NULL VARCHAR2(30)
LAST_MODIFIED TIMESTAMP(6)
FILE_SIZE NUMBER(38)
DATA BLOB
DELETED CHAR(1)

Also have and index based on IDX_NAME$T.DELETED column to speedy up purge operations.
To enqueue operation at the index it defines a DBMS_AQ Queue IDX_NAME$Q with his storage table
IDX_NAME$QT.
IDX_NAME$Q queue have payload defined as LUCENE_MSG_TYP object. This object type is defined
as:
Name Null? Type
RIDLIST SYS.ODCIRIDLIST
OPERATION VARCHAR2(32)

SYS.ODCIRIDLIST is an special structure defined by ODCI API to hold a list of rowid changed by
an DML operation. OPERATION is one of insert, delete, update, rebuild or optimize reserved
keyword. rebuild and optimize operations are used with SyncMode:OnLine to perform these tasks
automatically using a background process.

C JUnit test suites explained

C.1 DBTestCase base class

This is base class for most of the test suites includes.


It provides a connection pool using OracleDataSource with a minimum of two ready to use
connection and growing to 5, after this it will wait up to 20 seconds for free connection. This
connection pool is created at the class constructor.
Utility methods provided by this class, each method use is own SQLConnection, so they are
autonomous transactions:
• createTable(), create a test table as follow, (T1 is a constant value defined as TABLE):

create table T1 (
f1 number primary key,
f2 varchar2(200),
f3 varchar2(200),
f4 number)

• dropTable(), drop table created above.


• createIndex(), add a Lucene Domain Index to previous one created table as follow,
(LogLevel,Analyzer,MergeFactor,ExtraCols and FormatCols are customizable at class
level, after index creation MergeFactor is reduced to 2):

create index IT1 on T1(f2) indextype is lucene.LuceneIndex

parameters('LogLevel:WARNING;Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:500;ExtraCols:F1;Forma

• dropIndex(), drop previous one index.


• int insertRows(int startIndex, int endIndex), insert a set of rows at above table with F1
column varying from startIndex to endIndex. F2 column is an english text representation
of F1, F4 is F1*10 and F3 is an english text representation of F1*10.
Return a number of rows inserted. If there are problems such as primary key violation it
rollback the transaction.
• int deleteRows(int startIndex, int endIndex), delete a set of rows where F1 between
startIndex and endIndex.
Return a number of rows deleted. If there are problems rollback the transaction.
Note that deleting rows automatically update Lucene Index.
• int updateRows(int startIndex, int endIndex), update F2 column with his own value to
fire ODCI update method on each row between startIndex and endIndex.
Return a number of rows updated.
• findRows(int n), find rows which F2 match again a text representation of n using
lcontains operator. It only test for a result having 0 or more rows.
• long syncIndex(), perform a sync operation at Lucene Domain Index applying pending
changes (inserts, updates). If there are errors, usually caused by another transaction
having an exclusive lock in a row being indexed, it rollback the operation. Next successful
sync will apply pending changes of failed operations.
Return a long value with the amount of milliseconds spent during sync.
• long optimizeIndex(), perform an optimize operation at Lucene Domain Index merging
segments in a new one. If there are errors, usually caused by another transaction having
an exclusive lock on the index, it rollback the operation.
Return a long value with the amount of milliseconds spent during optimize.

C.2 TestDBIndex

Simple test which create a table his index and performs insertions, sync, optimize and deletions,
finally drop index and table. His output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndex


[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 3.836 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 40 total char inserted: 415 avg text length: 10
[junit] Index synced: IT1 elapsed time: 265 ms.
[junit] Avg Sync time: 6
[junit] Index optimized: IT1 elapsed time: 40 ms.
[junit] Avg Optimize time: 1
[junit] Row deleted 40, from: 10 to: 49 elapsed time: 1303 ms. Avg time: 32 ms.
[junit] Index droped: IT1
[junit] Table droped: T1

C.3 TestDBIndexAddDoc

Performs several insertions and sync, starting with 10 rows, then 90 and so on, ending with
3.000 insertions using insertRow method of DBTestCase base class. After each batch of insertions
calls to syncIndex method calculating average time of sync method for each row inserted. His
output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexAddDoc


[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 64.696 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Index synced: IT1 elapsed time: 126 ms.
[junit] Inserted rows: 10 total char inserted: 49 avg text length: 4
[junit] Index synced: IT1 elapsed time: 142 ms.
[junit] Avg Sync time: 14
[junit] Inserted rows: 90 total char inserted: 988 avg text length: 10
[junit] Index synced: IT1 elapsed time: 374 ms.
[junit] Avg Sync time: 4
[junit] Inserted rows: 400 total char inserted: 9201 avg text length: 23
[junit] Index synced: IT1 elapsed time: 1276 ms.
[junit] Avg Sync time: 3
[junit] Inserted rows: 500 total char inserted: 11726 avg text length: 23
[junit] Index synced: IT1 elapsed time: 1601 ms.
[junit] Avg Sync time: 3
[junit] Inserted rows: 1000 total char inserted: 35950 avg text length: 35
[junit] Index synced: IT1 elapsed time: 4675 ms.
[junit] Avg Sync time: 4
[junit] Inserted rows: 3000 total char inserted: 110851 avg text length: 36
[junit] Index synced: IT1 elapsed time: 25480 ms.
[junit] Avg Sync time: 8
[junit] Index droped: IT1
[junit] Table droped: T1

C.4 TestDBIndexDelDoc

At setup method this test case a create a table and fill it with 500 rows. Then performs deletions
batch of 10, 90 and 400 rows each calculating average time for each row deleted. His output
look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexDelDoc


[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 20.543 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 500 total char inserted: 10238 avg text length: 20
[junit] Index synced: IT1 elapsed time: 1643 ms.
[junit] Row deleted 10, from: 1 to: 10 elapsed time: 356 ms. Avg time: 35 ms.
[junit] Row deleted 90, from: 11 to: 100 elapsed time: 2535 ms. Avg time: 28 ms.
[junit] Row deleted 400, from: 101 to: 500 elapsed time: 11526 ms. Avg time: 28 ms.
[junit] Index droped: IT1
[junit] Table droped: T1

C.5 TestDBIndexParallel

This is more complex test case to check concurrent access to Lucene Domain Index. To do this
creates several threads, some for simulating batch insertions of 10 rows, others for simulating
batch deletions of 10 rows, another for simulating batch updates of 10 rows and finally many
threads searching for rows each 0.5 seconds.
By default creates 3 threads for each kind of operations and each thread perform:
• 20 inserts
• 5 deletes
• 5 update
• 100 search
Each thread takes his own connection from the connection pool and do his job, if fastSync
constant is true after each successful insert and update it calls to syncIndex method to update
Lucene Index, if fastSync is false another thread is started performing sync index each 1 second.
It end when all threads (inserts, deletes, updates) finish.
Here some part of his output:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexParallel


[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 97.7 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] FastSync: true
[junit] Deleter 1 deleting at block 70
[junit] Updater 1 updating at block 70
[junit] Inserter 2 inserting at block 90
[junit] No Row deleted at: 70 to: 79 elapsed time: 131 ms.
[junit] No Row updated at: 70 to: 79 elapsed time: 12 ms.
[junit] Searcher 2 searching row 30
[junit] Searcher 1 searching row 77
[junit] Not Found rows with: thirty elapsed time: 211 ms.
[junit] Not Found rows with: seventy-seven elapsed time: 170 ms.
[junit] Inserted rows: 10 total char inserted: 115 avg text length: 11
[junit] Searcher 2 searching row 62
[junit] Searcher 0 searching row 63
[junit] Searcher 1 searching row 49
[junit] Not Found rows with: sixty-two elapsed time: 64 ms.
[junit] Index synced: IT1 elapsed time: 283 ms.
[junit] Not Found rows with: sixty-three elapsed time: 215 ms.
[junit] Searcher 2 searching row 74
[junit] Not Found rows with: seventy-four elapsed time: 39 ms.
[junit] Not Found rows with: forty-nine elapsed time: 137 ms.
[junit] Searcher 1 searching row 95
[junit] Searcher 2 searching row 46
[junit] Found rows with: ninety-five elapsed time: 103 ms.
....
[junit] Updater 2 updating at block 20
[junit] No Row updated at: 20 to: 29 elapsed time: 3 ms.
[junit] Inserted rows: 10 total char inserted: 80 avg text length: 8
[junit] Searcher 0 searching row 97
[junit] Found rows with: ninety-seven elapsed time: 60 ms.
[junit] Index synced: IT1 elapsed time: 147 ms.
.....
[junit] Searcher 2 searching row 39
[junit] Searcher 1 searching row 84
[junit] Not Found rows with: thirty-nine elapsed time: 33 ms.
[junit] Not Found rows with: eighty-four elapsed time: 38 ms.
[junit] Updater 0 updating at block 90
[junit] Row updated 10, from: 90 to: 99 elapsed time: 16 ms. Avg time: 1 ms.
[junit] Index synced: IT1 elapsed time: 162 ms.
......
[junit] Inserted rows: 10 total char inserted: 125 avg text length: 12
[junit] Searcher 0 searching row 57
[junit] Searcher 1 searching row 28
[junit] Deleter 1 deleting at block 80
[junit] Searcher 2 searching row 64
[junit] No Row deleted at: 80 to: 89 elapsed time: 58 ms.
[junit] Not Found rows with: twenty-eight elapsed time: 112 ms.
[junit] Not Found rows with: fifty-seven elapsed time: 155 ms.
[junit] Index synced: IT1 elapsed time: 242 ms.
[junit] Searcher 0 searching row 98
[junit] Found rows with: ninety-eight elapsed time: 72 ms.
[junit] Not Found rows with: sixty-four elapsed time: 175 ms.
[junit] Searcher 0 searching row 27
[junit] Not Found rows with: twenty-seven elapsed time: 75 ms.
[junit] Searcher 1 searching row 5
[junit] Deleter 2 deleting at block 50
[junit] Searcher 2 searching row 84
[junit] Not Found rows with: eighty-four elapsed time: 20 ms.
[junit] Updater 2 updating at block 10
[junit] No Row deleted at: 50 to: 59 elapsed time: 28 ms.
[junit] Row updated 10, from: 10 to: 19 elapsed time: 36 ms. Avg time: 3 ms.
[junit] Found rows with: five elapsed time: 216 ms.
.................
[junit] Inserter 1 inserting at block 50
[junit] Found rows at: 50 position, ignoring insertions
[junit] Index droped: IT1
[junit] Table droped: T1

C.6 TestDBIndexSearchDoc

This test check some special features of lcontains operator such as in-line pagination, sort by
and filter by expressions.
First create a table with 200 rows and then query them, his output look like:

[junit] Testsuite: org.apache.lucene.index.TestDBIndexSearchDoc


[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 14.001 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 746 ms.
[junit] testFilterAll()
[junit] Excecution time: 129 ms.
[junit] 120 Score: 0.9606395 str: one hundred twenty
[junit] 119 Score: 0.25453204 str: one hundred nineteen
[junit] 118 Score: 0.25453204 str: one hundred eighteen
[junit] 117 Score: 0.25453204 str: one hundred seventeen
[junit] 116 Score: 0.25453204 str: one hundred sixteen
[junit] 115 Score: 0.25453204 str: one hundred fifteen
[junit] 114 Score: 0.25453204 str: one hundred fourteen
[junit] 113 Score: 0.25453204 str: one hundred thirteen
[junit] 112 Score: 0.25453204 str: one hundred twelve
[junit] 111 Score: 0.25453204 str: one hundred eleven
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 721 ms.
[junit] testFilterBy()
[junit] Excecution time: 162 ms.
[junit] 103 Score: 1.0 str: one hundred three
[junit] 120 Score: 0.9606395 str: one hundred twenty
[junit] 101 Score: 0.28600293 str: one hundred one
[junit] 100 Score: 0.27352643 str: one hundred
....
[junit] 115 Score: 0.25453204 str: one hundred fifteen
[junit] 116 Score: 0.25453204 str: one hundred sixteen
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 751 ms.
[junit] testFilterByOrderBy()
[junit] Excecution time: 138 ms.
[junit] 120 Score: 0.9606395 str: one hundred twenty
[junit] 119 Score: 0.25453204 str: one hundred nineteen
....
[junit] 103 Score: 1.0 str: one hundred three
[junit] 102 Score: 0.25453204 str: one hundred two
[junit] 101 Score: 0.28600293 str: one hundred one
[junit] 100 Score: 0.27352643 str: one hundred
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 761 ms.
[junit] testPagination()
[junit] Excecution time: 193 ms.
[junit] 117 Score: 0.03489425 str: one hundred seventeen
[junit] 118 Score: 0.03489425 str: one hundred eighteen
....
[junit] 132 Score: 0.03489425 str: one hundred thirty-two
[junit] 134 Score: 0.03489425 str: one hundred thirty-four
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 200 total char inserted: 3262 avg text length: 16
[junit] Index synced: IT1 elapsed time: 743 ms.
&nbsp; [junit] testCountHits()
[junit] Excecution time: 53 ms.
[junit] Hits: 126
[junit] Index droped: IT1
[junit] Table droped: T1

C.7 TestQueryHits

This test is not autonomous because requires an additional step to run. Before run it create a
table and his Lucene Index with:
create table test_source_big as (select * from all_source);
create index source_big_lidx on test_source_big(text)
indextype is lucene.LuceneIndex
parameters('AutoTuneMemory:true;MergeFactor:500;FormatCols:line(0000);ExtraCols:line "line"');

For 11g databases you can create a best optimize Lucene Index using some new Secure LOB
features:

create index source_big_lidx on test_source_big(text)


indextype is lucene.LuceneIndex
parameters('FormatCols:line(0000);ExtraCols:line
"line";Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:500;LobStorageParameters:PCTVERSION
0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS FILESYSTEM_LIKE_LOGGING');

On 10g running it as SCOTT, TEST_SOURCE_BIG table will have 220731 rows using a typical
installation based on database templates.
Using above table two test checks performance with a query which returns 18387 hits, once
call to LuceneDomainIndex.countHits function and another iterate over the result in pages of ten
rows, typical scenario of web applications. His output look like:

[junit] Testsuite: org.apache.lucene.indexer.TestQueryHits


[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 2.656 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] iteration from: 13775 to: 13785
[junit] Step time: 791 ms.
[junit] iteration from: 13785 to: 13795
[junit] Step time: 49 ms.
[junit] iteration from: 13795 to: 13805
[junit] Step time: 40 ms.
[junit] iteration from: 13805 to: 13815
[junit] Step time: 44 ms.
[junit] iteration from: 13815 to: 13825
[junit] Step time: 40 ms.
[junit] iteration from: 13825 to: 13835
[junit] Step time: 42 ms.
[junit] iteration from: 13835 to: 13845
[junit] Step time: 41 ms.
[junit] iteration from: 13845 to: 13855
[junit] Step time: 50 ms.
[junit] iteration from: 13855 to: 13865
[junit] Step time: 41 ms.
[junit] iteration from: 13865 to: 13875
[junit] Step time: 41 ms.
[junit] Elapsed time: 1877
[junit] Hits: 18387
[junit] Elapsed time: 564

Note that first iteration took more time because it includes parsing time and caching, also to
simulate a real word web application an SQLConnection is take and returned to the pool on each
iteration.
D Functions, operators and utilities

D.1 lcontains ancillary operator

lcontains operator is similar to Oracle Text score operator, but differs in query argument and
support another one argument to define in-line sorting.
Syntax

LCONTAINS(
[schema.]column,
text_query VARCHAR2
[,sort VARCHAR2]
[,label NUMBER])
RETURN NUMBER;

[schema.]column
Specify the Lucene text column to be searched on. This column must have a Lucene
Domain Index associated with it.
text_query
Specify a Lucene Query Parser syntax argument. In addition to Lucene Query Parser
syntax, Lucene Domain Index support in-line pagination at lcontains, to do that this
query must start with rownum[nn TO mm] AND where nn and mm are rownum values
of the result query which will be returned, in Oracle syntax rownum start with 1, and this
boundary are inclusive which means that for 20 to 30 we get 11 rows.
Follwing and excerpt of Lucene Query Parser Syntax.
Terms
A query is broken up into terms and operators. There are two types of terms:
Single Terms and Phrases.
A Single Term is a single word such as "test" or "hello".
A Phrase is a group of words surrounded by double quotes such as "hello dolly".
Multiple terms can be combined together with Boolean operators to form a more
complex query (see below).
Note: The analyzer used to create the index will be used on the terms and phrases
in the query string. So it is important to choose an analyzer that will not interfere
with the terms used in the query string.
Fields
Lucene supports fielded data. When performing a search you can either specify a
field, or use the default field. The field names and default field is implementation
specific.
You can search any field by typing the field name followed by a colon ":" and then
the term you are looking for.
As an example, let's assume a Lucene index contains two fields, title and text and
text is the default field. If you want to find the document entitled "The Right Way"
which contains the text "don't go this way", you can enter:

title:"The Right Way" AND text:go

or

title:"Do it right" AND right

Since text is the default field, the field indicator is not required.
Note: The field is only valid for the term that it directly precedes, so the query
title:Do it right

Will only find "Do" in the title field. It will find "it" and "right" in the default field (in
this case the text field).
Term Modifiers
Lucene supports modifying query terms to provide a wide range of searching
options.
Wildcard Searches
Lucene supports single and multiple character wildcard searches within single terms
(not within phrase queries).
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with the single
character replaced. For example, to search for "text" or "test" you can use the
search:

te?t

Multiple character wildcard searches looks for 0 or more characters. For example, to
search for test, tests or tester, you can use the search:

test*

You can also use the wildcard searches in the middle of a term.

te*t

Note: You cannot use a * or ? symbol as the first character of a search.


Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance
algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single
word Term. For example to search for a term similar in spelling to "roam" use the
fuzzy search:

roam~

This search will find terms like foam and roams.


Starting with Lucene 1.9 an additional (optional) parameter can specify the required
similarity. The value is between 0 and 1, with a value closer to 1 only terms with a
higher similarity will be matched. For example:

roam~0.8

The default that is used if the parameter is not given is 0.5.


Proximity Searches
Lucene supports finding words are a within a specific distance away. To do a
proximity search use the tilde, "~", symbol at the end of a Phrase. For example to
search for a "apache" and "jakarta" within 10 words of each other in a document
use the search:
"jakarta apache"~10

Range Searches
Range Queries allow one to match documents whose field(s) values are between
the lower and upper bound specified by the Range Query. Range Queries can be
inclusive or exclusive of the upper and lower bounds. Sorting is done
lexicographically.

mod_date:[20020101 TO 20030101]

This will find documents whose mod_date fields have values between 20020101
and 20030101, inclusive. Note that Range Queries are not reserved for date fields.
You could also use range queries with non-date fields:

title:{Aida TO Carmen}

This will find all documents whose titles are between Aida and Carmen, but not
including Aida and Carmen.
Inclusive range queries are denoted by square brackets. Exclusive range queries
are denoted by curly brackets.
Boosting a Term
Lucene provides the relevance level of matching documents based on the terms
found. To boost a term use the caret, "^", symbol with a boost factor (a number) at
the end of the term you are searching. The higher the boost factor, the more
relevant the term will be.
Boosting allows you to control the relevance of a document by boosting its term.
For example, if you are searching for

jakarta apache

and you want the term "jakarta" to be more relevant boost it using the ^ symbol
along with the boost factor next to the term. You would type:

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also
boost Phrase Terms as in the example:

"jakarta apache"^4 "Apache Lucene"

By default, the boost factor is 1. Although the boost factor must be positive, it can
be less than 1 (e.g. 0.2)
Boolean Operators
Boolean operators allow terms to be combined through logic operators. Lucene
supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators
must be ALL CAPS).
The OR operator is the default conjunction operator. This means that if there is no
Boolean operator between two terms, the OR operator is used. The OR operator
links two terms and finds a matching document if either of the terms exist in a
document. This is equivalent to a union using sets. The symbol || can be used in
place of the word OR.
To search for documents that contain either "jakarta apache" or just "jakarta" use
the query:

"jakarta apache" jakarta

or

"jakarta apache" OR jakarta

AND
The AND operator matches documents where both terms exist anywhere in the text
of a single document. This is equivalent to an intersection using sets. The symbol
&& can be used in place of the word AND.
To search for documents that contain "jakarta apache" and "Apache Lucene" use
the query:

"jakarta apache" AND "Apache Lucene"

+
The "+" or required operator requires that the term after the "+" symbol exist
somewhere in a the field of a single document.
To search for documents that must contain "jakarta" and may contain "lucene" use
the query:

+jakarta lucene

NOT
The NOT operator excludes documents that contain the term after NOT. This is
equivalent to a difference using sets. The symbol ! can be used in place of the word
NOT.
To search for documents that contain "jakarta apache" but not "Apache Lucene" use
the query:

"jakarta apache" NOT "Apache Lucene"

Note: The NOT operator cannot be used with just one term. For example, the
following search will return no results:
NOT "jakarta apache"
-
The "-" or prohibit operator excludes documents that contain the term after the "-"
symbol.
To search for documents that contain "jakarta apache" but not "Apache Lucene" use
the query:

"jakarta apache" -"Apache Lucene"

Grouping
Lucene supports using parentheses to group clauses to form sub queries. This can
be very useful if you want to control the boolean logic for a query.
To search for either "jakarta" or "apache" and "website" use the query:

(jakarta OR apache) AND website

This eliminates any confusion and makes sure you that website must exist and
either term jakarta or apache may exist.
Field Grouping
Lucene supports using parentheses to group multiple clauses to a single field.
To search for a title that contains both the word "return" and the phrase "pink
panther" use the query:

title:(+return +"pink panther")

Escaping Special Characters


Lucene supports escaping special characters that are part of the query syntax. The
current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

To escape these character use the \ before the character. For example to search for
(1+1):2 use the query:

\(1\+1\)\:2

sort
Sort string is with syntax sortField1[[:(ASC|DESC)]:[type]] for example
revisionDate:DESC:string, ASC or DESC is optional as type which is either string, int or
float. Multimples fields can be used for sorting, sort string must be separated by , for
example revisionDate:DESC:string,title:ASC.
If you don't include sort argument at lcontains operator, a Lucene natural order which is
score descending will be used. For any other field ASC is the default sort order.
label
Is an string used in conjuntion with lscore operator to identified which is the lcontains
operators is used for each lscore.

D.2 lscore ancillary operator

Use the LSCORE operator in a SELECT statement to return the score values produced by a
LCONTAINS query. The LSCORE operator can be used in a SELECT, ORDER BY, or GROUP BY
clause.
Syntax

LSCORE(label NUMBER)

label
Specify a number to identify the score produced by the query. Use this number to identify
the LCONTAINS clause which returns this score.
Example
SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
where lcontains(bodytext,'security',1)>0;

D.3 lhighlight ancillary operator

Use the LHIGHLIGHT operator in a select statement to return a highlighted version of the
master column of the index associated to the LCONTAINS query. By now only highlighting
functionality is supported for the master column of the index and the return value of this
function is a VARCHAR2 data type with the text highlighted. VARCHAR2 limitation is not a big
problem because highlighted text usually is an small part of the original text of the column
showed to user as a preview of the original document.
Syntax

LHIGHLIGHT(label NUMBER):VARCHAR2

label
Specify a number to identify the score produced by the query. Use this number to identify
the LCONTAINS clause which returns this score.
Example

SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subject


FROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;

D.4 phighlight pipeline table function

PHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2
or CLOB of the input query. Columns not included into cols argument will not be affected and
they will be returned as is.
Syntax

PHIGHLIGHT(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt IN


VARCHAR2) RETURN ANYDATASET

index_name
Specify a Lucene Index to use.
qry
Lucene Query Parser syntax, same as the second argument of lcontains, except for the
Lucene Domain Index extension for pagination.
cols
A coma separated list of columns to highlight, note that are capitalized if you not use
columns alias.
stmt
Any SQL text of the query to execute by DBMS_SQL package. Remember to use double
single quote to represent a SQL single quote inside the string. Columns returned by this
query should be mapped as String, BigDecimal, Timestamp, CLOB, TIMESTAMP,
TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means for example that for table
with a column VARCHAR2(40) the associated Java type inside the OJVM is String,
then it can be highlighted or returned by this pipeline table function.
Example
SELECT * FROM
TABLE(phighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'select lscore(1) sc,e.* from eMails e where lcontains(bodytext,''rownum:[1 TO 10]
AND (security OR mysql)'',''subject:ASC'',1)>0'
));

D.5 rhighlight pipeline table function

RHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2
or CLOB of the input query. Columns not included into cols argument will not be affected and
they will be returned as is. This is a variant of PHighlight which requires an additional
argument (rType) telling to this function the type that will be returned. This version is free to
any kind of SQL injection and can start several invocation in parallel by the RDBMS based on
the information of the last argument.
Syntax

RHIGHLIGHT(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType IN


VARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET

index_name
Specify a Lucene Index to use.
qry
Lucene Query Parser syntax, same as the second argument of lcontains, except for the
Lucene Domain Index extension for pagination.
cols
A coma separated list of columns to highlight, note that are capitalized if you not use
columns alias.
rType
A collection to be returned by RHighlight table function, usually is "colType TABLE OF
aRowType".
rws
Any SQL query wrapped by the function CURSOR if you are using SQLPlus for example,
or a JDBC ResultSet passed as setObject(n,rs), if you are using an application in Java.
Columns returned by this query should be mapped as String, BigDecimal, Timestamp,
CLOB, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means for
example that for table with a column VARCHAR2(40) the associated Java type inside
the OJVM is String, then it can be highlighted or returned by this pipeline table function.
Example

CREATE TYPE EMAILR AS OBJECT


( sc NUMBER,
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB);

CREATE OR REPLACE TYPE EMAILRSET AS TABLE OF EMAILR;


SELECT * FROM
TABLE(rhighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'EMAILRSET',
CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,'rownum:[1 TO 10] AND (security OR
mysql)','subject:ASC',1)>0)
));

D.6 MoreLike.this function

MoreLike.this function have two declarations, once using index_name argument which uses
current connected users and owner,index_name pair for using index in another database
schema.
Syntax

FUNCTION this(index_name IN VARCHAR2,


x IN ROWID,
f IN NUMBER DEFAULT 1,
t IN NUMBER DEFAULT 10,
minTermFreq IN NUMBER DEFAULT 2,
minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist

index_name
Specify a Lucene Index to use.
x
ROWID used as pivot, it defines which row is used to extract the text with term used for
More Like This Lucene functionality. DefaultColumn parameter of the index is used to
define the column used to get the text, only columns of type VARCHAR2, CLOB or
XMLType are supported.
f,t
From to pagination information, default values are 1 to 10.
minTermFreq,minDocFreq
minTermFreq is the frequency below which terms will be ignored in the source doc,
minDocFreq is the frequency at which words will be ignored which do not occur in at
least this many docs, default values are 2 to 5.
SYS.odciridlist
Is an array of ROWIDs which can be wrapped with a pipeline table function ridlist_table
for selecting his values, for example (select * from table(ridlist_table(ridlist))).

FUNCTION this(owner IN VARCHAR2,


index_name IN VARCHAR2,
x IN ROWID,
f IN NUMBER DEFAULT 1,
t IN NUMBER DEFAULT 10,
minTermFreq IN NUMBER DEFAULT 2,
minDocFreq IN NUMBER DEFAULT 5) RETURN sys.odciridlist
owner
Database schema owner of the index, for example SCOTT. Previous overloaded definition
of this function will query ALL_INDEXED system view to know which is the owner of the
index name.

D.7 lfacets aggregate function

lfacets() aggregate function have one argument which is a coma separated list of the index
name and the category and sub category to be queries.
Syntax

FUNCTION lfacets(input IN VARCHAR2) RETURN LUCENE.agg_tbl

input
A coma separated list including index name, category and optional a sub category.
Category and sub category are in Lucene Query Parser syntax including the column name
indexed (Lucene Field), for example text:(Ciencias naturales y formales), line:[1 TO 10]
and so on. When category and sub category are present the ODCI API start the
computation by calculating the bit set of the main category and then iterate over each
sub category doing a bit and operation between the two bit set.
LUCENE.AGG_TBL
Is a TABLE OF agg_attributes and AGG_ATTRIBUTES is an object type with two field
qryText VARCHAR2(4000) and hits NUMBER when a category and sub category is
passed as argument, this return value will be a table with each row representing the
cardinality of intersection between the category and the sub category, it means a table
with a number of rows equal to the number of sub categories.
To help formatting the output in a traditional query there is function ljoin() which
receives as input an agg_tbl type plus a char separator and returns an string with all the
rows, here the syntax:

FUNCTION ljoin(i_tbl in agg_tbl,


i_glue IN VARCHAR2 := ',') RETURN VARCHAR2

i_tbl
A LUCENE.agg_tbl table to scan.
i_glue
A VARCHAR2 string to use as separator, default value ",".

D.8 index_terms pipeline table function

index_terms() pipeline table function returns a list of Lucene terms values and their
frequency, it have two arguments, first argument is the Lucene Domain Index name and
second argument a Lucene term name.
Syntax

FUNCTION index_terms(index_name VARCHAR2, term_name VARCHAR2) RETURN


LUCENE.term_info_set

index_name
Lucene index name with a syntax, SCHEMA.IDX_NAME or IDX_NAME if current user is the
owner.
term_name
Lucene term name, if this argument is NULL the information of all Lucene Index terms will
be returned.
LUCENE.term_info_set
Is a TABLE OF term_info and TERM_INFO is an object type with two field term
VARCHAR2(4000) and docFreq NUMBER(10) this table can be easily iterated with
traditional SELECT FROM construction, for example:

SQL> select * from (select * from table(index_terms('SOURCE_BIG_LIDX',null)) order by


docFreq desc) where rownum<=10;
TEXT:in 24952
TEXT:varchar 16996
...
TEXT:return 6241

The natural order returned by index_terms() is ordered by term_name:term_value.

D.9 high_freq_terms pipeline table function

high_freq_terms() pipeline table function returns a Top-N most frequents Lucene terms
values and their frequency, it have three arguments, first argument is the Lucene Domain
Index name, second argument is a Lucene term name and last argument how many Top-N
terms should be returned.
Syntax

FUNCTION high_freq_terms(index_name VARCHAR2, term_name VARCHAR2,


num_terms NUMBER) RETURN LUCENE.term_info_set

index_name
Lucene index name with a syntax, SCHEMA.IDX_NAME or IDX_NAME if current user is the
owner.
term_name
Lucene term name, if this argument is NULL the information of all Lucene Index terms will
be returned.
num_terms
How many Top-N high frequency terms should be returned.
LUCENE.term_info_set
Is a TABLE OF term_info and TERM_INFO is an object type with two field term
VARCHAR2(4000) and docFreq NUMBER(10) this table can be easily iterated with
traditional SELECT FROM construction, for example:

SQL> select * from table(high_freq_terms('SOURCE_BIG_LIDX',null,10));


TEXT:in 24952
TEXT:varchar 16996
...
TEXT:return 6241

The natural order returned by high_freq_terms() is ordered by docFreq descending.


Note that this result is similar to the example of index_terms(), but this is more
efficient.

D.10 DidYouMean package


DidYouMean package provides dictionary creation and querying for Lucene Did You Mean
functionality to Lucene Domain Index. This package have a procedure to create or update
the dictionary and function to query it.

Syntax

PROCEDURE indexDictionary(index_name VARCHAR2, spellColumns VARCHAR2 DEFAULT


null, distancealg IN VARCHAR2 DEFAULT 'Levenstein')
PROCEDURE indexDictionary(owner IN VARCHAR2, index_name VARCHAR2, spellColumns
VARCHAR2 DEFAULT null, distancealg IN VARCHAR2 DEFAULT 'Levenstein')

owner
Lucene index owner.
index_name
Lucene index name.
spellColumns
Lucene Domain Index columns to be included in Did You Mean dictionary.
distanceAlg
Distance Algorithm used when create the dictionary, possible values Levenstein,
NGram or Jaro, default Levenstein.

This procedures update Lucene Domain Index structure adding new Field storing
the information required for doing Did You Mean functionality.

FUNCTION suggestwords
(
owner IN VARCHAR2,
index_name IN VARCHAR2,
cmpval IN VARCHAR2,
highlight IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein'
) RETURN VARCHAR2

owner
Lucene index owner.
index_name
Lucene index name.
cmpval
String with values to be replaced by Did You Mean algorithm.
highlight
Tag using for highlighting if it is not null, for example i will be used to return the
tag <i>text</i>
distanceAlg
Distance Algorithm used when create the dictionary, possible values Levenstein,
NGram or Jaro, default Levenstein.
This function query Lucene Domain Index to compute a Did You Mean words for the
input string.

E Project Change Log

2.9.2.1.0 Production Release based on Lucene 2.9 (2.9.2) core base

• Added elapsed time information when log level is INFO


• Removed deprecated usage of LUCENE_CURRENT constant
• Fixed facets inconsitence due ignore internal parameter ColName
• Initial implementation of DidYouMean functionality contributed by Pedro Pinheiro
• Temporary fix until Lucene defines clear semantics for Directory.fileLength (see Lucene issue
2316)

2.9.1.1.0 Production Release based on Lucene 2.9 (2.9.1) core base

• New Lucene Core base libraries


• Full Lucene Test Suites certified
• Fixed bug enqueue more rowids than required when using OnLine mode and ExtraTabs,
WhereCondition parameters
• Fixed operator priority when WhereCondition have OR operator
• DefaultUserDataStore now uses an array of cached fields to improve performance
• Spanish Analyzer use latest ASCIIFoldingFilter
• high_freq_terms(idx_name,term,max_num_term) pipeline table function was added to return
high frequent terms and the associated docFreq value
• index_terms(idx_name,term) pipeline table function was added to return a list of terms and
their associated frequency
• DefaultUserDataStore now have support for ANALYZED, ANALYZED_WITH_VECTORS,
ANALYZED_WITH_OFFSETS, ANALYZED_WITH_POSITIONS and
ANALYZED_WITH_POSITIONS_OFFSETS Lucene Field option values
• OJVMLock was replaced by SingleInstanceLockFactory for per instance locking, cross sessions
lockings are implemented by select for update functionality
• an automatic upgrade from 2.9.0 is possible without Index deletions or rebuild, you have to
execute:

ant upgrade-domain-index
ant ncomp-lucene-ojvm (10g only)
ant jit-lucene-classes (11g only)

2.9.0.1.0 Production release based on Lucene 2.9.0 core base, 29/Sep/09

• Tested with Oracle 11gR2, 11gR1 and 10.2 databases


• DefaultUserDataStore do a SAX parsing to get text nodes and attributes from an XMLType
value.
• A SimpleLRUCache is used to load rowids and his associated Lucene doc id, this reduce
memory consumption when querying very big tables. A new parameters has been added,
CachedRowIdSize by default 10000 to control the size of the LRU cache.
• Lucene Domain Index core was updated to use TopFieldCollector and to avoid computation
time when lscore() is not used.
• Two new parameter has been added NormalizeScore which control when to track the Max
Score and PreserveDocIdOrder when querying, both parameters are consequence of new
Lucene Collector API and boost performance when querying.
• A table alias L$MT is defined for the master table associated to the index to be used in
complex queries to associate columns from master tables and columns from dependent tables

2.4.1.1.0 (maintenance release based on Lucene 2.4.1, 27/Mar/09)

• Do not store internal parameters into system's views and force to PopulateIndex:false
• After every sync, now files marked as deleted are purged to free BLOB storage
• Added lfacets aggregated function for doing facets
• CountHits function no longer requires sort argument
• Filter are stored/retrived only using QueryParser.toString() key
• UN_TOKENIZED format string at DefaultUserDataStore class was replaced by NOT_ANALYZED
or NOT_ANALYZED_STORED according to new Lucene definitions.
• Fix bug when sync try to process more than 32767 rowids enqueued.
• Added parameters for highlighting functions Formatter, MaxNumFragmentsRequired,
FragmentSeparator and FragmentSize.
• Added PerFieldAnalyzer parameter to use independent Analyzer for each columns.
• Added sample of a custom Formatter org.apache.lucene.search.highlight.MyHTMLFormatter

2.4.1.0.0 (first release based on Lucene 2.4.1, 9/Mar/09)

• Fix compatibility problem between 10g/11g SQL Date representation on pipeline table
function.

2.4.0.1.0 (maintenance release based on Lucene 2.4.0, 10/Jan/09)

• Added Rhighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, rType IN


VARCHAR2, rws IN SYS_REFCURSOR) RETURN ANYDATASET pipeline table function
• Added Phighlight(index_name VARCHAR2, qry VARCHAR2, cols VARCHAR2, stmt IN
VARCHAR2) RETURN ANYDATASET pipeline table function
• Added lhighlight(NUMBER):VARCHAR2 ancilliary operator
• Removed usage of Lucene deprecated API (Hits and IndexWriter for example)
• Usage of FIRST_ROWS optimizer hits to decide how many rows load at first time
• sync, optimize and rebuild interfaces now use index_name or [owner,index_name] arguments
• A better build system to build Lucene Domain Index from sources
• More tests
• Tested against 11.1.0.7 and 10.2.0.3
• See online docs to see usage of FIRST_ROWS and lhighlight() operator

2.4.0.0.0 (production release based on Lucene 2.4.0, 10/10/08)

• Added parameter for CLOB enconding


• More Like this function
• NGram analyzer
• EnglishWikipediaAnalyzer
• DataStore interface include API for setting current connection
• Now analyzers, queries, snowball and WikiPedia contrib packages are required
2.3.2.0.0 (binary release based on Lucene 2.3.2, 1/Jun/08)

• Compiled against Lucene 2.3.2 production release


• Used latest API for merging based on RAM usage
• Use Writer for deleting during Sync
• Confirm 4x improvement during indexing reported by Lucene dev group
• Fix workaround which changes order of the rowids in ODCRIDList
• Added an Spanish WikiPedia Analyzer for testing
• Reports IOException instead of RunTimeException to signal EOF or File Not Found
• Decouple Flush functionality from TableIndexer

2.2.0.2.2 (fixpack for 2.2.0.2.0 release, 5/Apr/08)

• Added Rowid to lucene doc id caching.


• Usage of LoadFirstFieldSelector during Document loading to only load rowid field.
• Added a test suite which index a wikipedia dump inside the OJVM.

2.2.0.2.1 (fixpack for 2.2.0.2.0 release, 12/Dec/07)

• DefaultUserDataStore requires usage of XPath text() expresion for getting only textual value
• Added logging info SQL being executed at table indexer
• Change document logging to FINER level
• More pre-defined mapping at DefaultUserDataStore for NUMBER, BINARY_FLOAT,
BINARY_DOUBLE, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Oracle types.
• New parameter PopulateIndex:[true|false] for populating or not Lucene Index at creation
time.
• New parameter IncludeMasterColumn:[true|false], to choose whether or not index master
column, useful with Virtual Columns and XMLType.
• New parameter BatchCount:integer, to choose how many rows count are enqueued for
indexing using create ... index ... parameters('SyncMode:OnLine');
• Creating an index with SyncMode:OnLine causes that LuceneDomain index will enqueue
batchs of "BatchCount" rows for index by AQ PLSQL callback in background. Lucene Domain
Index is intermediately ready for querying after create.
• Batch rowid indexing is doing using a pipeline function.

2.2.0.2.0 (third major release synchronized with Lucene 2.2.0, 12/Dec/07)

Binary download (see package ojvm):


http://sourceforge.net/project/showfiles.php?group_id=56183
CVS access:
cvs -d:pserver:[email protected]:/cvsroot/dbprism login
cvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm
• sort by column passed at lcontains(col,query_parser_str,sort_str,corr_id) syntax
• Logging support using Java Util Logging package
• JUnit test suites emulating middle tier environment
• Support for rebuild and optimize online for SyncMode:OnLine index
• XMLDB Export
• AutoTuneMemory parameter for replacing MaxBufferedDocs parameter
• Functional column support

2.2.0.1.1 (second release, 27/Sep/07 05:39 AM)

Binary download:
https://issues.apache.org/jira/secure/attachment/12366661/ojvm-09-27-07.tar.gz
CVS access:
cvs -d:pserver:[email protected]:/cvsroot/dbprism login
cvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm

• LuceneDomainIndex.countHits() function to replace select count from .. where lcontains(..)>0


syntax.
• support inline pagination at lcontains(col,'rownum:[n TO m] AND ...") function
• rounding and padding support for columns date, timestamp, mumber, float, varchar2 and
char
• ODCI API array DML support
• BLOB parameter support

2.2.0.1.0 (first release synchronized with lucene 2.2.0, 14/Sep/07 06:44


AM)

CVS access:
cvs -d:pserver:[email protected]:/cvsroot/dbprism login
cvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm

• Synchronized with latest Lucene 2.2.0 production


• Replaced in memory storage using Vector based implementation by direct BLOB IO, reducing
memory usage for large index.
• Support for user data stores, it means you can not only index one column at time (limited by
Data Cartridge API on 10g), now you can index multiples columns at base table and columns
on related tabled joined together.
• User Data Stores can be customized by the user, it means writing a simple Java Class users
can control which column are indexed, padding used or any other functionality previous to
document adding step.
• There is a DefaultUserDataStore which gets all columns of the query and built a Lucene
Document with Fields representing each database columns these fields are automatically
padded if they have NUMBER or rounded if they have DATE data, for example.
• lcontains() SQL operator support full Lucene's QueryParser syntax to provide access to all
columns indexed, see examples below.
• Support for DOMAIN_INDEX_SORT and FIRST_ROWS hint, it means that if you want to get
rows order by lscore() operator (ascending,descending) the optimizer hint will assume that
Lucene Domain Index will returns rowids in proper order avoided an inline-view to sort it.
• Automatic index synchronization by using AQ's Call Back.
• Lucene Domain Index creates extra tables named IndexName$T and an Oracle AQ named
IndexName$Q with his storage table IndexName$QT at user's schema, so you can alter
storage's preference if you want.
• ojvm project is at SourceForge.net CVS, so anybody can get it and collaborate
• Tested against 10gR2 and 11gR1 database.

2.0.0.1.3 (third release, 09/Jan/07 11:40 AM)

https://issues.apache.org/jira/secure/attachment/12348574/ojvm-01-09-07.tar.gz
• The Data Cartridge API is used without column data to reduce the data stored on the queue
of changes and speedup the operation of the synchronize method.
• Query Hits are cached associated to the index search and the string returned by the
QueryParser.toString() method.
• If no ancillary operator is used in the select, do not store the score list.
• The "Stemmer" argument is recognized as parameter given the argument for the SnowBall
analyzer, for example:
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Stemmer:English');.

• Before installing the ojvm extension is necessary to execute "ant jar-core" on the snowball
directory.
• The IndexWriter.setUseCompoundFile(false) is called to use multi file storage (faster than the
compound file) because there is no file descriptor limitation inside the OJVM, BLOBs are used
instead of File.
• Files are marked for deletion and they are purged when calling to Sync or Optimize methods.
• Blob are created and populated in one call using Oracle SQL RETURNING information.
• A testing script for using OE sample schema, with query comparisons against Oracle Text
ctxsys.context index.

2.0.0.1.2 (second release, 20/Dec/06 02:03 PM)

https://issues.apache.org/jira/secure/attachment/12347614/ojvm-12-20-06.tar.gz
This new release of the OJVMDirectory Lucene Store includes a fully functional Oracle Domain Index
with a queue for update/insert massive operations and a lot of performance improvement.

2.0.0.1.1 (first release, 28/Nov/06 01:04 PM)

https://issues.apache.org/jira/secure/attachment/12345967/ojvm-11-28-06.tar.gz
• The complet API for the Oracle Domain index was completed, but the solution for the
operator contains outside the where clause is not good.
• I will implement a singleton solution for the OJVMDirectory object when is used in read only
mode, typically when user performs select operations against tables which have columns
indexed with Lucene. This implementation will increase a lot the final performance because
the index reader will be ready for each select operation. Obviously I will check if another user
or thread makes a write operation on the index to reload the read-only singleton.
• The queue for storing the changes on the index is not implemented yet, I'll add it in a short
time.

2.0.0.1.0 (initial implementation, 22/Nov/06 03:45 PM)

https://issues.apache.org/jira/secure/attachment/12345516/ojvm.tar.gz

You might also like