Lucene Domain Index
Lucene Domain Index
This project was originally sponsored by Lending Club, an online social lending network where people
can borrow and lend money among themselves based upon their affinities and/or social connections.
The project is under Apache V2 License:
http://www.apache.org/licenses/LICENSE-2.0
1. Introduction
1.1 What is Lucene
Apache Lucene is a high-performance, full-featured text search engine library written
entirely in Java. It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.
If Lucene is a pure Java framework why not use it inside Oracle Database JVM
environment?
2. Install
2.1 Requirements
• JDeveloper 11g only if you want to edit Java code.
• Ant 1.7.0
• Sun JDK 1.5.0_05/1.4.2 ($ORACLE_HOME/jdk directory works fine as Java Home for compiling
on 10g and 11g)
• Linux/Windows Database Oracle 10g 10.2/11g production
Edit your ~/build.properties file with your Database values (Windows users can find
build.properties file at C:\Documents and Settings\username folder):
db.str=test
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install
javac.debug=true
javac.source=1.4
javac.target=1.4
db.str is your SQLNet connect string for your target database, check first with tnsping
This is an example environment setting before installing on 11g database
MAVEN_HOME=/usr/local/maven
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=$ORACLE_BASE/product/11.1.0.6.0/db_1
ORACLE_SID=test
JAVA_HOME=$ORACLE_HOME/jdk
PATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/
bin:$PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/lib
CVS_RSH=ssh
umask 022
export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOME
CVS_RSH NLS_LANG
# ant install-ojvm
# ant test-domain-index
# ant jit-lucene-classes
This target force to translate all Lucene, Snowball and OJVMDirectory classes to
assembler.
Instead of waiting that the database compile it by detecting most used classes or method.
db.str=orcl
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install
javac.debug=true
javac.source=1.4
javac.target=1.4
MAVEN_HOME=/usr/local/maven
ORACLE_BASE=/u01/app/oracle
ORACLE_HOME=$ORACLE_BASE/product/10.2.0/db_1
ORACLE_SID=orcl
JAVA_HOME=$ORACLE_HOME/jdk
PATH=$MAVEN_HOME/bin:$HOME/bin:$ORACLE_HOME/bin:$JAVA_HOME/bin:/usr/local/
bin:$PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/usr/local/lib
CVS_RSH=ssh
umask 022
export PATH LD_LIBRARY_PATH ORACLE_HOME ORACLE_BASE ORACLE_SID JAVA_HOME
CVS_RSH NLS_LANG
If you are re-installing Oracle Lucene OJVM integration first drop any Lucene Domain
Index not installed at Lucene's schema.
Default target will drop first Lucene schema if exists, additionaly (Recommended for
production system) you can run "ant ncomp-ojvm" which translates all Lucene classes to
C using JAccelerator, for example:
# ant ncomp-ojvm
# ant test-domain-index
cd /tmp
cvs -d:pserver:[email protected]:/cvsroot/dbprism login
cvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P
ojvm
- Copy to $LUCENE_ROOT/contrib
# cd $LUCENE_ROOT/contrib
# cp -rp /tmp/ojvm .
- Edit $LUCENE_ROOT/common-build.xml adding a target for creating a jar file with test
sources.
- Also edit above file at the target name test adding db.usr, db.pwd and db.str properties
as System properties to be available for Lucene Domain Index JUnit suites.
################################################################
JUnit not found.
Please make sure junit.jar is in ANT_HOME/lib, or made available
to Ant using other mechanisms like -lib or CLASSPATH.
################################################################
</fail>
............
<!-- contrib/ojvm uses these system properties to connect to the target database --
>
<sysproperty key="db.str" value="${db.str}"/>
<sysproperty key="db.usr" value="${db.usr}"/>
<sysproperty key="db.pwd" value="${db.pwd}"/>
............
<delete file="${build.dir}/test/junitfailed.flag" />
</target>
# cd $LUCENE_ROOT/contrib/ojvm
# ant jar-core
# ant jar-test
db.str=orcl
db.usr=LUCENE
db.pwd=LUCENE
dba.usr=sys
dba.pwd=change_on_install
javac.debug=true
javac.source=1.4
javac.target=1.4
db.str is your SQLNet connect string for your target database, check first with tnsping
utility, also note that for 11g database user and password are case sensitive, so leave
LUCENE in uppercase.
- Upload your code to the database
# ant install-ojvm
You can generate Lucene and OJVM Directory Maven's artifacts following previous one steps,
then execute:
# ant generate-maven-artifacts
2.4 Optimizations
# ant ncomp-ojvm
First verify that your database parameter java_jit_enabled is TRUE. Oracle 11g
includes a JIT technology which automatically translates most used Java methods
to assembler. If you want to pre-compile all Lucene Java code to assembler and not
wait for Oracle database detects common used code you can execute this target:
ant jit-lucene-classes
ant jit-oracle-classes
IMPORTANT: Before start using Lucene Domain Index grant this to any Oracle user rather
than LUCENE:
-- connected as sysdba
begin
dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission',
'control', '' );
commit;
end;
/
Lucene Domain Index have two kinds of test suites to check that everything is OK after
installation.
First test suite which can be launched using Ant is pure SQL and use SQLPlus to work, to launch
it simply execute:
test-domain-index:
[exec]
[exec] SQL*Plus: Release 11.1.0.6.0 - Production on Wed Dec 5 17:43:24 2007
[exec]
[exec] Copyright (c) 1982, 2007, Oracle. All rights reserved.
[exec]
[exec]
[exec] Connected to:
[exec] Oracle Database 11g Release 11.1.0.6.0 - Production
[exec]
[exec]
[exec] Table dropped.
[exec]
[exec]
[exec] Table created.
[exec]
[exec] SQL> Disconnected from Oracle Database 11g Release 11.1.0.6.0 - Production
[echo] See output at ../../build/testLuceneDomainIndex.txt
Except for the test which uses test_source_small table which outputs his log at the .trc files
other will outputs his log information at ../../build/testLuceneDomainIndex.txt file.
Second test suite is a set of JUnit tests to simulate middle tier environments, it also use a
connection pool. To start these suites run:
ojvm-test:
[echoproperties] #Ant properties
[echoproperties] #Wed Dec 05 17:56:30 ART 2007
.........
common.test:
[junit] Testsuite: org.apache.lucene.index.TestDBIndex
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 5.883 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] Table created: T1
[junit] Index created: IT1
[junit] Index altered: IT1
[junit] Inserted rows: 40 total bytes inserted: 421 avg text length: 10
[junit] Index synced: IT1 elapsed time: 249 ms.
[junit] Avg Sync time: 6
[junit] Index optimized: IT1 elapsed time: 46 ms.
[junit] Avg Optimize time: 1
[junit] Row deleted 41, from: 10 to: 50 elapsed time: 2005 ms. Avg time: 48 ms.
[junit] Index droped: IT1
[junit] Table droped: T1
[junit] ------------- ---------------- ---------------
.............
[junit] Testsuite: org.apache.lucene.indexer.TestQueryHits
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 4.158 sec
[junit]
[junit] ------------- Standard Output ---------------
[junit] iteration from: 13775 to: 13785
[junit] Step time: 1291 ms.
[junit] iteration from: 13785 to: 13795
[junit] Step time: 157 ms.
[junit] iteration from: 13795 to: 13805
[junit] Step time: 144 ms.
[junit] iteration from: 13805 to: 13815
[junit] Step time: 147 ms.
[junit] iteration from: 13815 to: 13825
[junit] Step time: 145 ms.
[junit] iteration from: 13825 to: 13835
[junit] Step time: 147 ms.
[junit] iteration from: 13835 to: 13845
[junit] Step time: 145 ms.
[junit] iteration from: 13845 to: 13855
[junit] Step time: 150 ms.
[junit] iteration from: 13855 to: 13865
[junit] Step time: 278 ms.
[junit] iteration from: 13865 to: 13875
[junit] Step time: 146 ms.
[junit] Elapsed time: 3159
[junit] Hits: 18387
[junit] Elapsed time: 653
[junit] ------------- ---------------- ---------------
[junit] Testsuite: org.apache.lucene.indexer.TestTableIndexer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.685 sec
[junit]
[delete] Deleting: /u01/src/lucene-2.2.0/build/contrib/ojvm/test/junitfailed.flag
BUILD SUCCESSFUL
Total time: 8 minutes 22 seconds
Or in 11g with:
Note that this argument is enclosed by "" to prevent Unix shell replacement.
3. Examples
IMPORTANT: Before start using Lucene Domain Index grant this to any
Oracle user rather than LUCENE:
-- connected as sysdba
begin
dbms_java.grant_permission('SCOTT','SYS:java.util.logging.LoggingPermission',
'control', '' );
commit;
end;
/
Table example:
create table t1 (
f1 number,
f2 varchar2(200),
f3 varchar2(200),
f4 number unique);
For previous one table example you can also index extra columns passing the information
as parameter to the index due Oracle 10g do not support Domain Index with compound
columns, here example:
Creating an index with ExtraCols parameter cause that Lucene index both columns,
master column f2 indexed as F2 and F1 indexed as "f1", as you can see below, at query
section examples, lcontains() operator provides Lucene's Query Parser Syntax which have
functionality for selecting multiples field using f1:text for example. Using ExtraCols
parameter imply that create index operator performs a full scan on table t1 with a syntax
like: SELECT ROWID,F2,F1 "f1" FROM T1.
Because ODCI Api will not detect changes on other columns than the master, you need to
create a trigger that fire an update on the master column when a change on ExtraCols list
is detected. Here an example:
Any changes on f1 also will force to change f2, then ODCI will notify Lucene that an
specific rowid was updated, Lucene Domain Index based on his parameter definition will
update the inverted index to reflect changes in both columns.
Lucene Domain Index supports indexing in multiples column and multiples tables which
can be joined in a natural form, it means defining a list of tables with ExtraTabs
parameter, and a where condition with WhereCondition parameter. Here an example:
create table t2 (
f4 number primary key,
f5 VARCHAR2(200));
create table t1 (
f1 number,
f2 VARCHAR2(4000),
f3 number,
CONSTRAINT t1_t2_fk FOREIGN KEY (f3)
REFERENCES t2(f4) ON DELETE cascade);
You can index both tables using t1 as master index definition with:
Note that tables t1 and t2 are joined directly by a foreign key, so t2 could be consider as
a satellite table of t1. With this set of parameters when ODCI Api detects a change on it1
master column (f3) a select like this is executed:
Bold parts of the query are injected by Lucene Domain Index implementation and italics
parts are extracted from ExtraCols and ExtraTabs parameters. The table alias L$MT is
automatically added by Lucene Domain Index to the master table, this alias is important
to create complex joins with Object Tables which uses existsNode or extracValue
operator, that functionality was added starting with 2.9.0.1.0 release.
With the above scenario a trigger for getting Lucene Index synced with changes in any
columns defined at ExtraCols parameter is a bit complex, it requires a combination of two
triggers:
First trigger is similar to the previous example, second trigger at the satellite table looks
for all rowid at the master table who have references to satellite row, then it uses
LuceneDomainIndex.enqueueChange procedure to notify Lucene Domain Index changes.
sys.ODCIRidList is an special ODCI structure to hold a group of rowid.
3.1.4 Padding and formatting
parameters('Stemmer:English;FormatCols:F2(zzzzzzzzzzzzzzz),F3(00.00);ExtraCols:F3');
Above example shows that for F2 column all values will be automatically padded to 15
character (z) and F3 column using 00.00, then these rows will be indexed as Lucene
documents:
Document<stored/compressed,indexed<rowid:*BAEAPBQCwQL+>
indexed,tokenized<F2:zzzzzzzzzzzravi> indexed<F3:03.46>>
Document<stored/compressed,indexed<rowid:*BAEAPBQCwQT+>
indexed,tokenized<F2:zzzzzzzzzmurthy> indexed<F3:15.87>>
For columns based on Oracle XMLType, FormatCols parameter can be used to define an
XPath expression which controls a subset of XML nodes to be indexed.
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA>
indexed,tokenized<F1:001> indexed,tokenized<F2:ravi >>
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB>
indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >>
ExtraCols parameter also have a possibility to define functional column for Lucene Index
which means, any SQL function valid in a select section is allowed. For example, using
above table definition:
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAA>
indexed,tokenized<F1:001> indexed,tokenized<F2:ravi > indexed,tokenized<id:01>>
Document<stored/compressed,indexed<rowid:AAATciAAEAAADwcAAB>
indexed,tokenized<F1:003> indexed,tokenized<F2:murthy >
indexed,tokenized<id:03>>
Note that a virtual column was defined and indexed as "id", this column then is available
at lcontains operator.
If you put SyncMode:OnLine during create index DDL operation it will cause that Lucene
Domain Index enqueues all rowids of the master table for indexing in batchs of
BatchCount rows (default is 115). Immediately that the command returns the index is
ready and a PLSQL AQ Callback will populate Lucene Index structure in background. For
example:
During create index DDL statement using PopulateIndex:false causes that Lucene Index
structure is created empty and the Domain Index is ready. Then you can call to alter
index rebuild DDL statement to populate it. Here an example:
parameters('PopulateIndex:false;LogLevel:ALL;IncludeMasterColumn:false;ExtraCols:F1,extractValue
emp/name/text()'') "name",extractValue(F2,''/emp/@id'')
"id";FormatCols:F1(000),id(00)');
Starting with Lucene Domain Index 2.9.1.1.0, you can enable ParallelDegree
operations, ParallelDegree parameter which can be 0 or 2 to 9, is implemented using
multiples Data Storage to process insert operations in parallel, this is useful when you
have multi core chips or RAC environments. By now only insert are parallelized an the
index must be configured in OnLine mode. Following an example of index creation with
parallel inserts enables
create index source_big_lidx on test_source_big(text)
indextype is lucene.luceneindex
parameters('BatchCount:1000;ParallelDegree:4;SyncMode:OnLine;LogLevel:INFO;AutoTuneMemory:tru
"line"');
After this index DDL statement is executed five new tables will be visibly on user's
schema, SOURCE_BIG_LIDX$T (master index storage), and
SOURCE_BIG_LIDX$[0..3]$T storage for slaves process, also a sequence
SOURCE_BIG_LIDX$S is created generating number from 0 to 3.
The parallel implementation will enqueue batch of 1000 rows (BatchCount
parameter) on the master queue related to the index, the AQ callback which is
enable for this queue will dequeue each batch of rows and enqueue in the slaves
queues, the result of these operations is that Oracle AQ process will execute
multiple AQ server process, you can see multiples ora_j00x_sid process running.
With Oracle 11g we saw that the AQ implementation do not start new slaves
process if one callback is getting a lot of CPU usage, my experience show that for a
BatchCount parameter set up to 250 leaves a level of degree on AQ queues which
guarantee that multiples slaves process will be executed resulting in a real parallel
insert operations.
3.2 Alter
SQL DDL alter index command can be used with Lucene Domain Index to change any
parameter after index creation time. Lucene Domain Index parameters are a simple list of
name:value pairs stored into Lucene OJVMDirectory storage. If you want to remove any
parameter from the storage pre-pending ~ in a parameter name is used.
Here some examples of alter index:
Change Lucene Index Writer parameter MaxBufferedDocs to 500 and disable Auto Tune Memory
functionality.
Disable SyncMode from the above example, you can get similar functionality setting
SyncMode:Deferred which is the default value for SyncMode.
3.2 Rebuild
SQL DDL alter index allow you to rebuild an index from scratch, this is useful when Lucene
Domain Index is damaged, corrupted or you need to change some parameter which is
necessary to be applied to existing rows already indexed, for example Lucene Analyzer
parameter.
3.2.1 Manual
parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;MaxBufferedDocs:500;AutoTuneMemory
Above example shows how to change Lucene Index Analyzer, if you change your index
Analyzer its necessary to rebuild the complete index because you should not query an
index with an analyzer different from the index time.
3.2.2 On Line
Alter index rebuild will not return up to the complete operation is finished. Rebuild On
Line is a functionality for Oracle Index available in enterprise edition databases, but with
a little trick you can rebuild Lucene Domain Index On Line too.
If you are working with SyncMode:Deferred you need to change to SyncMode:OnLine,
then you can rebuild the index by using:
Rebuild command enqueues batchs of 1000 rowids of the master table (it1) for addition
to Lucene Index structure then Lucene Domain Index AQ Callback will process these
messages using background database process and automatically commit changes when it
finish.
3.3 Drop
Dropping Lucene Index do not differs from dropping any other index. Just call:
This operation implies drop Lucene Domain Index table, for above example IT1$T, and an AQ
queue IT1$Q with his storage table IT1$QT. If the index is configured with SyncMode:OnLine,
first the AQ Callback is disabled.
If something is wrong during index drop command you can add "force" at the end of the
command to be sure that System's views will not have any reference to the index.
3.4 Querying
Lucene Domain Index define a new SQL operator named lcontains() with his ancillary operators
lscore() and lhighlight(), his functionality is similar to Oracle Text contains and score operators.
Next example shows operator functionality and parameters.
3.4.1 Simple columns
For the table and index defined into 3.1.4/3.1.5 section a simple usage of lcontains and
lscore is:
LSCORE(1)
----------
F2
----------------------------------------------------------------------------------------------------
----------------------------------------
1
<emp id="1">
<name>ravi</name>
</emp>
SQL>
First parameter of lcontains operator is the column which have attached Lucene Domain
Index, this is the master column of the index and is a default field for Query Parser
syntax..
Second parameter is Lucene Query Parser syntax string, above table example have
defined Lucene Domain Index at f1 columns, so its not default field for the query, with
this definition to query for an string inside F2 column its necessary to explicit defined
"F2:".
If you want to use lscore its necessary to specify as third argument in lcontains, a
correlation id, it this example "1", this correlation id then match with lscore(1) to
associate the ancillary operator to a proper lcontains.
If you are querying for the master column of the index you can simply omits column
qualifier, for example:
LSCORE(1)
----------
1
SQL>
Query Parser Syntax supports many logical operator and term modifier, you can combine
any of them with each column indexed. Here a practical example using table and index
from section 3.1.4/3.1.5
F1 SC ID
---------- ---------------- -----------------------------
1 .577350259 1
3 .288675129 3
Note that first row match against column F2:ravi and functional column id:01, second row
match with F1 equal to 003 (remember F1 qualifier its not necessary because is the master
column of the index defined in 3.1.5)
3.4.3 Pagination
lcontains operator have an extension to Query Parser syntax to include in-line pagination
information at Lucene Domain Index Hits result.
You can select an specific window (pagination) of your query injecting a Query Parser like
range inside lcontains() operator. For example:
3.4.4 Sort
Lucene provides sort over the result of a particular query, Lucene Domain Index provides
sorting by using an extra argument at lcontains() operator. Here examples of sorting using
emails table created in section 3.1.4:
SUBJECT
----------------------------------------------------------------------------------------------------
----------------------------------------
Re: lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
lucene injection
Elapsed: 00:00:00.04
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject FROM emails
2 where lcontains(bodytext,'security','subject:DESC',1)>0;
SUBJECT
----------------------------------------------------------------------------------------------------
----------------------------------------
lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
Re: lucene injection
Elapsed: 00:00:00.17
SUBJECT EMAILFROM
----------------------------------------- ----------------------------------------------------------
---------------
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
lucene injection [email protected]
Elapsed: 00:00:00.06
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails
2 where lcontains(bodytext,'security','subject:ASC:string,emailFrom:ASC:string',1)>0;
SUBJECT EMAILFROM
------------------------------------------ ---------------------------------------------------------
---------------
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
lucene injection [email protected]
Elapsed: 00:00:00.05
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ subject,emailFrom FROM emails
2 where lcontains(bodytext,'security',1)>0;
SUBJECT EMAILFROM
------------------------------------------- --------------------------------------------------------
--------------
lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Re: lucene injection [email protected]
Elapsed: 00:00:00.09
Latest query doesn't include sort so it sorted by score. An abbreviated syntax for sort string
is ASC or DESC which means sort by score ascending or descending, this short format is
equivalent to use order by syntax with lscore operator, for example:
LSCORE(1) SUBJECT
-------------- -----------------------------------------------------------------------------------
-------------------
.241440386 lucene injection
.22763218 Re: lucene injection
.199178159 Re: lucene injection
.140840232 Re: lucene injection
.140840232 Re: lucene injection
Elapsed: 00:00:00.10
SQL> SELECT lscore(1),subject FROM emails
2 where lcontains(bodytext,'security',1)>0 order by lscore(1) asc;
LSCORE(1) SUBJECT
------------- ------------------------------------------------------------------------------------
-----------------
.140840232 Re: lucene injection
.140840232 Re: lucene injection
.199178159 Re: lucene injection
.22763218 Re: lucene injection
.241440386 lucene injection
Elapsed: 00:00:00.11
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
2 where lcontains(bodytext,'security','subject:DESC',1)>0;
LSCORE(1) SUBJECT
------------- ------------------------------------------------------------------------------------
----------------
.241440386 lucene injection
.22763218 Re: lucene injection
.199178159 Re: lucene injection
.140840232 Re: lucene injection
.140840232 Re: lucene injection
Elapsed: 00:00:00.07
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
2 where lcontains(bodytext,'security','subject:ASC',1)>0;
LSCORE(1) SUBJECT
------------- -----------------------------------------------------------------------------------
--------------
.140840232 Re: lucene injection
.140840232 Re: lucene injection
.199178159 Re: lucene injection
.22763218 Re: lucene injection
.241440386 lucene injection
Elapsed: 00:00:00.07
First example uses default sort by score descend, second example uses order by syntax
overriding default sort and change it to score ascend, the other ones are equivalent but
using lcontains sort argument string.
Note that if you are using lcontains sort string, you has to add DOMAIN_INDEX_SORT
optimizer hint, this hint tells Oracle optimizer that the order of the rows will be dictated by
Lucene Domain Index.
The usage of lscore(anc_id) in conjunction with lcontains(column,query,sort_str,anc_id)
make not sense and produce an extra overhead on the score computation which can be
avoided, it means if you are querying Lucene Domain Index and want to get the result
ordered by other columns rather than the relevance order why to compute it, AVOID
lscore() function in the select list and you will get a query faster. For example:
Count hits function is a Lucene Domain Index optimization to replace SQL count(*)
functionality. This is extremely fast because there is no rowid information passed from
Lucene Data Cartridge to Oracle Engine to count matching rows. Here an example:
SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') hits from dual;
HITS
----------
5
Elapsed: 00:00:00.02
First argument of count hits function is an string with Lucene Domain Index syntax
(IDX_NAME), second argument is Query Parser syntax string equals to second argument of
lcontains function, optionally you can use a three argument version of countHits function to
use index in another schemas, first argument is the schema, second argument is the index
name and last one is the Query Parser syntax string. After a count hits function call you
can use a select with lcontains function, if count hits query matchs with lcontains query,
lcontains will have a cached information for returning matching rowids. Following some
examples of count hits an his correlated query using caching results:
HITS
----------
5
Elapsed: 00:00:00.02
SQL> select emailFrom FROM emails
2 where lcontains(bodytext,'security',1)>0;
EMAILFROM
----------------------------------------------------------------------------------------------------
----------------------------------------
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Elapsed: 00:00:00.08
SQL> select LuceneDomainIndex.countHits('EMAILBODYTEXT','security') from dual;
LUCENEDOMAININDEX.COUNTHITS('EMAILBODYTEXT','SECURITY')
------------------------------------------------------------------------------
5
Elapsed: 00:00:00.02
SQL> select emailFrom FROM emails
2 where lcontains(bodytext,'security','emailFrom:ASC',1)>0;
EMAILFROM
----------------------------------------------------------------------------------------------------
----------------------------------------
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Elapsed: 00:00:00.04
In both queries lcontains found a cached hits structure evaluated by count hits function.
Lucene Domain Index stores cached hits information, to localize it uses a key compounded
by sort_string(QueryParser.toString()) so both arguments of count hits and lcontains
should match to re-use a cached hits structure. For last query example the string
emailFrom:(security) is used as key.
Starting with 2.4.0.1.0 release we have replaced deprecated Lucene Hits class by
TopDocs class. If you use FIRST_ROWS optimizer hint in conjuction with lcontains inline
pagination Lucene Domain Index will call to TopDocs to get only the first M hits. For
example:
FIRST_ROWS and rownum:[1 TO 10] tells to Lucene Domain Index that performs a
Lucene Query for the first 10 hits only. Next query with rownum:[10 TO 20] will have
most of the Lucene structures cached in memory such as the Searcher and the ROWID<-
>Lucene DocID association, but it will re-query Lucene Index to get first 20 Hits (1..20),
this cache miss behavior of Hits could be interpreted as bad solution but is extremely useful
if 90% of query only visits the first page of the hits, typical behavior on Internet Search.
In the other hand if you omits FIRST_ROWS optimizer hint, Oracle by default switch to
ALL_ROWS mode which means, if you are using pagination (rownum:[n TO m]) with
m greater than 2000, Lucene Domain Index will fetch m first hits, but if m is lower than
2000, Lucene Domain Index will try to fetch by default 2000 hits. The magic number 2000
is due Oracle ODCI API calls to ODCIFetch routine in batch of 2000 rowids.
If FIRST_ROWS and in-line pagination are not included in query, Lucene Domain Index
works in batch of 2000 hits causing several cache miss in a full scan mode. For example
given a query:
causes that Lucene Domain Index fetch the first 2000 hits, finally with the information
that the hits length is 2736 it re-fetch (cache miss) the 2736 hits. Obviously you can use
LuceneDomainIndex.countHits() function to count hits faster than the previous query.
3.4.6 Highlighting
lhighlight ancillary operator works as lscore but returning a VARCHAR2 text with the
words highlighted during the evaluation of lcontains function, the tag used to remark
matching words is not customizable yet and is <B>, also the fragment separator and the
maximum number of fragments are constant (... and 4, respectively). Starting with
2.4.1.1.0 release it have parameters customizable through alter index ... parameters()
DDL command to change. Highlighting example:
SQL> SELECT /*+ DOMAIN_INDEX_SORT */ lhighlight(1) txt,lscore(1) sc,subject
2 FROM emails where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0;
TXT SC
SUBJECT
On Dec 21, 2006, at 4:56 AM, Deepan wrote:> I am bothered about <B>security</B>
.27477634 Re: lucene injection
problems with lucene. Is it vulnerable to> any kind of injection like <B>mysql</
B> injection? many times the query from> user is passed to lucene for search wit
hout validating.Rest easy. There are no known <B>security</B> issues with Lucen
e, and ithas even undergone a recent static code analysis by Fortify (see theluc
ene-dev e-mail list
Highlighting only works with columns of type VARCHAR2, CLOB and XMLType. You can
perform highlighting operation even if your master columns is not indexed/stored, for
example for an index created with:
<page xmlns="http://www.mediawiki.org/xml/export-0.3/">
<title><B>Música</B> de Argentina... [[Latinoamérica|latinoamericanos]] con más desarrollo en su
[[<B>música</B>]].
Aunque tanto la milonga como el <B>tango</B> están en [[compás]] de 2/4, las 8 [[semicorchea]]s de la
milonga están distribuidas en 3 + 3 + 2 en cambio el <B>tango</B> posee un ritmo más «cuadrado». Las
letras...]] criticó en algún momento el <B>tango</B> y prefirió la milonga, que no trasmite la melancolía
Milonga (género musical)
Index creation DDL have IncludeMasterColumn:false, which means the whole XMLType
representation of the Spanish Wikipedia page dump is not indexed only the virtual columns
title, comment, text and revisionDate are processed by Lucene, but TextHighlight Java
function attached to lhighlight operator receives the XMLType from RDBMS engine, so it
call to Lucene Highlighter class with the whole XMLType object (note that page titles are in
bold only to separate rows at the output).
Parameters supported by highlighting functions are:
• Formatter, a valid class name which implements Lucene Interface Formatter
and with a constructor with no arguments, default value
org.apache.lucene.search.highlight.SimpleHTMLFormatter.
• MaxNumFragmentsRequired, number of text fragments returned by Highlight
function, default value 4.
• FragmentSize, size of each fragment returned, default value 100.
• FragmentSeparator, String used as fragment separator, default is "...". Note that
you can not use ";" or ":" as fragment separator because are used as parameter
and value delimiters into alter index ... parameters(..) DDL statement.
There is no customization allowed by passing constructor arguments to Formatter class,
but you can easily creates your own Formatter which call to SimpleHTMLFormatter with
arguments, your Formatter will look like:
phighlight and rhighlight provides a more general usage patterns of Lucene highlighting
functionality. phighlight receives an SQL query as string and performs highlighting in set
of user defined columns on the query result. rhighlight receives a SYS_REFCURSOR
argument and performs highlighting in a set of user defined query columns, unlike
phighlight, rhighlight requires that the user defined a return type of the query, usually
a TABLE OF collection, because with a SYS_REFCURSOR argument there is no option to
know at compilation time the return type of the query.
Both phighlight and rhighlight support highlighting parameters defined during create
index or alter index DDL statements, see 3.4.6 section for more information.
Here two examples of highlighting features by using pipeline table functions, table emails
is the example table/index of the section 3.1.4:
SELECT * FROM
TABLE(phighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,''security OR mysql'',''subject:ASC'',1)>0'
));
SELECT * FROM
TABLE(rhighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'EMAILRSET',
CURSOR(select /*+ DOMAIN_INDEX_SORT FIRST_ROW */ lscore(1) sc,e.*
from eMails e where lcontains(bodytext,'security OR mysql','subject:ASC',1)>0)
));
First three arguments of both pipeline function are equals, the Lucene Domain Index used,
the Lucene Query Syntax argument (should match with lcontains argument) and finally
the columns of the query which will be highlighted.
Last arguments are, for phighlight is a VARCHAR2 type with SQL query to be executed
by DBMS_SQL package, note the double single quote used as escape character sequence
to encode SQL single quotes char.
For rhighlight two arguments are required the type returned by the cursor, in this example
is EMAILRSET defined as:
Note that EMAILR is record which holds all columns of table EMAILS plus the score
returned by lscore() function, then EMAILRSET is simple collection type TABLE OF
EMAILR which is required type for CURSOR value.
And finally last argument is of CURSOR type which means any SQL query.
Note that the anonymous PL/SQL block gets the first ROWID returned from the first
query as pivot, then expands the result set with other rows which also includes terms like
"procedure (C, Java or PL/SQL), optionally qualified", "C" is not take into account due was
eliminate as stop word.
Refers to the Appendix D.6 for a full explanation of each parameter.
3.4.9 Facets
Starting with Lucene Domain Index 2.4.1.1.0, Lucene Facets functionality is available
through an SQL aggregate function lfacets():
lfacets(index_name_and_categories IN VARCHAR2
) RETURN LUCENE.agg_tbl
where index_name_and_categories is encoded string with the Lucene Index name and
categories, aggregated function only accepts one scalar value as argument so we need to
encode the index and categories in a coma separated list, for example:
Using the index created on the example of section 2.5 Testing Lucene Domain Index index
name can be SCHEMA.IDX_NAME sintax, categories can be one or two and are expressed
in Lucene Query Syntax, in the above example TEXT is the index column procedure is the
main category and java the sub category.
Creating a table with categories and linking the rows with parent is an option to
automatically generate facets, for example:
create table source_categories (
cat_code number(4),
cat_name varchar2(256),
cat_parent number(4),
CONSTRAINT PK_SOURCE_CATEGORIES PRIMARY KEY (cat_code),
CONSTRAINT FK_CAT_PARENT FOREIGN KEY (cat_parent)
REFERENCES source_categories (cat_code)
);
insert into source_categories values (1,'TEXT:procedure',null);
insert into source_categories values (2,'TEXT:function',null);
...
insert into source_categories values (6,'TEXT:java',1);
insert into source_categories values (7,'TEXT:(pl sql)',1);
insert into source_categories values (8,'TEXT:wrapped',1);
...
insert into source_categories values (21,'line:[1 TO 1000]',1);
insert into source_categories values (22,'line:[1001 TO 2000]',1);
insert into source_categories values (23,'line:[2001 TO 3000]',1);
Now we can query above table calling to lfacets with the category and sub category:
LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))
CAT_CODE LEVEL
TEXT:procedure(5116)
1 1
TEXT:function(5574)
2 1
TEXT:trigger(96)
3 1
TEXT:package(860)
4 1
TEXT:(object type)(5140)
5 1
TEXT:procedure,TEXT:java(9)
6 2
.....
TEXT:procedure,line:[1 TO 1000](3)
21 2
TEXT:procedure,line:[1001 TO 2000](615)
22 2
...
SQL> select ljoin(lfacets('SOURCE_BIG_LIDX,'||
case level when 1 then cat_name
ELSE PRIOR cat_name||','|| cat_name
END
)), cat_parent
FROM source_categories
start with cat_parent is null
CONNECT BY PRIOR cat_code = cat_parent
group by cat_parent;
LJOIN(LFACETS('SOURCE_BIG_LIDX,'||CASELEVELWHEN1THENCAT_NAMEELSEPRIORCAT_NAME||','||CAT_NAMEEND))
CAT_PARENT
-----------------------------------------------------------------------------------------------------------------------------
------------ ---------------
TEXT:procedure,TEXT:java(11),TEXT:procedure,TEXT:(pl sql)(70),TEXT:procedure,line:[1 TO
1000](3),TEXT:procedure,TEXT:wrapped(21),TEXT:proced
ure,line:[1001 TO 2000](675),TEXT:procedure,line:[3001 TO 4000](105),TEXT:procedure,line:[4001 TO
5000](10),TEXT:procedure,line:[2001 TO 300
0](199) 1
TEXT:function,TEXT:java(22),TEXT:function,TEXT:wrapped(85),TEXT:function,line:[1 TO
1000](0),TEXT:function,TEXT:(pl sql)(87),TEXT:function,l
ine:[1001 TO 2000](835),TEXT:function,line:[3001 TO 4000](21),TEXT:function,line:[4001 TO
5000](0),TEXT:function,line:[2001 TO 3000](338) 2
TEXT:trigger,TEXT:java(1),TEXT:trigger,line:[1 TO
1000](0),TEXT:trigger,TEXT:wrapped(0),TEXT:trigger,TEXT:(pl sql)(1),TEXT:trigger,line:[100
1 TO 2000](33),TEXT:trigger,line:[3001 TO 4000](0),TEXT:trigger,line:[4001 TO
5000](0),TEXT:trigger,line:[2001 TO 3000](0) 3
TEXT:package,TEXT:java(7),TEXT:package,line:[1 TO 1000](0),TEXT:package,TEXT:(pl
sql)(25),TEXT:package,TEXT:wrapped(137),TEXT:package,line:[
1001 TO 2000](54),TEXT:package,line:[3001 TO 4000](5),TEXT:package,line:[4001 TO
5000](0),TEXT:package,line:[2001 TO 3000](5) 4
TEXT:(object type),TEXT:java(56),TEXT:(object type),TEXT:(pl sql)(106),TEXT:(object type),line:[1 TO
1000](1),TEXT:(object type),TEXT:wrappe
d(76),TEXT:(object type),line:[1001 TO 2000](441),TEXT:(object type),line:[4001 TO 5000](0),TEXT:(object
type),line:[3001 TO 4000](28),TEXT:
(object type),line:[2001 TO 3000](119) 5
TEXT:procedure(5574),TEXT:(object type)(5584),TEXT:package(868),TEXT:trigger(114),TEXT:function(6167)
6 rows selected.
Note that we are using ljoin() function which convert agg_tbl type to a coma separated
string plus his cardinality. First row do not have a sub category because parent column is
null, so 5116 is a number of rows which includes the text procedure, last row showed
included a category and sub category, TEXT:procedure,line:[1001 TO 2000] implies
the bit AND intersection between the set of rows which includes procedure against a set
of rows which match with line[1001 TO 2000], the group by cat_code causes that the
oracle ODCI API call first to calculate the bit set for procedure and iterate over all his sub
categories, java, pl sql, wrapped, doing the bit AND intersections, this is fast and once
the facets is computed is stored as Filter in Lucene Domain Index memory structures.
When a number of rows or the amount of categories is big we can use a materialized view
to work as cache of the facets computation. For example:
Now source_facets materialized view can be queried as any other table and his access will
be too fast. The materialized view then can be refreshed by the application at an specific
point in time.
Starting with Lucene Domain Index 2.9.1.1.0, two pipeline table functions has been
included to iterate over terms of Lucene Index structure, high_freq_terms():
FUNCTION high_freq_terms(index_name VARCHAR2,
term_name VARCHAR2,
num_terms NUMBER) RETURN term_info_set
is available for getting the Top-N (num_terms) most used terms on the whole index or in
a particular field. term_info_set is defined as:
TYPE term_info AS OBJECT (
term VARCHAR2(4000),
docFreq NUMBER(10)
);
TYPE term_info_set AS TABLE OF term_info;
You can query your index by using:
and, index_terms():
FUNCTION index_terms(index_name VARCHAR2,
term_name VARCHAR2) RETURN term_info_set
on both functions if argument term is NULL, these functions will iterate over all index
terms. The natural order for high_freq_terms() is descendent by docFreq,
but index_terms() is ordered by term_name:term_value ascending. Note that if you
pass a non NULL value to term to starts with the first value for the specific
term index_terms() do not stop when all the values of this term are completed, this
functionality is similar to Lucene Java method reader.terms(new Term(term)). Here
example if you want only iterate on an specific term name:
BEGIN
FOR term_rec IN (SELECT * FROM table(index_terms('SOURCE_BIG_LIDX','line')))
LOOP
/* Fetch from cursor variable. */
EXIT WHEN substr(term_rec.term,1,length('line'))<>'line'; -- exit when last row
is fetched
-- process data record
dbms_output.put_line('Name = ' || term_rec.term || ' ' || term_rec.docFreq);
END LOOP;
END;
You can use index_terms() to get the Top-N terms order by docFreq, for example:
Two queries are equivalent semantically but high_freq_terms() is more efficient because
it uses TermInfoQueue structure for sorting, caches his computation one is executed and
do not creates a lot of term_info objects which then are sorted by the RDBMS engine.
Starting with Lucene Domain Index 2.9.2.1.0, Did You Mean Lucene functionality was
added as an extended LDI property using the Lucene SpellChecker library to create the
dictionary index from the main index. Finaly, the dictionary index will be merged to the
main index.
PROCEDURE indexDictionary(
index_name IN VARCHAR2,
spellColumns IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein')
Note: The dictionary structure create the "word", "gramN", "startN" and "endN"
Lucene fields, so be carefull if you have this fieds in the main index. The structure of
this index is (for a 3-4 gram) this:
and,
FUNCTION suggest
(
index_name IN VARCHAR2,
cmpval IN VARCHAR2,
highlight IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein'
) RETURN VARCHAR2
is available to query the dictionary index. You can query the dictionary by using:
SUGGESTION
--------------------------------------------------------------------------------
source
Elapsed: 00:00:00.31
The index_name parameter and the word to respell (cmpval) parameter are
mandatory. You can define, optionaly, the highlight to be used (e.g. b for bold, i for
italic, etc.) and define the distance algorithm to apply.
3.5 Synchronize
Working with SyncMode:Deferred you has to manually synchronize your index, it means update
Lucene Domain Index structure applying pending changes such as insert and update. Deletes
operations are always applied due ODCI Api do not accept rowid of deleted rows.
Here an example:
begin
LuceneDomainIndex.sync('IT1');
commit; -- release locks
end;
3.6 Optimize
Optionally you can optimize Lucene Index storage, for doing that execute:
begin
LuceneDomainIndex.optimize('IT1');
commit; -- release locks
end;
Like sync operation this procedure get an exclusive lock at Lucene Index storage table and
perform an optimization of Lucene Index merging multiples segment in new one for example. You
can still performing select operation (read-only) using Lucene Domain Index during optimization
time, Oracle concurrency system (redo logs) provides you this functionality, once you perform a
commit operation any other concurrent session will automatically see index changes.
begin
LuceneDomainIndex.xdbExport('IT1');
commit; -- makes change visible to Ftp or WebDAV
end;
Username: scott/tiger
Export done in US7ASCII character set and AL16UTF16 NCHAR character set
server uses AL32UTF8 character set (possible charset conversion)
COUNT(*)
----------
6167
Index dropped.
Index created.
Table dropped.
.... Restore your .dmp now and check again if your index returns a correct result ....
-bash-3.2$ imp scott/tiger
COUNT(*)
----------
0
Table truncated.
19 rows created.
SQL> exit
..... and connect again to refresh Lucene Domain Index in memory structures ....
SQL> conn scott/tiger
Connected.
SQL> select count(*) from test_source_big where lcontains(text,'function')>0;
COUNT(*)
----------
6167
As you can see the Lucene Domain Index structure can be export alone without exporting the
master table, this is useful when you are upgrading Lucene Domain Index that requires that all
index need to be dropped first and you don't want to re-create a very big index.
4. Locking and Performance
Lucene Index Writer class uses several parameters to control index structure. Lucene
Domain Index pass to Index Writer several parameters such as MergeFactor,
MaxBufferedDocs among others.
As best practice if you want to index thousands of rows you can override default Lucene
parameters for other which speed up indexing time. With create index or alter index
rebuild you can set MergeFactor to 100 and MaxBufferedDocs to 4000.
This parameters increase index performance but then DML operations at the base table
will batch small set of rows, so after DDL commands change MergeFactor to 2 and
MaxBufferedDocs to 100. A good place to start knowing these parameters behavior is the
Wiki page Improving Indexing Speed.
Lucene Domain Index have a parameter called AutoTuneMemory a true value means that
for Index Writer operations it will try to use up to 90% of the Java Pool Size configured at
the Oracle SGA to adjust how many documents are buffered (MaxBufferedDocs) before
call IndexWritter.flush().
With AutoTuneMemory:true MaxBufferedDocs its not required, its calculated using free
ram at the SGA, but you has to set MergeFactor.
Due Java Pool Size is global parameter the rule is not valid if you want to create many
index with parallel connexions, two connections will try to use 90% of the SGA, so one of
them will ran out of memory.
4.2.3 Keep Index on RAM
OJVMDirectory replaces Lucene file system storage by a table storage with BLOBs. For
every Lucene Domain Index created there is a new table which stores every Lucene file as
a row with a BLOB column, see section 6 for more detail, using similar strategy as Oracle
Text you can keep this table in RAM. Unlike Oracle Text which uses multiples tables for
storing the inverted index, Lucene Domain Index use one table, execute this DDL
command to keep Lucene Index on RAM:
alter table source_small_lidx$t storage (buffer_pool keep) modify lob (data) (storage
(buffer_pool keep));
During Index creation use AutoTuneMemory:true (default value) and a MergeFactor high
because many rows will be indexed at this time. Then change MergeFactor to 2 to work
better after each DML/sync operation. Finally change OJVMDirectory storage table and
LOB to keep them in RAM.
Be sure that your SGA has a enough RAM to keep it. To know how big your index you can
query the table:
SUM(FILE_SIZE)
--------------
147444
To be sure that your Lucene Domain Index is properly used compare your executions
plans and try to avoid non necessary filter by or sort order by predicates by using in-line
sort or multiples field Query Parser conditions.
Here examples of sorting using emails table created in section 3.1.4:
Explained.
Elapsed: 00:00:00.58
SQL> set echo off
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------
Plan hash value: 1542204867
Id Operation Name Rows Bytes Cost (%CPU) Time
0 SELECT STATEMENT 1 4016 3 (34) 00:00:01
3 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security',1)>0)
Above execution plan tells that you are using Lucene Domain Index but you can get a better
optimizer plan by using lcontains sort:
Explained.
Elapsed: 00:00:00.01
SQL> set echo off
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------
Plan hash value: 1450245214
Id Operation Name Rows Bytes Cost (%CPU) Time
0 SELECT STATEMENT 1 4016 2 (0) 00:00:01
2 - access("LUCENE"."LCONTAINS"("BODYTEXT",'security','subject:ASC',1)>0)
5 Know caveats
1. Lucene Domain Index uses Java Util Logging API it means that a grant is required to create
and operate any index:
dbms_java.grant_permission( 'USER_NAME',
'SYS:java.util.logging.LoggingPermission', 'control', '' )
2. SyncMode:OnLine should be reserved only for index which a number of update/insert/delete
operation are too small compared to select operations, because each message process
requires almost open an IndexWriter/IndexReader on the associated Lucene Index by a
background process, except for bulk collect operation or "insert into ... select ... from" which
are processed in batch off 150 rows. Tables with many insert/update operations by seconds
should use LuceneDomainIndex.sync(idx) procedure called by DBMS_JOB periodically or by
the application.
3. Syntax for Inline pagination is only supported at the beginning of the Query, it means that if
you want to perform pagination lcontains() query syntax must start with "rownum:[n TO m]
AND" note that this syntax is case sensitive. Also this extraction is performed by splitting the
query by position and does not take into account grouping operator, so this query
"rownum:[1 TO 10] AND word1 OR word2" will be passed to Lucene's Query Parser as "word1
OR word2" which is not semantically the original one if you look precedence operator. We can
try to modify Query Parser class in a future to solve this semantic issues.
4. Since October 25 column name are case sensitive in ExtraCols and FormatCols parameters
using traditional SQL behavior, it means that for this DDL index creation:
You can use ExtraCols with f3 or F3 but FormatCols should be F3 because f3 is returned by
the SQL select operation as F3 during the table full scan, also Lucene Index will have a
document with a Field F3 instead of f3. If you want to use f3 as is you can re-write DDL index
creation with:
With this sentence Lucene will create documents with two field F2 and f3, F2 is uppercase
because is the master column of the index and his passed as "F2" by ODCI API but, due is the
default Field of the query, you can omit his name at lcontains syntax, F3 now is lowercase
and will be indexed as a Field "f3".
5. Since November Index parameters are pre-cached in memory for faster response. Due
isolation behaviour of Oracle JVM sessions, if you call to alter index or re-create a new one in
another session you need to close all SQL session that are already pre-load an index
parameter storage.
Calling to LuceneDomainIndex.getParameter('owner.index_name','parameter_name') you can
see the values of any parameter passed to the ODCI API either by calling create index or alter
index.
Otherwise you can call to LuceneDomainIndex.refreshParameterCache stored procedure.
6. If you re-install Lucene Domain Index without previous deleting existing indexes you can
manually drop resources associated to and old index. For example:
TABLE_NAME
------------------------------
DEPT
EMP
BONUS
SALGRADE
SOURCE_BIG_LIDX$QT
DR$SOURCE_BIG_IDX$I
DR$SOURCE_BIG_IDX$R
SOURCE_BIG_LIDX$T
TEST_SOURCE_BIG
DR$SOURCE_BIG_IDX$N
DR$SOURCE_BIG_IDX$K
11 rows selected.
SQL> drop table SOURCE_BIG_LIDX$T;
Table dropped.
SQL> conn / as sysdba
connected.
SQL>exec DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q')
BEGIN DBMS_AQADM.DROP_QUEUE ('SCOTT.SOURCE_BIG_LIDX$Q'); END;
*
ERROR at line 1:
ORA-01403: no data found
ORA-06512: at "SYS.DBMS_AQADM_SYS", line 3359
ORA-06512: at "SYS.DBMS_AQADM", line 167
ORA-06512: at line 1
SQL> exit
Note that "drop index ... force" will de-register Lucene Domain Index from Oracle's system
views, then Lucene Domain Index storage's table is manually dropped, finally connected as
SYS Lucene Domain Index AQ's table is dropped.
7. Oracle 11g have a know bug "6445561 - ORA-00600 [26599] [62] DUE TO INCORRECT
PERSISTENCE OF BY INVOKER PIN" please apply patch number
p6445561_111060_LINUX.zip available at Metalink, this bug affects select count(*) with a
large results.
8. Up to Lucene Domain Index 2.9.0 there is known problem with the WhereCondition
parameter using OR SQL operator, see section A.3.3 to see the workaround.
Appendixes
A. Parameter reference and syntax
Lucene Domain Index accept several parameters which can be passed using create index or alter
index DDL commands. This parameters are divided into four categories, Index Writer, Analyzer, User
Data Store and General parameters.
A.1 Lucene Index Writer parameters
This section covers Lucene Index Writer parameters for more information about this parameter
see Lucene docs and Wiki.
A.1.1 MergeFactor
Determines how often segment indices are merged by addDocument(). If you are creating
a new index over a table with thousands of rows a value of 100 to 500 is good value.
A.1.2 MaxBufferedDocs
Determines the minimal number of documents required before the buffered in-memory
documents are merged and a new Segment is created. This value can cause an out
of memory exception you provide a value larger than user space available. A typical
SGA configuration can accept values of 4000 or 5000 depending how big are your rows
being indexed. If you are not sure of how megabytes can consume your rows you
can use AutoTuneMemory:true parameter which is a default value, so you choose true
MaxBufferedDocs will be ignored and Lucene Domain Index will try to uso 90% of Oracle
Java Pool Size value.
A.1.3 MaxMergeDocs
A.1.4 MaxBufferedDeleteTerms
Determines the minimal number of delete terms required before the buffered in-memory
delete terms are applied and flushed.
A.1.5 UseCompoundFile
Setting to turn on usage of a compound file. When on, multiple files for each segment are
merged into a single file once the segment creation is finished. This is done regardless of
what directory is in use. By default Lucene Domain Index do not use compound file format
because its not affected by max open file descriptors.
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting
index terms from text.
Typical implementations first build a Tokenizer, which breaks the stream of characters from the
Reader into raw Tokens. One or more TokenFilters may then be applied to the output of the
Tokenizer.
Analyzer, PerFieldAnalyzer or Stemmer parameter affects indexing and query expressions,
so if you want to change this parameter on a exists index you to must rebuild it, the priority
of these three parameters is first check for the Stemmer if its not present check for
PerFieldAnalyzer if its not present checks for Analyzer parameter, finally if none of them are
defined will use SimpleAnalyzer.
A.2.1 Analyzer
A.2.2 Stemmer
Stemmer is another kind of analyzer which divides words, stop words and another term
related object based on an specific language. Stemmer parameter use Snowball Analyzer,
possible values for Stemmer parameter using Lucene 2.2.0 distribution are:
• Danish
• Dutch
• English
• Finnish
• French
• German
• German2
• Italian
• Kp
• Lovins
• Norwegian
• Porter
• Portuguese
• Russian
• Spanish
• Swedish
Stemmer parameter override Analyzer parameter.
A.2.3 PerFieldAnalyzer
In the above example four columns are being indexed by Lucene Domain Index rowid
(added by default) using KeywordAnalyzer, F1 and id (added by ExtraCols parameter)
using KeywordAnalyzer too, and finally name which is not included into
PerFieldParameter and then using StandardAnalyzer.
Lucene Domain Index implements a User Data Store functionality, this functionality
provides many parameters to control which column will be included into a Lucene
Document which is inserted into the index.
and First three parameters are used to choose which columns will added to the index
in addition to the master column. Oracle Domain Index are bound to a single column,
this is a limitation with Oracle 10g version. To avoid this problem passing ExtraCols,
ExtraTabsWhereCondition you can easily build a set of new column from the master table
and others. Basically a select DML statement is built using these parameters. To clarify this
Lucene Domain Index will performs a query like:
Text in italic are injected by Lucene Domain Index and text in bold are user defined.
A.3.1 ExtraCols
A coma separated list of columns of the Master table of table being indexed or the tables
defined into ExtraTabs parameter. Note that if you don't define columns alias column
name are capitalized by default on Oracle databases. For example 'ExtraCols:F2 "f2",T2.F3
"f3"' note that you can omit master table name if there is no collisions
A.3.2 ExtraTabs
A coma separated list of table name and alias for this tables. For example 'ExtraTabs:T2
aliasT2,T3 aliasT3'. Note that ODCI API only will detect changes at index master column,
to notify changes based on ExtraCols list you need to attach triggers, see section examples
above for more detail.
A.3.3 WhereCondition
An SQL where condition used to join index's master table with ExtraTabs tables. For
example: 'WhereCondition:T1.f1=T2.f2(+) AND T1.F1=aliasT3.f3'. Be careful to produce a
correct join condition to guaranty single row result; multiple or zero row result based on
the master table values are not allowed.
A.3.4 UserDataStore
A.3.2 FormatCols
A coma separated list of column(format) strings interpreted by User Data Store class to
control how an specific database column will be transformed in a Lucene Field. For example
you can choose padding, un-tokenized values and so on.
Supported formats by Default Data Store class are:
• Number padding for numeric columns using java.text.DecimalFormat class
syntax, default is 0000000000.
• Date rounding for timestamp and date columns using
org.apache.lucene.document.DateTools, default is day.
• Character left padding for VARCHAR2 or CHAR columns using
org.apache.lucene.util.StringUtils class (leftPad method), default is no left char
padding. Any char can be used for left padding.
• XPath expression for XMLType columns, this XPath string will be passed to
XMLType.extract("format","") method, the result of the XPath extraction will be
a new XMLType object over getStringVal() will executed. If you want to perform
more user defined XMLType to Field extraction extend DefaultUserDataStore class
or use virtual column indexing.
• For columns of type VARCHAR2 or CHAR you can use an special string
NOT_ANALYZED or NOT_ANALYZED_STORED as format which tell to Default
User Data Store class that this column will be indexed but un-tokenized, this is
useful with columns which will be used for sorting.
SyncMode tells to Lucene Domain Index which strategy is used to update the index.
SyncMode:Deferred (default) left to the application when the index is synced either
by calling LuceneDomainIndex.sync procedure after a set of changes pending or by
DBMS_SCHEDULER process at an specific time. With SyncMode:Deferred update and
insert operations are queued using DBMS_AQ package. Delete operations are never
enqueued because require an update on Lucene Index to not return rowid of deleted rows.
SyncMode:OnLine is implemented by using DBMS_AQ PLSQL callback, so immediately
after a commit operation which involves insert or update rows a parallel process dbms_j*
is automatically started by DBMS_AQ package to applied pending changes.
SyncMode:OnLine should be reserved for index which update, insert or delete operations
are much lower than select, AQ callbacks can not handle very well exceptions during sync
time, for example when a row being index is locked by another session, so some changes
can be lost with this scenario.
A.4.2 AutoTuneMemory
A.4.3 LobStorageParameters
Lucene Domain Index uses a BLOB column named "data" for storing Lucene Inverted
index files. You can control any LOB storage parameter with this parameter during index
creation time, his default value is 'LobStorageParameters:PCTVERSION 0 ENABLE
STORAGE IN ROW CACHE READS NOLOGGING' for 11g databases you can use a
better optimize storage by using newest Secure LOB parameter, for example:
'LobStorageParameters:PCTVERSION 0 ENABLE STORAGE IN ROW CHUNK 32768
CACHE READS FILESYSTEM_LIKE_LOGGING'
A.4.4 LogLevel
Lucene Domain Index uses JDK Java Util Logging package, LogLevel parameter is any of
the string defined by Level.parse() method, for example: LogLevel:ALL. By default logging
level is defined to WARNING.
Lucene Domain Index uses:
• SEVERE for non recoverable error conditions
• FINER for debugging purpose such as ODCI API arguments
• INFO for checking index operations such as value being indexed
• WARNING for error messages which are reported as ERROR through ODCI API
• CONFIG to see user parameters changed by users
Logging information is sent by default to Oracle .trc files, but you can redirect this output
using dbms_java.set_output procedure for example.
If you are not sure which field and how these fields are added to the index change LogLevel
to INFO and check for lines starting with: "INFO: Document<"
exiting and throwing methods does not print messages also with log level defined to ALL.
This is because logging level used by these methods are controlled by ConsoleHandler
level.
To get these methods work copy logging.properties file from your JAVA_HOME/jre/lib to
ORACLE_HOME/javavm/lib directory and edit the line which includes level property:
# Limit the message that are printed on the console to INFO and above.
java.util.logging.ConsoleHandler.level = ALL
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
A.4.5 CachedRowIdSize
A.5.1 DefaultColumn
Note the correlation between DefaultColumn and ExtraCols. ExtraCols defines a Lucene
Field named "text" with a value calculated by the SQL expression
extract(object_value,''/page/revision/text/text()''), then you can use a Lucene
Field text as default Field in QueryParser syntax.
A.5.2 DefaultOperator
A.5.3 NormalizeScore
NormalizeScore is used during Lucene Index scan to know if they need to track the
maximum score, the maximum score then used to normalize the result of lscore() operator
to return only values between 0 to 1. If you don't need a normalized range of the score you
can avoid this computation and your query will be fast. Note that a not normalized score
not implied that the document are not in order of relevance.
A.5.4 PreserveDocIdOrder
This set of parameters which affects lhighlight, phighlight and rhighlight functionality.
A.6.1 Formatter
Formatter defines a valid class name which implements Lucene Interface Formatter
and with a constructor with no arguments, default value
org.apache.lucene.search.highlight.SimpleHTMLFormatter.
A.6.2 MaxNumFragmentsRequired
A.6.3 FragmentSize
FragmentSize defines the size of each fragment returned in characters of each fragment,
default value is 100.
A.6.4 FragmentSeparator
FragmentSeparator defines a String used as fragment separator, default value is "...". Note
that you can not use ";" or ":" as fragment separator because are used as parameter and
value delimiters into alter index ... parameters(..) DDL statement.
Also have and index based on IDX_NAME$T.DELETED column to speedy up purge operations.
To enqueue operation at the index it defines a DBMS_AQ Queue IDX_NAME$Q with his storage table
IDX_NAME$QT.
IDX_NAME$Q queue have payload defined as LUCENE_MSG_TYP object. This object type is defined
as:
Name Null? Type
RIDLIST SYS.ODCIRIDLIST
OPERATION VARCHAR2(32)
SYS.ODCIRIDLIST is an special structure defined by ODCI API to hold a list of rowid changed by
an DML operation. OPERATION is one of insert, delete, update, rebuild or optimize reserved
keyword. rebuild and optimize operations are used with SyncMode:OnLine to perform these tasks
automatically using a background process.
create table T1 (
f1 number primary key,
f2 varchar2(200),
f3 varchar2(200),
f4 number)
parameters('LogLevel:WARNING;Analyzer:org.apache.lucene.analysis.StopAnalyzer;MergeFactor:500;ExtraCols:F1;Forma
C.2 TestDBIndex
Simple test which create a table his index and performs insertions, sync, optimize and deletions,
finally drop index and table. His output look like:
C.3 TestDBIndexAddDoc
Performs several insertions and sync, starting with 10 rows, then 90 and so on, ending with
3.000 insertions using insertRow method of DBTestCase base class. After each batch of insertions
calls to syncIndex method calculating average time of sync method for each row inserted. His
output look like:
C.4 TestDBIndexDelDoc
At setup method this test case a create a table and fill it with 500 rows. Then performs deletions
batch of 10, 90 and 400 rows each calculating average time for each row deleted. His output
look like:
C.5 TestDBIndexParallel
This is more complex test case to check concurrent access to Lucene Domain Index. To do this
creates several threads, some for simulating batch insertions of 10 rows, others for simulating
batch deletions of 10 rows, another for simulating batch updates of 10 rows and finally many
threads searching for rows each 0.5 seconds.
By default creates 3 threads for each kind of operations and each thread perform:
• 20 inserts
• 5 deletes
• 5 update
• 100 search
Each thread takes his own connection from the connection pool and do his job, if fastSync
constant is true after each successful insert and update it calls to syncIndex method to update
Lucene Index, if fastSync is false another thread is started performing sync index each 1 second.
It end when all threads (inserts, deletes, updates) finish.
Here some part of his output:
C.6 TestDBIndexSearchDoc
This test check some special features of lcontains operator such as in-line pagination, sort by
and filter by expressions.
First create a table with 200 rows and then query them, his output look like:
C.7 TestQueryHits
This test is not autonomous because requires an additional step to run. Before run it create a
table and his Lucene Index with:
create table test_source_big as (select * from all_source);
create index source_big_lidx on test_source_big(text)
indextype is lucene.LuceneIndex
parameters('AutoTuneMemory:true;MergeFactor:500;FormatCols:line(0000);ExtraCols:line "line"');
For 11g databases you can create a best optimize Lucene Index using some new Secure LOB
features:
On 10g running it as SCOTT, TEST_SOURCE_BIG table will have 220731 rows using a typical
installation based on database templates.
Using above table two test checks performance with a query which returns 18387 hits, once
call to LuceneDomainIndex.countHits function and another iterate over the result in pages of ten
rows, typical scenario of web applications. His output look like:
Note that first iteration took more time because it includes parsing time and caching, also to
simulate a real word web application an SQLConnection is take and returned to the pool on each
iteration.
D Functions, operators and utilities
lcontains operator is similar to Oracle Text score operator, but differs in query argument and
support another one argument to define in-line sorting.
Syntax
LCONTAINS(
[schema.]column,
text_query VARCHAR2
[,sort VARCHAR2]
[,label NUMBER])
RETURN NUMBER;
[schema.]column
Specify the Lucene text column to be searched on. This column must have a Lucene
Domain Index associated with it.
text_query
Specify a Lucene Query Parser syntax argument. In addition to Lucene Query Parser
syntax, Lucene Domain Index support in-line pagination at lcontains, to do that this
query must start with rownum[nn TO mm] AND where nn and mm are rownum values
of the result query which will be returned, in Oracle syntax rownum start with 1, and this
boundary are inclusive which means that for 20 to 30 we get 11 rows.
Follwing and excerpt of Lucene Query Parser Syntax.
Terms
A query is broken up into terms and operators. There are two types of terms:
Single Terms and Phrases.
A Single Term is a single word such as "test" or "hello".
A Phrase is a group of words surrounded by double quotes such as "hello dolly".
Multiple terms can be combined together with Boolean operators to form a more
complex query (see below).
Note: The analyzer used to create the index will be used on the terms and phrases
in the query string. So it is important to choose an analyzer that will not interfere
with the terms used in the query string.
Fields
Lucene supports fielded data. When performing a search you can either specify a
field, or use the default field. The field names and default field is implementation
specific.
You can search any field by typing the field name followed by a colon ":" and then
the term you are looking for.
As an example, let's assume a Lucene index contains two fields, title and text and
text is the default field. If you want to find the document entitled "The Right Way"
which contains the text "don't go this way", you can enter:
or
Since text is the default field, the field indicator is not required.
Note: The field is only valid for the term that it directly precedes, so the query
title:Do it right
Will only find "Do" in the title field. It will find "it" and "right" in the default field (in
this case the text field).
Term Modifiers
Lucene supports modifying query terms to provide a wide range of searching
options.
Wildcard Searches
Lucene supports single and multiple character wildcard searches within single terms
(not within phrase queries).
To perform a single character wildcard search use the "?" symbol.
To perform a multiple character wildcard search use the "*" symbol.
The single character wildcard search looks for terms that match that with the single
character replaced. For example, to search for "text" or "test" you can use the
search:
te?t
Multiple character wildcard searches looks for 0 or more characters. For example, to
search for test, tests or tester, you can use the search:
test*
You can also use the wildcard searches in the middle of a term.
te*t
roam~
roam~0.8
Range Searches
Range Queries allow one to match documents whose field(s) values are between
the lower and upper bound specified by the Range Query. Range Queries can be
inclusive or exclusive of the upper and lower bounds. Sorting is done
lexicographically.
mod_date:[20020101 TO 20030101]
This will find documents whose mod_date fields have values between 20020101
and 20030101, inclusive. Note that Range Queries are not reserved for date fields.
You could also use range queries with non-date fields:
title:{Aida TO Carmen}
This will find all documents whose titles are between Aida and Carmen, but not
including Aida and Carmen.
Inclusive range queries are denoted by square brackets. Exclusive range queries
are denoted by curly brackets.
Boosting a Term
Lucene provides the relevance level of matching documents based on the terms
found. To boost a term use the caret, "^", symbol with a boost factor (a number) at
the end of the term you are searching. The higher the boost factor, the more
relevant the term will be.
Boosting allows you to control the relevance of a document by boosting its term.
For example, if you are searching for
jakarta apache
and you want the term "jakarta" to be more relevant boost it using the ^ symbol
along with the boost factor next to the term. You would type:
jakarta^4 apache
This will make documents with the term jakarta appear more relevant. You can also
boost Phrase Terms as in the example:
By default, the boost factor is 1. Although the boost factor must be positive, it can
be less than 1 (e.g. 0.2)
Boolean Operators
Boolean operators allow terms to be combined through logic operators. Lucene
supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators
must be ALL CAPS).
The OR operator is the default conjunction operator. This means that if there is no
Boolean operator between two terms, the OR operator is used. The OR operator
links two terms and finds a matching document if either of the terms exist in a
document. This is equivalent to a union using sets. The symbol || can be used in
place of the word OR.
To search for documents that contain either "jakarta apache" or just "jakarta" use
the query:
or
AND
The AND operator matches documents where both terms exist anywhere in the text
of a single document. This is equivalent to an intersection using sets. The symbol
&& can be used in place of the word AND.
To search for documents that contain "jakarta apache" and "Apache Lucene" use
the query:
+
The "+" or required operator requires that the term after the "+" symbol exist
somewhere in a the field of a single document.
To search for documents that must contain "jakarta" and may contain "lucene" use
the query:
+jakarta lucene
NOT
The NOT operator excludes documents that contain the term after NOT. This is
equivalent to a difference using sets. The symbol ! can be used in place of the word
NOT.
To search for documents that contain "jakarta apache" but not "Apache Lucene" use
the query:
Note: The NOT operator cannot be used with just one term. For example, the
following search will return no results:
NOT "jakarta apache"
-
The "-" or prohibit operator excludes documents that contain the term after the "-"
symbol.
To search for documents that contain "jakarta apache" but not "Apache Lucene" use
the query:
Grouping
Lucene supports using parentheses to group clauses to form sub queries. This can
be very useful if you want to control the boolean logic for a query.
To search for either "jakarta" or "apache" and "website" use the query:
This eliminates any confusion and makes sure you that website must exist and
either term jakarta or apache may exist.
Field Grouping
Lucene supports using parentheses to group multiple clauses to a single field.
To search for a title that contains both the word "return" and the phrase "pink
panther" use the query:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
To escape these character use the \ before the character. For example to search for
(1+1):2 use the query:
\(1\+1\)\:2
sort
Sort string is with syntax sortField1[[:(ASC|DESC)]:[type]] for example
revisionDate:DESC:string, ASC or DESC is optional as type which is either string, int or
float. Multimples fields can be used for sorting, sort string must be separated by , for
example revisionDate:DESC:string,title:ASC.
If you don't include sort argument at lcontains operator, a Lucene natural order which is
score descending will be used. For any other field ASC is the default sort order.
label
Is an string used in conjuntion with lscore operator to identified which is the lcontains
operators is used for each lscore.
Use the LSCORE operator in a SELECT statement to return the score values produced by a
LCONTAINS query. The LSCORE operator can be used in a SELECT, ORDER BY, or GROUP BY
clause.
Syntax
LSCORE(label NUMBER)
label
Specify a number to identify the score produced by the query. Use this number to identify
the LCONTAINS clause which returns this score.
Example
SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1),subject FROM emails
where lcontains(bodytext,'security',1)>0;
Use the LHIGHLIGHT operator in a select statement to return a highlighted version of the
master column of the index associated to the LCONTAINS query. By now only highlighting
functionality is supported for the master column of the index and the return value of this
function is a VARCHAR2 data type with the text highlighted. VARCHAR2 limitation is not a big
problem because highlighted text usually is an small part of the original text of the column
showed to user as a preview of the original document.
Syntax
LHIGHLIGHT(label NUMBER):VARCHAR2
label
Specify a number to identify the score produced by the query. Use this number to identify
the LCONTAINS clause which returns this score.
Example
PHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2
or CLOB of the input query. Columns not included into cols argument will not be affected and
they will be returned as is.
Syntax
index_name
Specify a Lucene Index to use.
qry
Lucene Query Parser syntax, same as the second argument of lcontains, except for the
Lucene Domain Index extension for pagination.
cols
A coma separated list of columns to highlight, note that are capitalized if you not use
columns alias.
stmt
Any SQL text of the query to execute by DBMS_SQL package. Remember to use double
single quote to represent a SQL single quote inside the string. Columns returned by this
query should be mapped as String, BigDecimal, Timestamp, CLOB, TIMESTAMP,
TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means for example that for table
with a column VARCHAR2(40) the associated Java type inside the OJVM is String,
then it can be highlighted or returned by this pipeline table function.
Example
SELECT * FROM
TABLE(phighlight(
'EMAILBODYTEXT',
'lucene OR mysql',
'SUBJECT,BODYTEXT',
'select lscore(1) sc,e.* from eMails e where lcontains(bodytext,''rownum:[1 TO 10]
AND (security OR mysql)'',''subject:ASC'',1)>0'
));
RHIGHLIGHT pipeline table function performs highlighting on any column of type VARCHAR2
or CLOB of the input query. Columns not included into cols argument will not be affected and
they will be returned as is. This is a variant of PHighlight which requires an additional
argument (rType) telling to this function the type that will be returned. This version is free to
any kind of SQL injection and can start several invocation in parallel by the RDBMS based on
the information of the last argument.
Syntax
index_name
Specify a Lucene Index to use.
qry
Lucene Query Parser syntax, same as the second argument of lcontains, except for the
Lucene Domain Index extension for pagination.
cols
A coma separated list of columns to highlight, note that are capitalized if you not use
columns alias.
rType
A collection to be returned by RHighlight table function, usually is "colType TABLE OF
aRowType".
rws
Any SQL query wrapped by the function CURSOR if you are using SQLPlus for example,
or a JDBC ResultSet passed as setObject(n,rs), if you are using an application in Java.
Columns returned by this query should be mapped as String, BigDecimal, Timestamp,
CLOB, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Java types, it means for
example that for table with a column VARCHAR2(40) the associated Java type inside
the OJVM is String, then it can be highlighted or returned by this pipeline table function.
Example
MoreLike.this function have two declarations, once using index_name argument which uses
current connected users and owner,index_name pair for using index in another database
schema.
Syntax
index_name
Specify a Lucene Index to use.
x
ROWID used as pivot, it defines which row is used to extract the text with term used for
More Like This Lucene functionality. DefaultColumn parameter of the index is used to
define the column used to get the text, only columns of type VARCHAR2, CLOB or
XMLType are supported.
f,t
From to pagination information, default values are 1 to 10.
minTermFreq,minDocFreq
minTermFreq is the frequency below which terms will be ignored in the source doc,
minDocFreq is the frequency at which words will be ignored which do not occur in at
least this many docs, default values are 2 to 5.
SYS.odciridlist
Is an array of ROWIDs which can be wrapped with a pipeline table function ridlist_table
for selecting his values, for example (select * from table(ridlist_table(ridlist))).
lfacets() aggregate function have one argument which is a coma separated list of the index
name and the category and sub category to be queries.
Syntax
input
A coma separated list including index name, category and optional a sub category.
Category and sub category are in Lucene Query Parser syntax including the column name
indexed (Lucene Field), for example text:(Ciencias naturales y formales), line:[1 TO 10]
and so on. When category and sub category are present the ODCI API start the
computation by calculating the bit set of the main category and then iterate over each
sub category doing a bit and operation between the two bit set.
LUCENE.AGG_TBL
Is a TABLE OF agg_attributes and AGG_ATTRIBUTES is an object type with two field
qryText VARCHAR2(4000) and hits NUMBER when a category and sub category is
passed as argument, this return value will be a table with each row representing the
cardinality of intersection between the category and the sub category, it means a table
with a number of rows equal to the number of sub categories.
To help formatting the output in a traditional query there is function ljoin() which
receives as input an agg_tbl type plus a char separator and returns an string with all the
rows, here the syntax:
i_tbl
A LUCENE.agg_tbl table to scan.
i_glue
A VARCHAR2 string to use as separator, default value ",".
index_terms() pipeline table function returns a list of Lucene terms values and their
frequency, it have two arguments, first argument is the Lucene Domain Index name and
second argument a Lucene term name.
Syntax
index_name
Lucene index name with a syntax, SCHEMA.IDX_NAME or IDX_NAME if current user is the
owner.
term_name
Lucene term name, if this argument is NULL the information of all Lucene Index terms will
be returned.
LUCENE.term_info_set
Is a TABLE OF term_info and TERM_INFO is an object type with two field term
VARCHAR2(4000) and docFreq NUMBER(10) this table can be easily iterated with
traditional SELECT FROM construction, for example:
high_freq_terms() pipeline table function returns a Top-N most frequents Lucene terms
values and their frequency, it have three arguments, first argument is the Lucene Domain
Index name, second argument is a Lucene term name and last argument how many Top-N
terms should be returned.
Syntax
index_name
Lucene index name with a syntax, SCHEMA.IDX_NAME or IDX_NAME if current user is the
owner.
term_name
Lucene term name, if this argument is NULL the information of all Lucene Index terms will
be returned.
num_terms
How many Top-N high frequency terms should be returned.
LUCENE.term_info_set
Is a TABLE OF term_info and TERM_INFO is an object type with two field term
VARCHAR2(4000) and docFreq NUMBER(10) this table can be easily iterated with
traditional SELECT FROM construction, for example:
Syntax
owner
Lucene index owner.
index_name
Lucene index name.
spellColumns
Lucene Domain Index columns to be included in Did You Mean dictionary.
distanceAlg
Distance Algorithm used when create the dictionary, possible values Levenstein,
NGram or Jaro, default Levenstein.
This procedures update Lucene Domain Index structure adding new Field storing
the information required for doing Did You Mean functionality.
FUNCTION suggestwords
(
owner IN VARCHAR2,
index_name IN VARCHAR2,
cmpval IN VARCHAR2,
highlight IN VARCHAR2 DEFAULT null,
distancealg IN VARCHAR2 DEFAULT 'Levenstein'
) RETURN VARCHAR2
owner
Lucene index owner.
index_name
Lucene index name.
cmpval
String with values to be replaced by Did You Mean algorithm.
highlight
Tag using for highlighting if it is not null, for example i will be used to return the
tag <i>text</i>
distanceAlg
Distance Algorithm used when create the dictionary, possible values Levenstein,
NGram or Jaro, default Levenstein.
This function query Lucene Domain Index to compute a Did You Mean words for the
input string.
ant upgrade-domain-index
ant ncomp-lucene-ojvm (10g only)
ant jit-lucene-classes (11g only)
• Do not store internal parameters into system's views and force to PopulateIndex:false
• After every sync, now files marked as deleted are purged to free BLOB storage
• Added lfacets aggregated function for doing facets
• CountHits function no longer requires sort argument
• Filter are stored/retrived only using QueryParser.toString() key
• UN_TOKENIZED format string at DefaultUserDataStore class was replaced by NOT_ANALYZED
or NOT_ANALYZED_STORED according to new Lucene definitions.
• Fix bug when sync try to process more than 32767 rowids enqueued.
• Added parameters for highlighting functions Formatter, MaxNumFragmentsRequired,
FragmentSeparator and FragmentSize.
• Added PerFieldAnalyzer parameter to use independent Analyzer for each columns.
• Added sample of a custom Formatter org.apache.lucene.search.highlight.MyHTMLFormatter
• Fix compatibility problem between 10g/11g SQL Date representation on pipeline table
function.
• DefaultUserDataStore requires usage of XPath text() expresion for getting only textual value
• Added logging info SQL being executed at table indexer
• Change document logging to FINER level
• More pre-defined mapping at DefaultUserDataStore for NUMBER, BINARY_FLOAT,
BINARY_DOUBLE, TIMESTAMP, TIMESTAMPTZ and TIMESTAMPLTZ Oracle types.
• New parameter PopulateIndex:[true|false] for populating or not Lucene Index at creation
time.
• New parameter IncludeMasterColumn:[true|false], to choose whether or not index master
column, useful with Virtual Columns and XMLType.
• New parameter BatchCount:integer, to choose how many rows count are enqueued for
indexing using create ... index ... parameters('SyncMode:OnLine');
• Creating an index with SyncMode:OnLine causes that LuceneDomain index will enqueue
batchs of "BatchCount" rows for index by AQ PLSQL callback in background. Lucene Domain
Index is intermediately ready for querying after create.
• Batch rowid indexing is doing using a pipeline function.
Binary download:
https://issues.apache.org/jira/secure/attachment/12366661/ojvm-09-27-07.tar.gz
CVS access:
cvs -d:pserver:[email protected]:/cvsroot/dbprism login
cvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm
CVS access:
cvs -d:pserver:[email protected]:/cvsroot/dbprism login
cvs -z3 -d:pserver:[email protected]:/cvsroot/dbprism co -P ojvm
https://issues.apache.org/jira/secure/attachment/12348574/ojvm-01-09-07.tar.gz
• The Data Cartridge API is used without column data to reduce the data stored on the queue
of changes and speedup the operation of the synchronize method.
• Query Hits are cached associated to the index search and the string returned by the
QueryParser.toString() method.
• If no ancillary operator is used in the select, do not store the score list.
• The "Stemmer" argument is recognized as parameter given the argument for the SnowBall
analyzer, for example:
create index it1 on t1(f2) indextype is lucene.LuceneIndex
parameters('Stemmer:English');.
• Before installing the ojvm extension is necessary to execute "ant jar-core" on the snowball
directory.
• The IndexWriter.setUseCompoundFile(false) is called to use multi file storage (faster than the
compound file) because there is no file descriptor limitation inside the OJVM, BLOBs are used
instead of File.
• Files are marked for deletion and they are purged when calling to Sync or Optimize methods.
• Blob are created and populated in one call using Oracle SQL RETURNING information.
• A testing script for using OE sample schema, with query comparisons against Oracle Text
ctxsys.context index.
https://issues.apache.org/jira/secure/attachment/12347614/ojvm-12-20-06.tar.gz
This new release of the OJVMDirectory Lucene Store includes a fully functional Oracle Domain Index
with a queue for update/insert massive operations and a lot of performance improvement.
https://issues.apache.org/jira/secure/attachment/12345967/ojvm-11-28-06.tar.gz
• The complet API for the Oracle Domain index was completed, but the solution for the
operator contains outside the where clause is not good.
• I will implement a singleton solution for the OJVMDirectory object when is used in read only
mode, typically when user performs select operations against tables which have columns
indexed with Lucene. This implementation will increase a lot the final performance because
the index reader will be ready for each select operation. Obviously I will check if another user
or thread makes a write operation on the index to reload the read-only singleton.
• The queue for storing the changes on the index is not implemented yet, I'll add it in a short
time.
https://issues.apache.org/jira/secure/attachment/12345516/ojvm.tar.gz