Understanding OpenText Search Engine 21
Understanding OpenText Search Engine 21
White Paper
Understanding Search Engine 21
Patrick Pidduck, Director, Product Management
Understanding Search Engine 21
Foreword
Since 2009, I have had the honor of working with the extraordinary software
development teams at OpenText responsible for the OpenText Search Engine.
Search has always been a fundamental component of the OpenText Content Suite
Platform, and OpenText pioneered several key technologies that serve as the
foundation for modern search engines. Our team has built upon more than 25 years
of search innovation, and contributed to external research initiatives such as TREC
for years.
OpenText knows search.
In the last few years, our customers have pushed scalability and reliability
requirements to new levels. The OpenText Search Engine has met these goals, and
continues to improve with each quarterly product update. Several billion documents
in a single search index, unthinkable just a few years ago, is a reality today at
customer sites.
This edition of the “Understanding Search” document covers the capabilities of
Search Engine 21. Search Engine 21 is the most recent version, superseding 16.2,
16, 10.5, 10.0 and versions reaching back to Content Server 9.7. We understand our
enterprise customer needs, and this latest search engine provides seamless upgrade
paths from all supported versions of Content Server. While protecting your existing
investments, we continue to add incredible new capabilities, such as efficient search
methods optimized for eDiscovery and Classification applications, enhanced
backups, and integrated performance monitoring.
This document would not be possible without the help of our resident search experts.
As always, you have my thanks: Alex and Alex, Ann, Annie, Christine, Dave, Dave,
Dave, Hiral, Jody, Johan, Kyle, Laura, Mariana, Michelle, Mike, Ming, Parmis, Paul,
Ray, Rick, Riston, Ryan, Scott and Stephen.
Patrick.
Contents
Basics........................................................................................................................... 3
Overview ................................................................................................................. 3
Introduction ...................................................................................................... 3
Disclaimer ........................................................................................................ 3
Relative Strengths ............................................................................................ 4
Upgrade Migration: .................................................................................... 4
Transactional Capability: ........................................................................... 4
Metadata Updates: .................................................................................... 4
Search-Driven Update: .............................................................................. 4
Maintenance Commitment: ....................................................................... 4
Data Integrity: ............................................................................................ 4
Scaling: ...................................................................................................... 4
Advanced Queries: .................................................................................... 5
Related Components ....................................................................................... 5
Admin Server ............................................................................................. 5
Document Conversion Server ................................................................... 5
IPool Library .............................................................................................. 5
Content Server Search Administration ...................................................... 5
Query Languages ...................................................................................... 6
Remote Search.......................................................................................... 6
Backwards Compatibility .................................................................................. 6
Installation with Content Server ....................................................................... 6
Search Engine Components .................................................................................. 7
Update Distributor ............................................................................................ 8
Index Engines .................................................................................................. 8
Search Federator ............................................................................................. 8
Search Engines ................................................................................................ 9
Inter-Process Communication ................................................................................ 9
External Socket Connections ........................................................................... 9
Internal Socket Connections .......................................................................... 10
Search Federator Connections ...................................................................... 10
Search Queues........................................................................................ 10
Queue Servicing ...................................................................................... 11
Search Timeouts...................................................................................... 11
Testing Timeouts...................................................................................... 12
File System .................................................................................................... 13
Server Names ................................................................................................ 13
Partitions ............................................................................................................... 13
Basic Concepts .............................................................................................. 13
Large Object Partitions .................................................................................. 16
OTMetadataChecksum .................................................................................. 39
OTContentStatus ........................................................................................... 40
OTTextSize..................................................................................................... 42
OTContentLanguage ..................................................................................... 42
OTPartitionName ........................................................................................... 42
OTPartitionMode ............................................................................................ 42
OTIndexError ................................................................................................. 43
OTScore ......................................................................................................... 43
TimeStamp Regions....................................................................................... 44
OTObjectIndexTime................................................................................. 44
OTContentUpdateTime ........................................................................... 44
OTMetadataUpdateTime ......................................................................... 44
OTObjectUpdateTime .............................................................................. 45
_OTDomain .................................................................................................... 45
_OTShadow ................................................................................................... 45
Regions and Content Server ................................................................................ 45
MIME and File Types ..................................................................................... 46
Extracted Document Properties ..................................................................... 46
Workflow ........................................................................................................ 47
Categories and Attributes............................................................................... 47
Forms ............................................................................................................. 48
Custom Applications ...................................................................................... 48
Default Search Settings ........................................................................................ 48
Indexing and Query .................................................................................................. 49
Indexing ................................................................................................................ 49
Indexing using IPools ..................................................................................... 50
AddOrReplace ............................................................................................... 52
AddOrModify .................................................................................................. 53
Modify............................................................................................................. 53
Delete ............................................................................................................. 53
DeleteByQuery ............................................................................................... 54
ModifyByQuery .............................................................................................. 54
Transactional Indexing ................................................................................... 55
IPool Quarantine...................................................................................... 55
Query Interface ..................................................................................................... 56
Select Command ........................................................................................... 56
Set Cursor Command .................................................................................... 57
Get Results Command................................................................................... 58
Get Facets Command .................................................................................... 60
Date Facets .................................................................................................... 61
FileSize Facets .............................................................................................. 62
Expand Command ......................................................................................... 64
Hit Highlight Command .................................................................................. 64
Get Time......................................................................................................... 65
Set Command ................................................................................................ 66
Get Regions Command ................................................................................. 66
OTSQL Query Language...................................................................................... 68
SELECT Syntax ............................................................................................. 69
FACETS Statement........................................................................................ 70
WHERE Clause ............................................................................................. 70
WHERE Relationships ................................................................................... 71
WHERE Terms ............................................................................................... 72
WHERE Operators ......................................................................................... 73
Proximity - prox operator................................................................................ 76
Proximity - span operator ............................................................................... 77
Proximity – practical considerations .............................................................. 78
WHERE Regions ........................................................................................... 79
Priority Region Chains ................................................................................... 80
Minimum and Maximum Regions................................................................... 81
Any or All Regions .......................................................................................... 82
Regular Expressions ...................................................................................... 82
Relative Date Queries .................................................................................... 85
Matching Lists of Terms ................................................................................. 86
ORDEREDBY ................................................................................................ 88
ORDEREDBY Default ............................................................................. 89
ORDEREDBY Nothing ............................................................................ 89
ORDEREDBY Relevancy ........................................................................ 89
ORDEREDBY RankingExpression.......................................................... 89
ORDEREDBY Region ............................................................................. 89
ORDEREDBY Existence ......................................................................... 90
ORDEREDBY Rawcount ......................................................................... 90
ORDEREDBY Score[N] ........................................................................... 90
Performance Considerations for Sort Order ............................................ 90
Text Locale Sensitivity ............................................................................. 91
Facets ................................................................................................................... 91
Purpose of Facets .......................................................................................... 91
Requesting Facets ......................................................................................... 92
Facet Caching ................................................................................................ 93
Text Region Facets ........................................................................................ 93
Date Facets .................................................................................................... 93
FileSize Facets .............................................................................................. 94
Facet Security Considerations ....................................................................... 94
Facet Configuration Settings.......................................................................... 95
Reserving Facet Memory ..................................................................................... 96
Facet Performance Considerations ............................................................... 96
Protected Facets ............................................................................................ 97
Basics
This section is an overview of Search Engine 21, and introduces fundamental
concepts needed to understand some of the later topics.
Overview
Introduction
Search Engine 21 (“OTSE” – OpenText Search Engine) is the search engine
provided as part of OpenText Content Server. This document provides information
about the most common Search Engine 21 features and configuration, suitable for
administrators, application integrators and support staff tasked with maintaining and
tuning a search grid. If you are looking for information on the internal details of the
data structures and algorithms, you won’t find it here.
This document is based upon the features and capabilities of Search Engine 21.1,
which has a release date of January 2021.
Disclaimer
DISCLAIMER:
This document is not official OpenText product
documentation. Any procedures or sample code is specific to the
scenario presented in this White Paper, and is delivered as-is and
is for educational purposes only. It is presented as a guide to
supplement official OpenText product documentation.
While efforts have been made to ensure correctness, the
information here is supplementary to the product documentation
and release notes.
Relative Strengths
There are many search engines available on the market, each of which has relative
merits. Search Engine 21 is a product of the ECM market space, developed by
OpenText, with a proven record as part of OpenText Content Server. This search
engine has been in active use and development for many years, and was previously
known by names such as “OT7” and “Search Engine 10”.
Because of the nature of OpenText ECM solutions, OTSE has a feature set oriented
towards enterprise-grade ECM applications. Some of the pertinent features which
make OTSE a preferred solution for these applications include:
Upgrade Migration:
As new features and capabilities are added, you are not required to re-index your
data. OTSE includes transparent conversion of older indexes to newer versions.
Our experience is that customers with large data sets often do not have the time or
infrastructure to re-index their data, so this is a key requirement.
Transactional Capability:
During indexing, objects are committed to the index in much the same way that
databases perform updates. If a catastrophic outage happens in the midst of a
transaction, the system can recover without data corruption. Additionally, logical
groups of objects for indexing can be treated as a single transaction, and the entire
transaction can be rolled back in the event that one object cannot be handled
properly.
Metadata Updates:
The OpenText search technology has the ability to make in-place updates of some or
all of the metadata for an object. This represents a significant performance
improvement over search technology that must delete and add complete objects,
particularly for ECM applications where metadata may be changing frequently.
Search-Driven Update:
OTSE has the ability to perform bulk operations, such as modification and deletion,
on sets of data that match search criteria. This allows for very efficient index updates
for specific types of transactions.
Maintenance Commitment:
OpenText controls the code and release schedules. This way, we can ensure that our
ECM solutions customers will have a supported search solution throughout the life of
their ECM application.
Data Integrity:
OTSE contains a number of features that allow the quality, consistency and integrity
of the search index and the data to be assessed. These features give system
administrators the tools they need to ensure that mission critical applications are
operating within specification.
Scaling:
Not only can OTSE support very large indices (1 billion+ objects), it can be
restructured to add capacity, rebalance the distribution of objects across servers,
Related Components
The scope of this document is constrained to the core OTSE components which are
located within the search JAR file (OTSEARCH.JAR).
There are a number of other components of both the overall search solution and
Content Server which are strongly related to OTSE but are not covered in this
document. In some instances, because of the tight relationship with other
components, references may be made in this document to these other components.
For a complete understanding of the search technology, you may wish to also learn
about the following products and technologies:
Admin Server
The Admin Server is a middleware application which provides control, monitoring and
management of processes for Content Server. The Admin Server performs a number
of services, and is critical to the operation of the search grid when used with Content
Server. As a rule of thumb, there is generally one Admin Server installed on each
physical computer hosting OTSE components.
Document Conversion Server
DCS is a set of processes and services responsible for preparing data prior to
indexing. DCS performs tasks such as managing the data flows and IPools during
ingestion, extracting text and metadata from content, generating hot phrases and
summaries, performing language identification, and more. You should ensure that
DCS is optimally configured for use with your application before indexing objects.
IPool Library
Interchange Pools (IPools) are a mechanism for managing batch-oriented Data Flows
within Content Server. IPools are used to encapsulate data for indexing. OTSE uses
the Java Native Interface (JNI) to leverage OpenText libraries for reading and writing
IPools.
Content Server Search Administration
While most OTSE setup is managed using configuration files, in practice many of
these files are generated and controlled by Content Server. Many of the concepts
and settings described in this document have analogous settings within Content
Server Search Administration pages, and should be managed from those pages
wherever possible.
Query Languages
This document describes the search query language implemented by the OTSE. It is
common for applications to hide the OTSE query language and provide an alternative
query language to end users. The Content Server query language – LQL – is NOT
described in this document.
Remote Search
Content Server Remote Search currently uses code within OTSE to facilitate
obtaining search results from remote instances of Content Server.
Backwards Compatibility
OTSE is capable of reading all indexes and index configuration files from all released
versions of OpenText Search Engine 20, Search Engine 16.2, Search Engine 16,
Search Engine 10.5, Search Engine 10, and OT7. OT7 is the predecessor to SE10.0
that was part of Content Server 9.6 and 9.7. For most of these, an index conversion
will take place. The new index will not be readable by older versions of the search
engines.
Indexes created with OT6 are not directly readable. Search Engine 10 can be used
to convert an OT6 index to a format Search Engine 10 can use, which can then be
upgraded in a second step using OTSE. In practice, given the improvements and
fixes since OT6, you would be best advised to re-index extremely old data sets. You
should consult with OpenText Customer Support if you are considering a migration
from these older search indices.
otbackup.exe
otrestore.exe
otsumlog.exe
otcheckndx.exe
llremotesearch.exe
The llremotesearch.exe file is specifically for Content Server Remote Search, and is
not a requirement for other OTSE installations.
Update Distributor
The Update Distributor is the front end for indexing. The Update Distributor performs
the following tasks, not necessarily in this order:
• Monitors an input IPool directory to check for indexing requests.
• Reads IPools, unpacks the indexing requests.
• Breaks larger IPools into smaller batches if necessary.
• Determines which Index Engines should service an indexing request.
• Sends indexing requests to Index Engines.
• Rolls back transactions and sets aside the IPool message if indexing of an
object fails.
• Rebalance objects to a new Index Engine during update operations if a
partition is too full or retired.
• Manages which Index Engines can write Checkpoints.
• Grants merge tokens to Index Engines that have insufficient disk space.
• Controls sequence of operations for Index Engines writing backups.
Index Engines
An Index Engine is responsible for adding, removing and updating objects in the
search index. The Index Engines accept requests from the Update Distributor, and
update the index as appropriate. Multiple Index Engines in a system are common,
each one representing a portion of the overall index known as a “partition”.
The search index itself is stored on disk. In operation, portions of the search index
are loaded into memory for performance reasons.
Index Engines are also responsible for tasks such as:
• Converting older versions of the index to newer formats.
• Converting metadata from one type to another.
• Converting metadata between different storage modes.
• Background operations to merge (compact) index files.
Search Federator
The Search Federator is the entry point for search queries. The Search Federator
receives queries from Content Server, sends queries to Search Engines, gathers the
results from all Search Engines together, and responds to the Content Server with
the search results.
The Search Federator performs tasks such as:
• Maintaining the queues for search requests.
• Issuing search queries to the Search Engines.
Search Engines
The Search Engines perform the heavy lifting for search queries. They are
responsible for performing searches on a single partition, computing relevance score,
sorting results, and retrieving metadata regions to return in a query. Every partition
requires a Search Engine to support queries.
The Search Engines keep an in-memory representation of key data that replicates
the memory in the Index Engines. The files on disk are shared with the Index
Engines.
Search Engines read Checkpoint files at startup and incremental Metalog and
AccumLog files during operation to keep their view of the index data current. These
Metalog and AccumLog files are checked every 10 seconds by default, and any time
a search query is run.
Search Engines also perform tasks such as building facets, and computing position
information used for highlighting search results.
Inter-Process Communication
Each component of the search engine exposes APIs for a variety of purposes. This
section outlines the various communication methods used.
Sockets
Threads 210
Connections 200
Ports 21
The socket connections allocate and hold the threads and connections. Although this
uses the maximum number of resources, there are performance benefits and
predictability since it avoids allocation and re-use overhead that may exist within Java
or the operating system.
There are two search queues that may be used, the “normal” queue, and the “low
priority” queue. The low priority queue was first introduced in version 16.2.6, prior
versions supported only a single queue. The motivation for the low priority queue is
based on usage patterns in Content Server. There are background programmatic
operations that perform searches, and there are interactive user searches. The
programmatic searches have the potential to consume all available search capacity,
blocking users from having access. The purpose of the having two queues is to allow
specific search capacities to be independently reserved for background searches and
user searches.
Use of both queues is optional. By convention, the “normal” queue is always used.
[SearchFederator_xxx]
SearchPort=8500
WorkerThreads=5
QueueSize=25
The low priority queue is disabled by default, and activated with the following
settings:
[SearchFederator_xxx]
LowPrioritySearchPort=-1
LowPriorityWorkerThreads=2
LowPriorityQueueSize=25
Note that using the low priority queue requires an additional port. As a general
recommendation, small values (perhaps 2 or 3) should be used for the threads to
prevent the low priority searches from consuming too many resources.
Queue Servicing
There are three phases to servicing a request to the Search Federator.
Phase 1 – Content Server indicates a desire to start a search query by
opening a connection to the Search Federator. The connection is put on an
operating system / Java queue (not in the search code).
Phase 2 – a dedicated thread takes the connection from Java, and places it
in an internal queue. If the internal queue is full, the request is discarded and
the connection is closed.
Phase 3 – when a search worker thread becomes available, the connection
is removed from the queue and given to the worker. At this point, the worker
responds to Content Server to indicate it is ready to receive the search
request, and Content Server sends the search query for processing.
Note that in versions prior to 20.2, the process around Phase 1 and Phase 2 were
different – the pending requests were left on the operating system queue, and the
internal queue had an effective size of 1.
Search Timeouts
The Search Federator places a limit on how long it will wait for an application which
has opened a search transaction. If the application does not initiate a message in
the available time, then the Search Federator will close the connection and terminate
the transaction.
Keeping a connection and transaction open is expensive from a resource
perspective, and applications that leave connections open and idle can block search
activity by consuming all available threads from the search query pool.
There are two timeout values. The first is the time between the acknowledging that a
worker is ready to receive a query the first message arriving. This is expected to be
a short time, and the default is 10 seconds. The second is the time between
messages – for instance between consecutive “GET RESULTS” messages. This is
longer, with a default of 120 seconds. Both times can be adjusted or disabled in the
search.ini_override file. Bear in mind these are timeouts from the server perspective
– Content Server will also have timeout values from the client perspective.
Within the [SearchFederator] section of the search.ini file, you may specify the time
the Federator will wait between a connection being created and the first command
arriving (10 second default):
FirstCommandReadTimeoutInMS=10000
Time the Search Federator will wait between commands (2 minute default):
SubsequentCommandReadTimeoutInMS=120000
In either case, the timeouts can be completely disabled with a value of 0.
The Search Federator also places a limit on how long it will wait for a response from
a Search Engine with an open search session. If the Search Engine does not reply
within the available time, then the Search Federator will terminate the search
session. For example, if the Search Federator has issued a “SELECT” to a Search
Engine, it will wait a limited amount of time for the reply. This timeout value is in the
[DataFlow] section of the search.ini file has a default value of 2 minutes:
QueryTimeOutInMS= 120000
The search session on a Search Engine will regularly ping the Search Federator to
ensure that it is still responding. If the Search Federator does not answer, then the
Search Engine will terminate its search session to recover resources. In addition,
there is a failsafe timeout which is the maximum time that a Search Engine will leave
a session active. In normal operation, even if the Search Federator fails, this is not
typically encountered. Located in the [DataFlow] section of the search.ini file, the
failsafe timeout value is 6 hours:
SessionTimeOutInMS= 21600000
Testing Timeouts
In a test environment, search results are often completed too quickly to permit testing
of system behavior for long searches and search timeouts. For test purposes, there
is a configuration setting that will cause all searches to take at least a defined period
of time. In production environments, this value should be 0.
MinSearchTimeInMS=0
File System
The Index Engines communicate updates to the Search Engines using a shared file
system. At various times, files may be locked to ensure data integrity during updates.
It is important that the Search and Index Engines have accurate file information for
this to work correctly. Some file systems use aggressive caching techniques that can
break this communication method. The Microsoft SMB2 caching is one example, and
it must be disabled for correct operation of OTSE. Microsoft SMB3 reverts to using
the SMB2 protocol in many situations, and so should also be avoided. You must
disable SMB2 caching on the servers running the search processes and on the file
server. Similarly, Microsoft Distributed File System (DFS) is known to have
unpredictable file locking behavior and must not be used.
Some customers have also experienced locking issues with NFS, and have needed
to use the NOLOCK or NOAC parameter in their NFS configuration to ensure correct
operation.
Server Names
Java enforces strict adherence to the various IETF standards for URIs and server
naming conventions. RFC 952, RFC 2396 and RFC 2373 are examples. Some
operating systems allow server names that do not meet the criteria for these
standards. When this happens, OTSE will likely fail with exceptions at startup. One
example we have seen is violation of this rule in RFC 952: “The rightmost label of a
domain name consisting of two or more labels, begins with an alpha character”. This
means a domain name such as “zulu.server3.7up” is invalid because the “7” must
instead be an alpha character.
Partitions
Basic Concepts
The concept of partitions is central to how OTSE scales and manages search
indexes. A search index may be broken horizontally into a number of pieces. These
pieces are known as “partitions” in OTSE terminology. The sum of all the partitions
together represents the search index.
Splitting an index into partitions is needed for a number of possible reasons:
• For best query performance, some metadata can be stored in memory.
There are practical limits on the amount of memory that can or should be
used by a single Java process. Using partitions allows these limits to be
overcome.
• OTSE can often provide better indexing or searching performance by
allowing operations to be distributed to multiple partitions. These partitions
Update-Only Partitions
It is possible to place a partition in “Update-Only” mode. In this mode, the partition
will not accept new objects to index, but it will update existing objects or delete
existing objects. If a partition is marked as Update-Only, then the Update Distributor
will not send it new objects.
Update-Only behavior is a legacy feature inherited from OT7, and is still supported
for backwards compatibility. However, it is recommended that you do not use
Update-Only mode for future applications. In normal Read-Write mode, OTSE
contains a dynamic “soft” update-only feature which is generally superior. The use
and configuration of dynamic update-only mode is covered elsewhere in this
document. Beginning with Content Server 16, Update-Only mode is not available as
a configuration option from within Content Server.
The default storage mechanism for text metadata is independently configured for
Update-Only partitions. If your default configuration for Update-Only mode differs
from Read-Write mode, then the Index Engines will convert the index data structures
the first time they restart after the configuration is changed. This default
configuration setting is found in the FieldModeDefinitions.ini file.
Read-Only Partitions
OTSE allows partitions to be placed in a “Read-Only” mode. In this mode, the
partition will respond to search queries, but will not process any indexing requests.
Objects cannot be added to the partition, removed or modified.
In operation, when started, the Index Engines for Read-Only partitions will shut down
once they have verified the index integrity. This means that fewer system resources
are being consumed. It also means that, since there is no Index Engine to respond
to the Update Distributor, a new instance of an object will be created in another
partition if you attempt to replace or update an object in a Read-Only partition.
You should only use Read-Only partitions in very specific cases. Customers will
occasionally get into trouble because they use Read-Only partitions when their
applications are still updating objects. This would happen in an application such as
Records Management – a “hold” is put on an object in a Read-Only partition, and a
duplicate entry is inadvertently created in another partition. Similarly, moving items to
another folder, updating classifications, updating category attributes and other
operations will cause this type of behavior. The search engines then respond to
search queries with multiple copies of objects.
The use of “Retired” mode for partitions avoids these issues, and should be
considered instead of Read-Only mode. Beginning with Content Server 16, Read-
Only mode will no longer be provided as a configuration option in the Content Server
administration interface.
Read-Only partitions also have a distinct default configuration for text metadata
storage in the FieldModeDefinitions.ini file, and changing to or from Read-Only mode
may trigger data conversion on startup.
Retired Partitions
OTSE allows partitions to be placed in a “Retired” mode. This mode of operation is
intended for use when a partition is being replaced. The behavior is close to
partitions in Update-Only mode. It will not accept new items, but it will update
existing objects or delete existing objects. If a partition is marked as Retired, then the
Update Distributor will not send it new objects. The key difference is that when an
object in a Retired partition is re-indexed, it will be deleted from the Retired partition
and added to a Read-Write partition.
Support for Retired Partitions is new starting with Search Engine 10.5. Retired mode
is strongly preferred over Read-Only mode, since Retired mode avoids problems
related to creating duplicate copies of objects in the Index.
Retired partitions are also a key feature for merging many small partitions into a set
of larger partitions. This is typical for customers upgrading older systems that use
RAM mode, and are switching to Low Memory mode. In this case, approximately
65% of the partitions can be marked as “Retired”, and incremental re-indexing of the
Retired partitions will empty move all the objects out of the Retired partitions. When
empty, the partitions can be removed from the search grid.
One common strategy for moving items from one partition to another is to place a
partition into Retired Mode, perform a search for all items in the Retired partition, add
them to a Collection, and re-index the Collection. This moves all the items that are
re-indexed from the Retired partition into other partitions. In practice, there are often
items left behind in the Retired partition after this is done. Typically, this is to be
expected. Occasionally, a Content Server object will be deleted but not removed
from the index. When this happens, it cannot be Collected. In other cases, the
Extractor may be set to re-index only recent versions of objects, and will not re-index
older versions. In some cases, when a document was deleted, an associated
Rendition may not have been removed from the index. If unsure about whether a re-
indexed Retired partition can be deleted, the OpenText customer support
organization may be able to provide some guidance.
Note that when objects are deleted from a partition, some of the data structures
remain in place. For example, a dictionary entry for a word may exist, even though
no objects now contain that word. It is normal for a retired partition that has had all
objects removed to show a small non-zero size. The search engine will also mark
items as deleted, but leave them in place until scheduled processes compact and
refresh the data – which may take days depending on the situation.
Read-Write Partitions
For completeness, the normal mode of operation for a partition is “Read-Write” mode.
In this mode, the partition will accept new objects, can delete objects and update
objects.
Read-Write partitions can be configured to automatically behave as Update-Only
partitions as they become full. More information on soft Update-Only configuration is
available in the optimization section.
To set the size threshold for determining if an object should be sent to a large object
partition:
[DataFlow_yyyy]
ObjectSizeThresholdInBytes=1000000
Metadata Regions
A region is OTSE terminology for a metadata field. Using a database analogy, you
can think of a region as being roughly equivalent to a column in a database.
Understanding and optimizing how metadata regions are defined and stored has a
big impact on performance, sizing, usability and search relevance. This section
provides background on the administration of regions to optimize the search
experience.
Defining a Region
Regions are defined in the configuration file “LLFieldDefinitions.txt”. This file is edited
to define the desired regions and their behaviors, and interpreted by the Index
Engines when they start. Currently, Content Server does not provide an interface for
editing and managing this file, so you must do this with a text editor.
Once a region is defined, it is recorded in the search index. Changing the definition
for an existing region in the LLFieldDefinitions.txt file or attempting to index a
metadata value that is incompatible with the defined region type will usually result in
an error. It is possible to redefine the type for existing metadata regions in many
cases as explained under the heading “Changing Region Types”.
Region Names
There are limitations on the labels which can be used for a metadata region. The
rules for acceptable region names are approximately the same as the rules for valid
XML labels.
The simplified explanation is that almost any valid UTF-8 characters can be used in
the name, with some exceptions. White-space characters (various forms of spaces,
nulls and control characters) are not permitted. To remain compliant with XML
naming conventions, use of a hyphen ( “-“ ), period ( “.” ), a number ( 0-9 ) or various
diacritical marks are discouraged as the first character.
The DCS filters often create region names from extracted document properties. In
some cases, DCS will strip white space and punctuation from the property names to
ensure that the region names are comprised of valid characters.
Region names are case sensitive. The region “author” is different from the region
“Author”.
<customerName>
<firstName>bob</firstName>
<lastName>smith</lastName>
</customerName>
Then the region “customerName” is indexed, and it will have the value:
<firstName>bob</firstName><lastName>smith</lastName>.
Within the definitions file, you can define hierarchy structures that should be ignored
and flattened when looking for regions to index. In the case above, by declaring
“customerName” as a nested region, the field customerName is ignored and the
regions firstName and lastName would be recognized and indexed. This is not
intended to handle arbitrarily complex nesting structures, but was designed to
accommodate a few specific instances in data presented for indexing by Content
Server. In particular, indexing of Workflow objects within Content Server prior to
Content Server 10 SP2 Update 10 is the only known requirement for the use of
nested region names. Using the above example, a nested value is expressed within
the definitions file like this:
NESTED customerName
enable indexing for that region. For most applications, use of REMOVE is
recommended instead of DROP. In the definitions file:
DROP regionName
REMOVE someRegionName
Special considerations exist for the compound region types DATETIME and USER.
USER regions must be removed together in the same way they were defined, with 3
regions removed:
DATETIME regions can also be removed in their entirety by specifying both regions:
There is a special case supported for removing the TIME portion of a DATETIME pair
to leave only the DATE field behind. Ensure that you also add a DATE field to
prevent conversion of the DATE field to TEXT. There is no method available to
remove just the date portion of a DATETIME field to leave the time intact.
REMOVE OTVerMTime
DATE OTVerMDate
“noise” of empty regions. This feature can be disabled by adding the following entry
in the [Dataflow_] section of the search.ini file:
RemoveEmptyRegionsOnStartup=false
Renaming Regions
Consider the case where you need to change the name of a metadata field in
Content Server or a custom application. You are now confronted with the problem
that data which is already indexed is using an older name for the region.
OTSE provides a mechanism for handling these situations. Within the region
definitions file, you can rename an existing region like this:
Merging Regions
The merge capability of OTSE is similar to the RENAME capability, but is instead
used to combine two existing regions. Within the definitions file:
Once an index is running, any new values for sourceRegion will instead be indexed
within the targetRegion.
If targetRegion does not exist, the effective behavior of the MERGE command is the
same as a RENAME command.
There are limitations. The MERGE operation is NOT capable of merging text
metadata values that contain attributes. For Content Server, this includes the
OTName, OTDescription and OTGUID regions. The attributes will be silently lost
during the merge operation. You must check to ensure that regions being merged do
not incorporate attributes.
If conversions are required for MERGE at startup, this will trigger writing new
checkpoint files.
To
Boolean Integer Long Enum Text Date
Boolean
Integer
Long
From Enum
Text
Date
TimeStamp
You cannot change the type of a Text region that has multiple values or uses
attribute/value pairs, since these concepts are only available for Text regions.
The procedure is as follows:
Edit the search.ini (or search.ini_override) file to include the following entry in the
[Dataflow] section: EnableRegionTypeConversionAsADate=YYYYMMDD, where
YYYMMDD is today’s date. This informs OTSE that type conversion is allowable
today. This is a safety feature to prevent inadvertent region type conversion.
Edit the LLFieldDefinitions.txt file to have the desired region type definitions.
Restart the search processes. On startup, the Index Engines will determine that a
conversion is required, and use the stored values to rebuild the metadata indexes for
the changed regions. This process may require several minutes per partition, longer
if many region types are being defined.
In the event that a given value cannot be converted, the failure is recorded in the log
files and the OTIndexError count for metadata errors is incremented for the affected
object in the index.
You are strongly encouraged to back up an index before converting region types and
ensure that conversion has succeeded, reverting to the backups if there are
problems. In the log files, each failed conversion has an entry along these lines:
Couldn't set field OTFilterMIMEType for object
DataId=254417&Version=1 to text/plain:
…
</OTMeta>
This would create 3 separate values for the region OfficeLocation attached to this
object. A search for any of “Chicago”, “Ontario” or “New York” would match this
object. Similarly, if the region OfficeLocation is selected for retrieval, the results
would return all three values.
When updating values in regions, you cannot selectively update one specific value of
a multi-value region. If a new value is provided for OfficeLocation for this object, all 3
existing values would be replaced with the new data – which may be a single value or
multiple values.
<OTMeta>
…
<OTName lang="en">My red car</OTName>
<OTName lang="fr">Mon voiture rouge</OTName>
…
</OTMeta>
In addition to using the multiple value capabilities of OTSE, region attributes are used
by Content Server to tag each metadata value with attribute key/value pairs. In this
example, the key is “lang”, and the values are “en” and “fr”.
When constructing a search query, use of the region attributes is optional. A search
for “red car” or a search for “rouge” will find this object and return the values. When
values are returned, the attributes are included in the results only on request.
It is possible to construct a search query against regions that have specific region
attributes. If you only want to locate objects that contain the term “rouge” in the
French language value for OTName, the where clause would look like this:
where [region "OTName"][attribute "lang"="fr"] "rouge"
The query language has also been extended to permit sorting of results using an
attribute. Consider the case where there are values for both French and English, but
the user preference is French. Sorting based on the French values is therefore
desired. Within the “ORDEREDBY” portion of a SELECT statement, the SEQ
keyword is used to specify the attribute to be used for sort preferences:
The use of attributes with text values for specifying language values is a relatively
simple example. You may index multiple attributes within a single region. You may
also have different attributes for each value. The following example illustrates this
concept for indexing:
serious data corruption or potential data injection is assumed and the metadata for
the object is discarded and an OTContentStatus code is used to capture the error.
For example, if an attacker provided metadata in the Description field of an object
that looked like this:
Silly stuff</Description><fakeRegion>Certified
Paid</fakeRegion><Description>nothing to see here
Then this data could be wrapped in a legitimate Description region when extracted for
indexing, resulting in:
<Description>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>
Which effectively forges a value for fakeRegion. By using the otb attribute,
<Description otb=94>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>
The Index Engine would notice that the Description region ended after only 11 bytes
instead of 94 bytes, and would prevent the injection of the fakeRegion by flagging the
object metadata as unacceptable. Content Server first began using this otb
protection for regions generated by Document Conversion Server in September
2016, and for regions provided by Content Server metadata in December 2016.
The otb attribute is never stored in the index. There is a search.ini setting that will
disable this capability, which will ignore the otb value. In the [Dataflow_] section:
IgnoreOTBAttribute=true
Key
Each object in the index must have a unique identifier, or key. The KEY entry in the
region definitions file identifies which region will be used as this unique identifier. It is
of type text and may not have multiple values. Exactly one must be defined. The
default Key name is OTObject. During indexing, the Key is typically represented by
the entry OTURN within an IPool. To paraphrase, in a default Content Server
installation, the OTURN entry in an IPool is treated as the Key, and populates the
region OTObject.
KEY OTObject
Text
Text, or character strings. Text strings must be defined in UTF-8 encoding. Text
strings can potentially be very large. Because of this, many customers find that the
available space in their search index is consumed quickly by text regions. To help
manage the large potential sizes, there are several methods available for storing text
metadata. This is covered in a separate section.
Text values may contain spaces and special punctuation. When represented in the
input IPools, certain characters may need to be ‘escaped’ to allow them to be
expressed in the IPools. In general, this means placing a backslash (‘\’) character
before “greater than” and “less than” characters (‘<’ and ‘>’).
There are some features available for TEXT regions which are not available for other
data types, and these may affect the decision about which type of region is suitable
for a given metadata field. TEXT regions support multiple values for an object, and
TEXT regions also support attribute keys and values.
It is possible to index numeric information in a text region, but they
are indexed as strings. When using comparison operations – such
as greater than, less than, ranges and sorting – remember that
strings sort differently than numbers. Intuitively, you expect the
number 123 to be greater than the number 50. But text
comparisons consider 123 to be less than 50. For example, in a
TEXT region, a clause of WHERE [region "partnum"] range
"100~200" will match a value of 1245872. If numeric comparisons
are important, a TEXT region is not a good choice.
TEXT textRegionName
There are default limits on the size and number of values you can place in a text
region. It is possible to configure these limits on a per-region basis. Size is
expressed in Kbytes. These parameters are optional. More details are available in
the “Protection” section of this document.
Rank
The rank type region is a special case for modifiers used in computing the relevance
of an object to boost its position in the result list. For example, frequently used
objects may be given a rank of 50. The default is 0. Values in this region must be
between 0 and 100 inclusive. Only 1 region may be defined with type of rank. In the
definitions file:
RANK rankRegionName
Integer
An integer is a 32 bit signed value, which can represent an integer value between -
2,147,483,648 and 2,147,483,647. Integer values are stored in memory. Search
results can be sorted on an integer field. In the definitions file:
INT integerRegionName
Long Integer
A long integer is a 64 bit signed value, which can represent a number between
−9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 inclusive. LONG
integer values are stored in memory. Existing Integer fields in an index can be
converted to LONG Integer values by changing their definition. Search results can
be sorted on a LONG integer field. In the definitions file:
LONG longRegionName
Timestamp
A TIMESTAMP region encodes a date and time value. TIMESTAMP values are
expressed in a string format that is compatible with the standard ISO 8601 format.
The milliseconds and time zone are optional, but time up to the seconds is
mandatory:
2011-10-21T14:24:17.354+05:00
2011-10-21T14:24:17
Where
The time zone is always optional. If omitted, the local system time zone will be
assumed. The local system time zone is determined from the operating system, but
can also be explicitly set by means of a search.ini file setting. Internally, timestamp
values are converted to UTC time before being indexed.
During search queries, lower significance time elements can be omitted. For
instance, the following will all be accepted:
2011-05-30T13:20:00
2011-05-30T13:20
2011-05-30-2:30
2011
If not fully specified, during indexing the earliest possible time for a value will be
used. For example:
2011-05
Would be interpreted as:
2011-05-01T00:00:00.000
TIMESTAMP values are kept in memory, stored as 64 bit integers. In the definitions
file:
TIMESTAMP timestampRegionName
There are special behaviors for several reserved metadata regions that use
TIMESTAMP definitions for tracking the time when objects are indexed or modified.
See the section on Reserved Regions for more information.
Enumerated List
The enumerated type is ideal for metadata regions which will have one of a defined
set of values. For example, file type identifiers (Word, Excel, etc.) are members of a
set of file types. Enumerated lists use less memory than text if RAM storage is being
used. In the definitions file:
ENUM enumerableRegionName
Boolean
The BOOLEAN type is used for objects which can have a value of true or false.
Fields of type BOOLEAN use memory very efficiently. In order to accommodate the
reality that different applications represent BOOLEAN values in different ways, the
indexing processes will accept BOOLEAN values in any of the following alternate
forms:
true false
yes no
1 0
on off
y n
t f
Boolean values are not case sensitive, so that False, FALSE and false are
equivalent. When retrieved, the values are always presented as true or false,
regardless of which form was used for indexing. If building a new indexing
application, the use of true and false is the preferred form.
BOOLEAN booleanRegionName
Date
A Date region accepts a string that represents a date in the form ‘YYYYMMDD’,
where YYYY is the year, MM the month, and DD the day. For example, 20130208
would represent February 8th 2013. Date values can be used presented in search
facets, and used in relevance scoring computation. This form of a Date matches the
format for dates used in Content Server. The date portion of a DateTime region is
effectively a Date region. The Date region type is first available in Search Engine 10
Update 10.
DATE dateRegionName
Currency
A region can be defined as a currency, a feature first available with Update 2015-09.
When so declared, the input data will be assumed to be in one of several common
forms that are used to represent currency values. The data is stored internally as a
long integer, with an implied 2 decimal digits. Character strings preceding or trailing
the currency value are discarded, which would typically be a symbol or a country
currency designation. Although some tolerance of poorly formed currency values is
built in, the expectation is that well formed data with 0 or 2 digits after the decimal will
be present. Examples of valid currency representations are:
$1,376,378 1376378.00
1456.87 AUD 1456.87
€ 8.447,75 8447.75
$ 4000US 4000.00
CURRENCY2 ListPrice
Aggregate-Text Regions
An AGGREGATE-TEXT region has a search index which is the sum of all the regions
it aggregates, but does not store a copy of the values. The values remain within the
original regions. Aggregation only applies to TEXT regions.
Judicious use of AGGREGATE-TEXT regions can improve search performance and
simplify the user experience. Searching many text regions is slower than searching
against an equivalent AGGREGATE-TEXT region. When the AGGREGATE-TEXT
feature is combined with the DISK_RET storage mode for text regions, a significant
reduction in the total memory used to store the index and metadata of the aggregate
is possible if not using Low Memory mode.
AGGREGATE-TEXT regions are constructed using the LLFieldDefinitions.txt file.
Create an entry along these lines:
CHAIN Regions
The CHAIN definition can be used to define a synthetic region which is used for
constructing queries against lists of regions. The list is prioritized. The value of the
first region that is defined (not null) is used for evaluating the query. There is no
additional storage or index penalty since the definition is an instruction used at query
execution that directs how the CHAIN region should be evaluated.
CHAIN UserHandle UserID FacebookID TwitterID
CHAIN regions can be used with any region type. Using different region types within
a single CHAIN region is not recommended, since not all search operators are
consistently available or applied to all region types.
The [first "UserID","FacebookID","TwitterID"] syntax in a query is equivalent to a
CHAIN region for queries. However, when a CHAIN region is predefined, the value
of the CHAIN region can also be requested in the search results using the SELECT
statement.
It is possible to change the text metadata storage modes for an existing index without
re-indexing the content. The Index Engines can perform any necessary storage
mode conversions when they are started.
Content Server exposes control over the storage modes in the search administration
pages. Beginning with Content Server 16, support of several legacy configuration
modes have been removed, forcing indexes to use DISK + Low Memory + Merge
Files as the proven best overall configuration. For most applications, the
configuration file settings described here will not need to be directly manipulated.
[ReadWrite]
SomeRegionName=DISK
OtherRegionName=DISK_RET
[ReadOnly]
ImportantRegionName=RAM
[NoAdd]
HugeRegionName=DISK
[Retired]
HugeRegionName=DISK
The [General] section of this file specifies the default storage mode for text metadata.
The ‘NoAdd’ value is the setting for Update-Only partitions.
You can also specify storage modes for regions which differ from the default settings.
Each partition mode has a section, and a list of regions and their storage modes can
be provided. Note that Low Memory and Merge File storage modes require DISK
configuration as a pre-requisite.
The FieldModeDefinitions.ini file is generated
dynamically by administration interfaces within
Content Server. Normally, you should not edit
this file.
Beginning with Content Server 16, RAM based storage, ReadOnly mode and NoAdd
mode are no longer available through the administrative interfaces.
typically see a 30% reduction in the indexing performance with Disk storage relative
to Memory storage. For example, in one of the OpenText test cases using a 4-
partition system performing a 1 million+ objects indexing test: 7 hours 24 minutes
with Disk mode versus 5 hours 9 minutes in RAM mode.
Switching between Value Storage and Low Memory disk modes will trigger a
conversion of the index format when the Index Engines are next started. Typically,
conversion of a partition should be less than 20 minutes. Value Storage mode is
backwards compatible with versions of Search Engine 10.0 back to Update 2. Low
Memory mode is new beginning with Update 9, and partitions in Low Memory mode
cannot be read by earlier versions of Search Engine 10.5.
The Merge File storage mode is fist available in Content Server 10.5 Update 2015-
03.
Retrieval Storage
This mode of storage is optimized for text metadata regions which need to be
retrieved and displayed, but do not need to be searchable. In this mode, the text
values are stored on disk within the Checkpoint file, and there is no dictionary or
index at all. This mode of operation is recommended for regions such as Hot
Phrases and Summaries. These regions do not need to be searchable since they
are subsets of the full text content (you can search the full body text instead). Typical
ECM applications see a savings of 25% of metadata memory using Retrieval Storage
mode instead of Memory Storage for these two fields.
Retrieval Storage mode can be configured in the FieldModeDefinitions.ini file using
the value DISK_RET.
[DataFlow_DFname0]
DiskRetSection=DISK_RET
[DISK_RET]
RegionsOnReadWritePartitions=OTSummary,OTHP
RegionsOnNoAddPartitions=OTSummary,OTHP
RegionsOnReadOnlyPartitions=OTSummary,OTHP
are uncertain about whether you will need to convert regions, using
a more conservative partition memory setting may be advisable in
order to ensure you have memory available for future metadata
region tuning.
Reserved Regions
There are a number of region names which are reserved by OTSE, and application
developers must be aware of the restrictions on their use. In most scenarios, the
Document Conversion Server is part of the indexing process, and DCS will also add
a number of metadata regions that are not described here.
OTMeta
The OTMeta region is reserved for use in two ways. In the first case, the region
OTMeta is reserved to indicate the collection of all metadata regions defined in the
Default Metadata List. This list is described in the search.ini file by the entry
DefaultMetadataFieldNamesCSL. A query against the OTMeta region will
search this entire list of regions. Where possible, this should be discouraged since
searches of this form may be relatively slow compared to searching in a specific
region, particularly if there are many regions included in the default search region list.
The second application is using OTMeta as the prefix for a region in a search query.
A query with a WHERE clause of [region "someRegion"] "term" is
equivalent to [region "OTMeta": "someRegion"] "term".
<furniture>
<chairs>
4
<chairColor>red</chairColor>
</chairs>
</furniture>
You can construct a query to locate objects where the chair color is red. The
WHERE clause of the search query would look something like this:
The XML search capability does not require a complete XML path specification. The
following WHERE clauses would also match this result, but would potentially also
match other results that are less specific:
[region "OTData":"chairs":"chairColor"] "red"
[region "OTData":"chairs"] "red"
To be a candidate for XML search matching, the XML document must have been
assigned the value text/xml in the OTFilterMIMEType region, which is typically
the responsibility of the Document Conversion Server. The metadata region and the
value for allowing XML content search are configurable in the DataFlow section of the
search.ini file:
ContentRegionFieldName=OTFilterMIMEType
ContentRegionFieldValue=text/xml
OTObject
Each index must specify a unique key region which functions as the master reference
identifier for an object. The region which represents the key is declared in the region
definitions file, but by convention and by default, the region OTObject is almost
always used as the key. During indexing, the unique key is defined in the OTURN
entry for an IPool object.
In practice, Content Server uses strings that begin with “DataId=” for the unique
identifier of managed objects. There are special cases in the code that rely on this
form of the OTObject field to determine when certain optimizations can be applied,
such as Bloom Filters for membership within a partition. If you are creating
alternative or custom unique object identifiers, ensure that the string “DataId” is not
present in the identifier to avoid unexpected behaviors.
OTCheckSum
This region contains a checksum for the full text content indexed for an object. The
value is generated by the Index Engines. Attempts to provide an OTCheckSum value
when indexing an object will increment the metadata error count for the object, and
be ignored. You can search and retrieve this region.
Internally, the Index Engines use this field to optimize re-indexing operations by
skipping content that is unchanged. This value is also used by index verification
utilities to verify that data has not been corrupted.
OTMetadataChecksum
This region has several purposes related to checksums for metadata. You cannot
index this region, but you can query against it and retrieve the values. Internally, this
value is used to verify the correctness of the metadata. Errors in the checksum
generally indicate severe hardware errors.
When a new object is indexed, a checksum of each metadata value is made. These
values are combined to create an aggregate checksum value, and the checksum is
stored in the region OTMetadataChecksum.
A background process is then scheduled which runs at a low priority. This process
traverses all objects in the index and recalculates the metadata checksum. If the
recalculated value does not match the stored value, a message is logged, and an
error code (-1) is placed in the OTMetadataChecksum region for that object.
Applications can find objects with metadata checksum errors by searching for a value
of -1 in this region.
If an existing index does NOT have checksums computed, then the background
process will populate checksum values. When objects are re-indexed, changes to
the metadata will be reflected in the new checksum. Transactional integrity for
metadata regions that were not changed is preserved.
There are configuration settings in the Index Engine section of the search.ini file that
allow the feature to be ON, OFF or IDLE. When IDLE, new indexing operations will
still create checksums, but the background process will not be validating them. In the
Index Engine section of the search INI file, this entry controls the mode, where
acceptable values are ON, OFF and IDLE. Default value is OFF for backwards
compatibility.
MetadataIntegrityMode=OFF (IDLE | ON)
By default, the engines will wake up once every two seconds and verify 100 objects:
MetadataIntegrityBatchSize=100
MetadataIntegrityBatchIntervalinMS=2000
Metadata regions stored on disk are excluded from this processing by default, since
disk files have other checksum validation mechanisms. It is possible to include
checksum validation for regions stored on disk, as indicated below, but the
processing is considerably slower in this mode:
TestMetadataIntegrityOnDisk=OFF (ON)
OTContentStatus
This region is used to record an indicator of the quality of the full text index for each
object. This data can assist applications with assessing the quality of the indexed
data, and taking corrective action when necessary. The status codes are roughly
grouped into 4 levels of severity – level 100, 200, 300 and 400 codes, where 100
level codes indicate good indexed content, and level 400 codes represent significant
problems with the content.
Applications can provide a status code as part of the indexing process. If the
Indexing Engines encounter a more serious content quality condition (a higher
number code) then the higher value is used. In other words, the most serious code is
recorded if multiple status conditions exist.
The majority of the codes are generated within DCS. Based upon Content Server 16,
the defined codes are:
100 There is no content indexed, only metadata. This is expected behavior, since
no content was provided as part of the indexing request.
103 This is the value for a normal, successful extraction and indexing of a single
document, both text and metadata.
104 One or more metadata regions contained non-UTF8 data. The non-UTF8 bytes
were removed and best-attempt indexing of the region performed. This
behavior only exists when region forgery detection is disabled.
120 The full text content of the indexing request was correctly processed, and is
comprised of multiple objects. The metadata of only the top or parent object
was extracted. The full text content of all objects is concatenated together. An
example is when multiple documents within a single ZIP file are indexed.
125 There were multiple objects provided for indexing, but some of them were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. The metadata of only the top or parent object was extracted. The
full text content of all objects that were not discarded are concatenated
together. A typical example would be when a Word document and JPEG photo
are attached to an email object, and the JPEG was discarded as an excluded
file type.
130 There were one or more content objects provided for indexing, but all were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. There is no full text content.
150 During indexing, the statistical analyzer in the Index Engine identified that the
content has a relatively high degree of randomness. This is a warning, the data
was accepted and indexed.
300 During indexing, the text required more memory than is allowed by the
Accumulator memory settings that are currently configured. The text has been
truncated, and only the first portion of the text that fit in the available memory
has been indexed.
305 Multiple content objects were provided, and at least one but not all of them are
an unsupported file format. There is some full text content, but the content of
the unsupported files have not been indexed.
310 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects consists of an unsupported file
format.
320 Multiple content objects were provided, and at least one but not all of them
timed out while trying to extract the full text content. There is some full text
content, but the content of the objects which timed out have not been indexed.
360 Multiple content objects were provided, and at least one but not all of them
could not be read. There is some full text content, but the content of the objects
exhibiting read problems have not been indexed.
365 One or more content objects were provided, and the full text of at least one but
not all of them could be indexed. At least one of these objects was rejected
because of a serious internal or code error while preparing the content. This
401 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of
unsupported character encoding.
405 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because the
process timed out while trying to extract the full text content from a file.
406 Non-UTF8 data was found in metadata regions with region forgery detection
enabled. The metadata was discarded.
408 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of a
serious internal or code error while preparing the content. This error may or
may not recur if you re-index this object.
410 DCS was unable to read the contents of the IPool message or the file
containing the content. No full text content has been indexed.
OTTextSize
This region captures the size of the indexed full text content in bytes. Note that for
many languages there may be fewer characters than bytes. Note that this reflects
the size of the text extracted by DCS and filters, and can be significantly different
from the OTFileSize region defined by Content Server. Should be declared as type
INTEGER. First available in update 21.1.
OTContentLanguage
This region is optionally generated by the Document Conversion Server. DCS can
assess the full text content of an object to determine the language in which the
content is written. The language code is then typically represented in this region.
OTPartitionName
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the name of the partition which contains
the object. In a search query, OTPartitionName supports equals and not equals, for
either an exact value or a specific list of range values. Operations like regular
expressions or wildcards are not supported. This limited query set is intended to help
administrators with system management tasks, such as locating all the objects in a
given partition. In Content Server, partition names usually start with the text
“Partition_”.
OTPartitionMode
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the operating mode of the partition which
contains the object. In a search query, OTPartitionMode supports equals and not
equals, for either an exact value or a specific list of range values. Operations like
regular expressions or wildcards are not supported. This limited query set is
intended to help administrators with system management tasks, such as locating all
the objects in a retired partition. The mode will be one of:
OTIndexError
This field is used to contain a count of metadata indexing errors associated with an
object. Metadata indexing errors occur for situations such as:
• An improperly formatted metadata object. A string value within an integer or
date field would be examples of this.
• An improperly formed region name.
• Attempts to provide values for reserved and protected region names.
For each such instance, the OTIndexError count region is incremented. Applications
providing objects for indexing may provide an initial value. For example, DCS may
have found that a date or integer value it attempted to extract was incorrect, and
therefore could determine that there is already a metadata error before the Index
Engine is provided with the object.
The error counts are incremental. Updates to objects which contain metadata errors
can cause this value to become artificially inflated. For example, if an object is added
with a date error, and then 10 updates include the same date error, then error count
may be 11.
Applications can query and retrieve this field to help assess the quality of the search
index.
OTScore
This synthetic region usually contains the computed relevance score for a search
result as an integer value. With the default configurations, a relevance score is
between 0 and 100. It is important to understand that the relevance score as
computed does NOT have any measurable correlation with the relevance of an object
as assessed by a user. These scores at best must be considered relative. For most
applications, displaying the OTScore (or computed relevance) is not normally
appropriate.
TimeStamp Regions
During indexing operations, the Index Engine can mark objects with the time that
objects are created or updated. This behavior is enabled by including the appropriate
definitions in the LLFieldDefinitions.txt file as described below. When enabled, by
default these timestamps are added on all objects. If trying to minimize the index
size, you might want to add timestamps to only a subset of objects. For example,
with Content Server, you might want to add timestamps to only the Content Server
“Index Tracer” objects. For stamping only limited object types, ensure the
TimeStamp fields are defined in LLFieldDefinitions.txt, and add the list of object types
to the [DataFlow_] section of the search.ini file. Only objects that contain an
OTSubType value in the list will have the time stamp values added:
IndexTimestampOnlyCSL=147
OTObjectIndexTime
When an object is created, this field will be populated with the current time, as
determined by the system clock. This field has the type TIMESTAMP, and must
be declared in the LLFieldDefinitions.txt file to function.
OTContentUpdateTime
When the text content of an object is updated, this value records the current time
for the update. Only actual changes to the content will trigger a change. If an
object is re-indexed, but the text content is identical, then this value will not be
updated. This region has the type TIMESTAMP, and must be declared in the
LLFieldDefinitions.txt file to function.
The definition of “identical” is based upon the text as interpreted by the index
engine. Changes in the tokenizer or file format filters may result in the text being
declared “different”, even if the master object content is unchanged.
OTMetadataUpdateTime
This field records the time at which the metadata for an object was last modified.
If an object is re-indexed and no metadata changes, then this value is not
updated. This region has the type TIMESTAMP, and must be declared in the
LLFieldDefinitions.txt file to function.
OTObjectUpdateTime
This field is updated any time the metadata OR the content is changed. You
should normally not remove this field, since it is required for correct operation of
Search Agents.
_OTDomain
The searchable email domain feature generates synthetic regions by appending this
suffix to the email region name. For instance, if your region that contains email is
OTEmailSender, then the region OTEmailSender_OTDomain will be created to
support the email domain search capability.
_OTShadow
Regions ending with the string _OTShadow are created when the LIKE operator is
configured. If the Content Server region OTName is configured for use with LIKE,
then the region OTName_OTShadow contains the extended indexing information
required by the LIKE feature.
In the index, these types of regions are typically prefixed with OTDocXXXX or
OTXMP_XXXX. Be careful if you choose to remove these, since it is possible that
region names from other sources might match this naming convention. For example,
the Content Server ‘User Rating’ metadata fields OTDocSynopsis and
OTDocUserRating also have this form.
Workflow
Indexing of Workflow metadata from Content Server has been problematic
historically, but is considerably better since Content Server 10.0 Update 10.
Firstly, the default Workflow configuration indexes all the internal Workflow metadata
to the search engine. In most applications, many of these regions have no value for
user search. The default region definitions file has DROP or REMOVE instructions in
place to prevent this data from being indexed. If you need to make these metadata
fields searchable, edit the definitions file appropriately.
The other aspect is Workflow Map attributes. These are presented as regions for
indexing in the form WFAttr_xxxx, where xxxx is text that represents the name of the
Workflow attribute. It is possible for a very large number of these WFAttr_ regions to
exist, especially in older versions of Content Server where the default setting was to
always index these regions. This increases the size of the index. If you do not need
to search on these fields, you might consider DROP or REMOVE in the definitions
file.
If searching the aggregate value of these fields is sufficient, you might also want to
consider using AGGREGATE-TEXT for queries against these regions, in conjunction
with DISK_RET for storing the values.
region definitions file and restart the search grid BEFORE these values are indexed.
Once indexed, they will be marked as type ‘TEXT’, and cannot be changed short of
removing the entire region and re-indexing the objects, or using the region type
conversion features.
This is an optimization consideration only. Leaving the Category and Attribute values
as TEXT within the index does not affect feature availability, although differences in
behavior between integer and text values may be a concern.
Forms
Within Content Server 10, the Forms module permits users to create arbitrary labels
for form fields. The region names are generated directly from these labels.
Unfortunately, this can result in conflicts with other search regions in the index. It is
recommended that you enforce a business practice of prefixing all form names with a
unique value, such as OTForm_. This will provide two major benefits: it will minimize
the chance of name conflicts, and it allows use of AGGREGATE-TEXT regions to
improve search usability.
Content Server 10.5 or later will generate region names that follow a well defined
syntax, along the lines of OTForm_1234_5678. This change makes it much easier to
identify regions associated with forms, and simplifies selecting them for REMOVE or
aggregation purposes.
Custom Applications
It is common for OpenText customers to create their own solutions using Content
Server as a platform. Often, the considerations for metadata indexing and search are
overlooked. If you have custom applications that index metadata fields, you should
consider the impact on search index size and performance.
• Only index object subtypes that are of interest to users
• Only extract metadata fields that are useful for search
• Ensure that the region definition file has optimal configuration for each region
• Provide a unique prefix so that the custom metadata will not conflict with
other region names
• If appropriate, add the custom regions to the default Content Server search
regions.
Indexing
Updating the search index is performed by preparing files containing indexing
commands in a defined location. The input files and structures are in the OpenText
“IPool” format. The Update Distributor watches for these files, and initiates indexing
when IPools arrive.
A single IPool may contain many indexing commands and objects. Updates to the
index from an IPool are only “committed” once all of the messages within the IPool
are successfully handled. If either the Update Distributor or one of the Index Engines
is unable to process a message, then the indexing process will halt and the all the
changes from the IPool are rolled back when the Index Engines are restarted. This
behavior applies to serious indexing IPool errors, such as malformed IPool
messages. Objects too large, for example, are not IPool errors.
If multiple partitions exist for an index, the Update Distributor chooses which partition
will index an object. Some operations, such as Modify By Query, are broadcast to all
the Index Engines. Most operations are specific to a single partion, and the first step
in deciding which partition to use is to ask if any of the existing Index Engines already
have an entry with the same object identifier (the “Key” value). If one of the Index
Engines responds affirmatively, then the object is given to that Index Engine to add,
modify or remove.
If no partition already has the object, the Index Engine will make a selection based
upon the Read-Write or Update-Only mode of the partitions, and whether they are
full.
Partitions which are in “Update-Only” or “Retired” mode are never given new objects
to index. Partitions which are in “Read-Only” mode do not have Index Engines
running, and are not given any indexing tasks.
performance improvements.
<Object>
<Entry>
<Key>OTURN</Key>
<Value>
<Size>16</Size>
<Raw>8273908620;ver=1</Raw>
</Value>
</Entry>
<Entry>
<Key>Operation</Key>
<Value>
<Size>12</Size>
<Raw>AddOrReplace</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>187</Size>
<Raw>
<FileName>/MyContentInstances/testhtml.html</FileName>
<ObjectTitle>Things that go bump</ObjectTitle>
<OTName>Cars</OTName>
<OTName lang=”fr”>Voitures</OTName>
<OTCurrentVersion>true</OTCurrentVersion>
</Raw>
</Value>
</Entry>
<Entry>
<Key>ContentReferenceTemp</Key>
<Value>
<Size>20</Size>
<Raw>C:/dev/testhtml.html</Raw>
</Value>
</Entry>
<Entry>
<Key>Content</Key>
<Value>
<Size>28</Size>
<Raw>full text to be indexed here</Raw>
</Value>
</Entry>
</Object>
The <Size> value reports the number of characters contained within a <Raw>
section. The <Raw> section contains the actual values. The <Raw> section can
contain arbitrary data expressed in UTF-8 encoding, and does not require character
escaping because the <Size> is known, although for metadata regions this data is
expected to be structured much like XML. The <Key> value specifies the top level
purpose for each entry, sometimes processed by DCS, sometimes by the Index
Engines. This object contains 5 entries – the OTURN, Operation, Metadata, and
content referenced in two different ways.
Every object to be indexed requires a unique identifier. For typical Content Server
applications, the unique identifier is provided in the region “OTURN”, as shown in this
example. The value for the OTURN is “8273908620;ver=1” – different Content
Server modules may provide OTURN values in different forms. Operations such as
ModifyByQuery would use a query “where clause” as the OTURN.
The Operation entry instructs the Index Engines how the object should be interpreted
as explained in the sections below.
The Metadata entry is used to provide the regions names and values that are
provided for indexing. In the example above, metadata for the regions FileName,
ObjectTitle, OTName and OTCurrentVersion are provided. You can specify multiple
values for one region. The OTName region, for example, has two values, and one of
them also uses the attribute key/value feature of OTSE to specify that “voitures” is
the French language value.
The entry for ContentReferenceTemp is used to identify that the content data is
located at the specified file location. The IPool libraries would normally delete the file
after processing, since by convention ContentReferenceTemp is used when a
temporary copy of a file was made. A permanent copy can also be specified using
ContentReference as the key, which does not delete the original. IPools given to the
Index Engines normally should NOT have either ContentReferenceTemp or
ContentReference entries, since extraction and preprocessing of files should already
have occurred to extract the raw text data. These modes are common for earlier
steps in the DCS process.
The entry for Content in the example indicates that the data in question is contained
within the IPool, in the <Raw> section. This is the normal expected use case for
IPools being consumed by the Update Distributor. Unlike this artificial example,
having both Content and ContentReferenceTemp values is atypical.
AddOrReplace
This is the primary indexing operation used to create new objects in the index. If the
object does not exist, it will be created. If an entry with the same OTURN exists in
either a Read-Write or Update-Only partition, then it will be completely replaced with
the new data, equivalent to a delete and add.
The AddOrReplace function distinguishes between content and metadata. If an
object already exists, and metadata only is provided, the existing full text content is
retained. However, the line between content and metadata is somewhat distorted.
The DCS processes will typically extract metadata from content and insert this
metadata into regions for indexing. There is a list of metadata regions which are
therefore considered to be “content”, and not replaced or deleted if content is not
provided in a replace operation.
The list of metadata considered to be content for this purpose is defined in the
[DataFlow_] section of the search.ini file by:
ExtraDCSRegionNames=OTSummary,OTHP,OTFilterMIMEType,
OTContentLanguage,OTConversionError,OTFileName,OTFileType
ExtraDCSStartsWithNames=OTDoc,OTCA.OTXMP_,OTCount_,OTMeta_
DCSStartsWithNameExemptions=OTDocumentUserComment,
OTDocumentUserExplanation
ExtrasWillOverride=false
The ExtrasWillOverride setting is used to disable this feature, which would cause the
regions to be deleted if content is not indexed in an AddOrReplace operation. The
DCSStartsWith entry is used to capture the dynamic regions that DCS extracts from
document properties.
The Exemptions list identifies regions that should not be treated as part of the full text
content, despite matching the DCS “starts with” pattern.
The AddorReplace function can also trigger “rebalancing” operations. If the target
partition is Retired or has exceeded it’s rebalancing threshold, the Update Distributor
will instead delete the object from the partition where it currently resides, and redirect
the AddorReplace operation to a partition with available space.
AddOrModify
The intended use of AddorModify is to update selected metadata regions for an item
thought to already exist in the index. The AddOrModify function will update an
existing object, or create a new object if it does not already exist. When modifying an
existing object, only the provided content and metadata is updated. Any metadata
regions that already exist which are not specified in the AddOrModify command will
be left intact.
There is no mechanism to delete a region which has already been defined for an
object, but you can delete the values by providing an empty string as the value for
the region ("").
One potential downside of the AddOrModify operation is that if you selectively modify
metadata regions and the target object is not already correctly indexed, you will
create a new object that only has the metadata regions or content which was defined
in the modify operation. This will effectively create an object which only has partial
data indexed. If you provide all metadata region values in a modify operation, this
situation will not arise. New applications may want to consider using the
“ModifyByQuery” or “Modify” indexing operator instead of AddOrModify, which do not
create an object if not already defined.
Modify
The Modify operation is used to update specific metadata in an object. Unlike the
AddOrModify operation, Modify will never create a new object. If the OTURN
specified in a Modify operation does not exist, the transaction is simply discarded.
Modify can add new metadata, or replace existing metadata. Metadata for regions
not included in the IPool message are unaffected.
Delete
The Delete function will remove an object from the index, including both the metadata
and the content.
Note that if an object exists in multiple partitions, it will only be removed from the
partition to which the Update Distributor sent the Delete operation. This is a very rare
case, and would likely only arise if partitions were marked as Read-Only, then
updates to objects in the Read-Only partition were performed.
DeleteByQuery
The DeleteByQuery operator deletes objects which meet the provided search criteria.
A standard “WHERE” clause is provided in OTURN. This operator can be used to
delete many objects at once. Since the Update Distributor broadcasts the function to
all active Only partitions, duplicate objects can also be removed.
DeleteByQuery is of particular usefulness for applications that no longer track the
unique identifier for an object.
Applications which need to perform bulk deletes on a project will also find this far
more efficient. Instead of issuing 25,432 delete requests for every object in a project,
a single DeleteByQuery operation with an OTURN of
[region "ProjectName"] "old project"
would delete all objects marked as belonging to the project in a single transaction.
ModifyByQuery
This operation is used to selectively modify the content or specific metadata regions
for objects in the index. The affected objects are specified by search parameters – a
valid “WHERE” clause within the OTURN entry of the IPool. If no objects match the
query, then no updates are performed. Every object in the index which matches the
query will have the provided regions updated. Other regions for objects are not
affected; for example, you could change the value in the region “CurrentVersion” to
“false” without modifying values in other regions.
The Update Distributor will send ModifyByQuery operations to every active partition.
To modify a specific known object, you can place an object ID in the OTURN field:
[region "OTURN"] "ObjectID=1833746;ver=3"
You can also quickly perform bulk operations, such as marking all the objects
associated with a specific project as “released”. The IPool would contain region
values such as:
<ProjectStatus>released</ProjectStatus>
And the Key field in the IPool would contain a ‘WHERE’ clause such as:
[region "ProjectName"] "Great Scott"
All objects with the value of “Great Scott” in a region labeled “ProjectName” will then
have their ProjectStatus region populated with the value “released”.
A value for a region cannot be completely removed, but it can be replaced with an
empty string by providing a region definition in the IPool that has an empty string:
<ProjectStatus></ProjectStatus>
The full text content of an object can not be updated using ModifyByQuery.
Transactional Indexing
The indexing process with OTSE is transactional in nature. This essentially means
that the indexing request is not deleted until the index updates have been committed
to disk.
Transactional indexing ensures that no indexing requests are lost in the event of a
power loss or similar problem while indexing is taking place.
OTSE treats all of the indexing requests within an input IPool as a single transaction.
The input IPool is not considered complete until every request in the IPool is serviced
and committed to disk. Only then is the IPool deleted.
There are performance considerations related to transactional indexing. The more
objects there are within an IPool indexing transaction, the more efficient the indexing
process is. This is because a new index fragment is created each time a transaction
completes. Many objects in a transaction therefore generate fewer new index
fragments, and use the disk bandwidth more efficiently.
The converse of this is the time to index. By collecting index updates and packaging
them into transactions, for low-load systems, the average time for an object to be
indexed is somewhat slower. The majority of applications do not have a requirement
to minimize the lag time between an object update and the moment the changes are
reflected in the index, so large numbers of objects in the indexing IPool is generally
the best approach.
OTSE does not collect objects to create transactions. The number of objects in a
transaction is set by the upstream applications which are generating the indexing
updates. By default, Content Server 16 will attempt to package up to 1000 objects
within a single indexing transaction.
IPool Quarantine
In the event that an object in an IPool cannot be indexed because of severe errors,
the affected indexing component will halt. Upon restart, all of the indexing operations
for the IPool will be rolled back. Depending on the error code and configuration
settings, the Admin Server might automatically restart the component. If an IPool
fails in this way 3 times, it is moved into quarantine and the next IPool is processed.
The quarantine location is a sub-directory named \failure in the IPool input directory.
If there are too many quarantined items, the IPool libraries can be configured to
either halt or discard the oldest IPool. Quarantine behavior is a Content Server
configuration, not in OTSE.
Query Interface
Queries to OTSE are submitted to the Search Federator over a socket connection
using a language known as OpenText Search Query Language (OTSQL).
Applications communicating directly with the Search Federator will need to
understand and implement this wire-level protocol exposed by the Search Federator.
Content Server implements this protocol, as does the Admin Server component of
Content Server and the search client built into OTSE.
Connection to the Search Federator requires knowledge of the computer IP address
and the port number on which the Search Federator is listening, which is configurable
within the search.ini file. The search client will need to establish a basic text socket
to engage in a query conversation, which is a generic network function which should
be possible from most programming languages. The OTSQL commands and
responses described here are conveyed across the socket connection.
A conversation with the Search Federator consists of opening a socket connection,
issuing commands, receiving responses, and closing the socket connection.
Managing the number of open connections can be important in optimizing the overall
resource use in OTSE. There are two settings: the number of queries that can be
simultaneously active (being serviced by the Search Engines); and the queue size
(maximum number of queries waiting for service). By default, the queue size is 25
and the active query limit is 10. When the queue is full, the Search Federator simply
does not accept any additional socket connections.
A typical query conversation between an application and the Search Federator is:
Responses from the Search Federator are expressed in a clear text data stream
which explicitly includes data size information to allow parsing values without needing
to escape special characters.
The available commands are described below. The commands themselves are not
case sensitive, although parameters to the commands such as region names may be
case sensitive.
Select Command
The select command is used to initiate a query. This command is essentially the
OpenText “OTSTARTS” query language, which is described in more detail in the
OTSQL section of this document. The basic form is:
<OTResult>
Cursor 0
DocSetSize 1012
</OTResult>
<OTResult>
cursor 100
</OTResult>
The cursor is automatically advanced after a get results command, which means
that use of set cursor between get results is optional if you are retrieving
consecutive sets of results. It should also be noted that moving the cursor forward is
relatively efficient. Moving the cursor backwards internally requires a reset to the
start of the results and moving forward to the desired location. If you are performing
multiple get results operations, structuring them to move strictly forward through
the results is much faster. This observation is only true within a search transaction
(between open and close operations), and has no impact on distinct queries.
There is an alternative method for managing the cursor location. The general form of
a query is:
Select … where … orderedby … starting at N for M
Where N is the number of the first desired result, and M is the number of results to
return in the Get Results command. The first result has a number of 0.
Select "OTObjectID" where "dogs" starting at 1000 for 250
Would return results number 1000 through 1249 when Get Results is called. This
method is not generally used or recommended, and is noted here for completeness.
Using Set Cursor with Get Results is the recommended usage pattern.
<OTResult>
ROWS 4
ROW 0
COLUMN 0 "OTObject"
DATA 25
DataId=41280133&Version=1DATA END
COLUMN 1 "OTName"
DATA 29
Approval Handilist Poothe.pdfDATA END
ROW 1
COLUMN 0
DATA 25
DataId=41280094&Version=1DATA END
COLUMN 1
DATA 18
P&L Jun to Nov.xlsDATA END
ROW 2
COLUMN 0
DATA 25
DataId=41280131&Version=1DATA END
COLUMN 1
DATA 0
DATA END
ROW 3
COLUMN 0
DATA 25
DataId=41280093&Version=1DATA END
COLUMN 1
DATA 10
Mar TB.XLSDATA END
</OTResult>
In this example, there are 4 results, indicated by the “ROW” values. ROW values are
numbered starting at 0.
Each result contains 2 returned regions, identified by the COLUMN values. In the first
ROW, the COLUMN labels are provided. To save bandwidth, the COLUMN values are
not labeled in subsequent ROWS.
The COLUMN values are numbered starting at 0, in the same order in which the
regions were requested in the SELECT statement for the query. Note that the
DataId= portion of the COLUMN 0 results is typical of how Content Server provides
the data for indexing, this is not an artifact of the search technology.
If a value is not defined for a region, the region is still returned in the results with an
empty value. ROW 2 COLUMN 1 illustrates this case.
If ATTRIBUTES were requested in the select statement, then the requested attribute
information will be appended to the get results data. In the example below, the data
element for the region “TestSplit” has 3 values. The first value had one attribute, the
language (English), the second has two attributes, and the third value has no
attributes – indicated by the empty placeholder.
COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr"
translated="true"</><></>ATTRIBUTES END
If HIT LOCATIONS were requested in the select statement, the locations are added
to the results:
COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr" translated="true"</>
<></>ATTRIBUTES END
LOCATIONS 17
0 4 6 1; 2 10 7 3 LOCATIONS END
The triplets indicate that the first cell (start counting at 0) has a hit at location 4,
length 6, matching term 1. The third term (2) has a hit starting at character 10 with
length 7, matching query term 3.
If you are retrieving large numbers of search results, it can be more efficient to break
the operation into multiple get results operations. Typically, these “gulp” sizes are
optimal in the 500 to 2000 results range. The performance benefit of using an
optimal size is typically only about 10 percent, so this is not a critical adjustment.
FileSize
The next line contains the text FACETS with the facetLength value. This is the total
length of the string in bytes on the next line including the FACETS END statement.
The next line contains the actual facet data. The first integer, nFacets, is the number
of key/value pairs that are included in the facet results for this column. The key/value
pairs are represented by data triplets of keyLength, key and count. The key is the
text of the value. The count is an integer. The keyLength is the number of bytes in
the key – using a length simplifies parsing.
Note that there is a special case for nFacets, where it may be appended with a plus
(+) character. This indicates that building of the facet data structures terminated
because of size restrictions. This means that there are facet values in the index for
this region that have not been considered in computing these facet results.
The facet data is terminated with the FACETS END text.
A simple example of output from a get facets command is included below. Note the
special case where a facet has no values, as illustrated in the COLUMN 1 values.
get facets
<OTResult>
ROWS 1
ROW 0
COLUMN 0 "OTModifyDate","Date"
FACETS 45
3,9,d20120605,14;9,d20120528,4;9,d20120514,1;
FACETS END
COLUMN 1 "OTUserName","UserLogin"
FACETS 3
1,;FACETS END
</OTResult>
Date Facets
Facets for regions that are defined as type DATE in the LLFieldDefinitions.txt file
have a special presentation in the facet results.
Each date value is placed into buckets representing days, weeks, quarters, months
and years. Instead of the most frequent values being returned in facets, the most
recent values are returned instead. For most search-based applications, the
“recentness” of an object is a key consideration, and the implementation of date
facets reflects this requirement.
A single date value may be represented in multiple buckets. For example, if today is
July 1st 2012, an object with an OTCreateDate of June 30 2012 may be represented
in the facet values for yesterday, for this week, for last month, last quarter and this
year. Each date bucket type has a distinct naming convention to help parsers
discriminate between the buckets.
• Years have the form y2012. Years are aligned to the calendar. The current year
will include dates from the start of the year to today.
• Quarters have the form q201204, which represent the year and the month in
which the quarter starts. Quarters start in January, April, July and October. The
current quarter will include dates from the start of the quarter to today.
• Months have the form m201206, which represent the year and the month. Month
facets are aligned to the calendar month. The current month will include dates
from the start of the month to today.
• Weeks have the form w20120624, which represents the year, month and first
day of the week. Weeks are always aligned to start on Sundays. The current
week will include dates from the start of the week to today.
• Days have the form d20120630, which represents the year, month and day.
If the contents of a date bucket are empty (count of zero), then no result is returned
for that bucket.
Refer to the FACETS portion of the SELECT statement for information on requesting
the number of facet values for each of years, quarters, months, weeks and years.
FileSize Facets
The search.ini file can be used to identify integer or long regions that should be
treated as FileSize facets. Size facets are optimized for values that represent file
sizes. Clearly, discrete file size facets are useless. File sizes have the property that
they range from 0 to Gigabytes, but are psychologically thought of in geometric sizes.
The FileSize facet places integers into ranges that follow this geometric pattern. The
entire set of sizes is returned, rather than the most frequent counts for facets.
Applications presenting facets may choose to combine these ranges into larger
ranges.
The buckets for FileSize facets and the corresponding labels for those buckets are
captured in the table below:
1k 1000 to 1999
2k 2000 to 4999
5k 5000 to 9999
10k 10000 to 19999
20k 20000 to 49999
50k 50000 to 99999
100k 100000 to 199999
200k 200000 to 499999
500k 500000 to 999,999
1m 1,000,000 to 1,999,999
2m 2,000,000 to 4,999,999
5m 5,000,000 to 9,999,999
10m 10,000,000 to 19,999,999
20m 20,000,000 to 49,999,999
50m 50,000,000 to 99,999,999
100m 100,000,000 to 199,999,999
200m 200,000,000 to 499,999,999
500m 500,000,000 to 999,999,999
1g 1,000,000,000 to 1,999,999,999
2g 2,000,000,000 to 4,999,999,999
5g 5,000,000,000 to 9,999,999,999
10g 10,000,000,000 to 19,999,999,999
Label Integer Range
20g 20,000,000,000 to 49,999,999,999
50g 50,000,000,000 to 99,999,999,999
100g 100,000,000,000 to 199,999,999,999
big >= 200,000,000
negative <0
undefined No value for field
The list of integer regions to be presented as FileSize facets is within the search.ini
file in the [Dataflow_] section. The default regions shown here are tailored for typical
Content Server installations:
GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize
Expand Command
This command is used to determine the list of words that are used in a search query
for a given term expansion operation. Term expansions occur when features such as
stemming, regular expressions or a thesaurus are used in a term. The simple case
of stemming to match boat and boats is illustrated below.
> HH
> DATA 61
> The <B>rain</B> in <Tag>Spain</Tag> falls mainly on the
plain
> TERMS 2
> the
> spain falls
<OTResult>
HITS 3
0,3,0
52,3,0
24,17,1
</OTResult>
After the TERMS element, each keyword to be matched is entered on a separate line.
If there are multiple words in the line, it is considered to be a phrase to be matched.
This example requests hit highlighting for the terms “the” and “spain falls”.
The results are comprised of numeric triplets, where each triplet is of the form
POSITION,LENGTH,TERM. The position starts at 0, and the term numbering starts
at 0.
The hit highlighting code strips common HTML formatting characters out of the data.
In this example, the </Tag> is ignored when matching the phrase “spain falls”,
although these formatting tags are counted in the character positions.
You may need to use the EXPAND command to obtain a list of terms that should be
tested in hit highlighting.
Get Time
While a query is executing, detailed timing information for each element of the query
is tracked. The Get Time command will return this data, including total time, wait
time, execution time, and execution time broken down by each command execution
within the connection. To obtain accurate information about the entire search query,
this should be the last command executed before closing the connection.
<OTResult>
<TIME>
<ELAPSED>68638</ELAPSED>
<SELECT>21329</SELECT>
<GET RESULTS>610</GET RESULTS>
<GET FACETS>187</GET FACETS>
<HH>0</HH>
<GET STATS>31</GET STATS>
<EXECUTION>22157</EXECUTION>
<WAIT>46481</WAIT>
</TIME>
</OTResult>
Set Command
The set command is used to specify values for variables that apply to the subsequent
operations. The supported set operations include:
<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "Description"
DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 0
DATA END
ROW 2
COLUMN 0
DATA 16
OTWFSubWorkMapIDDATA END
COLUMN 1
DATA 0
DATA END
…
</OTResult>
The Get Regions command can take an optional parameter, “types”.
get regions types
When the types parameter is present, this function will include the type definition for
the region in the response. This type definition can be used to provide optimized
interfaces for users (for example, integer comparisons instead of text modifiers). If
multiple partitions report different types, then the Search Federator will respond with
the value “inconsistent” as the type. Note that differences in region types for partitions
in Retired mode are allowed; the assessment of inconsistency is based only on
partitions that are not Retired. The possible types are: Integer, Long, Enum, Date,
Text, Boolean, Timestamp.
<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "RegionType"
DATA 4
DateDATA END
COLUMN 2 "Description"
DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 4
EnumDATA END
COLUMN 2
DATA 0
DATA END
…
</OTResult>
When the facets parameter is present, then the type definition of generated facets is
included in the response. Normally, the facet types are the same as the region types,
but the special handling of integers that represent file sizes is an exception, returning
the value ‘FileSize’.
get regions types facets
<OTResult>
…
ROW 98
COLUMN 0
DATA 12
OTObjectSizeDATA END
COLUMN 1
DATA 4
LongDATA END
COLUMN 2
DATA 8
FileSizeDATA END
COLUMN 3
DATA 0
DATA END
…
</OTResult>
SELECT parameters
FACETS parameters
WHERE clauses
ORDEREDBY parameters
Content Server users do not directly use OTSQL. The Content Server search query
language is known as LQL (historically, the Livelink Query Language). LQL is similar
to OTSQL in most respects, but provides some convenience operators and generally
uses different keywords. LQL in Content Server represents only the subset of
OTSQL that defines the WHERE clauses. Some of the differences between LQL and
OTSQL include:
LQL OTSQL
termset termset
stemset stemset
qlprox prox
qlregion region
qlleft-truncation left-truncation
qlright-truncation right-truncation
qlthesaurus thesaurus
qlstem Stem
qlphonetic phonetic
qlregex regex
qlrange range
qllike like
in in
any any
text text
” « » ‟ ″ “ ”„″ "
SELECT Syntax
The SELECT section is used to specify which regions in the index should be included
in the returned results. The more regions that are requested, the longer the ‘get
results’ operations will take, but this does not impact the query time.
SELECT "region1","region2","region3"
To return all of the regions use the * keyword. For a Content Server installation, this
is not recommended, since there may be hundreds of regions. Requesting the
minimum necessary regions is suggested for optimal performance.
If you want to return information about the key/value attributes within text regions,
you can use the ATTRIBUTES modifiers:
FACETS Statement
The FACETS section specifies whether facets are desired, and if so, for which
regions. This is optional, with the default being no facets returned. Refer to the next
major section of this document entitled “Facets” for a complete description of the
FACETS statement.
Sample facet requests:
FACETS "regionX"[10],"regionY"
FACETS "OTCreateDate"[d100,m24]
The ‘get facets’ command is used to retrieve the results. See the commands section
for additional details.
WHERE Clause
The WHERE clause defines the rules by which an object satisfies the search query.
The basic form is:
where "red"
where "red riding hood"
where [region "name"] "red riding hood"
where [region "FileSize"] >= "1000" and [region "FileSize"]
< "10000"
WHERE Relationships
Each WHERE clause in a query is evaluated relative to other WHERE clauses by a
logical relationship. The supported relationships are:
options.
Relationships are evaluated from left to right. Brackets can be used to clarify and
modify the order of evaluation of clauses. For example, using single letters a through
d to represent entire clauses:
WHERE Terms
The search terms in a WHERE clause should normally be enclosed in quotes.
Although there are some specific cases where the lack of quotes is tolerated, if you
are writing a query application, quotes are recommended in all cases.
The first form of a search term is the simple token. This is a value which is normally
expected to pass through the tokenizer and be recognized in its entirety as a single
token. All operators work on simple terms.
"hello"
"pottery123"
"3.1415926"
The second form is an exact phrase. Not all operators are compatible with phrases.
Phrases should normally only be used in string comparison operations.
"the quick brown fox"
"1334.8556/995-x"
You can also request that matches are only returned when the entire value is an
exact match for the phrase. For example, if there is a search region “ProjectName”,
and possible values are “Plan A” and “Plan A Extended”, searching for “Plan A” will
match both of these cases. Preceding the phrase with an equality operator ( = ) can
differentiate these, and match only the values that do not include the “Extended”
term:
[region "ProjectName"] = "Plan A"
Finally, there is a special case for search terms, the * character (asterisk or star) or
the keyword all, with no quotation marks. This value is interpreted by the search
engine to match any object which has a value for the specified region. This will not
match objects if the region does not have a value defined for an object.
[region "name"] *
WHERE Operators
Each WHERE clause is comprised of a region specification, a comparison operation,
and a term. The region is optional, and if missing is assumed to be the default
search region list. The operation is optional, and if absent is assumed to match any
token within the region.
The following operators function with either simple tokens or phrases:
The next set of operators is available for use with integers, dates and text metadata
values. They are disabled by default for full text query, since comparison queries in
full text are generally misleading and perform very slowly, although this behavior can
be changed by setting AllowFullTextComparison=true in the search.ini file.
These operators also have special capabilities for Date regions described later.
< Will match all values which exist and are less
than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
<= Will match all values which exist and are less
than or equal to the specified term. If a
phrase is provided, only the first term in the
phrase is used.
> Will match all values which exist and are
greater than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
>= Will match all values which exist and are
greater than or equal to the specified term. If
a phrase is provided, only the first term in the
phrase is used.
is not efficient. To improve performance, the query syntax parser will attempt to
identify usage patterns where multiple comparisons are made to a single region, and
convert it to the more efficient form of
[region"x"] range "20150621~20160101"
The following operators are designed for use with single tokens, not phrases. Some
limited phrase support is available with some of the operators as noted in the
explanations.
range "start~to" Will match any value between the start term
and the end term, inclusive. Note that the
start term must be less than the end term.
range "value1|value2|value3" The range operator can be provided with a
list of terms or phrases. This is equivalent to
value1 OR value2 OR value3. This operator
matches any value in a region; it is not
restricted to matching entire values.
thesaurus Will match the exact term or synonyms for
the term using the currently defined
thesaurus.
phonetic Will match phonetic equivalents for the term.
If applied to a phrase, phonetic matching for
each word in the phrase will be performed.
Refer to the Phonetic matching section for
more information.
regex Will interpret the term as a regular
expression. Values which satisfy the regular
expression match the term. Regular
expressions apply only to a single token.
Regular expressions are more fully described
later.
stem Will match values that meet the stemming
rules. Refer to the Stemming section for
more information. If stemming is applied to a
phrase, then the last word in the phrase is
stemmed.
right-truncation Right truncation matches terms which begin
with the provided search term. The user
would typically consider this as term*. If
used with a phrase, then the last word in the
phrase is stemmed.
left-truncation Left truncation matches terms which end with
the provided search term. The user would
typically consider this to be of the form *term.
This operator is valid only for single tokens.
like String matching optimized for part number
and file names. Only valid with “Likable”
regions.
any (term,"search phrase") Match any term or phrase in the list. Unlike
the IN operator, partial matches within a
metadata region are acceptable. Equivalent
to (term SOR "search phrase").
in (term, "search phrase") Match any term or phrase in the list. Within a
region, only matches complete values.
Equivalent to (=term SOR ="search phrase").
not in (term, "search phrase") Excludes any objects containing the term or
phrase. For regions, equivalent to (and-not
[region "xx"] in (term,"search phrase")).
termset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of terms and phrases. N% may also be
used.
stemset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of the stems (singular/plural) of the
terms and phrases. N% may also be used.
text (something to search)
For large blocks of text, finds objects with
similar common terms. Check Advanced
Concepts section for more details.
span (distance, query)
Match query within distance number of terms.
Will match “big truck” or “big red truck” but not “truck is big”. The second parameter
is a single letter indicating whether order needs to match. Use a ‘t’ (true) or f (false).
In the example above, using f would match “truck is big”.
The first parameter of the span operator is the maximum distance between terms that
will satisfy the query. These fragments would meet the distance of 4 requirement:
Mike smith
A smith named Michael
Michael Herbert James Smit
The span operator supports query fragments for any combination of AND, OR, and
nesting (brackets) for single search terms.
“space” and span(10, ((Yellow and sun) or (blue and moon))
and (earth or planet))
The span operator can be used with full text, but not with text metadata.
A span query is a relatively expensive operation and can be very expensive when
used with wildcards (left-truncation and right-truncation) or regular expressions. By
default, the engine is configured to disable support for these types of term
expansions within the span operator. If term expansion is enabled, the search
engines will store temporary working data on disk files during the evaluation of the
span. Temporary files are stored by each Search Engine in their corresponding
index\tmp directory, and files are named matchingWordsNNNNN and
spanValuesNNNNN, where NNNNN is a dynamically generated unique value. The
temporary files are deleted when the query completes, and also by the general
purpose cleanup thread which runs from time to time.
If abused, the span operator has the potential to require large amounts of disk space
and will take a long time to execute. There are a number of limits set by default in
the search.ini configuration file, which can be adjusted if more complex queries must
be run. When a limit is reached, the search will be terminated as unsuccessful. The
limits apply to a single partition (not the entire query for the entire index) and are
located in the [Dataflow_] section of the configuration file, with the defaults shown
below.
SpanScanning=false
By default, use of term expansion (regex and wildcards) is not permitted with the
span operator. Set true to enable.
SpanMaxNumOfWords=20000
The upper limit on the number of terms that will be considered when wildcards and
regular expressions are expanded.
SpanMaxNumOfOffsets=1000000
Each term in the span expression may exist multiple times in documents. This file
stores the locations of the terms being evaluated. This is the upper limit for the
number of instances of matching terms.
SpanMaxTmpDirSizeInMB=1000
Limits the temporary disk space the partition can use for storing temporary data
during span operation evaluation.
SpanDiskModeSizeOfOr=30
The cost of executing a span is directly related to the number of “OR” operations in
the span query. This setting is an upper limit on the number of “OR” Boolean
operators that can be assessed.
WHERE Regions
A region is specified within square brackets with a region keyword, and enclosed in
quotation marks. The search term is likewise enclosed in quotation marks. There
are specific cases which are unambiguous and quotation marks are not required, but
for consistency your application should use quotation marks regularly. Region names
are case sensitive!
If the region portion of a WHERE clause is absent then the default search list is used
to determine the regions.
The following are examples of WHERE clauses using regions:
[region "OTNAME"] "cars"
[region "OTNAME"] all
[region "OTDate"] > "20100602"
[region "abc”] <= "string1"
Regions are grouped by OTSE into content and metadata regions, which are
internally represented by OTData and OTMeta. The representation of the “OTNAME”
in the example above is actually an abbreviated form of:
[region "OTMeta":"OTNAME"]
You can use OTMeta without a region name to examine all of the metadata regions.
However, this is relatively slow (depending on the number of regions) and in many
cases is not logical because of the different type definitions for regions.
You can also use OTMeta with some surrounding syntax to search within metadata
regions. For example, the clause:
[region "OTMeta"] "<someRegion>123 ABC</someRegion>"
Will find the exact value ‘123 ABC’ within the region “someRegion”. This is a much
slower way to locate the value, but there may be special cases where matching a
phrase anchored to the start or end of a region is needed.
You can specify searching in the full text using the OTData region:
[region "OTData"] "looking for this"
If you have indexed XML content, you can also search within specific XML regions of
the full text content using the XML structure, refer the section on indexing XML data
for more information.
The WHERE clause can also be used to set restrictions on attribute/value tags for
text metadata. For example, to restrict a search to looking at French language
values of the OTName field, you might use the syntax:
[region "OTName"][attribute "lang"="fr"] "voiture"
This presumes that “lang” is the attribute name, and “fr” is the value for that attribute.
Multiple attribute fields are possible, which effectively operates as a Boolean “and”,
requiring that both attributes must match:
[region "OTName"][attribute "lang"="fr"][attribute
"size"="med"] all
This syntax can be used to dynamically define the regions and their priority as part of
the query. However, this approach does not allow the value that matched the query
to be returned. If retrieving of a priority value is necessary, then a synthetic region
declaration must be made in the LLFieldDefinitions.txt file:
CHAIN GoodDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate OTCreateDate
A query can then be made using the pre-defined date, and the GoodDate field can
also be returned as a target of the SELECT:
[region "GoodDate"] < "-5y"
For those interested in trying to construct the equivalent query using standard
Boolean operators, an example is shown below. Note that using the ‘first’ feature is
not only more convenient, but the implementation is more efficient. Internally, a new
operator performs the necessary logic with fewer operations, it is not simply
converted to this Boolean equivalent:
[region "OTExternalCreateDate"] < "-5y" or ([region
"OTExternalCreateDate"] != all and ([region
"OTExternalModifyDate"] < "-5y" or ([region
"OTExternalModifyDate"] != all and ([region
"OTDocCreatedDate"] < "-5y" or ([region "OTDocCreatedDate"]
!= all and ([region "OTCreateDate"] < "-5y"))))))
The ‘first’ region method can be used with all region types and most operators.
However, search within a specific text metadata attribute value with the CHAIN / first
operator is not supported.
The min and max operators will skip assessment when an object lacks a value. For
example, if an object had only Attr2 defined in the example above, then it would
automatically be evaluated as the minimum value. If none of the regions has a value,
the object does not match.
Min and max region assessments work for all data types, although not all operations
are supported. Supported operations include comparisons against a value (<,=, >,
etc.), basic term and phrase matching, IN, ranges, etc. However, operators that
expand to multiple elements are not available, such as termset, stemset, thesaurus,
wildcards and regular expressions.
For multi-value TEXT metadata regions, the smallest value in a set of values for a
region will be used when assessing a minimum region, and the largest value will be
used when assessing a maximum region.
In addition to specifying ad-hoc minimum and maximum region evaluations in a
query, a synthetic region may be defined as a convenience using the
LLFieldDefinitions.txt file:
MIN SmallAttr Attr1 Attr2 Attr3
MAX BigDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate
A predefined region has the additional property that the tested value can also be
returned in a SELECT statement. Note that no additional storage or indexes are
created, this region definition is a directive to the query constructor. Both the
dynamic and predefined approaches execute identically.
As a point of interest, it is usually possible to construct an equivalent query using
standard Boolean logic, although the min and max forms are computationally more
efficient. The equivalent query is quite complex, and varies depending on the nature
of the comparison (greater than, equal, less than) and whether a minimum or
maximum is required. Where multi-value text is present, there is no Boolean logic
equivalent. As one example,
[min created,modified,record,system] >= "20150403"
Is equivalent to:
Similarly, the all region designation is a syntax shortcut for using the AND operator.
The convenience form:
[all "r4", "r5", "r6"] "sue"
Regular Expressions
OTSE supports the use of regular expressions for matching tokens. A regular
expression is a pattern of characters. In the OTSE query language, a term preceded
by the operator regex is interpreted as a regular expression. Patterns are defined
using the following rules:
+ The plus character matches the smallest preceding range one or more
times. For example,
"tr[eay]+ " will match words like try, tree, trey, treayaaa or country. It
will not match tr.
? The question mark character matches the smallest preceding range
exactly zero or one time. Reusing the previous example:
"tr[eay]? " will match try or pictr. However, it will not match tree.
| The vertical bar functions as an OR operation between patterns.
"go|stay" will match cargo or stay.
The range "[a-c]" could be represented as "a|b|c".
( ) Braces are used to group patterns together. This allows complex
patterns to be constructed.
"^....s?$" Match five letter words that end with the letter s or four letter
words.
Not sure how you spell encyclopedia. Starts with ‘en’, has some
"^en[a-z]+p[eaid]+$" letters, then a ‘p’, then some combination of e, a, i and d. Mind
you, this also matches envelope.
"(0?[1-9])|(1[0-2]):[0-5][0-9]" Find words that contain a string that might be a time in 12 hour
format, such as 1:30, 03:26, 12:59
"^s(ch)?m[iy](th|dt|tt)e?$" Match words like smith, smyth, Schmidt, smitte.
"^ope.+ext$” Matches the common user expectation of a wildcard in the
middle of a word: ope*ext.
Using the SOR operator ensures that multiple matches won’t rank the result higher.
Note the use of the = modifier; the IN operator will only match entire values in
metadata regions. The behavior in full text content is slightly different, in that the
entire value matching is no longer pertinent.
in(superior,erie, "Lake of the Woods")
The TERMSET feature allows you to locate objects that have at least N matching
values from the provided list. For example, the clause:
termset(5,Water, river, lake, pond, stream, creek, rain,
rainfall, dam)
will match an object that contains 5 or more of the terms and phrases. This is a very
powerful construct for discovery and classification applications. There is no simple
equivalent representation. The example above could be expressed like…
SELECT ... WHERE
(stream AND pond AND lake AND river AND water) OR
(creek AND pond AND lake AND river AND water) OR
(creek AND stream AND lake AND river AND water) OR
(creek AND stream AND pond AND river AND water) OR
(creek AND stream AND pond AND lake AND water) OR
(creek AND stream AND pond AND lake AND river) OR
(rain AND pond AND lake AND river AND water) OR
(rain AND stream AND lake AND river AND water) OR …
Fully written out, this query is comprised of 126 lines with 629 operators. The
TERMSET operator is powerful, concise, and eliminates errors constructing complex
queries. The implementation of TERMSET and STEMSET is also internally
optimized for these cases. Queries may operate considerably faster with less
memory using TERMSET/STEMSET compared to executing the fully expanded
equivalent queries constructed of AND / OR terms.
The value of N can also be a percentage, meaning that it must match at least the
specified percentage of terms. 50% of 4 terms means that 2 or more matching terms
are needed. 51% means that 3 or more must match, since the percentage is a
minimum requirement. Using percentages is typically useful when there are longer
lists of candidate matching terms. These are equivalent:
Termset( 3, Water, river, lake, "duck pond", "stream")
Termset( 50%, Water, river, lake, "duck pond", "stream")
Negative values for N are interpreted to mean M-N as the threshold. For example, if
there are 10 terms, a value of -2 is equivalent to a value of 8 for N. It may be of
interest to note that at the endpoints for a list of N terms, TERMSET 1 is an effective
OR, and TERMSET N is an effective AND.
Termset (1, red, blue, green) red OR blue OR green
Termset (3, red, blue, green) red AND blue AND green
The STEMSET operator is similar to TERMSET, except that it matches stems of the
values (that is, singular and plural variations).
stemset(5, Water, river, lake, pond, stream, creek, rain,
rainfall, dam)
Being singular/plural aware means that a document that had only the words:
Water, river, rivers, pond, ponds
will not match, since STEMSET considers the singular and plural forms of river and
pond to be the same term. This document therefore only has 3 matching terms,
instead of the desired 5. Essentially,
stemset(2,water,river,pond)
can be thought of as
((stem(water) and stem(river)) or (stem(water) and
stem(pond)) or (stem(river) and stem(pond)))
or, in a somewhat simplified form which doesn’t really cover all the variations of
stemming,
((water or waters) and (river or rivers)) or ((water or
waters) and (pond or ponds)) or ((river or rivers) and
(pond or ponds))
Unlike the IN operator, STEMSET and TERMSET are not constrained to matching
only full values in text metadata regions. The negation of these operators is possible
using NOT, and can be interpreted as follows:
(m or n) not termset(2,a,b,c)
(m or n) and-not (termset(2,a,b,c))
The TERMSET and STEMSET operators were first introduced in version 16.0.1
(June 2016).
ORDEREDBY
The ORDEREDBY portion of a query is optional. Its purpose is to give you control
over how the search results should be sorted (ranked) and returned in the get results
command. If omitted from the query, the result ranking is sorted by the relevance
score in descending order. This means that the most “relevant” results are returned
first.
language, then the language with the smallest value is used, otherwise use the
standard “no attribute” sorting.
ORDEREDBY Existence
Rank the search results by the number of matching terms in an object. This modifies
the standard relevance computation slightly, so that the number of times a term
appears is not important, only the number of terms which exist in the document.
ORDEREDBY Rawcount
Rank the search results by the number of instances of terms in an object. This
modifies the standard relevance computation slightly, so that the number of times a
term appears is highly rated. The default scoring algorithm considers the number of
times a word appears, but it is only a modifier. Using Rawcount will make the
number of times words appear a major factor in the score.
ORDEREDBY Score[N]
Rank the search results using a combination of the ranking computation (global
settings) and boost values specified as parameters in the query. Refer to the
Relevance section of this document for details.
Performance Considerations for Sort Order
In some cases, the sorting requested for results can be a factor in search
performance. Sorting is performed in the search engines, and each search engine
requires temporary memory allocation and time to perform the sorting. For both time
and memory, the key variables are the type of sort, and the cursor position of the
requested results.
Orderedby Nothing is the fastest performer, and uses the least memory – since it
skips the sorting step entirely. If your application needs to gather all the results from
a query, the use of Nothing as the sort order is strongly recommended, especially if
you are dealing with large data sets. Sorting and retrieving 1 million results may
require on the order of 100 Mbytes of temporary memory. Sorting by Nothing will
avoid this penalty.
Sorting by primitive data types such as floats (relevance), integers, or dates is the
next best performing configuration. Roughly speaking, primitive types require about
4 Mbytes of RAM for each 100,000 results the cursor is advanced.
Sorting by string values is slower and uses more memory. Performance may start to
become material moving the cursor past about 20,000 results. Memory requirement
varies depending on the lengths of the strings, but typically runs about 15 Mbytes of
temporary memory per 100,000 results the cursor is advanced.
Sorting on multiple fields is slower, and uses more memory. The performance
penalties are difficult to predict, since they depend on the numbers and types of
sorts. The order also matters – a sort on a number first, then on a string uses about
8 Mbytes per 100,000 results the cursor advances. Reversing it to sort on the string
first, then a number, would use more memory than just a string sort.
What does it mean when we talk about advancing the cursor position? Regardless of
how many search results there are, if you are only retrieving the first few hundred, the
sort time and memory required will be low. However, if you want sorted results
numbers 99,900 to 100,000 – then the cursor must be advanced to at least position
100,000. The search engines must sort at least that number of results, requiring
significant resources. When asking for results 1 to 100, the search engines can
optimize their sorting implementation to focus on just the ensuring the minimum set of
values are properly sorted.
The memory resources required for sorting are per search engine, per concurrent
search query. If you want to support up to 10 concurrent queries, each asking for
100,000 results, then each search engine may need over 150 Mbytes of working
space available. In normal types of applications this pattern is rarely observed, and
in practice most applications use relatively small amounts of memory to retrieve less
than 10,000 results from a few concurrent queries.
Text Locale Sensitivity
When ordering results by a text region, locale-sensitive sorting is used by default. As
a result, sorting can differ somewhat depending upon the locale. Locale-sensitive
collation generally groups accented characters near their unaccented equivalents.
Depending on the locale, multiple characters may be considered as a single logical
character, and some punctuation may be ignored.
The locale for a system is determined from the operating system by Java, and uses
the Java system variables user.language, user.country and user.variant. For
debugging, these values are logged during startup. In Java, locale can explicitly be
set to override system defaults as command line parameters. For example:
java -Duser.country=CA -Duser.language=fr …
Locale sensitive sorting was first added in 20.4, and can be disabled in the
[Dataflow_] section of the search.ini file by requesting the older behavior:
OrderedbyRegionOld=true
Facets
Purpose of Facets
Facets allow metadata statistics about a search query to be retrieved. For example,
if facets are built for the region “Author”, and there were 300 results, facets might
supply the following information from the “Author” region:
Mike 121
Alexandra 72
David 32
Michelle 21
Stephen 19
Alex 11
Paul 6
The interpretation would be that of the 300 results, 121 of them had the value “Mike”
in the “Author” region, 72 had the value “Alexandra”, and so forth. As an application
developer, you can present this information to the user to help them understand more
about their search results. It is also common to allow the user to “drill down” into the
results based on facets. For example, the user might determine they only want
results authored by Ferdinand. They select Ferdinand, which re-issues the same
search, this time with an additional clause in the query along the lines of AND
[region "Author"] "Ferdinand" (require “Ferdinand” in the region “Author”).
Requesting Facets
OTSE generates facet results when requested within the search queries. There are
no special configuration settings necessary to use facets, although optimization by
protecting commonly required facets may be a good idea. To request facets, in the
‘SELECT’ portion of the query, you add text along these lines:
SELECT “OTObject”,“OTSummary” FACETS
"Author","CreationDate" WHERE …
OTSE would then generate facets for two regions: Author and CreationDate. There
is no defined limit to the number of facets that can be requested for a query, but
memory or performance limitations will become a factor for large numbers of facets.
The design optimizations selected for OTSE are based on expectations of 100 or
fewer distinct facets in use at any time.
Once the query completes, you retrieve the results from the search engines with the
command:
GET FACETS
The output from the GET FACETS command is described in more detail in the Query
Interface section.
Like the search results, the facets for the query are retained until the query is
terminated or times out. Except for date facets, the values are returned sorted from
highest frequency to lowest frequency.
When facet values are returned, there are a couple of additional values provided.
The number of facet values identifies the total number of facet values found. The
returned count is the number of facet values actually returned, which is usually
smaller. There is also an overflow indicator, which identifies whether the number of
facet values exceeded the configurable limit – meaning that the facet results are not
exact since they are incomplete.
In most applications, a user is not interested in reviewing thousands of possible
metadata values in a facet. Usually, only the most common values are of interest.
The facets implementation allows you to place a limit on the number of values for
each facet you want to see. Using syntax such as:
SELECT "OTObject" FACETS "Author"[5], "DocType"[15]
This would return only the 5 highest frequency values in the field “Author” and the 15
highest frequency values in the field “DocType”. By default, the first 20 values are
returned. This default can be overridden by a configuration setting. You are strongly
advised to limit the number of values returned, especially with facets that may contain
arbitrary values, since they can potentially contain millions of values which would
significantly impact search performance.
Facet Caching
Facets data structures are built on demand. Once created for a given facet, the
structure is retained in memory so that subsequent queries using the facet are very
fast. In order to keep memory use constrained, there is a maximum number of facets
that the search engine will retain. If a query requests new facets that are not in
memory and the maximum number of facets is exceeded, then the search engine will
delete the facet structure that has not been used for the longest time. The default is
to retain up to 25 facet structures in memory. There is a 10 minute “safety margin” –
meaning that even if 25 facets are exceeded, a facet that was used in the last 10
minutes will not be deleted. A facet that that is included in a query can also not be
deleted. The limit is therefore a guideline rather than an absolute maximum.
If your applications use more than 25 facets regularly, then search query
performance may suffer as facet data structures are regularly created and deleted.
You can adjust the number of facets to retain in memory in the [Dataflow_] section of
the search.ini file:
MaximumNumberOfCachedFacets=25
Date Facets
Date facets represent a special case, which has been constructed specifically to
address a very common and important requirement, namely presenting facets that
represent the “recentness” of an object in the index. Date facets are not designed to
handle arbitrary dates or future dates.
If facets are requested for regions of type DATE, special handling occurs. Each day
within the supported time range is counted multiple times – as a day, within a week,
within a month, within a calendar quarter, and within a calendar year.
Date facets are not sorted by frequency. Instead they are ordered by recentness. If
you have requested facets for 8 months, you will always get the most recent 8
months returned. When constructing a query for date facets, the syntax within the
SELECT statement is:
… FACETS "CreateDate"[d30,w0,m12,q0,y10] …
The facet counts are optionally specified as a letter followed by the number of facet
values desired, where:
d –
number of days, including today
w –
number of weeks starting on Sunday, including today
m –
number of months, including the current month
q –
number of calendar quarters (Jan, Apr, Jul, Oct),
including the current quarter
y – number of calendar years, including the current year
The example above would request the last 30 days, the last 12 months, the last 10
years, and no facets for weeks or quarters. To obtain no values for a category,
specify zero. Omitting the category will result in the default number of values being
returned. If the count for a value is zero, then no facet value will be returned.
The default number of date values to be returned is defined in the search.ini file. In
the [DataFlow_] section:
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10
The values returned for date facets are formatted to easily identify their type and date
range.
Days: d20120126 (dYYYYMMDD) 26 Jan 2012
Weeks: w20120108 (wYYYYMMDD) week starting 8 Jan 2012
Months: m201202 (mYYYYMM) Feb 2012
Quarters: q201204 (qYYYYMM) quarter starting Apr 2012
Years: y2012 (yYYYY) year 2012
Date facets can only be built for dates where the day is within range of the
maximum number of facet values, per the settings described later. The default is
32767, or about 90 years.
FileSize Facets
Integer regions may be marked in the search.ini file to have their facets presented as
FileSize facets. This mode groups file sizes into a set of about 30 pre-defined
ranges. This mode ignores the number of facets request, and always returns a fixed
number of facet values representing the buckets (or ranges). Details of these facet
values are described in the get facets command section.
results are post-processed to filter out results that a particular user is not entitled to
see. It is more difficult to do this with facet values.
For applications in which the security requirements are high, you must ensure that
facets which contain sensitive information are not made available to users without
suitable clearance. In many cases, it is considered acceptable to display facets
which do not contain sensitive data, such as file sizes, object types, or dates. It might
also be possible to achieve acceptable security by reducing the exactness of the
object counts – displaying a more generic frequency count (eg: 1 to 4 “bars”, or labels
such as “many” or “few”) instead of the precise counts from the search engine.
Ultimately, you will need to choose an appropriate tradeoff between user
convenience and improved user search experience versus the risk that a user might
glean harmful information from facets values.
The maximum number of values per facet sets the upper limit on how many distinct
facet values are possible. This limitation is present as a failsafe from abuse, and
presumes the typical facet application is intended for much smaller data sets.
Increasing this value will increase the amount of memory required to store facet
information. Because the internal data structures use bit-fields, the optimal setting for
this value are 1 less than a power of 2 (eg: 2**N – 1). It should be noted that multi-
value text fields consume a facet value for every combination of text values contained
in the field. For example, if the region “Colors” can contain combinations of “red”,
“blue”, “green” and “black”, then 15 combinations are possible and 15 of the facet
values could potentially be used. If you expect to create facets for regions that may
have many combinations (such as email distribution lists) then this number may need
to be very large, and you may be limited by usable memory.
MaximumNumberOfValuesPerFacet=32767
The number of desired facets is the default number for the “most common N” facet
values to be returned if the number of desired facets is not specified in the query.
This ini setting does not affect the special return values for Date type facets.
NumberOfDesiredFacetValues=20
During normal operation, after the facet data structures are generated, computing
facet information for a single metadata region is relatively fast, typically less than 50
milliseconds. This time varies primarily depending on the number of search results,
since the facet values for every result need to be added together. If your typical
queries return many million results, average times would be closer to 1 second. As
more facets are requested for a query, these time are additive. Experience has been
that facet computation is not a material consideration for performance in most
scenarios.
Conversely, initial generation of facet data structures can be relatively expensive.
Each potential metadata value must be examined, and a new facet value created or
the data structures updated if it already exists. The time to perform this task varies
widely based on the number of items in the partition, the data type, the number of
possible unique values, and for text metadata – whether the values are stored in
memory or on disk.
For example, if there is an enumerated data type with less than 100 possible values
in a partition containing just 1 million items, generation of the facet data structures is
likely less than 1 second.
At the other extreme, generating facet data structures on a text region that has high
cardinality (e.g. 200,000 possible values, such as a folder location or keywords/hot
phrases), in a large partition containing 10 million items that is configured for storage
on disk will take considerably longer, potentially many minutes.
For larger systems in particular, limiting the use of facets for regions with high
cardinality may be necessary to meet performance objectives.
Protected Facets
As noted above, the time required to generate facet data structures can be material.
In addition to building search facets on demand, it is possible to specify facets that
are known to be commonly used. On startup, the data structures for these facets will
be built if they are not in the Checkpoint; they are excluded from facet recycling
(never destroyed); and they are optionally saved in the Checkpoint file for faster
loading on next startup. Content Server uses this feature. To build protected facets at
startup, in the search.ini file, specify the regions in the [Dataflow_] section:
PrecomputeFacetsCSL=region1,region2,region3
As an option, the protected facets may be stored in the Checkpoint file. This also
means a copy of the facet data is maintained in the Index Engines, which requires
additional memory. To enable persisting facets in the Checkpoints, in the [Dataflow_]
section of the search.ini file add:
PersistFacetDataStructure=true
When specifying protected regions, you should also ensure that the desired number
of cached facets is greater than or equal to the number of protected facets specified
in this list. The desired number represents the point at which the search engine will
begin recycling non-protected facets to make room for new facets requested in
queries. In addition, the maximum number of facets should be higher still. The
maximum number of facets is the limit, which may be higher than the desired number
if there are many facets requested in a single query. Beyond this maximum number,
the facet requests are discarded.
DesiredNumberOfCachedFacets=16
MaximumNumberOfCachedFacets=25
Search Agents
Search Agents are stored queries that are tested against new and changed objects
as part of the indexing process. The two most common uses of Search Agents are to
stay up to date on topics of interest, and for assigning classifications.
The monitoring case is illustrated by the Content Server concept of Prospectors.
Consider a situation where you want to know everything about a particular customer.
You construct a query to match the name of the customer or a few of the known key
contacts at that customer. By adding this as a Prospector, you are notified any time
new data is indexed that matches this query.
For classification, you construct a set of queries that define a specific classification
profile. For example, if all customer service requests use a form that contains the
text “customer support ticket”, then this query is attached to the classification agent,
and any object containing this phrase is marked with the classification. By using
many queries, you can build a complete set of classification categories. One object
may match several possible queries, and be tagged with multiple classifications this
way. In Content Server, this is known as Intelligent Classification.
In operation, the queries to be tested against new data are contained in a file.
Matches to the search agent queries are placed in iPools which are monitored by the
parent application, typically Content Server.
[UpdateDistributor_xxx]
RunAgentsEveryTransaction=false
RunAgentIntervalInMS=30000
If the interval is set to a value of -1, the agent execution will pause. There is no loss
of activity – when the interval is restored to a positive value, the agent queries will
include all objects that were indexed while paused. Pausing may be desirable if
there is a temporary need to maximize indexing performance.
The Update Distributor keeps track of the agent execution in files that are stored in a
subdirectory of the search index:
index/enterprise/controls
The files are named upDist.N and contain the timestamp for each of the last agent
runs, expressed in Unix time (also known as Epoch time or POSIX time, i.e.
milliseconds since Jan 1, 1970). Sample file below.
UpDistVersion 1
SearchAgentBaseTimestamp 1571261889130 "MySA0"
SearchAgentBaseTimestamp 1571261889130 "MySA1"
EndOfUpDistState
The timestamp field used by default is the OTObjectUpdateTime. The field can be
changed, but there are currently no known scenarios where the default value should
not be used.
[Dataflow_xxx]
AgentTimestampField=OTObjectUpdateTime
When using interval agent execution, the Update Distributor timing summaries will
include the time spent running agent queries, identified with the label SAgents.
[SearchAgent_agent1]
operation=OTProspector
readArea=d:\\locationpath
readIpool=334
queryFile=d:\\someDirectory\prosp1.in
The readArea and readIpool parameters specify the file path and directory name
where iPools with results from the Search Agent should be written. These are then
consumed by the controlling application.
The queryFile contains the search queries to be applied during indexing. You can
have many search queries within each queryFile.
The operation can be one of OTProspector or OTClassify. This value does not
change the operation of the search agents, but is recorded in the output iPools, and
is used to help the application (typically Content Server) determine how the iPool
should be processed.
For example, assume that your application creates a Search Agent file named
prosp1.new. The Update Distributor will delete any existing prosp1.in file and rename
prosp1.new to prosp1.in. This approach allows Search Agent queries to be modified
without changing the search.ini file and restarting the Update Distributor.
<Q2R0C0>
<OTObject>DataId=16388&Version=0</OTObject>
</Q2R0C0>
<Q2N1>OTScore</Q2N1>
<Q2R0C1>71</Q2R0C1>
<Q2R1C0>
<OTObject>DataId=16398&Version=0</OTObject>
</Q2R1C0>
<Q2R1C1>71</Q2R1C1>
<Q2R2C0>
<OTObject>DataId=16409&Version=0</OTObject>
The Search Agent type, in this case OTClassify, is the first entry in the IPool. This
value is drawn from the search.ini file in the Search Agent configuration setting.
The search results themselves are presented with a naming convention that reflects
a QUERY, ROW, COLUMN numbering convention. For instance, the value
<Q2R0C1> is used for Query 2, Row 0 (the first result), Column 1 (the second region
in the select clause). Likewise, the value <Q1N0> is used to label the Name of
Column 1 for Query 1 (in this case “OTObject”). Note that the names of the regions
are only provided in the first row for a given query.
Performance Considerations
Search Agents are not free. Although the Agents are only applied to newly added
objects, the frequency, complexity and number of queries run as agents can have a
noticeable impact on indexing performance. For applications with high indexing
rates, Search Agents may not be an appropriate feature.
If you require these types of features for high indexing volumes, you can consider
implementing your solution using standard search queries, serviced by the Search
Engines. By enabling the TIMESTAMP feature for objects, the exact indexing time of
objects can be determined, and a pure search application can provide similar
features, running on a scheduled interval.
Relevance Computation
Relevance is a measurement of how well actual search results meet the user
expectations for search result ranking. Relevance is a subjective area, based upon
user judgments and perception, and often requires experimentation and tuning to
optimize. This is one of the fundamental challenges with relevance tuning: if you
improve relevance for one type of user, you may well be reducing relevance for other
users who have different expectations.
Relevance is a method for determining how close to the top of the list a search result
should be placed. However, relevance has NO IMPACT on whether an object
actually satisfies a query. If a query matches 100,000 items, tuning relevance only
affects the ordering of the items, not which items are matched.
Search relevance is not entirely the responsibility of the search engine. Relevance
scoring is a function of many parameters, most of which are provided by the
application, such as Content Server. Tuning Content Server is also required to
optimize search relevance, but this document will focus more on the OTSE
contributions to relevance.
For typical users trying to find objects, relevance is an important consideration, and
the search results are usually presented sorted by the relevance score. However,
relevance is not a consideration for certain types of applications. For example, Legal
Discovery search is concerned with locating all objects, but does not care about the
order of presentation. Likewise, when using search to browse, results are often
sorted by date or object name.
Components of Relevance
There are two different types of computations that are applied to objects in the index
to determine their relevance. The first is “ranking”, which is a computation applied in
the same way on every search query. Ranking typically adjusts relevance by giving
higher weights to recently created objects, office documents, or known important
locations. Before Search Engine 16, Ranking was the only available relevance
scoring method, and ranking and relevance were often used interchangeably.
Beginning with Search Engine 16, a second type of relevance computation is
available, known as “boost”. Unlike Ranking, the Boosting parameters are dynamic,
and are provided on each query. This permits the application to add relevance
adjustments based on context, such as the user identity or current folder location.
The remainder of this section will cover the Ranking capabilities, with Boost features
detailed later. You can mix and match both Ranking and Boost, although each
additional relevance feature slightly increases the overall search query time.
In most cases, the ranking configuration is comprised of weights and regions. The
weights indicate how important the parameter is in scoring. Note that these weight
values are relative. Setting all the weights high is the same as setting all the weights
to a medium value. The difference in weights is ultimately what matters.
Some of the explanations below contain simplified versions of the equations used to
compute the scores. They are simplified to the extent that a number of additional
computations are performed to adjust the results from each computation to a
normalized range. The equations presented here are only intended to clarify the
impact that adjustments to the parameters make on the ranking computations.
Date Ranking
The date an object was created or updated is typically an important aspect of
relevance, especially for a dynamic or social application. In these cases, users tend
to favor objects that are recent. Applications such as archival on the other hand
typically do not care about recentness, and different settings might be appropriate.
The date ranking parameter allows you to identify metadata regions which contain
date values that reflect the recentness of an object, and configure their scoring
parameters.
Date ranking is computed using a decay rate from the current date. The decay rate
is one of the configurable values. Small values for decay rates will reduce the score
of older items more rapidly. A simplified approximation of the algorithm is:
Date Relevance = decay / (recentness + decay)
In practice, a very aggressive value that strongly favors recent objects would be a
decay rate of 20 days. Consider this chart of some representative values. The
decimal values in the body of the table represent the contribution to ranking, with
higher values representing higher ranking.
AGE IN DAYS
Clearly, small values of decay rates generate small ranking contributions for older
items. Remember that the date ranking value is only one component of the ranking
score, and you also control the weight to be applied to this computed value.
The syntax for the date ranking configuration in the search.ini file is:
DateFieldRankers="dateRegion",decay,weight
For example, the following would use the last modified date on an object to compute
date ranking, with a moderately aggressive decay of 45 – but then make the overall
contribution of date to the ranking score small by giving it a weight of 2:
DateFieldRankers="OTModifiedDate",45,2
The date scoring algorithm supports multiple elements. For example, if you had two
different metadata regions that commonly contain important dates that reflect object
recentness, you can specify both, and each is independently computed and added to
the overall ranking score:
DateFieldRankers="OTCreateDate",45,50;"OTVerCDate",30,30
The DateFieldRankers setting is recorded in the search.ini file, and Content server
exposes this configuration setting in the search administration pages.
Relative frequency
The relative ratio of matched search terms to the overall content size is a factor. The
higher this ratio, the higher the relevance. An obvious example… assume you
search for “combustible”. If document ROMEO has the word combustible 30 times in
1000 words (3%) and document JULIETTE has 50 instances of combustible in 2000
words (2.5%), then document ROMEO will be ranked higher.
Frequency
The more often the search terms occur in the text for an object, the higher the
ranking score.
Commonality
The more common a search term is in the dictionary for this partition, the less weight
it is given in computing the text score. For example, with typical English language
data, if you search for keywords “the” AND “scooter” – the value given to matches for
“scooter” will be considerably higher than matches for “the”, since “the” is overly
common.
The full text search ranking algorithm is applied to the indexed content, plus any
metadata regions defined in the default search list. The relative weight of the full text
search is also configurable. Both values are specified in the search.ini file.
The default region search list is defined in the search INI file as:
DefaultMetadataFieldNamesCSL="OTName,OTDComment"
ExpressionWeight=100
Content Server exposes the list of default regions to search in the administration
pages for search, and the values are stored in the search.ini file. Remember to
ensure that any metadata text regions given an adjusted score are included in this
default region search list.
Object Ranking
The search ranking algorithm also allows external applications to provide ranking
hints for objects. In a defined metadata field, the application can provide a numeric
ranking score – an integer between 0 and 100. The search ranking algorithm can
incorporate this ranking value into the overall rank. You have the ability to set a
ranking value for each object, define the field to be used for object ranking, and
assign an overall weight to Object Ranking relative to other elements of the ranking
algorithm. If there is no Object Ranking value for an object, it gets a ranking
adjustment of zero.
The Object Ranking settings are kept in the search.ini file. In the example below,
OTObjectScore is the metadata region that contains the ranking value, and 80 is the
relative weight attached to the Object Ranking component of the ranking calculation.
ObjectRankRanker="OTObjectScore",80
If you are developing applications around search, using the Object Ranking feature
can improve the overall user experience. Some of the common events used to
modify the ranking include tracking objects that are popular for download, objects
placed in particular “important” folders, how frequently objects are bookmarked, or
other situations which are appropriate to the application. As a developer, you also
need to remember to degrade the object ranking over time – an object which is
important now may well lose its relevance later.
One other observation for developers setting Object Ranking values: as described
elsewhere in this document, OTSE supports indexing select metadata regions for
objects. You do not need to re-index the entire object in order to set the Object Rank
value; using the ModifyByQuery indexing operation is usually a good choice. Re-
indexing the entire object each time a ranking value changes would likely have a
material negative impact on overall system performance – both on the application
and OTSE.
Within Content Server, the use of Object Ranking is a feature that is leverage by the
Recommender module.
Query Boost
This boost method is used to adjust the relevance based on whether an object
matches query clauses. For illustration, consider the following example…
SELECT "OTObject" where "animal" ORDEREDBY Score[100] "dog"
BOOST[-10] "cat" BOOST[+15] ("t-rex" and "evolution")
BOOST[+%40]
The query will match items containing the text “animal”. However, we are less
interested in objects that also contain the text “dog”, so 10 is subtracted from the
relevance score. The user likes cats, so if the result contains the text “cat”, then we
add 15 to the score. If the result contained both “dog” and “cat”, then the net
adjustment would be +5. The full text clauses do not need to be simple, as shown
with the dinosaur adjustment. The dinosaur adjustment also illustrates that the
relevance can be boosted by a relative percentage. The text clause can also specify
text metadata regions and include complex parameters…
SELECT "OTObject" where "accident" ORDEREDBY Score[100]
([region "model"] in ("ford","Toyota",”gm") and [region
"Date"] > "-2m") BOOST[+15]
Date Boost
This boost method is used to adjust the relevance based on how closely the value in
a Date region matches a target date. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Date,"region","target",range,adjust]
Region is the metadata field in the search index that should be tested.
Target is the date we are comparing against.
Range is an integer number of days on either side of the target for which a
boost adjustment should be applied.
Adjust is an integer value that specifies the maximum adjustment to be
applied if the value in the region is an exact match for the target. The
adjustment is reduced in a linear fashion based on distance from the target.
An example is in order.
SELECT … ORDEREDBY Score[100]
BOOST[Date,"OTCreateDate","20140415",60,40]
This boost essentially states: Examine the value in OTCreateDate for each matching
search result. If the value is April 15 2014, then add 40 to the relevance score. If the
value in the OTCreateDate field is within 60 days of April 15, then add a pro-rated
value. For example, if the value in OTCreateDate was May 30 (45 days away), then
adjust the relevance score by 10 (which is 40 * (60-45)/60).
The intent of this type of boost is to help users find items based on dates. A typical
use case might be “I am trying to find a document that I think was issued June of
2000, but maybe I am off by 6 months”. Any document in that +/- 6 month range gets
a boosted relevance, with a higher adjustment the closer to the target date.
Another common application would be adjusting for recentness, where the target
date is today, and all objects with dates within 90 days receive an adjustment.
Integer Boost
This boost method is designed to allow a range of values to be mapped to a
relevance contribution. For example, if there was a “usefulness” rating for a
document on a scale of 1 to 10, you could use that range to boost relevance on the
objects. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Integer,"region",lower,upper,adjust]
This boost essentially states: Items with a Popularity value greater than 100 and less
than or equal to 200 will receive a relevance boost of up to 30. A value of 200 gets
the maximum adjustment of 30. A value of 120 would get a boost of 6 [ =30*(120-
100)/(200-100) ].
So why are there separate methods for Dates and Integers? The Date and Integer
boost features allow the boost adjustment to be varied depending on how close the
values are to a target, versus the all or nothing adjustment that occurs with Query
Boosting. If you have applications where getting close is useful, versus matching
exactly, the Date or Integer Boosting is superior.
The first step is to consider your application and user expectations. In some cases,
search relevance won’t be an issue. For example, if you always sort results by date
or a metadata region, then search relevance scores are immaterial. If your primary
objective is building collections for eDiscovery applications, then gathering all search
results is far more important than which ones show up at the top of the list.
For most customers however, a review of their search expectations and some
Content Server 16 considerations are in order.
Date Relevance
This is usually an important factor. Content Server has many ‘Date’ fields, where the
date represents specific information. Consider some of the following:
Creation Date – usually refers to the date an object was added to the system. Often
this is a good value for relevance, but the creation date only refers to the first version.
Versioned objects which are updated will not change this date, which reduces its
value for these data types.
Version Creation Date – for versioned objects, such as documents, this is a good
choice. Each version of the object gets an updated version creation date. On the
other hand, many objects do not have the concept of a version creation date.
Modified Date – for some types of objects, such as folders, the modified date clearly
identifies when the folder has been created or updated. However, for other types of
objects, the modified date is too volatile. Depending upon other settings in Content
Server, the modified date may change for many reasons, and therefore does not
reflect the user expectation for when an object has truly changed.
Understanding which types of objects are most important in your application for
search relevance will help you determine which Content Server date values should
be used for date relevance scoring.
There are several other date fields in Content Server that may also be used. Review
the types of objects that are most important for your application, and choose dates
that best reflect creation or change that users would consider material to search
relevance. Recent experiments suggest that new default values for Content Server
using both the Creation Date and the Version Creation Date, with relatively high
weights, may be a good choice for typical document management and workflow
applications.
now more than 20 MIME types officially used to represent Microsoft Office 2007 files
alone. The following chart is from the Microsoft technet web site.
For new installations of Content Server, the use of MIME types and OTSubTypes for
Type Ranking is discouraged in favor of using OTFileType instead. OTFileType is a
generated by the Document Conversion Server during indexing, and gives every
object a type such as “Microsoft Word”, “Adobe PDF” or “Audio”. This greatly
simplifies constructing the Type Rank, and improves accuracy.
Note that OTFileType was introduced in Content Server 10 Update 5, with some
minor tuning since then. If you have older data, then you may need to re-index the
objects. Details about the values for OTFileType are not included in this document.
Some of the more common values you may want to configure for Type Ranking using
the OTFileType region might be:
Word, Excel, PowerPoint, PDF, Folder, “Web Page”, Text, Audio, Video or Email.
Are HTML pages a key part of your data? Consider adding the HTML keywords
region to the default search regions.
Some applications, such as eDiscovery, are biased towards searching all possible
regions. The challenge is this: more default search regions results in slower query
performance. For small numbers of regions, this is not an issue. For eDiscovery,
with thousands of potential Microsoft Office document properties, this performance
degradation can be material. The “Aggregate-Text” features of the search engine
may be helpful for these situations.
Using Recommender
Recommender is a feature of Content Server which monitors user activity, and
leverages the Object Ranking feature of the search engine to boost the relevance
scores of certain objects. Specifically, the feature of Recommender known as
“Object Ranker” is responsible for computing relevance adjustments and triggering
the appropriate indexing updates. You can review the use of Recommender in the
Content Server documentation.
User Context
Statistically, a user is more likely to be searching for objects that meet one of more of
these types of criteria…
• It is located in my personal work area;
• It was created by me;
• It is located in the folder in which I am currently working;
• It is located in a sub-folder of my current location;
• It is in a location where I was recently working;
OTSE has no knowledge of the user performing a search. Content Server, however,
is aware of the user identity and location. New to Content Server 16, the relevance
boost features allow user context to be incorporated in relevance computation. For
example, each query could specify that items with the current user in the “created by”
metadata fields are emphasized, or that objects in specific locations and folders have
their relevance score enhanced. You should review these configuration settings in
Content Server, and adjust them to reflect your expected user behaviors.
Enforcing Relevancy
Adding Ranking Expressions to a search query results in more work for the Search
Engines. If the default relevance computation is performed (based on the WHERE
clause), then no material penalty occurs since the values are already retrieved as
part of the query evaluation. The Search Engines have an optimization that will
determine if the Ranking Expression is the same as the WHERE clause, in which
case the Ranking Expression computation is skipped. In updates of Content Server
prior to December 2015, the Ranking Expression differs from the WHERE clause,
which will reduce query performance.
There is a configuration setting that will ignore the Ranking Expression and enforce
use of the default WHERE clause ranking. Effectively, this is the same as using
ORDEREDBY RELEVANCY in the query. For older updates of Content Server that
install the 2015-12 or later update, this setting can be used to achieve a modest
search query performance gain. In the search.ini file [Dataflow_] section, add:
ConvertREtoRelevancy=true
Thesaurus
OTSE has the ability to search not only for keywords, but for synonyms of keywords,
using a thesaurus system. This section of the document explores the use of a
thesaurus with OTSE.
Overview
Searching with a thesaurus specified allows a query to match synonyms of words.
For example, the English thesaurus might have an entry for house which includes
“home”, “residence” and “dwelling”. A search for the keyword “house” would also
match any of those words if the thesaurus is enabled.
The list of synonyms to be used is contained within a thesaurus file. You can have
many thesaurus files, and each query can specify which thesaurus file should be
used. In practice, this flexibility is generally used to select a thesaurus containing
synonyms for a particular language. OTSE ships with a number of standard
thesaurus files: English, French, German, Spanish, and Multilingual.
It is also possible to use a thesaurus to help find specialized words in specific
applications. For example, a medical thesaurus file could contain alternate names for
drugs, symptoms or other medical terminology. A custom corporate thesaurus could
contain synonyms for products, part numbers, customer names or departments.
Thesaurus Files
Thesaurus files should be placed in the “config” directory. They should follow a
naming convention of “thesaurus.xxx”, where xxx defines the language and identifies
the thesaurus file as provided in the search query. By convention, OpenText default
thesaurus files are provided for English, French, German, Spanish and Euro
(multilingual) as follows:
thesaurus.eng
thesaurus.fre
thesaurus.ger
thesaurus.spn
thesaurus.eur
Thesaurus files are stored in a proprietary file format which is optimized for
performance and size. These files are created using a thesaurus builder utility, which
converts a thesaurus from the Princeton WordNet format to the OpenText thesaurus
format.
Thesaurus Queries
In order to leverage a thesaurus in a search query, you choose the thesaurus using
the “SET” command, and specify thesaurus use for a search term using the
“thesaurus” operator in the query select statement.
set thesaurus eng
select “OTName” where thesaurus “home”
The value for the language (in this case “eng”) must match the extension of the
thesaurus file. This is an optional statement. The default language setting for the
Thesaurus is English.
The “thesaurus” operator in the select statement only applies to simple single terms –
it cannot be combined with other features such as proximity, stemming, wildcards or
phrase search.
Stemming
Stemming is a method used to find words which have similar root forms, called
“stems”. The easiest way to explain stemming is by example.
The words flowers, flowering and flowered all have the same stem: flower. When
stemming is applied during a search, then a search for one of these words would
match any of these words.
The special terminology “stem” is used since the common element is not always a
word. For instance, for algorithmic reasons, the stem for “baby” might be “babi”,
which facilitates matching words such as babied or babies.
Stemming algorithms are not foolproof. In our example of “flower”, the stemming
algorithm might identify that “flow” is the stem – and try to find matches such as
flows, flowing or flowed. Stemming is a useful tool, but cannot always be relied upon
to behave as a user expects.
The concepts that make stemming possible are not applicable to all languages. In
general, Western European languages can use stemming, since plurals, tenses and
gender are typically formulated in terms of appending different endings to root forms
of words. Accordingly, the algorithms for stemming are different for each language.
There are many languages, such as East Asian languages, where the concept of
stemming does not apply.
Because of the language-specific aspects of stemming, a search engine has many
options available for how stemming should be implemented. One approach is to
stem words during indexing, and create an index of word stems. This can result in
very fast searches (since the stems are all pre-computed), but requires that you know
the language at index time. If only one language will ever be used, this is
acceptable. In multi-language environments, it is less useful. Some search
implementations will guess at the language during indexing and stem accordingly,
which is statistically useful but not always correct.
OTSE applies stemming rules at query time. This reduces the size of the index
(since word stems are not stored), but has a query performance penalty since the
stems for candidate words must be computed for each query.
The other key advantage of query-time stemming is that true multi-lingual stemming
can be used. Consider an index containing the following words:
Arrives (in English documents)
Arrivons (in French documents)
Arriva (in Spanish documents)
Each of these words might have the same stem (“Arriv”). By applying the stemming
algorithm at query time, the search system can differentiate between the English,
French and Spanish forms of the word based on the language preferences used for
stemming, since the English algorithms would not generate query expansions for the
words arrivons or arriva. This approach is not perfect, since in many cases similar
languages have common rules. For example, the French word “arriver” would match
the English stem for “Arrived”, since the postfix “er” is also common in the English
language.
OTSE supplies stemming rules for 5 languages: English, French, German, Spanish
and Italian. When building a search query, you request the stemming rules in the
“SET” command, using the language preference. To request a match for keyword
stems, use the “stem” operator on a keyword in the select statement:
SET language fre
select “OTName” where stem “arrive”
The stem operator does work not in conjunction with other operators, such as
proximity, wildcards and exact phrase searches.
Phonetic Matching
Phonetic matching, or “sounds like” algorithms, are used to match words that have
similarities when spoken aloud. There are many possible algorithms that can be
used for phonetic matching, and OTSE contains a phonetic matching algorithm which
is a variation of the classic US Government ‘Soundex’ algorithm.
Phonetic algorithms are primarily designed to help match surnames, particularly
where the names have been transcribed with potential errors. Matching surnames is
of particular interest for a number of reasons:
Many surnames were recorded as phonetic equivalents from other languages, often
with variations in spelling.
A name which sounds generally similar may in fact have different spelling, particularly
with language variations. Consider the dozens of variations of the name “Stephen”
that exist, including Steven, Steffen, Steffan, Stephan, Steafán, and Esteban.
There is no master dictionary that contains a “right” way to spell a surname, so it is
common for people hearing a name to write it as they think it should be spelled.
Smith, Smithe, and Smyth are all legitimate surnames – you cannot perform spelling
correction, since they are all correct.
In many applications, names are recorded over a poor quality phone connection,
which can introduce errors. I say ‘Pidduck’, the recipient hears and records ‘Pittock’.
All phonetic matching algorithms share some common attributes. They relax the
rules for matching search terms in certain ways. The result is more terms matching,
but with a decrease in accuracy. This decrease occurs because the algorithms can
match words which are clearly not related, despite having similar phonetic properties.
Matching “Schmidt” when querying for “Smith” makes sense. But you also need to
be prepared for false matches, such as finding “Country” when searching for
“Ghandi”.
Phonetic matching is generally NOT recommended for general keyword searching. It
is intended for use with names, and works best when applied against a metadata
region which is known to contain names. Otherwise, the number of false positives
will almost certainly be frustrating to a user.
There is one phonetic matching algorithm within OTSE, a modified “Soundex”
implementation. This algorithm is optimized for English. However, the algorithm is
sufficiently generic that it does provide useful results for many Western European
languages. The phonetic matching does not work for non-European languages.
To request a phonetic match for a keyword in a query, use the modifier ‘phonetic’:
Select X where [region "UserName"] phonetic "smith"
A phonetic modifier can only be applied to a simple keyword, and cannot be
combined with other features such as proximity, wildcards, regular expressions or
exact phrase searches.
There are two dictionaries of terms within the search engine, the primary dictionary
for terms that are “typical” western language words (Western European characters,
no punctuation or numbers), and the secondary dictionary for everything else.
Phonetic matching searches only for terms that meet the criteria for inclusion in the
primary dictionary.
Users might be accustomed to working only with a subset of the complete value, and
expect to find matches using arbitrary substrings of the value, such as:
Acme:SSU 87
ACF/24
F/24 3.5inches
The traditional searches using tokens, regular expressions and Like operators are
not sufficient.
Configuration
The implementation of exact substring matching is configured on a per-region basis,
and is valid only for text metadata regions. A custom tokenizer ID is configured for
the region in the LLFieldDefinitions.txt file; the custom tokenizer is specified in the
search.ini file; the custom tokenizer is constructed to encode the entire value using 4-
grams.
For example, in search.ini file [DataFlow_xxx] section:
RegExTokenizerFile2=c:/config/otsubstringExact1.txt
In LLFieldDefinitions.txt:
TEXT MyRegion FieldTokenizer=RegExTokenizerFile2
Note that there is an alternative mechanism available for specifying the entry in the
LLFieldDefinitions.txt file. The search.ini file can be used to logically append lines to
the field definitions file at startup (the file is not actually modified). This alternative
can be used by Content Server to control the configuration, since Content Server
does write the search.ini file.
ExtraLLFieldDefinitionsLine0=TEXT MyRegion
FieldTokenizer=RegExTokenizerFile2
Re-indexing is not required. When the Index Engines are next started, a conversion
of the index for the region will be performed. You can apply or remove a custom
tokenizer this way for existing data.
By convention, tokenizers should be located in the config\tokenizers directory.
Content Server uses this location to present a list of available tokenizers to
administrators.
Substring Performance
A region indexed for exact substring matching will require about 8 times as much
space for storing the index for that region. In a typical situation, with only a few
regions configured this way, the storage requirement difference will be minimal.
Exact substring configuration is only possible when the “Low Memory” mode
configuration is enabled for text metadata.
When a region is configured for exact substring matching, every query is equivalent
to having wildcards on either side of the query string. In the example above, a
search for “SSU 87” is effectively a search for “*SSU 87*”.No other operators
(comparisons, regular expressions, etc.) are allowed with regions configured for
exact substring searches.
The exact substring is usually much faster than a regular expression because of the
way the indexing is performed. By way of example, assume the indexed value is:
abcdefghijk. Using 4-grams, the following tokens are added to the dictionary: abcd
bcde cdef defg efgh fghi hijk. You want to search for cdefgh. The query engine will
first look for the first 4 gram... “cdef”, which is fast because it is in the dictionary. It
then looks for all 4-grams starting with “gh**”, and finds values with adjacent “cdef +
gh**” 4-grams. While there may be a number of 4 grams for the regions beginning
with “gh”, this is much more efficient than scanning the entire dictionary with a regular
expression to find matches.
Substring Variations
The choice of Tokenizer determines the behavior of substring matching. The usual
suggested tokenizer would make the data case-insensitive, but otherwise leave all
other characters unchanged, including whitespace and punctuation.
Case sensitivity requires additional mappings in the tokenizer file. By default, the
tokenizer performs upper to lower case conversion. To preserve case sensitivity, add
a section to the start of a tokenizer file:
mappings {
0x41=0x41
0x42=0x42
0x43=0x43
…
}
Include a mapping to itself for every character that requires case preservation.
Ensure that suitable mappings for non-ASCII characters are included if those are
important for your application.
The other use case to be aware of is punctuation normalization or elimination.
Consider the example which includes ACF/24 in the value. If users are not expected
to correctly use the slash character “/” correctly, there are a couple of variations that
may be used. Normalization would convert all (or a desired set) of punctuation to a
standard value, perhaps Underscore. The string would be indexed as if it had the
value:
“Vendor_Acme_SSU_876MJACF_24_3_5inchesus”
If the user searches for Acme-SSU or ACF:24, the engine would similarly convert the
queries to “Acme_SSU” and “ACF_24”, which would then match.
Similarly, elimination strips all whitespace and punctuation from index and query
values. The index is built from:
“VendorAcmeSSU876MJACF2435inchesus”
With elimination, the test queries “Acme-SSU” or “ACF:24” are handled as if they
were “AcmeSSU” or “ACF24”, again generating a match. Eliminating punctuation is
generally better at finding a match (since it also handles extraneous whitespace), but
is not as precise – potentially returning some false positives.
Included Tokenizers
Customizing a tokenizer can be a challenge. To facilitate substring matching, there
are 3 tokenizers provided with OTSE that cover the most common exact substring
requirements, in addition to the default tokenizer.
ExactSubstringTokenizer.txt
This tokenizer is case insensitive, but otherwise preserves all punctuation and
spaces.
TolerantSubstringTokenizer.txt
This tokenizer eliminates all punctuation and whitespace. The strings “12-3.MY
name” and “123-m&n_amE” are equivalent, being interpreted as “123myname” in
both queries and indexed values.
EmailAddressTokenizer.txt
This tokenizer treats email addresses in common forms as a single token. With
the traditional tokenizer, [email protected] would be 5 tokens, as the
punctuation would be interpreted as white space. The email address tokenizer
would leave the email address intact as a single token. Searching on a single
token for email is faster and more accurate than a phrase search for multiple
tokens.
Problem
Part numbers and file names are primary examples. A human might describe a part
for a machine as: “the 14 centimeter widget that fits jx27 engine”. Instead, we create
names along the lines of “PN3004/widget-14JX27”. Search technology that is
trying to formulate tokens and patterns based on regular sentence structure and
grammar rules will struggle to match these types of values.
Similarly, we create file names such as “SalesForecast2013-europeFRANCE
Rene&Gina1.doc”. With file uploads and Internet encoding, this can even inject
strings such as %20 or & into the metadata values. Again, algorithms designed
to parse human language have difficultly succeeding with these metadata fields.
Like Operator
To accommodate these types of metadata search requirements, OTSE includes the
concept of a “Likable” region. If you have metadata that fits the problem profile, a list
of the appropriate metadata regions can be declared as Likable in the search.ini file:
OverTokenizedRegions=OTFileName,MyParts
This instructs the Index Engines to build a “shadow” region derived from the original
metadata region, but using a very different set of rules for interpreting the metadata
and building tokens. For example, the traditional indexed tokens for our sample part
number and file name values might be:
pn3004 widget 14jx27
salesforecast2013 europefrance rene gina1 doc
When a query using the like operator is processed, the query is also tokenized
using the alternate rules, and is tested against the shadow region instead of the
original region. In this case, the following queries would succeed that would typically
fail using normal human language tokenizing rules:
where [region "OTFileName"] like "gina 2013 sales forecast"
where [region "MyParts"] like "JX27 widget 3004"
If the like operator is requested for a region that does not support it, then the
operator is treated as an “AND” between the provided terms, and applied against the
original region instead of the shadow region.
Like Defaults
Since many users will not understand the requirement to specify the “like” operator
in a query, a configuration option is provided in the search.ini that allows the use of
Like as the default operator.
UseLikeForTheseRegions=OTName,OTFileName
If a query for a token or phrase is requested against one of these regions and there is
no explicit term operator provided, then Like will be assumed. This also works if the
region is in the list of default search regions. For example, the common Content
Server region OTName can be both a default search region and have the Like
operator applied by default. Note that Content Server can be configured to inject a
default operator like stem into a query term, which would negate using like by
default.
There is also a configuration setting that controls whether stemming should be used
searching with Like queries. By default, this feature is active. If there is a term
component in a query that is 3 letters or longer, then either the singular or plural form
will match. To disable this feature and only match the entered values, in the
[Dataflow_] section of the search.ini file:
LikeUsesStemming=false
Shadow Regions
The synthetic shadow regions built to support the Like operator have some
properties of interest. They are created when the Index Engines start based upon
the search.ini settings. This adds some time to startup, but allows the Like feature
to be applied to existing data sets without re-indexing. The shadow regions are
saved on disk as part of the index until removed from the list of over-tokenized
regions, which also occurs on Index Engine restart.
The shadow regions have the same names as their masters, appended by
_OTShadow. If the region OTName is configured as likable, then the synthetic region
OTName_OTShadow is created. These regions consume space in the index. Due
to the extra tokenization, the space requirements for shadow regions are higher than
for equivalent normal text regions.
The shadow regions will show up in lists of regions, and are also directly queryable or
retrievable. Although not the intended use, it is valid to perform other queries on
these regions.
Limitations
Multi-value text regions have some limitations in behavior. The aggregate strings
from all the values are gathered together to create a single region that is tokenized
for the like operator. This means that there is no ability to combine the like
operator with specific values, such as might be expected when using attributes to
represent languages.
For example, a multi-value text region might have the English value “RedCar88” and
the French value “VoitureRouge88”. The like operator does not support examining
only one language. A search for “like RedVoiture” would match this object.
A second limitation is hit highlighting. Hit highlighting operations are not processed
using the like operation, which means a likely mismatch between tokens in the
original metadata value and the tokens matched in the shadow region during a query.
It is unclear what the correct operation should be, given the existence of the shadow
regions and one-to-many relationships between tokens and the original values. At
this time, hit highlighting ignores the like aspect of a query.
Most importantly, the like operator may generate many more search results than
expected. Due to the nature of part numbers and the behavior of the tokenization,
many small and common tokens can be generated. The like operator is biased
towards finding candidate search results, not towards filtering results to a most
probable match.
At this time there is no relevance adjustment based on the quality of the match in a
shadow region.
User Guidance
The description of the like operator so far may be good background into
configuration and applications, but does not provide much practical advice for an end
user. The normal warning that this guidance is not applicable in all situations applies.
Suggestions for a user trying to maximize success using a metadata region with the
like operator may include:
• Select fragments that appear to be logically distinct.
• Use spaces in place of punctuation.
• Do not enter a fragment of a longer numeric sequence as a search term.
• Do not enter a fragment of a text sequence as a search term.
• Do not use wild card operators.
• More terms or portions of the part number will be more precise.
An example using a fictitious part number string in a metadata region:
PN4556-WidgetRED01395b/v5.68.99 $2,867
Where multiple identical email domain values exist for an object in the email region,
duplicates will be removed. This behavior is important given that many recipients of
an email message are often in the same organization or email domain.
In the search.ini file there are several configuration settings for tuning and enabling
email domain search capabilities. The main setting to enable or disable the feature is
a comma-separated list of text metadata regions that should be treated as email
regions. By default, this list is empty:
EmailDomainSourcesCSL=OTEmailSender,OTEmailRecipient
When you add or remove regions from the email domain list, the changes take effect
the next time the Index Engines are started. At startup, any new email domain
regions will be created and the values populated. This may add 10 or so minutes to
the first startup process. Likewise, if any regions were removed, they will be deleted
from the index at next startup.
Tuning of the behavior is possible with remaining configuration settings. By default,
_OTDomain is used as the suffix for the email domain regions, but this can be
adjusted. There is an upper limit on the number of distinct email domain values that
will be retained for a given email value, which defaults to 50. If you anticipate longer
lists of email domains, this value can be adjusted upwards. Finally, the separators
used to delimit an email domain can be defined. When indexing, a simple rule is
used that text after the @ symbol up to a separator character represents the email
domain. The separators are defined in the search.ini file, and default to comma,
colon, semicolon, various brackets and whitespace. The separator string must be
compatible with a Java regular expression.
EmailDomainFieldSuffix=_OTDomain
MaxNumberEmailDomains=50
EmailDomainSeparators=[,:;<>\\[\\]\\(\\)\\s]
An example: if a multi-value email region for an indexed object has the values:
<OTEmailSender>[email protected]</OTEmailSender>
<OTEmailSender>[email protected]</OTEmailSender>
<OTEmailSender>[email protected]</OTEmailSender>
<OTEmailSender>[email protected]</OTEmailSender>
<OTEmailSender>[email protected]</OTEmailSender>
The OTEmailSender_OTDomain for that object will have effective values of:
<OTEmailSender_OTDomain>gmail.com</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>acme.com</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>smith.com</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>other.co.uk</OTEmailSender_OTDomain>
The same _OTDomain values would exist if a single value email region contains the
string:
<OTEmailSender>[email protected], [email protected][Robert]
[email protected];[email protected](“MightyBob”);
[email protected]</OTEmailSender>
First observation… to be compatible with the text operator, the paragraph end
“CRLF” characters and closing braces “)” in the source were removed.
The Text operator would then analyze the text, discarding short words and top words.
Statistical analysis would select notable words (and phrases). Although not in this
text example, overly long words or lists of numbers would be ignored. The resulting
set of 8 to 15 terms would then be used internally with stemset, with an effective
internal query something like:
stemset(80%,alice,sister,book,”pictures or
conversations”,rabbit,considering,trouble,picking,pleasure,
sleepy,stupid,daisies)
Which in turn would match all items that have 80% or more of those terms and
phrases in the full text of the object. In general, numbers are dropped from
consideration in TEXT queries. However, if the provided block of TEXT is relatively
short (less than about 250 characters), numbers will be included if necessary to meet
the minimum number of terms.
The TEXT operator has a number of configuration settings. See the Top Words
section below for more settings.
Performance degrades with more words used in stemset, while accuracy drops with
too few words. The upper limit on the number of terms and phrases to use is:
TextNumberOfWordsInSet=15
For accuracy, such as trying to match exact documents, termset is a better choice.
Otherwise, stemset is used to find more objects with singular/plural variations but
runs slightly slower:
TextUseTermSet=true
The percentage of matches with termset and stemset can be adjusted. Low values
find more objects with less similarity (eg: 40%). Higher values, such as 80%, require
better matches with the source material:
TextPercentage=80
Top Words
The TEXT query operator is specifically designed to efficiently locate good quality
results when provided with large blocks of text. In this particular scenario, overly
common words are of little value, and need to be discarded. In OTSE, the Top
Words feature is used for this purpose.
Top Words are those words which are found within a large percentage of the
documents. For example, the OpenText corporate document management system
has the word OpenText in many documents, and hence it is eliminated from TEXT
queries. Top Words are determined based upon the percentage of objects containing
a word. For example, if more than 30% of objects contain the word ‘date’, then ‘date’
is added to the Top Words list.
Top Words are computed independently for each search partition. Usually, more
partitions are added over a prolonged period. If the frequency of words changes over
time, then newer partitions will have slightly different Top Words than older partitions.
This also means that TEXT queries which eliminate Top Words might construct
slightly different queries on each partition.
The Top Words are first computed for a partition once it contains approximately
10,000 objects. On reaching 100,000 and 1,000,000 objects, the list is discarded
and recomputed. This helps to ensure that the Top Words properly reflect the
contents of the partition. The Top Words are stored in a file that is not human
readable, and has the name topwords.10000, with the number changing to reflect the
size. If the topwords.n file is missing, it will be generated during next startup or
checkpoint write.
The threshold for selecting Top Words is a real number that should be between 0.01
and 0.99, representing the fraction of objects in the partition that contain the word.
The default value is 33%, which we found in some typical partitions larger than 1
million objects generated a Top Words list of about 750 words. Larger fractions result
in fewer Top Words. In the [Dataflow_] section:
TextCutOff=0.33
If the Top Words features are not required, generation and use can be disabled by
setting:
TextAllowTopwordsBuild=false
Stop Words
Stop words are words which are considered too common to be relevant, or do not
convey any meaning, and are therefore stripped from search queries, or potentially
not even indexed. For English, a typical list of stop words would contain words such
as:
a, about, above, after, again, against, all, am, an,
and, any, are, aren't, as, at, be, because, been,
before, being, below, between, both, but…
The potential advantage of stop words is a reduction in the size of the search index.
However, use of stop words introduces several limitations for search.
If stop words are applied at indexing time, certain types of queries become
impossible. A Shakespearean scholar could never find Hamlet’s soliloquy “to be or
not to be”, since all of those words are considered stop words, and would not be in
the index.
Another reason to not apply stop words during indexing is the multi-lingual capability
of OTSE. The Spanish word “ante” is very common, so it should be a stop word, and
not indexed. However, in English, this is an uncommon word, so it clearly should be
indexed.
As a result, the search engine does not use stop words during indexing, nor are they
applied as a general rule during search queries. However, there is a closely related
capability known as Top Words that is used under special circumstances.
Accumulator
The Accumulator is an internal component of the Index Engines which is responsible
for gathering the tokens (or words) that are to be added to the full text search index.
A basic understanding of the Accumulator is useful when considering how to tune
and optimize an OTSE installation.
As objects are provided to the Index Engine, the full text objects are broken into
words using the Tokenizer, and added to the Accumulator. When the Accumulator is
full, this event triggers creation of a new full text search fragment. In a process
known as “dumping” the Accumulator, a fragment containing the objects stored within
the Accumulator is written to disk.
The transactional correctness of indexing is possible in part because of how the
Accumulator works. As objects are added to the accumulator, they are also written to
disk in the accumlog file. These files are monitored by the search engines to keep
the search index incrementally updated. When the Accumulator dumps, a new index
fragment is created, and the accumlog files are available for cleanup.
The size of the accumulator has an impact on system performance, and on the
maximum size of an object that can be indexed. A small Accumulator is forced to
dump frequently, which can reduce indexing performance. A large Accumulator
consumes more memory. The default size value for the Accumulator is 30 Mbytes
(which is a nominal allocation target – Java overhead results in the actual memory
consumption being higher), and can be set from within the Content Server search
administration pages, which sets the [Dataflow_] value in the search.ini file:
AccumulatorSizeInMBytes=30
If a single object is too large to fit within the Accumulator, it will be truncated –
discarding the excess text content. You cannot always predict whether an object will
exceed this size limit, since this is a measurement of internal memory use including
data structures, and not a measurement of the length of the strings being indexed.
The Accumulator will dump if it contains data and indexing has been idle. The idle
time before dumping is configurable:
DumpOnInactiveIntervalInMS=3600000
During indexing of an object, the accumulator also makes an assessment of the
quality of the data it is given to index. If the data is too “random” from a statistical
perspective, then the accumulator will reject it with a “BadObjectHeuristics” error.
The randomness configuration settings in the [Dataflow_] section are:
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0
MaxAverageTokenLengthHeuristic2=15.0
MinDocSizeInTokens=16384
The heuristics are relatively lax, and essentially designed to try and protect the index
from situations where random data or binary data was provided. It is rare that these
values need to be adjusted, and some experimenting will be needed to find values
that meet special needs. There is a minimum size of about 16,384 bytes before
these heuristics are applied, since small objects would otherwise fail the uniqueness
requirement.
There is one situation where this safety feature is known to occasionally discard good
objects. If a spreadsheet is indexed that contains lists of names, numbers and
addresses, the uniqueness of the tokens may be very high, and it may be rejected as
random.
A related configuration setting is an upper limit on the size of a single object. Objects
are truncated at this limit, meaning that only the first part of the object is indexed.
Note that this size limit is applied to the text given to the Index Engine, not the size of
an original document file. For example, a 15 MB Microsoft PowerPoint file might only
have a filtered size of 100 Kbytes. Conversely, an archive file (ZIP file) with a size of
1 MB might expand to more than 10 MB after filtering.
ContentTruncSizeInMBytes=10
From an indexing perspective, 10 Mbytes is a lot of information. For English
language documents, this would normally be more than 1 million words. By way of
comparison, this entire document in UTF8 form is less well under 1 MByte.
Accumulator Chunking
Starting with Search Engine 10 Update 7, the Accumulator also has the ability to limit
the amount of memory consumed by “chunking” data during the indexing process.
Essentially, if the size of the accumulator exceeds a certain threshold, the input is
broken into smaller pieces, or chunks. Each chunk is separately prepared and
written to disk. When all the chunks are completed, a “merge” operation combines
the chunks into the index.
Chunking is a very disk-intensive process. When chunking occurs, there is a
noticeable impact on the indexing performance. Fortunately, chunking is only
required when indexing very large objects. Using the default settings, we noted while
indexing our own typical “document management” data set that chunking occurs with
hundreds of documents per million indexed, and showed an overall indexing
performance hit of about 15% in a development environment. If indexing
performance must be optimized, you can disable chunking or even reduce the
Content Truncation size described above to a small value (perhaps 1 MByte) such
that chunking may never happen.
There are configuration settings in the [DataFlow_] section of the search.ini file for
tuning the chunking process. The number of bytes in an object before chunking will
occur has a default of 5 MBytes. The feature can be disabled with a large value, say
100,000,000.
AccumulatorBigDocumentThresholdInBytes=5000000
An additional amount of memory for related data such as the dictionary is reserved
as working space, expressed as a percentage of the Accumulator size (typically 30
Mbytes), with a default of 10 percent.
AccumulatorBigDocumentOverhead=10
As a result of this change, it will no longer be possible to search within XML regions
in the body of text for large XML objects where chunking occurs. Chunking can be
disabled for XML documents by setting the configuration to true, but this will negate
the memory savings from chunking.
CompleteXML=false
Reverse Dictionary
The search engine maintains dictionaries of words in the index. The dictionary is
sorted to be efficient for matching words, and for matching portions of words where
the beginning of the word is known (right-truncation, such as washin*). However, for
matching terms that start with wildcards (left-truncation), the dictionary is not optimal.
The search engine can optionally store a second dictionay, known as the Reverse
Dictionary. This is a dictionary of each term spelled backwards. For instance, the
term “reverse” is stored as “esrever”. This Reverse Dictionary allows high
performance matching of terms that begin with a wildcard, and for certain types of
regular expressions that are right anchored (ending with a $).
There is an indexing performance penalty associated with building and maintaining
the Reverse Dictionary. The penalty varies due to many factors, but has been
observed to be over 10%. There is additional disk space required, typically about 1
GB for a partition with 10 million objects. As far as memory is concerned, another
Accumulator instance is used which consumes about 30 MB of RAM in the default
configuration, and space of about 15 MB is required for term sorting. The Reverse
Dictionary is enabled with a setting in the [Dataflow] section of the search.ini file:
ReverseDictionary=true
The Reverse Dictionary works with full text content and text metadata stored in “Low
Memory” mode. Older storage modes are not supported. The Reverse Dictionary is
not used with regions that are over-tokenized or configured for exact substring
matching.
Transaction Logs
In the event that an index or partition is corrupted or destroyed, OTSE provides
Transaction Logs to help rebuild and recover indexes with the least amount of re-
indexing. Transaction Logs are generated by the Index Engines with a minimal
record of the indexing operations that have been applied. A fragment of a
Transaction Log looks like this:
2018-03-15T14:19:22Z, replace - content, DataId=1009174&Version=1
2018-03-15T14:19:22Z, add, DataId=1036021&Version=1
2018-03-15T14:19:22Z, delete, DataId=1015932&Version=1
2018-03-15T14:19:22Z, add, DataId=1036022&Version=1
2018-03-15T14:19:22Z, add, DataId=1036023&Version=1
2018-03-15T14:19:22Z, Start writing new checkpoint
2018-03-15T14:19:23Z, Finish writing new checkpoint
2018-03-15T14:19:23Z, add, DataId=834715&Version=1
If an index is corrupted, it can be restored from the most recent backup. The
Transaction Log can then be used to determine which Content Server objects should
be re-indexed or deleted to bring the backup copy of the index up to date, based on
the date/time of the operations since the date of the backup.
The transaction logs are set up to rotate 4 logs of size 100 MB each, which should
typically be able to record more than 50 million operations for a partition. At this time,
these values are not adjustable. In a typical system with regular backups, this should
be more than enough to recover all transactions. If your backups are less frequent,
you may wish to copy these logs on a regular basis.
Multiple copies of the Transaction Logs can be written. The idea here is that these
logs must survive a disk crash to be useful for recovery. If you are concerned about
system recovery, consider recording the Transaction Logs on two different physical
disks. In the [IndexEngine_] section of the search.ini file:
TransactionLogFile=c:\logs\p1\transaction.log,
f:\logs\p1trans.log
TransactionLogRequired=false
In this example, logs are written to two locations. By default, the list is empty, which
disables writing the Transaction logs. The Index Engine will append text to the
provided file name to differentiate between the rotating logs. A second setting
dictates whether a failure to write Transaction Logs should be considered a
transaction failure, or should be accepted and allow indexing to continue. By default,
this is false – meaning the Transaction Logs are “nice to have”.
Protection
Because Content Server is relatively open and allows many types of applications to
be built on top of it, the search system can be exposed to unexpected data and
applications. This section touches on some of the configurable protection features of
OTSE.
object. The full text checksum for the affected objects is always 485363284. There
is a configuration setting to allow objects with this checksum to be treated as if they
have no text:
EnableWeakContentCheck=true
Cleanup Thread
As the Index Engines update the index, they create new files and folders. The
Search Engines read these files to update their view of the index. Left alone, these
files will eventually fill the disk. The Cleanup Thread is the component of the Index
Engine that runs on a schedule to analyze the usage of the files, and delete those
which are no longer necessary.
A Cleanup Thread only examines and deletes files for a single partition; each Index
Engine therefore schedules a Cleanup Thread. The Cleanup Thread will delete
unused configuration files, as well as unused files listed in the configuration files,
such as accumlog, metalog, checkpoint and subindex fragment files. Search
Engines keep file handles open for config files currently in use, and this is the primary
mechanism used by the Cleanup Thread to determine if files can be deleted.
There is no specific process to monitor for the Cleanup Thread; it is part of the Index
Engine process. By default, the Cleanup Thread is scheduled to run every 10
minutes. You can adjust the interval in the search.ini file [Dataflow_] section:
FileCleanupIntervalInMS=600000
The Cleanup Thread also has a secure delete capability, disabled by default.
SecureDelete=false
When set to true, the Cleanup Thread will perform multiple overwrites of files with
patterns and random information before deleting them, making them unreadable by
most disk forensic tools. This also makes the file delete process considerably slower,
and uses significant I/O bandwidth. Some additional notes on this feature:
• The US Government has updated their guidelines to require physical
destruction of disk drives for highest security situations.
• Overwriting files is ineffective with journaling file systems.
• The algorithm is designed for use with magnetic media, and may not provide
any additional security with Solid State Disks.
• Optimizations by Storage Array Network storage systems may defeat this
feature.
The Cleanup Thread code has been enhanced starting with Search Engine 10
Update 4 to delete unused fragments more aggressively. If for some reason you
require the previous behavior, it can be requested in the search INI file with by setting
SubIndexCleanupMode=0. The default value is 1.
Merge Thread
The Merge Thread is a component of the Index Engine that consolidates full text
index fragments. As the Index Engines add or modify the index, they do not change
the existing files. Instead, they append new files, referred to as the “tail” fragments.
The Search Engines must search against all of the files that comprise the full text
index.
As the number of files containing index fragments grows, the performance of search
queries deteriorate. The purpose of the Merge Thread is to combine fragments to
create fewer files that the Search Engines need to use, ensuring that query
performance remains high. Merging also reduces the overall size of the index on
disk, since deleted objects are simply “marked” as deleted in the tail fragments, and
modified objects will have multiple representations until they are merged.
The Merge Thread will create new full-text index fragment files and then
communicate with the Search Engine using the Control File regarding which files now
comprise the index. Once the Search Engine changes (locks the new files), the
Cleanup Thread will delete the older index files.
Merging is a disk-intensive process. The Merge Thread therefore tries to maintain a
balance between how frequently merges occur and how many index fragments exist.
In a typical index, there are frequent merges taking place within the tail index
fragments, which tend to be small and can be merged quickly. Eventually, older and
larger fragments must also be merged.
An optimal target for the number of fragments an index should have is about 5. In
practice, the number of smaller fragments can grow quite large depending upon the
characteristics of the index. As a safeguard, there is a configuration setting that
places an upper limit on the number of fragments that are permitted for a partition
index, and this will force merges to occur. Too many fragments can seriously affect
query performance due to the level of disk activity in a query and the number of file
handles needed.
Tail Fragments
The Merge Thread configuration settings are located in the [Dataflow_] section of the
search.ini file:
// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
CompactEveryNDays=30
NeighbouringIndexRatio=3
“Want Merges” would normally only be changed for debugging purposes. In most
installations, these settings do not need to be modified. One setting of note is the
Compact Every N Days value, which instructs the Merge Thread to make a more
aggressive attempt to merge indexes over the long term. This setting helps to merge
older index fragments which are relatively stable, and would otherwise not be
scheduled for compaction.
Merge Tokens
Merging fragments temporarily requires additional disk space, nominally the size of
all the fragments being merged. If the temporary disk space needed causes the
partition to exceed the configured maximum size of the partition, then the merge will
fail. One way to address this is to increase the configured allowable disk space.
However, increasing the disk space for every partition can be a costly approach to
solving the problem.
The better approach is to enable Merge Tokens. Merge Tokens are managed by the
Update Distributor, and can be granted on an as-needed basis to Index Engines that
do not have sufficient space to perform merges. If given a Merge Token, the Index
Engine will proceed to perform a merge even if this exceeds the configured maximum
disk space. If the largest index fragments are 20 GB, then 100 GB of temporary
space would suffice for 4 or 5 Merge Tokens. Relatively few Merge Tokens are
needed. 3 tokens would likely suffice for 10 partitions, perhaps 10 tokens for 100
partitions.
The Merge Token capability was first added in Update 2015-03, and the default
setting is disabled for backwards compatibility. In the [UpdateDistributor_] section of
the search.ini file:
NumOfMergeTokens=0
Too Many Sub-Indexes
Although OTSE has a typical target of merging down to 5 or so index fragments,
there are situations when this may not be possible. There is a maximum number of
allowable index fragments (or sub-indexes), which by default is 512. There have
been scenarios, usually due to odd disk file locking, where this limit has been
reached or exceeded. In this case, a Java exception will occur, logging a message
along these lines:
MergeThread:2:Exception:Exception in
MergeThread:java.lang.ArrayIndexOutOfBoundsException; 512
To recover from this, you can edit the [Dataflow_] section of the search.ini file to
increase the number of allowable sub-indexes (perhaps 600), and restart the affected
engines. Once recovered, the lower number should be restored, since running with
larger values has a potential negative performance impact.
MaximumSubIndexArraySize=512
Tokenizer
The Tokenizer is the module within OTSE that breaks the input data into tokens. A
token is the basic element that is indexed and can be searched. The Tokenization
process is applied to both the input data to be indexed, and the search query terms
to be searched.
There is a default standard Tokenizer (Tokenizer1) built into OTSE that applies to
both the full text and all search regions. The system supports adding new tokenizers
that can be applied to specific metadata regions. In addition, Tokenizer1 can be
replaced and customized, or can be used with a number of configuration options.
Everything that follows until the section entitled “Metadata Tokenizers” describes the
use of the default Tokenizer1.
Language Support
OTSE is based upon the Unicode character set, specifically using the UTF-8
encoding method. This means that all indexing and query features can handle text
from most languages. If there are limitations in supported character sets, any
necessary changes would take place within the Tokenizer.
Case Sensitivity
By design, OTSE is not case sensitive. Text presented for indexing or terms provided
in a query are passed through the Tokenizer, which performs mapping to lower case.
This design decision provides a slight loss of potential feature capability in full text
search, but improves performance and reduces index size dramatically. Note that
text metadata values are stored in their original form, including accents and case, so
that retrieval of metadata has no accuracy loss. The mapping to lower case is not
applied to other aspects of the index, such as region names, which ARE case
sensitive.
recognize this as a special case, and keep a string in this form intact as a single
token instead of breaking it into 3 separate tokens.
words{
word_specifications
}
The comm|nocomm line is optional, and not recommended. This controls whether
text that meets the criteria for SGML or XML style comments should be retained or
discarded. The default value is nocomm (do not index comments). This line is
equivalent to setting the standard Tokenizer options in the search.ini file with a value
of TokenizerOptions=2.
Using the null character as the “to” value in a mapping is a special case. Null
characters are skipped during a subsequent Indexing step, so mapping a character
to 0x00 will effectively drop it from the string. This may be useful for removing
standalone diacritical marks or punctuation such as the single quote mark from the
word “shouldn’t”.
The following table illustrates the default character mappings for many of the
European languages.
From To
A-Z a-z
À Á Â Ã Å à á ã å Ā ā Ă ă Ą ą a
Ä Æ ä æ ae
Ç ç Ć ć Ĉ ĉ Ċ ċ Č č c
Ďď Đ đ d
È É Ê Ë èé ê ë Ē e
Ì Í Î Ï ì í î ï i
Ð ð ð
Ñ ñ n
Ò Ó Ô Õ Ø ò ó ô õ ø o
Ö ö oe
Ú Û ù ú û u
Ü ü ue
Ý ý ÿ y
Þ Þ (Large Thorn)
þ Þ (small Thorn)
ß ss
Note: prior to Update 2014-12, upper and lower case Ø characters were mapped to a zero.
The upper and lower case IJ ligatures are mapped to the two letters I J.
Upper and lower case Letter L with Middle Dot are preserved ( Ŀ and ŀ).
Upper and lower case Œ ligatures converted to oe.
Accented W and Y characters are preserved (Ŵ ŵ Ŷ ŷ Ÿ ).
The ſ character (small letter “long s”) is preserved.
Arabic Characters
There are special cases implemented for tokenization of Arabic character sets, which
improves the findability of Arabic words.
Step 1 is character mapping. The character mapping is extended to handle cases in
which multiple characters must be mapped as a group. These mappings are:
Step 3 is removal of WAW and ALEF-LAM prefixes, only if doing so leaves at least 2
characters remaining.
The final step is removal of HEH-ALEF and YEH-HEH suffixes, again only if at least 2
characters will remain in the token.
Note that Arabic tokenization was improved significantly starting with Update 2014-
12.
Tokenizer Ranges
Ranges define the primitive building blocks of characters, organizing them in logical
groups. Each range specification is comprised of Unicode characters and character
ranges, expressed in hexadecimal notation. For example, a range for the simple
numeric characters 0 through 9 would be:
number 0x30-0x39
In practice, there are multiple Unicode code points where numbers could be
represented, so a richer definition of a number might need to include Arabic numerals
(x660-x669), Devenagari numerals (0x966-0x96f) and similar representations from
other languages. You would probably also want to use the character mapping
feature to convert these all to the ASCII equivalents:
number 0x30-0x39 0x660-0x669 0x996-0x96f
May or may not start with currency – currency would be a list of symbols such as
$ ¥ £ or €.
May or may not start with a dash after the optional currency sign.
Has one or more numbers (0-9) following optional dash and currency.
Has zero or more sets of nseparators (, and .) and numbers following the first
number.
In general, the regular expressions are greedy – matching the longest possible string.
The following operations on ranges are supported, and are applied following the
range:
? Zero or one instances of the range
- Token matching this pattern is not valid, advance start pointer one
character and continue
The Tokenizer begins at a specific character, and attempts to find the longest valid
regular expression match. Once found, it takes the matching value as a word,
advances to the character following the match, and repeats. If no match is found, it
advances one character and repeats.
In general, regular expressions that you construct should be relatively lax. In the
currency example above, for instance, we do not enforce 3 digits between commas.
Erring on the side of indexing information rather than rejecting it is a good guideline.
_NGRAM2 gram2+
}
Bigram indexing is the default behavior for these languages. Older versions of the
Search Engine indexed each East Asian character as a separate token. There is a
configuration setting in the search.ini file that can force use of the older method. This
may be useful if you have an older index that predates OTSE with significant East
Asian character content that you do not wish to re-index.
Tokenizer Options
If you are using the standard Tokenizer, the following options are available in
[Dataflow_xxx] section of the search.ini file:
TokenizerOptions=128
The default value is 0 (no options set). The options are a bit field, and can be added
together to combine values. The bit field values are:
1 : a dash character “-“ is counted as a standard character for words. The string
“red-bananas-26” would be indexed as a single token, instead of as 3 the
consecutive tokens “red”, “bananas”, “26”.
2 : XML comments are indexed. By default, strings which fit the pattern for an
XML comment are stripped from the input. XML comments have the form
<!--any text in comment-->
4 : treat underscore characters “_” as separators. This would cause input such
as “My_house” to be indexed as two tokens, “my” and “house”. The default
would preserve this as a single token.
8 : special case handling to look for software version numbers of the form v2.0
and treat them as a single token.
16: treat the “at symbol” @ as a character in a word.
32: treat the Euro symbol as a character in a word.
128 : used to request the “older” method of indexing East Asian character strings
with each character as a separate token. The default indexes these strings as 2-
character “bi-grams”.
-inifile identifies that the tokenizer filename follows, in this case tok.ini.
inputfile is the name of the file containing the data you wish to tokenize.
If inputfile contains “THIS is a TEßT”, the output would be of the form:
|THIS|this
|is|is
|a|a
|TEßT|tesst
Where the first value on each line represents the word tokens accepted by the
regular expression parser, and the second value represents the results after the
character mappings are applied.
Sample Tokenizer
The following sample tokenizer file is similar to the default implementation. Indented
lines have been wrapped to fit the available space. In practice, lines should not be
broken.
ranges {
alpha 0x30-0x39 0x41-0x5a 0x5f 0x61-0x7a 0xc0-0xd6
0xd8-0xf6 0xf8-0x131 0x134-0x13e 0x141-0x148
0x14a-0x173 0x179-0x17e 0x384-0x386 0x388-0x38a
0x38c 0x38e-0x3a1 0x3a3-0x3ce 0x400-0x45f 0x5d0-0x5ea
0xFF10-0xFF19 0xFF21-0xFF3a 0xFF41-0xFF5a
number 0x30-0x39
numin 0x2c-0x2e
currency 0x24 0xfdfc
numstart 0x2d
alphain 0x5f
tagstart 0x3c
colon 0x3a
tagend 0x3e
slash 0x2f
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d 0xfa30-0xfa6a
0xfa70-0xfad9 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d
0x3041-0x3094 0x30a1-0x30fe 0xff66-0xff9d 0xff9e-0xff9f
arabic 0x621-0x63a 0x640-0x655 0x660-0x669 0x670-0x6d3
0x6f0-0x6f9 0x6fa-0x6fc 0xFB50-0xFD3D 0xFD50-0xFDFB
0xFE70-0xFEFC 0x6d5 0x66e 0x66f 0x6e5 0x6e6 0x6ee 0x6ef
0x6ff 0xFDFD
indic 0x900-0x939 0x93C-0x94E 0x950-0x955 0x958-0x972
0x979-0x97F 0xA8E0-0xA8FB 0xC01-0xC03 0xC05-0xC0C
0xC0E-0xC10 0xC12-0xC28 0xC2A-0xC33 0xC35-0xC39
0xC3D-0xC44 0xC46-0xC48 0xC4A-0xC4D 0xC55 0xC56
0xC58 0xC59 0xC60-0xC63 0xC66-0xC6F 0xC78-0xC7F
0xB82 0xB83 0xB85-0xB8A 0xB8E-0xB90 0xB92-0xB95
Metadata Tokenizers
The default configuration uses the full text tokenizer for text metadata regions. OTSE
supports the use of additional tokenizers for text metadata regions. There are 3
requirements to enable this: creating the tokenizer file; referencing the tokenizer file
in the search.ini file; and associating the tokenizer with a metadata region.
Adding or changing the tokenizer configuration for text metadata is possible. When
the search system is restarted, the text metadata stored values are used to rebuild
the text metadata index using the new tokenizer settings. This may require several
hours on large search grids. There are configuration settings that determine the
behavior of the rebuilding when the tokenizers are changed. The first setting is a
failsafe to prevent accidental conversion if the tokenizers are deleted or changed
unintentionally. It requires that today’s date be provided for the conversion to occur.
Use the value “any” to allow conversion any time the tokenizers are changed. The
second setting determines whether the conversion is applied to existing data, or
simply to new data. Usually, applying to new data only is not recommended due to
inconsistent results, so the default value is true. In the [Dataflow_] section:
AllowAlternateTokenizerChangeOnThisDate=20170925
ReindexMODFieldsIfChangeAlternateTokenizer=true
The search.ini file is used to define where the search tokenizer files are located. In
the search.ini file, to add two metadata tokenizer files:
[Dataflow_]
RegExTokenizerFile2=c:/config/tokenizers/partTKNZR.txt
RegExTokenizerFile3=c:/config/tokenizers/NoSpaceTokens.txt
Note that the additional tokenizer values start at the number 2. The first tokenizer
entry is always reserved for the full text tokenizer. The tokenizer definition files in this
example are located in the config/tokenizers directory, which is recommended by
convention as the preferred location for tokenizer definition files.
The next step is to identify the text metadata regions which should use the
enumerated tokenizers. This is done as an optional extension to the text region
definition in the LLFieldDefinitions.txt file:
The search engine would then apply the rules defined in partTKNZR.txt to the region
OTPartNum, and the tokenizer rules in the file NoSpaceTokens.txt to RegionX. The
tokenizer files are constructed using the same rules as the default full text tokenizer.
0xfffb=0x0
0xfffc=0x0
0xfffd=0x0
}
ranges {
gram4 0x9-0xe00 0xe2f 0xe3b-0xe3f 0xe4e-0x3004
0x3007-0x3040 0x3095-0x30a0 0x30ff-0x33ff 0x9fa6-0xabff
0xd7a4-0xf8ff 0xfa2e-0xfa2f 0xfa6b-0xfa6f 0xfada-0xff60
0xffa0-0xfffd
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d 0x3041-0x3094
0x30a1-0x30fe 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d
0xfa30-0xfa6a 0xfa70-0xfad9 0xff66-0xff9d 0xff9e-0xff9f
}
words {
_NGRAM4 gram4+
onechar
_NGRAM2 gram2+
}
Partition Sizes
Search for a partition name in the OTPartitionName region to get a count of the
number of objects stored in a given partition.
Metadata Corruption
Search for -1 in the region OTMetadataChecksum to identify if the metadata for any
objects are corrupt. This is only valid if the metadata checksum feature is enabled.
Query time and throughput varies based on many factors. The first step in optimizing
search query behavior is understanding how time is being consumed during search
queries. To help with this, the Search Federator keeps statistical information about
query performance, which is written to the Search Federator log once per hour.
Using this data, you can assess whether changes to the system or configuration are
improving or degrading search performance.
The data is written in tabular form, such that you can copy it and paste it into a
spreadsheet as Comma Separated Values to make analysis easier. The log entries
have this form, with leading time stamps and thread data omitted:
statistics are not persisted between restarts, so the data starts at zero after every
startup of the search grid. This information is written when the log level is set to
status level or higher. Data on a given query is collected when the query completes,
so queries that cross an hour or day boundary are reported for the time when the
query finished.
This data is also available on demand through the admin interface using the
command: getstatustext performance
Administration API
In addition to a socket-level interface to support search queries, the search
components have a socket-level interface that support a number of administration
tasks. Each component honors a different set of commands, and in some cases
reply to the same command with different information. Commands that make sense
for an Index Engine may be irrelevant for the Search Federator.
This section outlines the most common commands and the components to which
they apply. The client making the requests is also responsible for establishing a
socket connection to the component. The configuration of the port numbers for the
sockets is controlled in the search.ini file.
You do not need to use this API for management and maintenance. Applications
such as Content Server leverage the Administration API to hide details of
administration and provide unified administration interfaces.
The examples below use a > (prompt) symbol to represent the command(s), followed
by the response. White space has been added in responses for readability.
stop
Stops the process as soon as possible. Applies to all processes.
> stop
true
getstatustext
In the Index Engine, this command returns information about uptime, memory use
and number index operations performed:
> getstatustext
With the Search Federator, a variation of getstatustext can be used to retrieve data
about search query performance. The interpretation of the values is outlined in the
section entitled “Query Time Analysis”.
> getstatustext performance
<performance>
<hours>
<hour>
<hourNumber>13</hourNumber>
<numQueries>1</numQueries>
<elapsed>71305</elapsed>
<execution>1149</execution>
<wait>70156</wait>
<SELECT>376</SELECT>
<RESULTS>773</RESULTS>
<FACETS>0</FACETS>
<HH>0</HH>
<STATS>0</STATS>
</hour>
<hour>
<hourNumber>12</hourNumber>
<numQueries>4</numQueries>
<elapsed>149954</elapsed>
<execution>100071</execution>
<wait>49883</wait>
<SELECT>99761</SELECT>
<RESULTS>201</RESULTS>
<FACETS>16</FACETS>
<HH>0</HH>
<STATS>93</STATS>
</hour>
</hours>
</performance>
Similarly, the Update Distributor can provide accumulated statistics about indexing
throughput and errors with “getstatustext performance”. First introduced in 20.4,
the output is in XML form and includes the same data that is written to the logs on an
hourly basis.
<?xml version="1.0" encoding="UTF-8"?>
<performance>
<hours>
<hour>
<hourNumber>8</hourNumber>
<AddOrReplace>0</AddOrReplace>
<AddOrModify>0</AddOrModify>
<Delete>0</Delete>
<DeleteByQuery>0</DeleteByQuery>
<ModifyByQuery>0</ModifyByQuery>
<Modify>0</Modify>
…
Starting with the 2015-09 update, a new option for getstatustext will return a subset
of information, faster. The “basic” variation reduces the time needed by Content
Server to display partition data. The subset of data was specifically selected to meet
the needs of the Content Server “partition map” administration page. When basic is
used, the status and size of partitions is retrieved from cached data, and only
updated during select indexing operations such as “end transaction”. While
technically the information could be slightly incorrect, it is accurate enough for
practical purposes. If there is no cached data, then the slower methods are used –
querying each index engine for data.
For the Index Engines, there is new data in this response. Percentage full is
presented in two different ways, one for text metadata, and one for usage of the
allocated space on disk of the index. The Behaviour represents the “soft” modes of a
read/write partion - update only, rebalancing. Sample responses from the other
search processes are shown below, returning the same codes as a “getstatuscode”
command.
<?xml version="1.0" encoding="UTF-8"?>
<stats>
<UpDist1>
<status>135</status>
</UpDist1>
</stats>
<status>12</status>
</SEname0>
</stats>
getstatuscode
This function is used to determine if a process is ready, in error, or starting up.
Starting up is generally the status while an index is being loaded.
> getstatuscode
12
All Processes
12 Ready
133 Done
registerWithRMIRegistry
For all processes, this command forces a reconnection with the RMI Registry, and
reloads the remote process dependencies. Useful for resynchronizing after some
types of configuration changes without needed to restart the processes. If the search
grid is configured to not use RMI, this command is ignored.
> registerWithRMIRegistry
received ack
checkpoint
The checkpoint function is issued to the Update Distributor to force all partitions to
write a checkpoint file. This is especially useful as part of a graceful shutdown
process. If large metalogs are configured, the time to replay the metalogs during
startup can take a long time. Forcing checkpoints shortly before shutdown eliminates
metalogs and can dramatically improve startup time. After issuing the checkpoint
command, the Update Distributor waits for a number to be provided. The number is
a percentage, representing the threshold over which a checkpoint should be written.
For example, if a checkpoint is normally written when metalogs reach 200 Mbytes, a
value of 10 means that a checkpoint should be immediately forced if the metalog has
reached 20 Mbytes in size. The same logic applies for other checkpoint triggers,
such as number of new objects or number of objects modified. Any value other than
an integer from 0 to 99 will simply abort the command.
> checkpoint
> 10
true
reloadSettings
This command applies to all processes. Some, but not all, of the search.ini settings
can be applied while the processes are running, and some can only be applied when
the processes first start. This command requests that the process reload settings. A
list of reloadable settings is included near the end of this document.
> reloadSettings
received ack
getsystemvalue
Used to obtain specific values from the Index Engine. Currently, there are only two
keys defined. ConversionProgressPercent will return the percentage complete when
an index conversion is taking place. A “ping” operation to check that the process is
responding is also available. This command is different from the others in that it
requires two separate submissions, the first being the command and the second
being the key.
> getsystemvalue
> marco
polo
> getsystemvalue
> ConversionProcessPercent
36
addRegionsOrFields
This command applies to the Update Distributor only, and can be used to dynamically
add a region definition. Once added to an index, regions are generally sticky. The
LLFieldDefinitions.txt file is not updated, so note that using this command may cause
a drift between the index and the LLFieldDefinitions.txt file. This discrepancy is not a
problem, but should be kept in mind in support situations.
The syntax requires exactly one TAB character after the type and before the region
name. This command waits for additional lines of definitions until an empty line is
sent, which terminates the input mode. The function returns true on completion.
> addRegionsOrFields
> text flip
> integer flop
>
true
runSearchAgents
Update Distributor only. Instructs the Update Distributor to run all of the search
agents which are currently defined against the entire index. Results are sent to the
search agent IPool.
> runsearchagents
true
runSearchAgent
Update Distributor only. Instructs the Update Distributor to run a specific search
agent. The search agent named must be correctly defined in the search.ini file.
Results are sent to the search agent IPool. This command expects one line with the
search agent after the command.
> runsearchagent
> bob
true
runSearchAgentOnUpdated
Update Distributor only. Instructs the Update Distributor to run all of the specific
listed search agents. Time is based on the values in upDist.N file, and the timestamp
is updated (see Search Agent Scheduling). Requests are added to a queue and may
require some time to complete. Results are sent to the search agent IPool.
> runsearchagentonupdated
> MyAgentName
> AnotherAgent
true
runSearchAgentsOnUpdated
Update Distributor only. Instructs the Update Distributor to run all the search agents.
Time is based on the values in upDist.N file, and the timestamp is updated (see
Search Agent Scheduling). Requests are added to a queue and may require some
time to complete. Results are sent to the search agent IPool.
> runsearchagentsonupdated
true
Server Optimization
There are many performance tuning parameters available with OTSE. There is no
single perfect configuration that meets all requirements. You can optimize for
indexing performance or query performance. There are tradeoffs between memory
and performance, and many external parameters can affect the OTSE behavior. In
this section we examine some of the most common options for system tuning. The
focus here is on administration and configuration tuning, not on application
optimization.
If your use of OTSE includes high volumes of indexing and metadata updates, then
fragmentation may occur more quickly. You can consider modifying the configuration
settings to run the defragmentation several times per day. While defragmentation is
happening, there will be short periods, typically a few seconds at a time, where
search query performance is degraded. In practice, we find that Low Memory Mode
without daily defragmentation is providing the best indexing throughput.
The tuning parameters typically do not require adjustment unless you are
experiencing extraordinary levels of memory fragmentation. Within the
search.ini_override file, in the [DataFlow] section, the following settings can be
added to make adjustments if necessary:
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
Defragmentation times can be a list in 24 hour format (for example, 2:30;14:30) to run
multiple times per day. Space is the maximum temporary memory to consume while
defragmenting in MB; the larger the value, the faster defragmentation runs – up to a
limit based on the size of the largest region. To completely disable defragmentation,
set the DefragmentMemoryOptions value to 0. Setting the options value to 1 is not
recommended – it enables aggressive defragmentation, whereby all regions are
defragmented without relinquishing control to allow searches while defragmentation
occurs.
There are two other defragmentation settings that you will normally not need to
adjust:
DefragmentMaxStaggerMinutes=60
DefragmentStaggerSeedToAppend=SEED
If you have multiple search partitions, each partition will randomly select a
defragmentation start time up to “MaxStaggerMinutes” after the specified daily
defragmentation time. The purpose of this is to distribute CPU load randomly if you
have many partitions. The SEED value is a string used to seed the random number,
and is available to change if for some reason the default string “SEED” produces start
times which cluster too tightly. It is unlikely you will need to provide an alternative
string.
that a single process can consume of about 1.3 gigabytes. Once you factor out
memory needed for other purposes, the practical upper limit for memory that can be
reserved for metadata is about 1 gigabyte. Customers using Content Server on
Solaris, which uses a 64 bit JVM, have reported success using larger partition sizes,
up to 3 gigabytes.
Assuming a 64 bit Java environment, such as Content Server 10.5 or 16, you can set
the partition sizes larger. Because of the number of variables, there is no simple
optimal size which is always correct. For systems which cannot contain the entire
index within a single partition, larger partition sizes are synonymous with fewer
partitions. Here are some of the tradeoffs:
The memory overhead for a partition is more or less constant, regardless of the
partition size. Larger partitions are therefore more efficient in terms of memory
use, which can reduce the overall cost of hardware.
During indexing, the Update Distributor will balance the load over the available
index engines. If high indexing performance is a key requirement, more
partitions may be preferable.
For search queries returning small numbers of results (typical user searches),
fewer partitions are more efficient. This is typical of most Content Server
installations.
Some specific types of queries are slow and performance is based on the
number of text values in the partition dictionary. Smaller partitions are therefore
faster. If regular expression (complex pattern) queries on text values stored in
memory are common for your application, then smaller partitions may be a better
choice.
A small partition would reserve about 1 gigabyte of RAM for metadata. A very
large partition would be about 8 gigabytes in size. Experimenting with
intermediate sizes before configuring a large partition is strongly recommended.
Currently, very conservative default values are used: 80% full for rebalancing and
77% for the stop rebalancing threshold, which reflects the amount of memory
typically used by existing Content Server customers.
Selecting a suitable threshold for update-only mode requires a little more thought,
and depends upon your expected use of the search engine. The default value with
Content Server is a setting of 70%, which reserves 10% of the space for metadata
changes. Some considerations for adjusting this setting include:
• If your system has applications or custom modules known to add significant
new metadata to existing objects, you should allow more space for updates.
• Archival systems which rarely modify metadata can reduce the space
reserved for updates. Note that Content Server Records Management will
often update metadata when activities such as holds take place, even with
archive applications.
Note that these values are representative for traditional partitions with 1 GB of
memory for metadata. If you are using a larger partition, then reserving less space
for updates and rebalancing may be appropriate. The best practice is to periodically
review the percent full status of your partitions, and adjust the partition percent full
thresholds based upon your actual usage patterns.
The values in the search.ini file that defined the various thresholds are:
MaxMetadataSizeInMBytes=1000
StartRebalancingAtMetadataPercentFull=99
StopRebalancingAtMetadataPercentFull=96
StopAddAtMetadataPercentFull=95
WarnAboutAddPercentFull=true
MetadataPercentFullWarnThreshold=90
usually indicated for the Hot Phrases and Summaries regions (OTHP and
OTSummary).
Note that if you fill a partition in a low memory mode, you may not have enough
space later to convert to a higher memory usage mode. For example, if the partition
memory is 80% full with text regions in DISK mode, it is unlikely that you will be able
to switch the default setting to RAM mode unless some regions are removed or the
partition size is increased.
The times are cumulative since the Update Distributor was started. Each entry has
the form:
Category N ms (count).
Total Time Total uptime of the Update Distributor – this includes the start-up time
that is not included in any other category – hence it will be larger than
the sum of the other categories.
Start Transaction Time the Update Distributor spends waiting for the Index Engines to be
ready to start a transaction.
End Transaction Time the Update Distributor spends waiting for a transaction to end,
excluding time to write checkpoint files. Too much time in this category
may indicate an excessive amount of time is spent running search
agents (for Content Server, usually Intelligent Classification or
Prospectors).
Checkpoint Time the Update Distributor waits for the Index Engines to write
checkpoint files. Large percentages of time here suggest that
checkpoints are created too frequently, or the storage system is under-
powered. Metalog thresholds can be adjusted to reduce the frequency
of checkpoint writes.
Local Update Time the Update Distributor is working with the Index Engines to update
the search index. This is useful time. It is common for this value to
remain below 15% of the time even when a system is performing well.
Global Update Time in which the Update Distributor is interrogating the Index Engines
prior to initiating the local update steps. A typical purpose is to establish
which Index Engine should receive a given indexing operation. Long
times here may indicate that Update Distributor batch sizes are too
small.
Idle The amount of time the update distributor is idle – it has completed all
the indexing it can, and is waiting for new updates to ingest. A high
percentage of time idle indicates that OTSE has additional capacity. If
indexing is slow and there is sufficient idle time, the bottlenecks likely
exist upstream in the indexing process (DCS, Extractors or DataFlow
processes). Note that you should always have some idle time, since the
demand on indexing throughput is not constant.
IPool Reading The amount of time the Update Distributor spends reading indexing
instructions from the disk. In general, this should be relatively small
compared to measurements such as Local Updates. If not, it may
indicate poor disk performance for the disk hosting the input IPools.
Batch Processing The amount of time planning how to proceed with the local update. This
value should be very small as a percentage of global update time.
Start Transaction Older systems, using RMI mode could not differentiate between time
and Checkpoint spent writing checkpoints and time spent on starting a transaction.
Therefore on these systems those two operations are grouped into a
single category. A properly configured system should have a value of 0
in this field.
Search Agents Time spent running search agent queries. Does not apply when
configured to use the older method of running agents after every index
transaction.
Network Problems The values NetIO1 through NetIO5 capture the number of times 1 to 5
retries were needed to read or write to network IO. The NetIOFailed
counts the number of times IO failed after 5 retries.
Because the characteristics of Low Memory mode are different, these values can be
adjusted upwards significantly, perhaps to 100 MB, or 50,000 new objects or 10,000
objects modified. In order to maintain backwards compatibility and mixed mode
operation, OTSE has a separate set of Checkpoint Threshold configuration settings
for Low Memory Mode:
MetaLogSizeDumpPointInBytesLowMemoryMode=100000000
MetaLogSizeDumpPointInObjectsLowMemoryMode=50000
MetaLogSizeDumpPointInReplaceOpsLowMemoryMode=5000
Throughput normally increases with larger values because the number of times that
Checkpoints are created decreases. At the same time, this increases the likelihood
that many partitions will need to create checkpoint files at the same time. This may
place a high load on your disk system, and stall indexing for longer periods when
Checkpoint writes happen.
Larger values mean that more data is kept in the metalog and accumlog files instead
of in the Checkpoint. Larger metalog files require more time to consume during the
startup process for Index Engines or Search Engines. In most cases, this is a one-
time penalty and is acceptable.
When checkpoints are written, the Update Distributor writes lines to the log file that
indicate progress against each of the three configuration thresholds for each partition
that will write a checkpoint. Reviewing these lines can help you understand where
adjustments may be appropriate. The log lines look like this:
Set the CheckMode to 1 to enable use of metadata Merge File mode. The LogSize
determines how large the CheckLog files may become before a merge operation is
triggered, and defaults to 512 MBytes. The MergeThreadInterval determines how
often the Index Engines check to see if a merge should be performed, with a default
of 10 seconds. The MemoryOptions is optimized to minimize memory use; setting
this value 1 uses perhaps 100 MB of additional RAM per partition for a relatively
small performance increase while performing merge operations.
Index Batch Sizes
The Update Distributor breaks input IPools into smaller batches for delivery to Index
Engines. The default is a batch size of 100. For Low Memory mode, this can be
higher, perhaps 500. Since the batch size is distributed across all the Index Engines
that are currently accepting new objects, the batch size can be further increased if
you have many partitions. A guideline might be 500 + 50 per partition. Larger
batches result in less transaction overhead.
[Update Distributor section]
MaxItemsInUpdateBatch=500
Note that the batch size is also limited by the number of items in an IPool. Often, the
default Content Server maximum size for IPools is about 1000, so this may also need
to be modified to take full advantage of increases in the Update Distributor batch
size.
Starting with 20.3, batches are also split when the total size of the metadata plus text
in the objects to be indexed exceeds a defined threshold. The default is 10 MB, but
can be set higher if indexing large objects is common. This has been seen when
indexing email that has distribution lists with thousands of recipients. In the
[Dataflow_] section:
MaxBatchSizeInBytes=20000000
Prior to 20.3 the splitting of batches based on size used a different approach, where
the total size of the metadata of the objects in the batch can not exceed half of the
content truncation size (typically 5 MB).
There is another configuration setting that enables an optimization added in 16.2.2
related to how batches are handled. When processing ModifyByQuery or
DeleteByQuery operations, each request is sent to every Index Engine separately. In
practice, there are often many such contiguous operations in an IPool. The
optimization bundles these contiguous operations into a single communication to
each Index Engine, reducing the coordination overhead. By default, this optimization
is enabled, and can be controlled in the [DataFlow] section of the search.ini file:
GroupLocalUpdates=true
Partition Biasing
Research has shown that there is a strong correlation between the number of
partitions used for indexing and the typical indexing throughput rate. As expected,
more partitions improve parallel operation, and increases the throughput. However,
the transaction overhead per partition is relatively fixed, and the batch sizes become
fragmented into small batches when the operations are distributed to many partitions.
Depending on hardware, the optimal indexing throughput is usually in the range of 4
to 8 partitions.
To enable indexing in this optimal range for large search grids, there is a feature in
OTSE that restricts indexing of new objects to a specified number of partitions. For
example, you may have 12 partitions, but want to only fill 5 at a time for optimal
throughput. This is called partition biasing, and is set in the [Dataflow section]:
NumActivePartitions=5
The default value is 0, which disables partition biasing. Biasing only applies to new
objects being indexed. Updates to existing objects are always sent to the partition
that contains the object, regardless of biasing. For biasing purposes, a partition is
considered “full” when it reaches its “update only” percent full setting. The algorithm
for distributing new objects across active partitions is based upon sending objects
with approximately similar total sizes of full text and text metadata.
During an indexing performance test at HP labs in the summer of 2013, a brief test of
indexing throughput versus the number of partitions was performed. At the time, the
index contained about 46 million objects. There was plenty of spare CPU capacity,
and a very fast SAN was used for the index. In this particular test, the throughput
peaked around 12 partitions.
Parallel Checkpoints
Another index throughput adjustment setting is control over parallel checkpoints.
When a partition completes an indexing batch, it checks to see if the conditions for
writing a Checkpoint have been met. If so, then all partitions are given the
opportunity to write Checkpoint files. The logic being that if at least one partition is
stalled, then any partition that might need to write a Checkpoint soon should do it
now. However, if there are many Checkpoints, you may saturate disk or CPU
To completely disable Bloom Filters, AutoAdjust should be set to false, and the
Number of Hash Functions should be set to 0.
A further optimization was added in version 20.4, in which a quick single-token
search for the data ID is performed to get a short list of objects which are then tested
for the phrase match. This behavior is considerably faster since phrase searches are
considerably slower than single token searches. This fast lookup can be disable if
necessary in the [Dataflow_ ] section of the search.ini file:
DisableDataIdPhraseOpt=true
Compressed Communications
There is a configurable option in OTSE that allows the content data sent from the
Update Distributor to the Index Engines to be compressed. For systems which have
excess CPU capacity and slow networking to the Index Engines, enabling this option
can improve indexing throughput. Most systems do not have this performance
profile, so the feature is disabled by default. The threshold setting determines the
minimum size of full text content that needs to be present before the compression is
triggered for a specific object. Note that compression also requires additional
memory. The memory requirement varies based upon the maximum size of the text
content, and for a system with a content truncation size of 10 MB an Index Engine
would consume another 12 MB of RAM. In the [Dataflow_] section:
CompressContentInLocalUpdate=false
CompressContentInLocalUpdateThresholdInBytes=65535
Scanning Long Lists
There is a specific optimization available for updates to text metadata in partitions not
using Low Memory mode. Low Memory mode uses different data structures and
does not exhibit this behavior.
If metadata updates are applied to metadata values where many objects have the
same value, the update operation can be extremely slow. For example, the
“OTCurrentVersion” region may have 1 million objects with the value “true”. Updates
to this field would be very slow.
The optimization makes these updates fast, but requires additional memory.
Because many customers with this configuration have full partitions, they cannot
tolerate extra memory requirements, so the default is for the optimization to be
disabled (a value of 0). The configuration setting specifies the distance between
know synchronization points in the data structure. Values of about 2000 perform
well, values below 500 become memory-intensive. In the [Dataflow] section:
TextIndexSynchronizationPointGap=2000
Ingestion versus Size
When measuring performance of search indexing, bear in mind that throughput
reduces as the number of objects in the partition increases. As data structures
become larger, extending and updating the index becomes slower. The single largest
contributing factor to the performance degradation is writing Checkpoints. A
Checkpoint is a complete snapshot of the search partition. As the partition gets
larger, the time to create the Checkpoint increases. As a guideline, the indexing
The test was seeded with an 8-partition index of about 14 million items. Initially, 12 to
16 partitions were enabled. After each batch of 2 million items was ingested, the
performance was reviewed and occasionally changes made to the configuration of
hardware or the index.
Below 50 million items in the index, an important observation is that the Update
Distributor does not appear to be a bottleneck, despite all data for all Index Engines
passing through the Update Distributor. We see many data points where the overall
throughput exceeds 100 items per second, which would be in the neighborhood of 8
million objects per day.
Once we had confirmed that performance with 16 partitions was relatively high, we
adjusted the number of partitions down to 8, to focus on building larger partitions in
the available lab time. As expected, the throughput with 8 partitions is significantly
lower. By the end of the test, the 8 partitions contained indexes of 10 million objects
each. At this size, the indexing throughput had decreased to just under 30 objects
per second. This is nearly 2 million objects per day, not including excess capacity for
downtime or spikes.
Some interesting data points:
• At about 94 million objects, we enabled more active partitions and observed
that much higher ingestion rates were still possible.
• Around the 30 million object mark, a faulty network card was replaced,
resulting in a material jump in performance.
• During one interval we duplicated the exact same test on the same
hardware, running concurrently. Our indexing tests were not fully engaging
the capacity of the HP hardware, generally staying below 30% CPU use.
Doubling the indexing load on the hardware resulted in dropping the
throughput from about 40 to about 30 objects per second for the observed
test, although we did manage to get a peak CPU use above 60%. The
duplicate concurrent test had similar performance characteristics. It would
appear that the HP environment has capacity for a much larger index than
we tested, or could also be used for other purposes such as the Document
Conversion Server.
• We disabled CPU hyper-threading for two runs, which reduced throughput
again from about 40 to 30 new objects per second. Lesson learned: leave
hyper-threading enabled for Intel CPUs.
What about searching? Search load tests from within Content Server were
performed concurrently while indexing was occurring. As expected, search became
slower as the index size increased. By test end, with 100 million items and indexing
40 objects per second, simple keyword searches from the search bar averaged less
than 3 seconds, and advanced search queries about 6 seconds, including search
facets. This is not the search engine time, but the overall time including Content
Server.
Does this ingestion case study have relevance for even larger systems? Yes. The
indexing throughput we measured is based on the number of “active” partitions, using
partition biasing. Eventually, you may have many more partitions, but by biasing
indexing to a limited subset, the indexing throughput can be modeled along the lines
seen in this example.
As a final note, this test was performed using Search Engine 10.0 Update 11. There
are a number of performance improvements, in particular for high ingestion rates that
have been implemented since this test was performed. Consider these data points to
be conservative.
Re-Indexing
Although OTSE has many features that provide upgrade capability and in-place data
correction, there are times when you may want to completely re-index your data set.
If you have a small index, re-indexing is fast and easy. For larger indexes, there are
some performance considerations.
It is faster to rebuild from an empty index than to re-index over existing data. There
are several reasons for this. Firstly, the checkpoint writing process slows down as
the index becomes larger, since there is more data to write to disk. When starting
fresh, the early checkpoint writing overhead is very small. Modifying values is also
more expensive than adding values – searching for existing values, removing them,
and adding new values to the structure is slower than simply adding data to a
structure.
Another key factor is the metalog update rules. In particular, the default checkpoint
write threshold is lower for updates than it is for adding new items to the index. This
is a reasonable value during normal operation, but when a complete re-index is in
progress and all objects are being modified, this setting will result in a high
checkpoint overhead. A purge and re-index avoids this problem entirely. If re-
indexing very large data sets, increasing the threshold replace operations may be a
useful strategy.
For maximizing indexing throughput, disk performance is a key parameter, since disk
I/O is usually the limiting factor. Using several sample test setups on similar (but not
identical configurations) in 2012 we measured indexing times with 4 partitions of:
390 Minutes with a single good SCSI hard disk installed in the computer.
5000+ Minutes attached to a busy NFS storage array shared with other
applications with a 10 GB network connection running on VMware ESX.
Read that last one again. You really can configure disk storage that will reduce the
performance of OTSE by a factor of 20 or more. Disk fragmentation also has an
impact. On Windows, we typically see a 20% indexing performance drop between a
pristine disk and one with 60% file fragmentation.
Note that the caching features of some SANs are too aggressive, and can report
incorrect information about file locking and update times.
Customers using basic Network Attached Storage such as file shares generally report
poor search performance. In general, storing the search index on a network file
share will give very poor results.
The incidence of network errors that customers experience when using either SAN or
NAS is surprisingly high. OTSE has relatively robust error detection and retries for
these cases, but failure of the search grid due to network errors is still possible.
When using any type of network storage for the index, monitoring the network for
errors is a good practice that may prevent a lot of frustration due to intermittent
errors.
A dedicated physical high performance disk system will usually outperform a network
attached disk system. However, a SAN with high bandwidth often has other benefits,
such as high availability, which make them attractive. If you are configuring a SAN
for use with search, treat the search engine like a database. The performance of the
disk system is almost always the limiting factor in performance.
Any type of network storage is acceptable for index backups. In fact, backing up the
index onto a different physical system is generally recommended.
Finally, a word about Solid State Disks (SSD). SSDs are gaining acceptance for high
performance enterprise storage. The characteristics of fast SSD are a good fit for
search engines. Given the large number of small random access reads that occur
when searching, SSD storage is an excellent choice for maximizing search query
performance. Indexing performance is not as dramatically affected, since the Index
Engines are generally optimized to read and write data in larger sequential blocks.
However, even with indexing, the highest indexing throughputs we have measured in
our labs occurred with local SSD storage for the index, around 1 million objects
indexed per hour. If you need to improve the query performance or indexing
throughput, investing in good SSD storage media for the index is likely the best
hardware investment you can make.
ideally be in the 0-2 millisecond bucket. If there are counts recorded for long periods,
this is a strong indicator that there are performance problems with the storage
system.
Disk IO Counters. Read Bytes 0. Write Bytes 154394096.:
Histogram of Disk Writes. Avg 0 ms (381/18979). 0-2 ms
(18979). 3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0).
51-100 ms (0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms
(0).:
Histogram of Disk Syncs. Avg 179 ms (37276/208). 0-2 ms (0).
3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (37). 51-100
ms (23). 101-200 ms (59). 201-500 ms (87). 501-Inf ms (2).:
Histogram of Disk Seeks. Avg 0 ms (1/78). 0-2 ms (78). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Closes. Avg 0 ms (0/2). 0-2 ms (2). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
In addition to the times, the number of disk errors that occur and the number of
retries needed to succeed are recorded. If errors exist, an additional line of this form
will be written:
Disk IO Retries Needed. 1 (7). 2 (6). 3 (8). 4 (2). 5+ (22).
failed (17).
For example, this entry indicates that on 7 occasions, 1 error/retry was required. On
22 occasions 5 or more retries were attempted, and 17 times the disk I/O failed even
with retries.
Similarly, the Search Engine reports performance for selected disk operations, writing
entries of this form:
Disk IO Counters. Read Bytes 112711231. Write Bytes 0.:
Histogram of Disk Reads. Avg 0 ms (122/12347). 0-2 ms (12347).
3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms
(0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Seeks. Avg 0 ms (3/360). 0-2 ms (359). 3-5
ms (1). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms
(0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Closes. Avg 0 ms (0/1). 0-2 ms (1). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
By default, reporting of this data is enabled and is written every 25 transactions. The
feature can be disabled and the frequency of reporting can be controlled in the
[Dataflow_] section of the search.ini file:
LogDiskIOTimings=true
LogDiskIOPeriod=25
Checkpoint Compression
There is an optional feature in OTSE that allows Checkpoint files to be compressed.
Checkpoint files can be large, over 1 GB as you exceed 1 million objects in a
partition. New Checkpoint files are written from time to time, usually by all partitions
at once, which can place a significant burden on the disk system.
The compression feature is disabled by default since, in a simple system with a
single spinning disk, compression makes Checkpoint writing CPU bound, and
indexing throughput may decrease by 10% to 15%. However, if you have a system
which is limited by disk bandwidth rather than CPU, then enabling Checkpoint
compression may be a good choice, and actually increase indexing performance.
The compression feature generally reduces the size of Checkpoint files by about
60%. Compression is enabled in the [Dataflow_] section of the search.ini file:
UseCompressedCheckpoints=true
The other parameter is the number of results a Search Engine fetches each time the
Search Federator asks for a set of results. The default value is 50. Larger values
are more efficient when the typical query is for many results. Smaller values are
more efficient for typical relevance-driven queries. In general, if using the preload
above, a value of 20 to 50 is likely optimal, and reduces the potential load on the disk
system.
MergeSortChunkSize=50
These values are multiplicative with the number of partitions. For example, if you
have 8 partitions and a MergeSortChunk size of 250, then the MINIMUM number of
results that the Search Engines together will provide to the Search Federator is 2000.
Keeping MergeSortChunk size value low for systems with many partitions is
recommended.
Throttling Indexing
In some environments, it may be the case that indexing operations are creating
metalogs faster than they can be consumed by the search engines. There is an
upper limit on how many unprocessed metalog files are acceptable, which can be
adjusted if necessary should Search Engines chronically lag behind the Index
Engines. This can happen in environments in which long-running search queries tie
up the Search Engines at the same time that high indexing rates are occurring. In
some cases this problem can be resolved by configuring Search Federator caching.
When this limit is reached the indexing updates will pause to allow the Search
Engines to close the gap.
AllowedNumConfigs=200
In situations where queries are constantly running, it may be necessary to force a
pause in processing search queries in order to give the Search Engines an
opportunity to consume the index changes. There are two settings to control this,
one that specifies the maximum time that queries are allowed to run continuously
(thus blocking updates), and the other is the duration of the pause which is injected
into searching. By default, this feature is disabled.
[SearchFederator_xxx]
BlockNewSearchesAfterTimeInMS=0
PauseTimeForIndexUpdatingInMS=30000
use. To try and ensure correct operation in these environments, most file accesses
will detect errors and retry operations multiple times. The delay between retries is
about 2 seconds times the attempt number. For a number N, the total retry time is
N*(N+1) seconds (e.g. if N is 5, up to 30 seconds). In update 21.1, this setting was
extended to cover retries for reading the livelink.### files (aka livelink.ctl files). Using
these types of disk environments is strongly discouraged, and even if correct, can be
extremely slow. The number of retries is adjustable, and defaults to 5.
NumberOfFileRecoveryAttempts=5
processes and threads such that there is no problem in a system with multiple NUMA
nodes.
In a NUMA system, memory is partitioned with fast access to one CPU, and much
slower access by the other CPUs. OTSE uses many threads for execution, and the
operating system could assign different threads for the same Search Partition to
different physical CPUs. Tasks undertaken by the threads on CPUs not attached to
the memory take about 5 times longer to execute, in part because of slower memory
access, but also because serial interconnects between the CPUs must be used to
synchronize caches.
One approach to resolving this issue is to use operating system tools to pin
applications to physical CPUs. In a Content Server environment, Search Engine
processes are started and ‘owned’ by an Admin Server. It may therefore be
necessary to set the affinity of an Admin Server and all of its attached processes to a
single CPU. This in turn may require changing the number of Admin Servers in use
and allocating Search Engine processes to the Admin Servers to meet your
performance goals. In the Content Server environment, the Document Conversion
Server may likewise need to be adjusted.
The tools used to analyze the allocation of applications to CPUs and to pin
applications to CPUs vary by operating system. You may wish to investigate the use
of some of the following operating system functions for optimizing execution on
NUMA nodes:
Linux: taskset, numactl
Solaris: priocntl, pbind
Windows: start /NODE (may require hotfix to cmd.exe)
If you are running OTSE in a Virtual Environment, the VM tools will often have
processor and NUMA node affinity controls that may also be used to set node affinity.
Note that these considerations only apply to servers with multiple physical CPUs.
There is no scalability performance issue associated with many cores on a single
CPU.
Virtual Machines
In principal, virtual machines should be indistinguishable from physical computers
from the perspective of the software. In practice, there are occasionally problems
which arise from running software in a virtual environment. OTSE is known to
operate with VMWare ESX, Microsoft HyperV, and Solaris Zones. However,
OpenText cannot in reality rigorously test and certify every possible combination of
hardware and virtual environment, and there may be configurations of these virtual
environments that OpenText has not encountered which might be incompatible with
the search grid.
The most important point is this: virtual machines do NOT reduce the size of the
hardware you need to successfully operate a search grid. If anything, operating a
search grid in a virtual environment will require MORE hardware to achieve the same
performance levels, when measured in terms of memory and CPU cores/speed.
For small installations of the search grid where performance issues are not a factor, a
virtual environment can be attractive. However, as your system increases in size to
require many partitions, be aware that a virtual environment may be more costly than
a physical environment for the search grid, which needs to be considered against VM
benefits such as simplified deployment and management. Consider a search engine
as being analogous to a database. For larger or performance-intensive database
applications, the database is often left on bare metal, even if the remainder of an
application is virtualized. The Search Engine has performance characteristics similar
to a database and it may make sense to leave the Search Engine on dedicated
hardware.
One example of a limitation we have seen is virtual machines in a Windows server
environment. In some cases, the I/O stack space is not sufficient once the extra VM
layers are introduced, and tuning of the Windows settings to increase I/O resources
becomes necessary.
As with most applications deployed in a virtual environment, the software runs slower.
The change in performance depends on many factors, but a 10% to 15%
performance penalty is not uncommon.
We have also seen instances in which the memory used by Java in a VM
environment is reported as much higher than the equivalent situation on bare
hardware. In practice, the actual memory in use is very similar, but the reported
values can differ wildly. Often, over a period of many hours, the reported VM
memory will decline and converge on memory consumption reported on a bare
hardware environment.
Garbage Collection
The Java Virtual Machine will generally try to optimize the number of threads it
allocates to Garbage Collection. However, it is not always correct. For example,
when running in a Solaris Zones environment, the “SmartSharing” feature of Zones
can trigger the Java Garbage Collector to allocate very large numbers of threads and
memory resources, which in Zones may be manifested as Solaris Light Weight
Processes (LWPs).
If the number of threads on a system allocated to Garbage Collection seems
unusually large, you likely need to place a limit on the number of Garbage Collection
threads, which can be done using by modifying the Java command line to add the –
XX:ParallelGCThreads=N, where N is the maximum number of threads. Selecting N
may require experimentation, but values on the order of 8 are typical for a system
with 8 partitions, and values over 16 may provide little or no incremental value.
File Monitoring
Some tools that monitor file systems can cause contention for file access. One
known example of this is Windows Explorer. If you browse to a folder used by SE
10.5 to represent the search index using Windows Explorer, then you will likely cause
file I/O errors and a failure of the search system.
Virus Scanning
The performance impact of virus scanning applications on the search grid is
catastrophic because of the intense disk activity that the search grid performs. In
some cases, file lock contention can also cause failure or corruption of the index. You
must ensure that virus scanning applications are disabled on all search grid file I/O.
The search system only indexes data provided by other applications. If virus
scanning is necessary, then scanning the data as it is added to the controlling
application (such as Content Server) is the recommended approach.
Related to this, we see virus scanners now offering port scanning features as well.
Like virus scanners, we have found that port scanners can significantly reduce
performance or cause failure of the software.
Thread Management
OTSE makes extensive use of the multi-threading capabilities of Java. In general,
this leads to performance improvements when the CPUs have threads available.
However, for very large search grids with over 100 search partitions, the number of
threads requested by OTSE may exceed the default configuration values for specific
operating systems. Depending upon the operating system, it is usually possible to
increase the limits for the number of usable threads. This problem is less likely to
occur when running with socket connections instead of RMI connections.
Configuring an operating system to permit more threads for a single Java application
is beyond the scope of this document, and may also include tuning memory
allocation parameters for the JRE. The objective here is simply to make you aware
that additional system tuning outside the parameters of OTSE may be necessary.
Scalability
This section explores various approaches to scaling OTSE for performance or high
availability. OTSE does not incorporate specific scalability features. Instead, by
leveraging standard methods for system scalability with an understanding of how the
search grid functions, we can illustrate some typical approaches to search scalability.
Query Availability
The majority of customers that desire high availability are generally concerned with
search query performance and uptime. Usually, this is tackled by running parallel
sets of the Search Federators and Search Engines in ‘silos’, with a shared search
index stored on a high availability file system, as illustrated below:
To obtain the benefit of high availability, the search silos should be located on
separate physical hardware in order to tolerate equipment failure.
Search queries are not stateless transactions; they consist of a sequence of
operations – open a connection, issue a query, fetch results, and close the
connection. Because of this, simple load balancing solutions cannot easily be used
as a front end for multiple search federators. Instead, the application issuing search
queries should have the ability to direct entire query sequences to the appropriate
silo and Search Federator.
Content Server is one such application. If multiple silos are configured, search
queries will be issued to each one alternately. In the event that one silo stops
responding, Content Server will remove that target from the query rotation. Refer to
the Content Server search administration documentation for more information.
In this configuration, the Search Engines share access to a single search index. This
works because Search Engines are “read only” services which lock files that are in
use. All changes to the Search Index files are performed by the Index Engines.
When a Search Engine is using an index file, it keeps a file handle open – effectively
locking it. The Index Engines will not remove an index file until all Search Engines
remove their locks on a fragment. Because these locks are based on file handles in
the operating system, a Search Engine which crashes will not leave locks on files.
When Search Engines start, they load their status from the latest current checkpoint
and index files, and apply incremental changes from the accumlog and metalog files.
Because of this, no special steps are needed to ensure that Search Engines in each
silo are synchronized. They will automatically synchronize to the current version of
the index.
It is possible for an identical query sent to each silo at the same time to have minor
differences in the search results. The differences are rare, probably small, and short
lived – and would not be noticed or important for most applications. These potential
variances arise due to race conditions. The Search Engines in each silo update their
data autonomously. When an Index Engine updates the index files, perhaps adding
or modifying a number of objects, the Search Engines will independently detect the
change and update their data. For a short period of time, a given update to the
search index may be reflected in one of the search silos but not the other.
This approach to high availability for queries also allows many search grid
maintenance tasks to be performed on Search Federators or Search Engines without
disrupting search query availability. By stopping one silo, performing maintenance,
restarting the silo, and then repeating the process with the other silo, user queries are
not impacted throughout the process. Note that some administration tasks which
change fundamental configuration settings may not be possible without service
interruption.
An additional benefit of parallel silos is search throughput. Since applications such
as Content Server can distribute the query load across multiple silos, the overall
search performance might be higher. This will not be the case if the hardware on
which the search index is stored is a performance bottleneck, particularly the disk
which is shared by each silo.
For correct operation, each silo must have identical configuration settings. If you
have hand-edited any of the configuration files, you must ensure this is properly
reflected on both silos.
If you absolutely must have true high availability for indexing, this must be
implemented using technologies external to the search grid, with a combination of
configuration settings and external clustering hardware or software. The general
principal is that two completely separate search grids are created, the indexing
workflow is split and duplicated, and the indexes are independently created and
managed. This is an exercise pursued using products such as Microsoft Cluster
Server, and beyond the scope of this document.
Minimizing Metadata
Many Content Server applications index much more metadata than is actually used in
searches. Using the LLFieldDefinitions file to REMOVE metadata fields that will
never be used can minimize the RAM requirements.
Metadata Types
By default, metadata regions are created as type TEXT. Integers, ENUM, Boolean
are more efficient, and using the LLFieldDefinitions file to pre-configure types for
these regions can reduce the RAM requirements.
Redundancy
If you are building a high availability system with failover capabilities, the hardware
must be suitably duplicated.
Spare Capacity
In the event that there are maintenance outages, or a requirement to re-index
portions of your data, you will need spare CPU capacity to handle this situation.
Although OTSE is a solid product, indexing problems can happen – generally
incorrect configuration or network/disk errors, although (perish the thought) there are
occasionally bugs found. Sizing the hardware to meet the bare minimum operating
capacity won’t allow you any headroom to recover from problems.
Indexing Performance
As with all sizing exercises, making predictions is fraught with danger. Ignoring the
peril, our anecdotal experience is that the Index Engines can ingest more than 1
Gigabyte of IPool data per hour.
A specific example on a computer that we frequently use for performance testing:
• Windows 2008 operating system, 2 Intel X5660 CPUs, 16 Gbytes RAM
• Update Distributor
• 4 Index Engines / partitions
• Partition metadata size of 1000 Mbytes
• Index stored on a single SCSI local hard disk
• Predominantly English data flow
consumes more than 4 GB per hour, comprised of nearly 200,000 objects added or
modified per hour. Usually, high performance indexing is limited by disk I/O capacity.
Refer to the Hard Drive Storage section for more information.
Beyond about 4 partitions, the performance of the Update Distributor becomes a
factor, and you may need to ensure that the disk read capability for the indexing
IPools is adequate.
CPU Requirements
There is no single rule for the number of physical CPUs needed for a search grid.
Don’t rely on hyperscaling – physical CPUs are key. The requirement is directly
related to your performance expectations. Some of the variables you should bear in
mind are outlined here.
Most customers optimize for cost and have low CPU counts. This means that search
works, but user satisfaction with performance may be low.
Active searches are CPU intensive. If good search time performance is expected,
you should have at least 1 CPU per search engine. This is especially true if multiple
concurrent searches will be running.
Searches are bursty in nature. CPUs will sit idle until a search request arrives, then
saturate the system. Administrators will tend to look at the average CPU use over
time, and claim that utilization is low, therefore no additional CPUs are needed. They
are wrong. Check to see if CPU utilization hits high levels during active searches,
then plan your CPUs based on load during that period.
Search Agents (Intelligent Classification, Prospectors) place an additional load on the
Search Engines. If you are using these features heavily, you may need to
accommodate with some additional fractional CPU. Search Agents run on a
schedule, so they have no impact most of the time, but a heavy potential impact
when run.
Indexing is expensive. If you need high indexing throughput, you should have at
least 1 CPU per active partition, plus 0.25 CPU per inactive partition, plus 1 CPU for
the Update Distributor. With low indexing throughput requirements, 1 CPU for 4
Index Engines may suffice.
In addition, spare capacity is needed on the Index Engines for the following events…
running index backups, writing checkpoints, performing background merge
operations. These operations are designed to limit activity to a subset of partitions
concurrently (default about 6). You can choose degraded indexing during these
periods or allocate additional CPUs.
Example… if you want good search performance with many searches being run
(including searches for background RM disposition and hold) and expect hundreds of
thousands of indexing additions and updates every day. A medium-large system with
40 partitions (perhaps 500 million items). Configured with 6 active partitions (number
of partitions that accept new data, write checkpoints, merge concurrently).
1 CPU – Update Distributor
6 CPUs – Active Index Engines
8 CPUs – Update Index Engines
40 CPUs - Search engines with fast response
Assume indexing throughput can tolerate short slowdowns for background
operations, no extras. Over 50 CPUs is an appropriate size. Conversely, the same
system which can tolerate large backlogs for indexing (perhaps catching up in the
evenings) and is comfortable with users waiting 20 seconds on average for a search
can probably get by with 16 CPUs.
Maintenance
As with all sophisticated server software, there are a number of suggestions, best
practices and configurations that contribute to the long term health and performance
assessment. This section outlines some of the considerations.
Log Files
Each OTSE component has the ability to generate log files. There are separate log
files for each instance of each component. The basic settings are:
Logfile=<SectionName>.log
RequestsPerLogFlush=1
IncludeConfigurationFilesInLogs=true
Where the Logfile= specifies the path for logging (the file name is generated from the
component and the name of the partition). Requests per log flush specifies how
many logging events should be buffered before writing. The value of 1 is the least
performant, but does the best job of guaranteeing that logging occurs if something
crashed unexpectedly.
At startup, information about the version of OTSE and the environment are recorded
in the form of copies of the main configuration files, and can be used to verify that the
correct versions of software are running. This can be disabled by setting the Include
Configuration Files setting to false.
Log Levels
The log files have a configurable level of detail used when writing log files. The log
level for each component of the search engine is separately configured in the
search.ini file:
DebugLevel=0
The available log levels are:
0 – Lowest level, “Guaranteed logging” level output still
occurs.
1 – Severe Errors are logged
2 – All Error conditions are logged
3 – Warnings are logged
4 – Significant status information is logged
5 – Information level, most detail
If you are experiencing problems that require diagnosis, setting the log level to 5 is
recommended. You do not need to restart the search engine processes to change
the DebugLevel, these are reloadable settings.
RMI Logging
The RMI logging section determines how the RMI Registry component performs
logging. It is defined in the General section and the behavior is similar to the
descriptions above, however the names of settings in the search.ini file are different.
RMILogFile ---> Logfile
RMILogTreatment ---> CreationStatus
RMILogLevel ---> DebugLevel
Step 1
Ensure that the Index Engine and Search Engine for the partition are stopped. In
some cases, the processes might have started even if the index is corrupted. For
example, the corruption might be such that searching can still occur if only index
offset files are corrupted, preventing further indexing from happening.
Step 2
Check the IndexDirectory= setting in the search.ini file in the
[Partition_xxxx] section to be certain which directory you should work in.
Certain key files in the index partition directory need to be preserved and all other
files in the directory removed. The files that must be KEPT are:
Signature file (partition name with .txt extension, typically of the form
servernameX848474X999040X74657.txt)
ALL the .ini configuration files, which includes:
FieldModeDefinitions.ini
Backup process definition files
Step 3
Create an empty file in the partition index directory named createindex.ot. At this
point the directory should have only the INI files, the signature file, and
createindex.ot.
Step 4
Start the Index Engine. It will create a new, empty search index.
Security Considerations
OTSE does not directly implement any application security measures. However, the
interfaces to the search components are well defined, and if necessary can be locked
down using standard computer and network security tools.
A quick checklist of security access points that should be considered if you are
contemplating securing access to OTSE and the index:
• Socket API ports
• RMI API ports
• Access to folders where OTSE stores the index on disk.
• Access to the configuration files – search.ini, search.ini_override,
llfielddefinitions.txt, fieldmodedefinitions.ini.
• Access to create indexing requests, written to an input IPool folder.
• Access to logging files or folders.
• Access to the search agents configuration file.
grant {
permission java.io.FilePermission "<<ALL FILES>>", "read, write, delete,
execute";
a large search grid, the Update Distributor will manage the number of Index Engines
that are creating backups concurrently to ensure that CPU and disk capacity are not
abused.
The backup process does NOT require a search outage. Indexing and search
operations continue, subject to possible impacts of additional CPU and IO used by
the backup process. This method creates a complete backup of the grid. The
backup does not represent a single moment in time – each partition may have a
different capture time. The Index Transaction Logs can be used in conjunction with
the backups to reconstitute a current index from the backups.
There are several configuration settings that control the behavior of the backup
process. In the [UpdateDistributor_] section of the search.ini file:
BackupParentDir=c:/temp/backups
MaximumParallelBackups=4
BackupLabelPrefix=MyLabel
ControlDirectory=
KeepOldControlFiles=false
The BackupParentDir field specifies where the backups should be written. This must
be a drive mapping that is visible to all the admin servers running search indexing
processes. Within this directory, a sub-directory with the time the backup starts will
be created, and within that directory each Index Engine will create a directory using
the partition names to store the index. You must have enough space available to
capture a complete copy of the index. The MaximumParallelBackups setting
determines how many Index Engines can be running backups concurrently. This
number should reflect the CPU and disk capacity of your system. The LabelPrefix is
optional and can be used by a controlling application to help track status. The
ControlDirectory is optional, allowing you to override the default location for control
files used to manage the backup process. The KeepOldControlFiles is included for
completeness and is generally reserved for running test scenarios. Except for the
ControlDirectory, these settings can be reloaded (changed without restart). However,
some of the settings are only used at the start of a backup, and best practice is to
make changes only when there is no backup running.
The admin port on the Update Distributor will listen for and respond to the following
commands related to creating backups
backup
backup pause
backup resume
backup cancel
getstatustext
Backup is used to start a new backup process. Cancel and pause will complete
writing backups for the partitions that have already been instructed to create backup
files. This may take several minutes, so status checks include “pausing” status
results (note that some partitions may still be writing their output even though status
is “paused”. Resume will continue a paused backup. The response to backup
commands is “true” if the command has been accepted and acted upon, false
otherwise.
<BackupStatus>
<InBackup>InProgress</InBackup>
<BackupLabel>MyLabel_20190322_112519734</BackupLabel>
<TotalPartitionsToBackup>10</TotalPartitionsToBackup>
<PartitionsInBackup>4</PartitionsInBackup>
<PartitionsFinishedBackup>0</PartitionsFinishedBackup>
<BackupDir>C:\p4\search.ot7\main\obj\log\ot7testoutput\
BackupGridTest_testBackup5199Ten4\backups\
20190322_112519734</BackupDir>
<BackupMessage></BackupMessage>
</BackupStatus>
BackupStatus Completed
BackupTimestampString 20200528_123935489
BackupLabel SocketGridBase_BackupLabel_20200528_123935489
BackupDir
C:\p4\search.ot7\main\obj\log\ot7testoutput\BackupGridTest_testBackup5179a\b
ackups\20200528_123935489
NumPartitionsInThisBackup 2
NumBackupPartitionsCompleted 2
EndOfBackupRecord ----------------------------------------
Restoring Partitions
When restoring an index, the search partition(s) being restored must first be stopped.
Use file copy to restore the entire contents of the partition backup, then start the
Index Engine and Search Engine. The Transaction Logs can then be used to identify
missing transactions to bring the index up to date. Be sure you have Transaction
Logs enabled. As a convenience, entries are written to the Transaction Logs to mark
the point at which a backup occurred. The backup markers in the Transaction Log
have this form:
2018-06-11T20:49:57Z, Backup started,
backupDir="c:/temp/backups\20180608_132859489/partition1",
label="MyLabel_20180608_132859489-partition1", config="livelink.27"
Backup – Method 2
Operating system file copy utilities can be used to back up the search index. All
search and index processes must be stopped for this approach to succeed. Ensure
that the entire contents of the index directories for each partition are copied.
utilities. This backup method is not supported with current updates of Content
Server.
Differential Backup
The very first backup ever performed must necessarily be a full backup.
Subsequent backups can be differential backups or full backups depending on your
preference. A differential backup differs from a full backup in that it only makes a
copy of files that have changed from the last backup that was performed. These files
are:
• metaLog and accumLog: these change frequently. The backup always saves
these for both full and differential backups.
• checkpoint file: for some partitions this can be a large file (over a GB). It is
only copied if it has changed.
• sub-index fragments: new fragments are saved.
The differential backup reduces the amount of disk space required for the backup
and also reduces the time taken to make the backup. However, it does make the
restore process more complex, and requires that you have a complete trail of
differential backups available traced back to a full backup.
Backup Process Overview
The backup and restore processes rely on special configuration files to control their
behavior and to record the status of the backups. As an administrator, you should
normally not modify these files. Content Server automatically generates these files
as needed for backups. This information is primarily for troubleshooting and as a
starting point for developers that are integrating index backup and restore into their
applications.
To run a full backup, a configuration file with the name ‘Full.ini’ must first be created
and placed in each partition folder. For a differential backup, a file with the name
‘Diff.ini’ must be created.
The backup utility is then run, which performs the backup operation on a single
partition.
On completion, the backup data is contained in a folder target directory, called FULL
for a full backup and DIFFx for a differential backup (where x is the order number of
this differential backup relative to the baseline full backup). The backup process also
creates a file called ‘backup.ini’ with copies in the source and backup target partition
folders.
Sample Full.ini File
Note that the Diff.ini file is identical except for its name. The Full.ini file uses a basic
Windows INI file syntax with a single section [Backup]. There a comments injected
here for explanatory purposes (the line starts with a # symbol) which should not exist
in the actual file. In practice, the only values you may want to change are the log file
name and log level.
[Backup]
# AutoNew requests that a new folder is created if it
# does not already exist.
AutoNewDir=True
DelConfig=FALSE
# Index is the root location of the source index being backed up.
Index=F:/OpenText/cs1064main01/index/enterprise/index1
# Specify the names of regions that contain date and time values
# that can reasonably be expected to reflect object index dates.
IndexDateTag=OTCreateDate
IndexTimeTag=OTCreateTime
Related to this are the format codes that are used in the label string. The codes are:
Value Description
%% A percentage sign
%p AM or PM
%P AD or BC
[General]
# 0 status is good, other values are error codes
Status=0
DiffString=Differential
FullString=Full
[FULL]
CheckPointSize=624
MetaLogNumber=51
MetaLogOffset=0
AccumLogNumber=39
AccumLogOffset=0
I1=61
I1Size=447
I2=66
I2Size=39
TotalIndexSize=1109
Label=Enterprise_04_08_2011_Full_58863
Date=20110408 145139
MetaLogChkSum=524293
AccumLogChkSum=524293
CheckPointChkSum=206517074
I1ChkSum=15804739427
I2ChkSum=11071697352
ConfigChkSum=1160933350
Success=0
[DIFF2]
CheckPointSize=665
MetaLogNumber=53
MetaLogOffset=0
AccumLogNumber=41
AccumLogOffset=11785068
I1=69
I1Size=9
TotalIndexSize=674
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 150047
MetaLogChkSum=524293
AccumLogChkSum=258080884
CheckPointChkSum=1282731032
I1ChkSum=9500506248
ConfigChkSum=624209792
Success=0
[DIFF1]
CheckPointSize=664
MetaLogNumber=52
MetaLogOffset=4732284
AccumLogNumber=39
AccumLogOffset=5824292
TotalIndexSize=664
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 145644
MetaLogChkSum=1542696885
AccumLogChkSum=238343344
CheckPointChkSum=3018112456
ConfigChkSum=389190926
Success=0
Running the Backup Utility
Once the Full.ini or Diff.ini file is in place the backup utility can be run. The utility is
contained within the Search Engine, and documented in the Utilities section of this
document.
Validate
The final step is validation, in which the restored index is checked for integrity.
These stages do not automatically happen one after the other. The administrator or
the controlling application needs to initiate the steps sequentially after ensuring that
appropriate file preparation occurs.
The restore operation works on a single partition. Content Server provides a
mechanism to simplify the restore of the entire index, and prompts that administrator
to ensure the appropriate files and folders are available at each step. The syntax of
the restore utility is documented in the Utilities section of this document.
Restore.ini File
The restore.ini file is used for each stage of the restore procedure, and modified after
each stage. This file is the mechanism for transporting process information from one
phase to the next.
Before first running the analyze stage, a restore.ini file needs to be created that looks
like this:
[restore]
otbinpath=d:\opentext\bin
SourceDir=d:\llbackup\ent\incr18
destdir=d:\temprest
option=analyse
Once the analysis is complete, the restore.ini file will have been updated with
information about files that will be copied, and should look like this, without the added
comments and white space:
[restore]
OTBinPath=d:\opentext\bin
BackupIndexName=livelink
LogFilename=indexrestore.log
RestoreHistory=restore.ini
BackupHistory=backup.ini
DestDir=d:\temprest
SourceDir=d:\llbackup\ent\incr18
loglevel=1
# The insert option identifies that copy will take place next
option=insert
LastObjectSize=110750
LastObjectDate=20010426
Index Files
OTSE persists the search index on disk, in a specific hierarchy of folders and file
names. This section outlines each of the folders and files and its purpose. Below is
a typical listing for a search partition for reference which will be described in detail
below. There is one such folder for each partition.
servernameX848474X999040X74657.txt
accumlog.39
checkpoint.51
FieldModeDefinitions.ini
index.lck
livelink.280
livelink.ctl
metalog.51
topwords.100000
MODaccumlog.47
MODindex
\2
\\coreidx1.idx
\\coreidx2.idx
\\coreobj.dat
\\coreoff.dat
\\coreskip.idx
\\map
\\otheridx1.idx
\\otheridx2.idx
\\otherobj.dat
\\otheroff.dat
\\otherskip.idx
\\regionidx1.idx
\\regionobj.dat
\\regionoff.dat
\\regionskip.idx
\\updmask.dat
\3
\\ same
61
\coreidx1.idx
\coreidx2.idx
\coreobj.dat
\coreoff.dat
\coreskip.idx
\map
\otheridx1.idx
\otherobj.dat
\otheroff.dat
\otherskip.dat
\regionidx1.idx
\regionobj.dat
\regionoff.dat
\regionskip.dat
62
\ same
Signature File
This first file in the list, servernameXXXXX.txt, is technically not part of the search
index, and not required for search or indexing operations. Content Server adds this
file to allow the administration interfaces in Content Server to verify that related
Search Engines and Index Engines are referencing the same directories. If upgrades
occur, older server names may migrate, this is expected.
Upon startup, or upon resynchronization, Search Engines load their metadata image
from the checkpoint file, and then apply incremental changes from the metalogs.
It is possible for multiple checkpoint files to exist for a partition. Normally, this only
occurs for a short period, when a Search Engine is still using an older checkpoint file
after the Index Engine has created a new one. The Index Engines will reduce the
number of checkpoint files to one at the earliest safe opportunity.
Lock File
The Lock File is used by the Index Engine to indicate that this partition is in use. This
is a failsafe mechanism to ensure that multiple Index Engines will not attempt to use
the same data. In a properly configured system, this would not happen. The Lock
file provides additional insurance.
Control File
The Control File, named Livelink.ctl, is used by the Index Engines to record the name
of the current Config file. The Search Engines read this file to obtain the name of the
current Config file. To ensure atomic reads and writes, both the Index Engine and
Search Engine will lock this file when accessing it.
Top Words
Optional file. Top Words are used to track which words in an index are candidates for
exclusion from TEXT queries because they are too common. The file is named
topwords.n, where n is one of 10000, 100000 or 1000000 – which reflects the
number of objects in the partition when the file was generated.
Config File
Named livelink.x, where x is an incrementing number. The config file contains
detailed information about the index fragments, working file offsets, file checksums,
and other parameters needed by the Index Engine and Search Engine to properly
interpret the index files.
A new Config file is written each time the Index Engine creates a new fragment or
generates a checkpoint. A Search Engine will place a non-exclusive lock on the
Config file which represents the accumlog and metalog files it is currently consuming.
The Index Engine will clean up older, unused Config files.
Metalogs
A metalog contains incremental updates to metadata. The Index Engine writes
updates to the metalogs, and occasionally creates a checkpoint file that rolls up all
the metalogs since the last checkpoint into a new checkpoint file.
Search engines consume updates from the metalog files to keep their copy of the
metadata current. When a metalog exceeds a configurable size, a new checkpoint is
created and a new metalog started. It is possible for multiple metalogs to exist for
short periods while the Search Engines consume older metalogs.
Object Files
The file coreobj.dat contains a list of all internal object IDs and pointers to the word
location lists in the offset file.
Offset File
The file coreoff.dat contains the lists of word offsets. These word offsets indicate to
the search engine the relative position of a word within an indexed object.
Skip File
The file coreskip.dat contains pointers to the offset file that allows the Search Engine
to quickly skip over large data sets.
Map File
The map file contains checksums that can be used to verify that the index fragment
files have not been corrupted. There is only one Map file per partition fragment.
MODCheck.x
This is the master file for the metadata values, and the target after a merge.
The value of x increments after each merge operation.
MODcheckLog.x
Changes to text values are recorded in this file until a merge operation
occurs.
MODpremerge.x+1
MODptrs.x+1
Files containing pointers used for recovery and playback during startup.
It is possible that multiple versions (values of .x) of these files may exist, especially if
a Search Engine is lagging in accepting updates from the Index Engine, or multiple
Search Engines exist.
Configuration Files
OTSE derives the bulk of its configuration settings from a number of files. In this
section, we review each of the files to convey the basic purpose of each.
Search.ini
Most settings for OTSE are contained within the search.ini file. There is one
search.ini file per Admin Server. In practice, this usually means one per physical
computer, although other permutations are possible.
When used with Content Server, the search.ini file is generated by Content Server.
Although Content Server may preserve some of the edit changes you might make to
the search.ini file, this is not guaranteed. In general, you should not edit this file.
Most of the entries are set by Content Server, and using the Content Server search
administration pages is the preferred method for interacting with this file.
If you must edit this file within a Content Server application, consider using the
search.ini_override file instead.
The search.ini file follows generally accepted conventions for the structure of a ‘.ini’
file.
The file consists of several configuration sections. Where sections contain settings
for a particular partition, the section name will include the partition name. Refer to
the Search.ini section of this document for detailed information on entries in the
Search.ini file.
Search.ini_override
This file is specifically designed to supplement or override any values set in the
search.ini file. Because the search.ini file is controlled by Content Server, editing the
search.ini file does not ensure that your changes will be preserved.
The override file is optional. When present, it need contain only those configuration
settings which you want to take precedence over the default settings or the settings
within the search.ini file.
There is a special value that can be used in override settings, the DELETE_OVERRIDE
value. When this value is encountered, it means that the explicit value for the setting
in the search.ini file should be ignored, and the default value used instead.
For example, the default value for CompactEveryNDays is 30. If the search.ini file
contains the setting:
CompactEveryNDays=100
But the search.ini_override file contains:
CompactEveryNDays=DELETE_OVERRIDE
Then the default value of 30 will be used.
Note that the override file may need to be edited any time the partition configuration
changes. The most common situation is that when you create new partitions, you will
need to add corresponding sections to the override file.
If you use automatic partition creation (such as date based partition creation) within
Content Server, you may have difficulty keeping the override file current with newly
created partitions, and the override file might not be a good choice for this type of
deployment.
Backup.ini
This is an optional configuration file which is used to set the parameters for index
backup operations and record the status of the last backup operation. You should
not normally modify this file. Refer to the section on index backup for more
information.
FieldModeDefinitions.ini
This file defines the storage modes for text metadata regions, and should be located
in the partition directory. There is one FieldModeDefinitions.ini file per partition.
Although each partition could have different settings, keeping them identical across
partitions is generally recommended, and within a Content Server environment this is
enforced. A FieldModeDefinitions.ini file has the following form:
[General]
NoAdd=DISK
ReadOnly=DISK
ReadWrite=RAM
[ReadWrite]
someRegion1=DISK
someRegion2=RAM
[ReadOnly]
someRegion1=RAM
someRegion3=DISK
[NoAdd]
someRegion1=DISK_RET
someRegion2=RAM
The General section defines the default storage mode for a text metadata region.
The ReadWrite, ReadOnly and NoAdd sections allows control over storage of
specific regions, which have priority over the General section. The possible values
are DISK, RAM and DISK_RET. Refer to the section on text metadata storage for
details.
Within Content Server, the FieldModeDefinitions.ini file is created and managed by
Content Server, and should not be edited.
LLFieldDefinitions.txt
The field definitions file has several purposes. Experience indicates that most
customers do not understand or modify this file, which is unfortunate, since significant
performance and memory use benefits may be possible by reviewing and editing this
file BEFORE indexing your content. Once an index has been created, it is not
possible to change some of the settings in this file without generating startup errors.
One function of the file is to establish the type for each metadata region to be
indexed. Each region is tagged with a type such as:
• INT
• LONG
• TEXT
• DATETIME
• TIMESTAMP
• USER
• CHAIN
• AGGREGATE-TEXT
A second purpose for the field definitions file is to provide metadata parsing hints for
nested metadata regions. Using the NESTED operative, the input IPool parser can
ignore outer tags and extract and index the inner region elements.
The field definitions file also provides instructions for special handling of certain
region types. This includes dropping, removing, renaming and merging metadata
regions. You can also use the aggregate feature to create a new region comprised of
multiple text regions.
One field definitions file is required per Admin server. As a general rule, each field
definitions file should be identical for partitions with different Admin servers.
Differences will result in inconsistent handling of regions between partitions.
Content Server does not edit, generate or manage this file. In general, changes to
this file must be done manually. There is one exception to this – the search.ini file
has a special setting for logically appending lines to the LLFieldDefinitions.txt file.
This allows limited control over the definitions from Content Server. For example, if
the search.ini file contained these two lines:
ExtraLLFieldDefinitionsLine0=CHAIN MyID UserID
TwitterID FacebookID
ExtraLLFieldDefinitionsLine1=LONG OTBigNumber
Then at startup time, OTSE acts as if these lines existed at the end of the
LLFieldDefinitions.txt file:
CHAIN MyID UserID TwitterID FacebookID
LONG OTBigNumber
Content Server usually ships with two versions of this file – a standard version, and
one for use with Enterprise Library Services. The determination of which version to
use is determined by a setting in the search.ini file:
FieldModeDefinitions=FieldModeDefinitions.ini
Detailed information about each of the functions and data types of the field mode
definitions file can be found in the section of this document which covers metadata
regions.
SEARCH.INI Summary
This section gathers together most of the accessible configuration values that can be
used in the search.ini file, or the search.ini_override file. There are a number of
additional values which are only used for specific debugging or testing purposes that
are not listed here. A number of these configuration values are covered in more
detail in relevant sections of this document.
Not all processes read all sections of the search.ini file. Content Server generates
search.ini files for each process, and typically only includes values needed by the
process. Note that Content Server files do not include all of the entries, and default
settings are common.
Default values are displayed in this section wherever possible. Annotations in this
section are indicated with a // at the beginning of the line – this is not syntax
supported in an actual search.ini file, it is used here as a documentation device.
The settings in the INI file are applied when the processes start. Changes to this file
may require a restart of some or all of the search grid in order to take effect. Some of
these values can be re-applied to a running process without a restart, refer to the
“Reloadable Settings” section for a list.
General Section
This section is required for every search.ini file. The basic purpose is to share with
all components the configuration settings for the RMI Grid Registry and the Admin
Server. If RMI communication between grid components is not used, then the
General Section is ignored and not required.
[General]
AdminServerHostName=localhost
// RMI Registry
RMIRegistryPort=1099
RMIPolicyFile=otrmi.policy
RMICodebase=../bin/otsearch.jar
RMIAdminPort=8997
Partition Section
The Partition Section contains basic information about a partition, such as size,
memory usage preferences and, and mode of operation. The section name must
include the partition name after the underscore.
[Partition_]
AllowedNumConfigs=500 (-1 = none)
AccumulatorSizeInMBytes=30
PartitionMode=ReadWrite | ReadOnly | NoAdd | Retired
DataFlow Section
The DataFlow section contains the majority of configuration settings relating to how
data should be processed. The partition name must be appended to the section
name after the underscore.
[DataFlow_]
FieldDefinitionFile=LLFieldDefinitions.txt
FieldModeDefinitions=FieldModeDefinitions.ini
QueryTimeOutInMS=120000
SessionTimeOutInMS=216000
StatsTriggerThreshold=200
LastModifiedFieldName=OTModifyDate
// Time zone obtained from OS by default, you can set e.g +5 for EST
TimestampTimeZone=
// Accumulator configuration
ContentTruncSizeInMBytes=10
DumpOnInactiveIntervalInMS=3600000
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0
MaxAverageTokenLengthHeuristic2=15.0
MinDocSizeInTokens=16384
DumpToDiskOnStart=false
AccumulatorBigDocumentThresholdInBytes=5000000
AccumulatorBigDocumentOverhead=10
CompleteXML=false
// Tokenizer
RegExTokenizerFile=otsearchtokenizer.txt
RegExTokenizerFileX=c:/config/tokenizers/partTKNZR.txt
TokenizerOptions=0
UseLikeForTheseRegions=
OverTokenizedRegions=
LikeUsesStemming=true
AllowAlternateTokenizerChangeOnThisDate=20170925
ReindexMODFieldsIfChangeAlternateTokenizer=true
// Facets
ExpectedNumberOfValuesPerFacet=16
ExpectedNumberOfFacetObjects=100000
MaximumFacetValueLength=32
UseFacetDataStructure=true
MaximumNumberOfValuesPerFacet=32767
NumberOfDesiredFacetValues=20
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10
GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize
MaximumNumberOfCachedFacets=25
DesiredNumberOfCachedFacets=16
SubIndexCapSizeInMBytes=2147483647
// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
MaximumSubIndexArraySize=512
CompactEveryNDays=30
NeighbouringIndexRatio=3
ExtraDCSStartsWithNames=OTDoc,OTCA_,OTXMP_,OTCount_,OTMeta
DCSStartsWithNameExemptions=OTDocumentUserComment,OTDocumentUserExplanation
ExtrasWillOverride=false
// Handle bug where thumbnail requests were indexed as text
EnableWeakContentCheck=true
// Metadata defragmentation
DefragmentFirstSundayOfMonthOnly=0
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
DefragmentMaxStaggerInMinutes=60
DefragmentStaggerSeedToAppend=SEED
// Relevance tuning
ExpressionWeight=100
ObjectRankRanker=
ExtraWeightFieldRankers=
DateFieldRankers=
TypeFieldRankers=
DefaultMetadataFieldNamesCSL=
// Set true for minor query performance boost on older CS instances
ConvertREtoRelevancy=false
//
DiskRetSection=DISK_RET
FieldAliasSection=FAS_label
AutoAdjustDataIdBloomFilterSize=true
AutoAdjustDataIdBloomFilterMinAddsBetweenRebuilds=1048576
DisableDataIdPhraseOpt=false
[UpdateDistributor_]
// RMIServerPort not needed for direct socket connection mode
RMIServerPort=
AdminPort=
AllowRebalancingOfNoAddPartitions=false
IEUpdateTimeoutMilliSecs=3600000
MaxItemsInUpdateBatch=100
MaxBatchesPerIETransaction=1000
MaxBatchSizeInBytes=20000000
ReadOnlyConvertionBatchSize=1
// Retry and total wait time talking to UD, direct socket mode
WaitForTransactionMS=10000
MaxWaitForTransactionMS=600000
// logging
LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1
[IndexEngine_]
AdminPort=
IndexDirectory=
// For direct (non RMI) a timeout between connection and first command
IEConnectionTimeoutInMS=10000
[SearchFederator_]
RMIServerPort=
AdminPort=
SearchPort=8500
LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1
[SearchEngine_]
AdminPort=
IndexDirectory=
// Disk tuning values that you should leave alone unless you
// are having disk problems. Use cautiously.
UseSystemIOBuffers=true
MaximumNumberCachedIOBuffers=100
SizeInBytesIOBuffers=4096
DiskRet Section
This section is present to allow use of DISK_RET storage mode in older systems
where Content Server does not support DISK_RET configuration in the search
administration pages. Normally, should only be present in a search.ini_override file.
CS10 Update 3 and later would put this into the FieldModeDefinitions.ini file instead.
[DiskRetSection]
RegionsOnReadWritePartitions=
RegionsOnNoAddPartitions=
RegionsOnReadOnlyPartitions=
[SearchAgent_]
operation=OTProspector | OTClassify
[FAS_label]
From=to
// example
Author=OTUserName
understand exactly what you are doing. In general, this section is not present in a
search.ini file, and the default values are used.
[IndexMaker]
ObjectSkip=32
ObjectUseRLE=true
ObjectUseNyble=true
OffsetSkip=16
OffsetUseRLE=true
OffsetUseNyble=true
SmallestIndexIndexSizeInBytes=1048576
IndexingPartitionFactor=256
Reloadable Settings
A subset of the search.ini settings can be applied to search processes that are
already running. This feature is triggered using the “reloadSettings” command over
the admin API port. The search.ini settings applied at reload are:
Common Values
These values are reloadable in the Update Distributor, Index Engines, Search
Federator and Search Engines.
Logfile
RequestsPerLogFlush
CreationStatus
DebugLevel
LogSizeLimitInMBytes
MaxLogFiles
MaxStartupLogFiles
IncludeConfigurationFilesInLogs
NumberOfFileRecoveryAttempts
LargeObjectPartition
ObjectSizeThresholdInBytes
BlockBackupIfThisFileExists
BlockStartTransactionIfThisFileExists
If using RMI…
RMIRegistryPort
RMIPolicyFile
RMICodebase
AdminServerHostName
PolicyFile
Search Engines
DefaultMetadataFieldNamesCSL
DefragmentMemoryOptions
DefragmentSpaceInMBytes
DefragmentDailyTimes
DefragmentMaxStaggerInMinutes
DefragmentStaggerSeedToAppend
SkipMetadataSetOfEqualValues
MetadataConversionOptions
ExpressionWeight
ObjectRankRanker
ExtraWeightFieldRankers
DateFieldRankers
TypeFieldRankers
UseOldStem
HitLocationRestrictionFields
FieldAliasSection
DefaultMetadataAttributeFieldNames
SystemDefaultSortLanguage
SortingSequences
PrecomputeFacetsCSL
MaximumNumberOfCachedFacets
DesireNumberOfCachedFacets
TextNumberOfWordsInSet=15
TextUseTermSet=true
TextPercentage=80
Update Distributor
MaxItemsInUpdateBatch
MaxBatchSizeInBytes
MaxBatchesPerIETransaction
NumOfMergeTokens
RunAgentIntervalInMS
** The list of partitions is also reloaded from the section names in the Update
Distributor, allowing partitions to be added without restarts.
Although Search Agent definitions are not included in this list, changes to the Search
Agents do not require a restart. Search Agents use another mechanism for updates;
refer to the section on Search Agents for details.
Tokenizer Mapping
Earlier in this document, the Tokenizer section references various character
mappings. For reference, a detailed list of character mappings performed by the
tokenizer is included below. If a character is not included in this table, it is not
mapped – it is added to the index as itself.
The leftmost character in each row (and its hexadecimal Unicode value) represents
the output character(s) of the mapping. The remaining values following the colon
represent a list of source characters that are mapped to that output character. Each
of these source characters in the list is separated by a comma, with Unicode values
in parentheses.
ѝ (45d):
Ѝ (40d)
ў (45e):
Ў (40e)
џ (45f):
Џ (40f)
а (430):
А (410)
б (431):
Б (411)
в (432):
В (412)
г (433):
Г (413)
д (434):
Д (414)
е (435):
Е (415)
ж (436):
Ж (416)
з (437):
З (417)
и (438):
И (418)
й (439):
Й (419)
к (43a):
К (41a)
л (43b):
Л (41b)
м (43c):
М (41c)
н (43d):
Н (41d)
о (43e):
О (41e)
п (43f):
П (41f)
р (440):
Р (420)
с (441):
С (421)
т (442):
Т (422)
у (443):
У (423)
ф (444):
Ф (424)
х (445):
Х (425)
ц (446):
Ц (426)
ч (447):
Ч (427)
ш (448):
Ш (428)
щ (449):
Щ (429)
ъ (44a):
Ъ (42a)
ы (44b):
Ы (42b)
ь (44c):
Ь (42c)
э (44d):
Э (42d)
ю (44e):
Ю (42e)
я (44f):
Я (42f)
ا Arabic
(627): ﴼ,(675) ٵ ,(625) إ,(623) أ,(622) ( آfd3c), ﴽ
(fd3d), ( fe75), ( ﺁfe81), ( ﺂfe82), ( ﺃfe83), ( ﺄfe84), ﺇ
(fe87), ( ﺈfe88), ( ﺍfe8d), ( ﺎfe8e)
وArabic (648): ؤ,(676) ٶ ,(624) ( ؤfe85), ( ﺆfe86), ( وfeed), ﻮ
(feee)
يArabic (64a): ﯨ,(678) ٸ,(649) ى,(626) ( ئfbe8), ( ﯩfbe9), ﱝ
(fc5d), ( ﲐfc90), ( ﺉfe89), ( ﺊfe8a), ( ﺋfe8b), ( ﺌfe8c), ﻯ
(feef), ( ﻰfef0), ( ﻱfef1), ( ﻲfef2), ( ﻳfef3), ( ﻴfef4)
هArabic (647): ﳙ,(629) ( ةfcd9), ( ﺓfe93), ( ﺔfe94), ( ﻩfee9), ﻪ
(feea), ( ﻫfeeb), ( ﻬfeec)
0 (30): ۰ (660), ۰ (6f0), 0 (ff10)
1 (31): ۱ (661), ۱ (6f1), 1 (ff11)
2 (32): ۲ (662), ۲ (6f2), 2 (ff12)
3 (33): ۳ (663), ۳ (6f3), 3 (ff13)
4 (34): ٤ (664), ۴ (6f4), 4 (ff14)
5 (35): ٥ (665), ۵ (6f5), 5 (ff15)
6 (36): ٦ (666), ۶ (6f6), 6 (ff16)
7 (37): ۷ (667), ۷ (6f7), 7 (ff17)
8 (38): ۸ (668), ۸ (6f8), 8 (ff18)
9 (39): ۹ (669), ۹ (6f9), 9 (ff19)
ۇArabic (6c7): ,(677) ( ٷfbc7), ( ﯗfbd7), ( ﯘfbd8), ( ﯝfbdd)
ەArabic (6d5): ( ۀ6c0), ( ۀfba4), ( ﮥfba5), ( ﯀fbc0)
ロ (30ed): ロ (ff9b)
ン (30f3): ン (ff9d)
゛ (309b): ゙ (ff9e)
゜ (309c): ゚ (ff9f)
Additional Information
Version history and selected built-in utilities.
Version History
This section of the document identifies which updates of Search Engine 10 and 10.5
contain new features or material changes in behavior. This is not comprehensive, but
a list of the more notable changes.
Search Engine 10
Released with Content Server 10, approximately September 2010. The versions of
the search engine prior to this release were generally referred to as OT7.
• Add support for key-value attributes in text metadata, used for multi-lingual
metadata indexing and search.
• Added Hindi, Tamil and Telugu to the standard tokenizer.
• New percent full model with “soft” update-only mode and rebalancing.
• Defragmentation of metadata storage.
• Added ModifyByQuery.
• Added DeleteByQuery.
• Added Disk Retrieval Storage mode.
• Bi-gram indexing of far-east character sets. May require re-indexing of existing
content with far-east character sets.
• Faster ‘stemming’ focused on noun plurals.
• Content Status feature added.
• Synthetic regions: partition name and mode.
• Change bad metadata to record error instead of halting.
• Search Federator closes connections from inactive clients.
• Rolling log file support added.
• Various bug fixes
• Support for Java 6 (Update 20)
Error Codes
Errors and warnings from OTSE may be exposed in multiple ways. Process Error
codes are responses to communications. Detailed information about errors is
normally contained in the log files. The chart below articulates many of the possible
Process Error codes. This is not a comprehensive list.
Update Distributor
Code Description
129 Unable to load JNI library. To read or write IPools, OTSE leverages
Content Server libraries. This file is named jniipool.dll (Windows) or
jniipool.so and is expected to reside in the <OTHOME>\bin directory.
131 Insufficient memory. The memory can be adjusted using the –XMX
parameter on the command line. Content Server exposes this
control in its administration pages.
173 Index is full. All Index Engines report they are unable to accept new
objects.
Index Engine
Code Description
180 Index failed to start. In some cases, this error is acceptable if the
Index Engine is already running.
181 Request to start the Index Engine has been ignored because an
index restore operation is in progress.
Search Federator
Code Description
Search Engine
Code Description
Utilities
OTSE contains a number of built-in utilities and diagnostic tools. These are often
used by OpenText support staff and developers when analyzing and testing an index.
Many of these will have limited value for customers, but may be of assistance when
diagnosing particular index problems. For convenience, basic documentation for
some of the more common utilities is included here.
Many of the utilities are NOT a supported feature of the product. They are not
guaranteed to work as described, and may be modified or removed at any time.
General Syntax
The utilities are invoked by launching the search JAR using appropriate parameters.
The general syntax is:
java [-Xmx#M] –classpath <othome>\bin\otsearch.jar
com.opentext.search.tools.<subclasspath>
[parameters]
Where:
<othome>\bin is the file path where the search JAR file is located.
Backup
The backup utility is used to create either differential or full backups of a partition.
Refer to the section on Backup and Restore for more information.
Java –classpath otsearch.jar com.opentext.search.backup.Backup
-inifile J:\index\Diff.ini
Where the inifile identifies the backup configuration file to be used.
Restore
The restore utility is used to restore an index from a prior backup. Restore to the
section on Backup and Restore for more information.
Where the inifile identifies the restore.ini file to be used. You may need to run the
restore process many times. Using the utility directly is not for the faint of heart, and
you should probably let Content Server manage this for you.
DumpKeys
The DumpKeys utility attempts to generate a list of all the object IDs for objects in the
partition. This is often a tool of last recourse for repairing a corrupted index. The
DumpKeys tool will sometimes be able to get data from a partition which is
unreadable.
The input to dumpkeys is the search.ini file and partition information, and the output
is a file of object IDs. Sample output looks like this:
c DataId=41280133&Version=1
c DataId=41280132&Version=1
c DataId=41280131&Version=1
The first character details where the object ID was found. If in the checkpoint file, the
first character is a ‘c’ (as in the example above). If an object ID was found in the
metalog file (recently indexed), the first character reflects the operation type:
n: new
a: add
r: replace
m: modify
d: delete
Invoking Dumpkeys:
java -Xmx2000M -Xss10M -cp .\otsearch.jar;
com.opentext.search.tools.analysis.DumpKeys -inifile <path_to_search.ini> -
sectionName <IE_or_SE_Section_Name> -log <Path_to_log_file> -output
<Path_to_DumpKeys_Output>
Parameters:
path_to_search.ini: Path to the search.ini file, typically /config/search.ini
IE_or_SE_Section_Name: The full section name including the SearchEngine_ or
IndexEngine_ prefix.
Path_to_DumpKeys_Output: Path to where the output file should be created.
Path_to_log_file: Path to where the log file should be created.
VerifyIndex
This utility performs internal checks of the structure of the index. Levels 1 through 5
are cumulative, and level 10 is a distinct operation. Parameters are:
–level K -config SearchIniFile –indexengine IEName
[–outFile OutFile] [-html true] [-verbose true]
SubIndex Statistics
Index Statistics
RebuildIndex
This utility rebuilds the dictionary and index for metadata in a partition. This is
possible because an exact copy of the metadata is stored in the checkpoint files.
This does not affect the full text index. This utility can often be used to repair errors
detected by a Level 10 VerifyIndex.
Parameters:
Where
SearchIniFile is the location and name of the search.ini file which should be used.
IEName is the name of the partition which should be rebuilt.
Because this utility needs to build and load the entire index, you may need to ensure
an appropriate -Xmx (memory allocation) parameter is specified on the Java
command line.
LogInterleaver
Each component of the search grid – index and search engines, search federators
and the update distributor – create their own log files. It can be difficult trying to trace
a single operation through multiple log files. The LogInterleaver function will combine
multiple log files by ordering entries according to their time stamps into a single log
file to simplify interpretation. The output file will have a slightly different syntax –
each line of output will be prefixed by the original log file name.
Parameters:
-d logDir | -o outputFile
OR
tools.analysis.ConvertDateFormat
Log files from the search components have a time stamp in milliseconds from a
reference date. This utility will convert a log file to have human-readable time/date
values instead, which can be helpful when interpreting the logs manually.
This utility is somewhat unusual in that it reads from console input and writes to
console output, so the typical usage is to “pipe” the source logfile into the java
command line, and redirect the output to a target file like this:
com.opentext.search.tokenizer.LivelinkTokenizer
This utility enters a console loop. You enter one line of text, and it responds by
printing out each search token generated on a separate line. Control-C will terminate
the loop.
Optional command line parameters:
-TokenizerOptions <Number> -tokenizerfile <RegExParserFile>
Where Number represents the bitwise controls for tokenizer options, as defined in
the Tokenizer section of this document. The tokenizerfile parameter specifies
an optional or custom tokenizer definition that may be used.
ProfileMetadata
This utility function loads a checkpoint file, and writes information about the metadata
in the checkpoint to the console. You may wish to redirect the console output to a file
to capture the data.
Parameters:
Where:
l: profile level where 0=High Level,
1=Field Level (Default),
2=Field Part Level
values: true requests the # of objects with values
and the estimated total memory requirement
checkpointFile: file name of the checkpoint file to be
profiled
Refer to sample output fragments for the profile levels below.
Level 0:
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719
Level 1:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
2060 Field(Text):OTDocCompany
1996 Field(Text):OTDocRevisionNumber
10668 Field(Text):OTVerCDate
0 Field(Text):OTReservedByName
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719
Level 2:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
1376 Field(Text [RAM]):OTDocCompany dictionary (mappingEntries=0 wsEntries=1
tokenEntries=3)
256 Field(Text [RAM]):OTDocCompany content
428 Field(Text [RAM]):OTDocCompany index
2060 Field(Text [RAM]):OTDocCompany combined
1312 Field(Text [RAM]):OTDocRevisionNumber dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
256 Field(Text [RAM]):OTDocRevisionNumber content
428 Field(Text [RAM]):OTDocRevisionNumber index
1996 Field(Text [RAM]):OTDocRevisionNumber combined
10668 Field(Date):OTVerCDate combined
684 Field(Date):OTDateEffective combined
1312 Field(Text [RAM]):OTContentIsTruncated dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
33920 Field(Text [RAM]):OTContentIsTruncated content
428 Field(Text [RAM]):OTContentIsTruncated index
35660 Field(Text [RAM]):OTContentIsTruncated combined
Field(UserID):OTAssignedTo combined
…
Field(Integer):OTTimeCompleted combined
0 Field(UserLogin):OTReservedByName combined
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719
If the parameter “values” is true, the information for each region is considerably more
detailed:
tools.index.DiskReadWriteSpeed
The search configuration files allow you to control several aspects of file I/O. Tuning
these for optimal performance can be difficult, since many factors are involved. The
DiskReadWriteSpeed utility can help by simulating disk performance using several of
the available configurations. For each mode, this utility performs 32678 iterations of
the test using 8KB block of data. Note that this information can help you tune disk
performance or identify system I/O bottlenecks, but is not necessarily sufficient to
draw a firm conclusion regarding the optimal configuration.
Parameters:
(write|read|both) TestDirectory
The operations tested are:
SearchClient
The SearchClient is a console application that allows you to interactively issue
commands to the Search Federator. The SearchClient is useful for determining that
search is working as expected, or running queries without having an application such
as Content Server running. All console output is expressed in UTF-8 characters.
Note that you might need to adjust the default Search Federator timeout values
higher if using the SearchClient.
It is possible to use the SearchClient with an index that is also being used in a live
production system. In this situation, a SearchClient that is open consumes a search
transaction from the available pool, so this may impact the available pool of search
transactions.
Parameters:
–host SFHost –port SearchPort [-adminport SFAdminPort] [-time
true] [-echo true] [-pretty true]
SFHost is the URI for the target Search Federator, connected on SearchPort. The –
time true parameter adds response time information to each response.
The –echo parameter will add the input command to the output. This is useful when
redirecting input from a file for batch operations, so you can associate the commands
with the responses. By default, echo is false.
The –pretty parameter will use an alternate formatting of GET RESULTS. The
alternate format does not adhere to the API spec, but is better formatted for human
readability when developing or debugging.
The –csv true parameter will output the results in a form that can be easily imported
into a spreadsheet (comma separated values). This feature is most useful when
redirecting input and output from/to files. If –pretty is specified, it takes precedence
over –csv.
The –adminport setting enables specific commands to be interpreted and sent to the
administration port of the Search Federator. These admin commands are:
Problem Illustration
subindex1 has internal IDs = 1,2,3,4,5,6,8,9
subindex2 has internal IDs = 5,7,8,9,10
BaseOffset problem: subindex1 should only contain 1,2,3,4. Internal IDs 5,6,8 and 9
overlap with subindex2.
Fix: cut 5,6,8 and 9 from subindex1.
Items 5, 8, and 9 already exist as duplicates in subindex2. However, item 6 only
exists in subindex1, so the fix would remove the only instance of item 6 from the
index content.
After fix:
subindex1New: 1,2,3,4
subindex2: 5,7,8,9,10
Output of DumpSubIndexes before fix: ids for subindex1, subindex2 and deleteMask
Ouput of RepairSubIndexes: a file which lists the objects removed from subindex1
(5, 6, 8 and 9) along with their external IDs for re-indexing.
Output of Diff tool: a file which only lists object 6 along with its external ID for re-
indexing.
Output of DumpSubIndexes after fix: ids for subindex1New, subindex2 and
deleteMask
Repair Option 1
This approach requires about 30 to 60 minutes for a typical partition, and makes the
index usable as quickly as possible. However, there may be a lot of objects that
need to be reindexed.
Running the RepairSubIndexes utility
java -otsearch.jar
com.opentext.search.tools.index.RepairSubIndexes
-level x -config search.ini -indexengine firstEngine
Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.
If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will
warn and exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending
on the size of the sub-index that is being fixed. The utility will produce an
output file bearing the name of the sub-index that was fixed. This file contains
the internal-external objectID (OTObject region value) pairs that can be
utilized for re-indexing.
3. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.
Additional Comments:
While running the tools, it is strongly recommended that the output be
redirected out to a file for easier analysis (… > repairoutput.txt).
During the repair process, it is possible to navigate inside the directory where
the index under repair sits. It is possible to observe the new sub-index
fragment being written out, growing larger in size over time.
At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being
repaired (same location where new fragment is made)
Repair Option 2
This method typically requires about 45 minutes longer per partition, but minimizes
the number of objects which may require re-indexing.
Running the RepairSubIndexes utility
example (assuming that both the new otsearch.jar and otsearch-util.jar are in the
current directory):
• where dir is the index directory where all the output files were written out
• where deleteIDsFile is the output file made by the RepairSubIndexes utility for
the sub-index that was fixed
• where subIndexIDsFile is the appropriate output file made by
DumpSubIndexesIDs utility. It is crucial to use the correct file; if we have
subindex1 and subindex2 with overlap and subindex1 was cut out, then use the
DumpSubIndexesIDs file for subindex2.
example:
The minimum search.ini sections necessary to run this tool are the Index Engine
section, Dataflow section and Partition section. Any file paths mentioned in these
sections should be adjusted to point to the actual location of your index partition
directory in your environment
Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.
If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will warn
and exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending on
the size of the sub-index that is being fixed. The utility will produce an output file
bearing the name of the sub-index that was fixed. This file contains the internal-
external objected (OTObject region value) pairs that can be utilized for re-
indexing.
3a. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
3b. Run the DumpSubIndexesIDs utility after repair. This will generate a date-
stamped file for each sub-index. The file contains all the internal-external IDs for
each sub-index.
3c. Run the DiffObjectIDFiles tool (this only takes a few minutes). This will produce a
smaller set of objects to re-index. This set contains objects whose content was
cut from the bad sub-index and whose content is NOT contained anywhere else
in the partition.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.
Additional Comments:
While running the tools, it is strongly recommended that the output be redirected
out to a file for easier analysis (… > repairoutput.txt).
During the repair process, it is possible to navigate inside the directory where the
index under repair sits. It is possible to observe the new sub-index fragment
being written out, growing larger in size over time.
At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being repaired
(same location where new fragment is made.
Index of Terms
About OpenText
OpenText enables the digital world, creating a better way for organizations to work with information, on premises or in the
cloud. For more information about OpenText (NASDAQ: OTEX, TSX: OTC) visit opentext.com.
Connect with us:
www.opentext.com
Copyright © 2021 Open Text SA or Open Text ULC (in Canada).
All rights reserved. Trademarks owned by Open Text SA or Open Text ULC (in Canada).