Cloudera Search User Guide
Cloudera Search User Guide
Important Notice
Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or
slogans contained in this document are trademarks of Cloudera and its suppliers or licensors,
and may not be copied, imitated or used, in whole or in part, without the prior written
permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation.
All other trademarks, registered trademarks, product names and company names or logos
mentioned in this document are the property of their respective owners. Reference to any
products, services, processes or other information, by trade name, trademark, manufacturer,
supplier or otherwise does not constitute or imply endorsement, sponsorship or
recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting
the rights under copyright, no part of this document may be reproduced, stored in or
introduced into a retrieval system, or transmitted in any form or by any means (electronic,
mechanical, photocopying, recording, or otherwise), or for any purpose, without the express
written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual
property rights covering subject matter in this document. Except as expressly provided in
any written license agreement from Cloudera, the furnishing of this document does not
give you any license to these patents, trademarks copyrights, or other intellectual property.
The information in this document is subject to change without notice. Cloudera shall not
be liable for any damages resulting from technical errors or omissions which may be present
in this document, or from use of this document.
Cloudera, Inc.
1001 Page Mill Road, Building 2
Palo Alto, CA 94304-1008
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
This guide explains how to configure and use Cloudera Search. This includes topics such as extracting,
transforming, and loading data, establishing high availability, and troubleshooting.
Cloudera Search documentation also includes:
• Cloudera Search Installation Guide
Cloudera Search is one of Cloudera's near-real-time access products. Cloudera Search enables non-technical
users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or
programming skills to use Cloudera Search because it provides a simple, full-text interface for searching.
Another benefit of Cloudera Search, compared to stand-alone search solutions, is the fully integrated data
processing platform. Search uses the flexible, scalable, and robust storage system included with CDH. This
eliminates the need to move larger data sets across infrastructures to address business tasks.
Cloudera Search incorporates Apache Solr, which includes Apache Lucene, SolrCloud, Apache Tika, and Solr Cell.
Cloudera Search 1.x is tightly integrated with Cloudera's Distribution, including Apache Hadoop (CDH) and is
included with CDH 5. Cloudera Search provides these key capabilities:
• Near-real-time indexing
• Batch indexing
• Simple, full-text data exploration and navigated drill down
Using Search with the CDH infrastructure provides:
• Simplified infrastructure
• Better production visibility
• Quicker insights across various data types
• Quicker problem resolution
• Simplified interaction with the ability to open the platform to more users and use cases
• Scalability, flexibility, and reliability of search services on the same platform as where you can execute other
types of workloads on the same data
between data ingestion using the Flume sink to that content potentially appearing in search results is on the
order of seconds, though this duration is tunable. The Lily HBase Indexer uses Solr to index data stored in HBase.
As HBase applies inserts, updates, and deletes to HBase table cells, the indexer keeps Solr consistent with the
HBase table contents, using standard HBase replication features. The indexer supports flexible custom
application-specific rules to extract, transform, and load HBase data into Solr. Solr search results can contain
columnFamily:qualifier links back to the data stored in HBase. This way applications can use the Search
result set to directly access matching raw HBase cells. Indexing and searching do not affect operational stability
or write throughput of HBase because the indexing and searching processes are separate and asynchronous
to HBase.
using Morphlines. This built-in support simplifies index configuration for these formats, which you can use for
other applications such as MapReduce jobs.
HBase Search
Cloudera Search integrates with HBase, enabling full-text search of data stored in HBase. This functionality,
which does not affect HBase performance, is based on a listener that monitors the replication event stream.
The listener captures each write or update-replicated event, enabling extraction and mapping. The event is
then sent directly to Solr indexers, deployed on HDFS, and written to indexes in HDFS, using the same process
as for other indexing workloads of Cloudera Search. The indexes can then immediately be served, enabling near
real time search of HBase data.
Cloudera Search opens CDH to full-text search and exploration of data in HDFS and Apache HBase. Cloudera
Search is powered by Apache Solr, enriching the industry standard open source search solution with Hadoop
platform integration, enabling a new generation of Big Data search. Cloudera Search makes it especially easy
to query large data sets.
Search Architecture
Search runs as a distributed service on a set of servers, and each server is responsible for some portion of the
entire set of content to be searched. The entire set of information to be searched is split into smaller pieces,
copies are made of these pieces, and the pieces are distributed among the servers. This provides two main
advantages:
• Dividing the content into smaller pieces distributes the task of indexing the content among the servers.
• Duplicating the pieces of the whole allows queries to be scaled more effectively and the makes it possible
for the system to provide higher levels of availability.
Each Search server can handle requests for information. This means that a client can send requests to index
documents or carry out searches to any arbitrary Search server and the server routes the request to the correct
Search server.
Ingestion
Content can be moved to CDH through techniques such as using:
• Flume, a flexible, agent-based data ingestion framework.
• A copy utility such as distcp for HDFS.
• Sqoop, a structured data ingestion connector.
• fuse-dfs.
In a typical environment, administrators establish systems for search. For example, HDFS is established to
provide storage; Flume or distcp are established for content ingestion. Once administrators establish these
services, users can use ingestion technologies such as file copy utilities or Flume sinks.
Indexing
Content must be indexed before it can be searched. Indexing is comprised of a set of steps:
• ETL Steps — Extraction, Transformation, and Loading (ETL) is handled using existing engines or frameworks
such as Apache Tika or Cloudera Morphlines.
– Content and metadata extraction.
– Schema mapping
Indexes are typically stored on a local file system. Lucene supports additional index writers and readers. One
such index interface is HDFS-based and has been implemented as part of Apache Blur. This index interface has
been integrated with Cloudera Search and modified to perform well with CDH-stored indexes. All index data in
Cloudera Search is stored in HDFS and served from HDFS.
There are three ways to index content:
Cloudera Search includes a Flume sink that includes the option to directly write events to the indexer. This sink
provides a flexible, scalable, fault tolerant, near real time (NRT) system for processing continuous streams of
records, creating live-searchable, free-text search indexes. Typically it is a matter of seconds from data ingestion
using the Flume sink to that content potentially appearing in search results, though this duration is tunable.
The Flume sink has been designed to meet the needs of identified use cases that rely on NRT availability. Data
can flow from multiple sources through multiple flume nodes. These nodes, which can be spread across a
network route this information to one or more Flume indexing sinks. Optionally, you can split the data flow,
storing the data in HDFS while also writing it to be indexed by Lucene indexers on the cluster. In that scenario
data exists both as data and as indexed data in the same storage infrastructure. The indexing sink extracts
relevant data, transforms the material, and loads the results to live Solr search servers. These Solr servers are
then immediately ready to serve queries to end users or search applications.
This system is flexible and customizable, and provides a high level of scaling as parsing is moved from the Solr
server to the multiple Flume nodes for ingesting new content.
Search includes parsers for a set of standard data formats including Avro, CSV, Text, HTML, XML, PDF, Word, and
Excel. While many formats are supported, you can extend the system by adding additional custom parsers for
other file or data formats in the form of Tika plug-ins. Any type of data can be indexed: a record is a byte array
of any format and parsers for any data format and any custom ETL logic can be established.
In addition, Cloudera Search comes with a simplifying Extract-Transform-Load framework called Cloudera
Morphlines that can help adapt and pre-process data for indexing. This eliminates the need for specific parser
deployments, replacing them with simple commands.
Cloudera Search has been designed to efficiently handle a variety of use cases.
• Search supports routing to multiple Solr collections as a way of making a single set of servers support multiple
user groups (multi-tenancy).
• Search supports routing to multiple shards to improve scalability and reliability.
• Index servers can be either co-located with live Solr servers serving end user queries or they can be deployed
on separate commodity hardware, for improved scalability and reliability.
• Indexing load can be spread across a large number of index servers for improved scalability, and indexing
load can be replicated across multiple index servers for high availability.
This is a flexible, scalable, highly available system that provides low latency data acquisition and low latency
querying. Rather than replacing existing solutions, Search complements use-cases based on batch analysis of
HDFS data using MapReduce. In many use cases, data flows from the producer through Flume to both Solr and
HDFS. In this system, NRT ingestion, as well as batch analysis tools can be used.
NRT indexing using some other client that uses the NRT API
Documents written by a third-party directly to HDFS can trigger indexing using the Solr REST API. This API can
be used to complete a number of steps:
1. Extract content from the document contained in HDFS where the document is referenced by a URL.
2. Map the content to fields in the search schema.
3. Create or update a Lucene index.
This could be useful if you do indexing as part of a larger workflow. For example, you might choose to trigger
indexing from an Oozie workflow.
Querying
Once data has been made available as an index, the query API provided by the search service allows for direct
queries to be executed, or facilitated through some third party, such as a command line tool or graphical interface.
Cloudera Search provides a simple UI application that can be deployed with Hue, but it is just as easy to develop
a custom application, fitting your needs, based on the standard Solr API. Any application that works with Solr
is compatible and runs as a search-serving application for Cloudera Search, as Solr is the core.
The topics in this tutorial document assume you have completed the instructions in the Cloudera Search
Installation Guide.
This tutorial describes preparatory steps of:
• Validating the Deployment with the Solr REST API
• Preparing to Index Data
There two tutorial topics, including indexing strategies, are:
• Batch Indexing Using MapReduce
• Near Real Time (NRT) Indexing Using Flume and the Solr Sink
These tutorials use a modified schema.xml and solrconfig.xml file. In the versions of these files included
with the tutorial, unused fields have been removed for simplicity. Note that the original versions of these files
include many additional options. For information on all available options, including those that were not required
for the tutorial, see the Solr wiki:
• SchemaXml
• SolrConfigXml
Indexing Data
Begin by indexing some data to be queried later. Sample data is provided in the installed packages. Replace
$SOLRHOST in the example below with the name of any host running the Solr process.
$ cd /usr/share/doc/solr-doc*/example/exampledocs
$ java -Durl=http://$SOLRHOST:8983/solr/collection1/update -jar post.jar *.xml
Running Queries
Once you have indexed data, you can run a query.
To run a query:
1. Open the following link in a browser: http://$SOLRHOST:8983/solr.
Note: Replace $SOLRHOST with the name of any host running the Solr process.
Note: Choose wt as json and select the indent option in the web GUI to see more human readable
output.
Next Steps
Consider indexing more data using the Solr REST API or move to batch indexing with MapReduce or NRT indexing
with Flume. To learn more about Solr capabilities, consider reviewing the Apache Solr Tutorial.
5. Verify the collection is live. For example, for the localhost, use http://localhost:8983/solr/#/~cloud.
6. Prepare the configuration layout for use with MapReduce:
$ cp -r $HOME/solr_configs3 $HOME/collection3
7. Locate input files suitable for indexing, and check that the directory exists. This example assumes you are
running the following commands as a user $USER with access to HDFS.
9. Collect HDFS/MapReduce configuration details. You can download these from Cloudera Manager or use
/etc/hadoop, depending on your installation mechanism for the Hadoop cluster. This example uses the
configuration found in /etc/hadoop/conf.cloudera.mapreduce1. Substitute the correct Hadoop
configuration path for your cluster.
2. Run the MapReduce job using the GoLive option. Be sure to replace $NNHOST and $ZKHOST in the command
with your NameNode and ZooKeeper hostnames and port numbers, as required. Note that you do not need
to specify --solr-home-dir because the job accesses it from ZooKeeper.
For command line help on how to run a Hadoop MapReduce job, use the following command:
Note: For development purposes, use the MapReduceIndexerTool --dry-run option to run in
local mode and print documents to stdout, instead of loading them to Solr. Using this option
causes the morphline to execute in the client process without submitting a job to MapReduce.
Executing in the client process provides quicker turnaround during early trial and debug sessions.
Note: To print diagnostic information, such as the content of records as they pass through the
morphline commands, consider enabling TRACE log level. For example, you can enable TRACE log
level diagnostics by adding the following to your log4j.properties file:
log4j.logger.com.cloudera.cdk.morphline=TRACE
The log4j.properties file can be passed via the MapReduceIndexerTool --log4j command
line option.
2. Run the Hadoop MapReduce job. Be sure to replace $NNHOST in the command with your NameNode hostname
and port number, as required.
3. Check the job tracker status. For example, for the localhost, use http://localhost:50030/jobtracker.jsp.
4. Once the job completes, check the generated index files. Individual shards are written to the results directory
as with names of the form part-00000, part-00001, part-00002. There are only two shards in this example.
6. List the host name folders used as part of the path to each index in the SolrCloud cluster.
Note: You are moving the index shards to the two servers you set up in Preparing to Index
Data on page 18.
Near Real Time (NRT) Indexing Using Flume and the Solr Sink
The following section describes how to using Flume to index tweets. Before beginning this process, you must
have:
• Completed the process of Preparing to Index Data.
• Installed the Flume Solr Sink for use with Cloudera Search as described in Installing Flume Solr Sink for use
with Cloudera Search.
agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf
zkHost : "127.0.0.1:2181/solr"
JAVA_OPTS="-Xmx500m"
6. (Optional) You can configure the location at which Flume finds Cloudera Search dependencies for Flume Solr
Sink using SEARCH_HOME. For example, if you installed Flume from a tarball package, you can configure it to
find required files by setting SEARCH_HOME. To set SEARCH_HOME use a command of the form:
$ export SEARCH_HOME=/usr/lib/search
agent.sources.twitterSrc.consumerKey = YOUR_TWITTER_CONSUMER_KEY
agent.sources.twitterSrc.consumerSecret = YOUR_TWITTER_CONSUMER_SECRET
agent.sources.twitterSrc.accessToken = YOUR_TWITTER_ACCESS_TOKEN
agent.sources.twitterSrc.accessTokenSecret = YOUR_TWITTER_ACCESS_TOKEN_SECRET
Generate these four codes using the Twitter developer site by completing the follows steps:
1. Sign in to https://dev.twitter.com with a Twitter account.
2. Select My applications from the drop-down menu in the top-right corner, and Create a new application.
3. Fill in the form to represent the Search installation. This can represent multiple clusters, and does not require
the callback URL. Because this will not be a publicly distributed application, the name, description, and website
(required fields) do not matter much except to the owner.
4. Click Create my access token at the bottom of the page. You may have to refresh to see the access token.
Substitute the consumer key, consumer secret, access token, and access token secret into flume.conf. Consider
this information confidential, just like your regular Twitter credentials.
To enable authentication, ensure the system clock is set correctly on all nodes where Flume connects to Twitter.
Options for setting the system clock include installing NTP and keeping the host synchronized by running the
ntpd service or manually synchronizing using the command sudo ntpdate pool.ntp.org. Confirm time is
set correctly by ensuring the output of the command date --utc matches the time shown at
http://www.time.gov/timezone.cgi?UTC/s/0/java. You can also set the time manually using the date command.
3. Monitor progress in the Flume log file and watch for any errors:
$ tail -f /var/log/flume-ng/flume.log
After restarting the Flume agent, use the Cloudera Search GUI. For example, for the localhost, use
http://localhost:8983/solr/collection3/select?q=*%3A*&sort=created_at+desc&wt=json&indent=true
to verify that new tweets have been ingested into Solr. Note that the query sorts the result set such that the
most recently ingested tweets are at the top, based on the created_at timestamp. If you rerun the query, new
tweets show up at the top of the result set.
To print diagnostic information, such as the content of records as they pass through the morphline commands,
consider enabling TRACE log level. For example, you can enable TRACE log level diagnostics by adding the following
to your log4j.properties file:
log4j.logger.com.cloudera.cdk.morphline=TRACE
$ curl --data-binary
@/usr/share/doc/search-0.1.4/examples/test-documents/sample-statuses-20120906-141433-medium.avro
'http://127.0.0.1:5140?resourceName=sample-statuses-20120906-141433-medium.avro'
--header 'Content-Type:application/octet-stream' --verbose
$ cat /var/log/flume-ng/flume.log
3. Delete any old spool directory and create a new spool directory:
$ rm -fr /tmp/myspooldir
$ sudo -u flume mkdir /tmp/myspooldir
5. Send a file containing tweets to the SpoolingDirectorySource. Use the copy-then-atomic-move file system
trick to ensure no partial files are ingested:
$ sudo -u flume cp
/usr/share/doc/search*/examples/test-documents/sample-statuses-20120906-141433-medium.avro
/tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro
$ sudo -u flume mv /tmp/myspooldir/.sample-statuses-20120906-141433-medium.avro
/tmp/myspooldir/sample-statuses-20120906-141433-medium.avro
$ cat /var/log/flume-ng/flume.log
$ find /tmp/myspooldir
Use the Cloudera Search GUI. For example, for the localhost, use
http://localhost:8983/solr/collection3/select?q=*%3A*&wt=json&indent=true to verify that new
tweets have been ingested into Solr.
Solrctl Reference
Use the solrctl utility to manage a SolrCloud deployment, completing tasks such as manipulating SolrCloud
collections, SolrCloud collection instance directories, and individual cores.
A SolrCloud collection is the top level object for indexing documents and providing a query interface. Each
SolrCloud collection must be associated with an instance directory, though note that different collections can
use the same instance directory. Each SolrCloud collection is typically replicated (also known as sharded) among
several SolrCloud instances. Each replica is called a SolrCloud core and is assigned to an individual SolrCloud
node. The assignment process is managed automatically, though users can apply fine-grained control over each
individual core using the core command. A typical deployment workflow with solrctl consists of deploying
ZooKeeper coordination service, deploying solr-server daemons to each node, initializing the state of the
ZooKeeper coordination service using init command, starting each solr-server daemon, generating an instance
directory, uploading it to ZooKeeper, and associating a new collection with the name of the instance directory.
In general, if an operation succeeds, solrctl exits silently with a success exit code. If an error occurs, solrctl
prints a diagnostics message combined with a failure exit code.
You can execute solrctl on any node that is configured as part of the SolrCloud. To execute any solrctl
command on a node outside of SolrCloud deployment, ensure that SolrCloud nodes are reachable and provide
--zk and --solr command line options.
The solrctl commands init, instancedir, and collection affect the entire SolrCloud deployment and are
executed only once per required operation.
The solrctl core command affects a single SolrCloud node.
Syntax
You can initialize the state of the entire SolrCloud deployment and each individual node within the SolrCloud
deployment using solrctl. The general solrctl command syntax is of the form:
solrctl [options] command [command-arg] [command [command-arg]] ...
Each part of these elements and their possible values are described in the following sections.
Options
If options are provided, they must precede commands:
• --solr solr_uri: Directs solrctl to a SolrCloud web API available at a given URI. This option is required
for nodes running outside of SolrCloud. A sample URI might be: http://node1.cluster.com:8983/solr.
• --zk zk_ensemble: Directs solrctl to a particular ZooKeeper coordination service ensemble. This option
is required for nodes running outside of SolrCloud. For example:
node1.cluster.com:2181,node2.cluster.com:2181/solr.
• --help: Prints help.
• --quiet: Suppresses most solrctl messages.
Commands
• init [--force]: The init command, which initializes the overall state of the SolrCloud deployment, must
be executed before starting solr-server daemons for the first time. Use this command cautiously as it is a
destructive command that erases all SolrCloud deployment state information. After a successful initialization,
it is impossible to recover any previous state.
• instancedir [--generate path] [--create name path] [--update name path] [--get name
path] [--delete name] [--list]: Manipulates the instance directories. The following options are
supported:
– --generate path: Allows users to generate the template of the instance directory. The template is
stored at a given path in a local filesystem and it has the configuration files under /conf. See Solr's
README.txt for the complete layout.
– --create name path: Pushes a copy of the instance directory from local filesystem to SolrCloud. If an
instance directory is already known to SolrCloud, this command fails. See --update for changing name
paths that already exist.
– --update name path: Updates an existing SolrCloud's copy of an instance directory based on the files
present in a local filesystem. This can be thought of us first using --delete name followed by --create
name path.
– --get name path: Downloads the named collection instance directory at a given path in a local filesystem.
Once downloaded, files can be further edited.
– --delete name: Deletes the instance directory name from SolrCloud.
– --list: Prints a list of all available instance directories known to SolrCloud.
• core [--create name [-p name=value]...] [--reload name] [--unload name] [--status name]:
Manipulates cores. This is one of the two commands that you can execute on a particular SolrCloud node.
Use this expert command with caution. The following options are supported:
– --create name [-p name=value]...]: Creates a new core on a given SolrCloud node. The core is
configured using name=values pairs. For more details on configuration options see Solr documentation.
– --reload name: Reloads a core.
– --unload name: Unloads a core.
– --status name: Prints status of a core.
Cloudera Search provides the ability to batch index documents using MapReduce jobs.
If you did not install MapReduce tools required for Cloudera Search, do so now by installing MapReduce tools
on nodes where you want to submit a batch indexing job as described in Installing MapReduce Tools for use
with Cloudera Search.
For information on tools related to batch indexing, see:
• MapReduceIndexerTool
• HDFSFindTool
MapReduceIndexerTool
MapReduceIndexerTool is a MapReduce batch job driver that takes a morphline and creates a set of Solr index
shards from a set of input files and writes the indexes into HDFS in a flexible, scalable, and fault-tolerant manner.
It also supports merging the output shards into a set of live customer-facing Solr servers, typically a SolrCloud.
More details are available through the command line help:
Required arguments:
--output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside
there one output directory per shard will be
generated. Example: hdfs://c2202.mycompany.
com/user/$USER/test
--morphline-file FILE Relative or absolute path to a local config file
that contains one or more morphlines. The file
must be UTF-8 encoded. Example:
/path/to/morphline.conf
Cluster arguments:
Arguments that provide information about your Solr cluster.
--zk-host STRING The address of a ZooKeeper ensemble being used
by a SolrCloud cluster. This ZooKeeper ensemble
will be examined to determine the number of
output shards to create as well as the Solr URLs
to merge the output shards into when using the --
go-live option. Requires that you also pass the
--collection to merge the shards into.
The --zk-host option implements the same
partitioning semantics as the standard SolrCloud
Near-Real-Time (NRT) API. This enables to mix
batch updates from MapReduce ingestion with
updates from standard Solr NRT ingestion on the
same SolrCloud cluster, using identical unique
document keys.
Format is: a list of comma separated host:port
pairs, each corresponding to a zk server.
Example: '127.0.0.1:2181,127.0.0.1:
2182,127.0.0.1:2183' If the optional chroot
suffix is used the example would look like:
'127.0.0.1:2181/solr,127.0.0.1:2182/solr,
127.0.0.1:2183/solr' where the client would be
rooted at '/solr' and all paths would be
relative to this root - i.e.
getting/setting/etc... '/foo/bar' would result
in operations being run on '/solr/foo/bar' (from
the server perspective).
Go live arguments:
Arguments for merging the shards that are built into a live Solr
cluster. Also see the Cluster arguments.
--go-live Allows you to optionally merge the final index
shards into a live Solr cluster after they are
built. You can pass the ZooKeeper address with --
zk-host and the relevant cluster information
will be auto detected. (default: false)
--collection STRING The SolrCloud collection to merge shards into
when using --go-live and --zk-host. Example:
collection1
--go-live-threads INTEGER
Tuning knob that indicates the maximum number of
live merges to run in parallel at one time.
(default: 1000)
Generic options supported are
--conf <configuration FILE>
specify an application configuration file
-D <property=value> use value for given property
--fs <local|namenode:port>
specify a namenode
--jt <local|jobtracker:port>
specify a job tracker
--files <comma separated list of files>
specify comma separated files to be copied to
the map reduce cluster
--libjars <comma separated list of jars>
-D 'mapred.child.java.opts=-Xmx500m' \
--log4j src/test/resources/log4j.properties \
--morphline-file
../search-core/src/test/resources/test-morphlines/tutorialReadAvroContainer.conf \
--output-dir hdfs://c2202.mycompany.com/user/$USER/test \
--zk-host zk01.mycompany.com:2181/solr \
--collection collection1 \
--go-live \
hdfs:///user/foo/indir
HdfsFindTool
HdfsFindTool is essentially the HDFS version of the Linux file system find command. The command walks one
or more HDFS directory trees and finds all HDFS files that match the specified expression and applies selected
actions to them. By default, it simply prints the list of matching HDFS file paths to stdout, one path per line.
The output file list can be piped into the MapReduceIndexerTool using the MapReduceIndexerTool --inputlist
option.
More details are available through the command line help:
-group groupname
Evaluates as true if the file belongs to the specified
group.
-mtime n
-mmin n
Evaluates as true if the file modification time subtracted
from the start time is n days (or minutes if -mmin is used)
-name pattern
-iname pattern
Evaluates as true if the basename of the file matches the
pattern using standard file system globbing.
If -iname is used then the match is case insensitive.
-newer file
Evaluates as true if the modification time of the current
file is more recent than the modification time of the
specified file.
-nogroup
Evaluates as true if the file does not have a valid group.
-nouser
Evaluates as true if the file does not have a valid owner.
-perm [-]mode
-perm [-]onum
Evaluates as true if the file permissions match that
specified. If the hyphen is specified then the expression
shall evaluate as true if at least the bits specified
match, otherwise an exact match is required.
The mode may be specified using either symbolic notation,
eg 'u=rwx,g+x+w' or as an octal number.
-print
-print0
Always evaluates to true. Causes the current pathname to be
written to standard output. If the -print0 expression is
used then an ASCII NULL character is appended.
-prune
Always evaluates to true. Causes the find command to not
descend any further down this directory tree. Does not
have any affect if the -depth expression is specified.
-replicas n
Evaluates to true if the number of file replicas is n.
-size n[c]
Evaluates to true if the file size in 512 byte blocks is n.
If n is followed by the character 'c' then the size is in bytes.
-type filetype
Evaluates to true if the file type matches that specified.
The following file type values are supported:
'd' (directory), 'l' (symbolic link), 'f' (regular file).
-user username
Evaluates as true if the owner of the file matches the
specified user.
The following operators are recognised:
expression -a expression
expression -and expression
expression expression
Logical AND operator for joining two expressions. Returns
true if both child expressions return true. Implied by the
juxtaposition of two expressions and so does not need to be
explicitly specified. The second expression will not be
applied if the first fails.
! expression
-not expression
Evaluates as true if the expression evaluates as false and
vice-versa.
expression -o expression
expression -or expression
Logical OR operator for joining two expressions. Returns
true if one of the child expressions returns true. The
second expression will not be applied if the first returns
true.
-help [cmd ...]: Displays help for given command or all commands if none
is specified.
-usage [cmd ...]: Displays the usage for given command or all commands if none
is specified.
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied
to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to
include in the classpath.
-archives <comma separated list of archives> specify comma separated archives
to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
Example: Find all files that match all of the following conditions
• File is contained somewhere in the directory tree hdfs:///user/$USER/solrloadtest/twitter/tweets
• file name matches the glob pattern 'sample-statuses*.gz'
• file was last modified less than 1440 minutes (i.e. 24 hours) ago
• file size is between 1 MB and 1 GB
The Flume Solr Sink is a flexible, scalable, fault tolerant, transactional, Near Real Time (NRT) oriented system for
processing a continuous stream of records into live search indexes. Latency from the time of data arrival to the
time of data showing up in search query results is on the order of seconds and is tunable.
Completing Near Real-Time (NRT) indexing requires the Flume Solr Sink. If you did not install that earlier, do so
now, as described in Installing Flume Solr Sink for use with Cloudera Search.
Data flows from one or more sources through one or more Flume nodes across the network to one or more
Flume Solr Sinks. The Flume Solr Sinks extract the relevant data, transform it, and load it into a set of live Solr
search servers, which in turn serve queries to end users or search applications.
The ETL functionality is flexible and customizable using chains of arbitrary morphline commands that pipe
records from one transformation command to another. Commands to parse and transform a set of standard
data formats such as Avro, CSV, Text, HTML, XML, PDF, Word, or Excel. are provided out of the box, and additional
custom commands and parsers for additional file or data formats can be added as morphline plug-ins. This is
done by implementing a simple Java interface that consumes a record such as a file in the form of an InputStream
plus some headers plus contextual metadata. This record is used to generate output of zero or more records.
Any kind of data format can be indexed and any Solr documents for any kind of Solr schema can be generated,
and any custom ETL logic can be registered and executed.
Routing to multiple Solr collections is supported to improve multi-tenancy. Routing to a SolrCloud cluster is
supported to improve scalability. Flume SolrSink servers can be either co-located with live Solr servers serving
end user queries, or Flume SolrSink servers can be deployed on separate industry standard hardware for improved
scalability and reliability. Indexing load can be spread across a large number of Flume SolrSink servers for
improved scalability. Indexing load can be replicated across multiple Flume SolrSink servers for high availability,
for example using Flume features such as Load balancing Sink Processor.
This system provides low latency data acquisition and low latency querying. It complements (rather than replaces)
use-cases based on batch analysis of HDFS data using MapReduce. In many use cases, data flows simultaneously
from the producer through Flume into both Solr and HDFS using Flume features such as optional replicating
channels to replicate an incoming flow into two output flows. Both near real time ingestion as well as batch
analysis tools are used in practice.
For a more comprehensive discussion of the Flume Architecture see Large Scale Data Ingestion using Flume.
Once Flume is configured, start Flume as detailed in Flume Installation.
See the Cloudera Search Tutorial for exercises that configure and run a Flume SolrSink to index documents.
Flume Morphline Solr Sink provides the following configuration options in the flume.conf file:
For example, here is a flume.conf section for a SolrSink for the agent named "agent":
agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.solrSink.channel = memoryChannel
agent.sinks.solrSink.batchSize = 100
agent.sinks.solrSink.batchDurationMillis = 1000
agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf
agent.sinks.solrSink.morphlineId = morphline1
Note: The examples in this document use a Flume MemoryChannel to easily get started. For production
use it is often more appropriate to configure a Flume FileChannel instead, which is a high performance
transactional persistent queue.
morphlineId null Name used to identify a morphline if there are multiple morphlines in a
morphline config file
For example, here is a flume.conf section for a MorphlineInterceptor for the agent named "agent":
agent.sources.avroSrc.interceptors = morphlineinterceptor
agent.sources.avroSrc.interceptors.morphlineinterceptor.type =
org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
agent.sources.avroSrc.interceptors.morphlineinterceptor.morphlineFile =
/etc/flume-ng/conf/morphline.conf
agent.sources.avroSrc.interceptors.morphlineinterceptor.morphlineId = morphline1
Note: Currently a morphline interceptor can not generate more than one output record for each input
event.
prefix "" The prefix string constant to prepend to each generated UUID.
For example, here is a flume.conf section for a HTTPSource with a BlobHandler for the agent named "agent":
agent.sources.httpSrc.type = org.apache.flume.source.http.HTTPSource
agent.sources.httpSrc.port = 5140
agent.sources.httpSrc.handler = org.apache.flume.sink.solr.morphline.BlobHandler
agent.sources.httpSrc.handler.maxBlobLength = 2000000000
agent.sources.httpSrc.interceptors = uuidinterceptor
agent.sources.httpSrc.interceptors.uuidinterceptor.type =
org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
agent.sources.httpSrc.interceptors.uuidinterceptor.headerName = id
#agent.sources.httpSrc.interceptors.uuidinterceptor.preserveExisting = false
#agent.sources.httpSrc.interceptors.uuidinterceptor.prefix = myhostname
agent.sources.httpSrc.channels = memoryChannel
For example, here is a flume.conf section for a SpoolDirectorySource with a BlobDeserializer for the agent
named "agent":
agent.sources.spoolSrc.type = spooldir
agent.sources.spoolSrc.spoolDir = /tmp/myspooldir
agent.sources.spoolSrc.ignorePattern = \.
agent.sources.spoolSrc.deserializer =
org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.sources.spoolSrc.deserializer.maxBlobLength = 2000000000
agent.sources.spoolSrc.batchSize = 1
agent.sources.spoolSrc.fileHeader = true
agent.sources.spoolSrc.fileHeaderKey = resourceName
agent.sources.spoolSrc.interceptors = uuidinterceptor
agent.sources.spoolSrc.interceptors.uuidinterceptor.type =
org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
agent.sources.spoolSrc.interceptors.uuidinterceptor.headerName = id
#agent.sources.spoolSrc.interceptors.uuidinterceptor.preserveExisting = false
#agent.sources.spoolSrc.interceptors.uuidinterceptor.prefix = myhostname
agent.sources.spoolSrc.channels = memoryChannel
Cloudera Morphlines is an open source framework that reduces the time and skills necessary to build or change
Search indexing applications. A morphline is a rich configuration file that simplifies defining an ETL transformation
chain. These transformation chains support consuming any kind of data from any kind of data source, processing
the data, and loading the results into Cloudera Search. Executing in a small embeddable Java runtime system,
morphlines can be used for Near Real Time applications, as well as batch processing applications, for example
as outlined in the following flow diagram:
Morphlines can be seen as an evolution of Unix pipelines where the data model is generalized to work with
streams of generic records, including arbitrary binary payloads. Morphlines can be embedded into Hadoop
components such as Search, Flume, MapReduce, Pig, Hive, and Sqoop.
The framework ships with a set of frequently used high level transformation and I/O commands that can be
combined in application specific ways. The plug-in system allows the adding of new transformations and I/O
commands and integrates existing functionality and third party systems in a straightforward manner.
This integration enables rapid Hadoop ETL application prototyping, complex stream and event processing in real
time, flexible log file analysis, integration of multiple heterogeneous input schemas and file formats, as well as
reuse of ETL logic building blocks across Search applications.
Cloudera ships a high performance runtime that compiles a morphline as required. The runtime processes all
commands of a given morphline in the same thread, adding no artificial overhead. For high scalability, you can
deploy many morphline instances on a cluster in many Flume agents and MapReduce tasks.
Currently there are three components that execute morphlines:
• MapReduceIndexerTool
• Flume Morphline Solr Sink and Flume MorphlineInterceptor
Cloudera also provides a corresponding Cloudera Search Tutorial.
Morphlines manipulate continuous or arbitrarily large streams of records. The data model can be described as
follows: A record is a set of named fields where each field has an ordered list of one or more values. A value can
be any Java Object. That is, a record is essentially a hash table where each hash table entry contains a String
key and a list of Java Objects as values. (The implementation uses Guava’s ArrayListMultimap, which is a
ListMultimap). Note that a field can have multiple values and any two records need not use common field
names. This flexible data model corresponds exactly to the characteristics of the Solr/Lucene data model,
meaning a record can be seen as a SolrInputDocument. A field with zero values is removed from the record -
fields with zero values effectively do not exist.
Not only structured data, but also arbitrary binary data can be passed into and processed by a morphline. By
convention, a record can contain an optional field named _attachment_body, which can be a Java
java.io.InputStream or Java byte[]. Optionally, such binary input data can be characterized in more detail
by setting the fields named _attachment_mimetype (such as application/pdf) and _attachment_charset
(such as UTF-8) and _attachment_name (such as cars.pdf), which assists in detecting and parsing the data
type.
$ wget http://archive.apache.org/dist/avro/avro-1.7.4/java/avro-tools-1.7.4.jar
$ java -jar avro-tools-1.7.4.jar tojson
/usr/share/doc/search*/examples/test-documents/sample-statuses-20120906-141433.avro
3. Extract the fields named id, user_screen_name, created_at and text from the given Avro records, then
store and index them in Solr, using the following Solr schema definition in schema.xml:
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true"
multiValued="false" />
<field name="username" type="text_en" indexed="true" stored="true" />
<field name="created_at" type="tdate" indexed="true" stored="true" />
<field name="text" type="text_en" indexed="true" stored="true" />
<field name="_version_" type="long" indexed="true" stored="true"/>
<dynamicField name="ignored_*" type="ignored"/>
</fields>
Note that the Solr output schema omits some Avro input fields such as user_statuses_count. Suppose you
want to rename the input field user_screen_name to the output field username. Also suppose that the time
format for the created_at field is yyyy-MM-dd'T'HH:mm:ss'Z'. Finally, suppose any unknown fields present
are to be removed. Recall that Solr throws an exception on any attempt to load a document that contains a field
that is not specified in schema.xml.
1. These transformation rules can be expressed with morphline commands called readAvroContainer,
extractAvroPaths, convertTimestamp, sanitizeUnknownSolrFields and loadSolr, by editing a
morphline.conf file to read as follows:
# ZooKeeper ensemble
zkHost : "127.0.0.1:2181/solr"
}
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more potentially
# nested commands. A morphline is a way to consume records such as Flume events,
# HDFS files or blocks, turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on its way to
# Solr.
morphlines : [
{
# Name used to identify a morphline. For example, used if there are multiple
# morphlines in a morphline config file.
id : morphline1
# Import all morphline commands in these java packages and their subpackages.
# Other commands that may be present on the classpath are not visible to
this
# morphline.
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{
# Parse Avro container file and emit a record for each Avro object
readAvroContainer {
# Optionally, require the input to match one of these MIME types:
# supportedMimeTypes : [avro/binary]
# Optionally, use a custom Avro schema in JSON format inline:
# readerSchemaString : """<json can go here>"""
# Optionally, use a custom Avro schema file in JSON format:
# readerSchemaFile : /path/to/syslog.avsc
}
}
{
# Consume the output record of the previous command and pipe another
# record downstream.
#
# extractAvroPaths is a command that uses zero or more Avro path
# excodeblockssions to extract values from an Avro object. Each
excodeblockssion
# consists of a record output field name, which appears to the left of
the
# colon ':' and zero or more path steps, which appear to the right.
# Each path step is separated by a '/' slash. Avro arrays are
# traversed with the '[]' notation.
#
# The result of a path excodeblockssion is a list of objects, each of
which
# is added to the given record output field.
#
# The path language supports all Avro concepts, including nested
# structures, records, arrays, maps, unions, and others, as well as a
flatten
# option that collects the primitives in a subtree into a flat list. In
the
# paths specification, entries on the left of the colon are the target
Solr
# field and entries on the right specify the Avro source paths. Paths
are read
# from the source that is named to the right of the colon and written
to the
# field that is named on the left.
extractAvroPaths {
flatten : false
paths : {
id : /id
username : /user_screen_name
created_at : /created_at
text : /text
}
}
}
# Consume the output record of the previous command and pipe another
# record downstream.
#
# convert timestamp field to native Solr timestamp format
# such as 2012-09-06T07:14:34Z to 2012-09-06T07:14:34.000Z
{
convertTimestamp {
field : created_at
inputFormats : ["yyyy-MM-dd'T'HH:mm:ss'Z'", "yyyy-MM-dd"]
inputTimezone : America/Los_Angeles
outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
outputTimezone : UTC
}
}
# Consume the output record of the previous command and pipe another
# record downstream.
#
# This command deletes record fields that are unknown to Solr
# schema.xml.
#
# Recall that Solr throws an exception on any attempt to load a document
# that contains a field that is not specified in schema.xml.
{
sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
}
}
# log the record at DEBUG level to SLF4J
{ logDebug { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
The program should extract the following record from the log line and load it into Solr:
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
The following rules can be used to create a chain of transformation commands which are expressed with
morphline commands called readLine, grok, and logDebug by editing a morphline.conf file to read as follows:
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
Next Steps
You can learn more about Morphlines and the CDK.
• Search 1.1.0 ships with CDK version 0.8.1. For more information on Morphlines included in CDK 0.8.1, see
the detailed description of all morphline commands that is included in the CDK Morphlines Reference Guide.
• More example morphlines can be found in the unit tests.
Cloudera Search provides the ability to batch index HBase tables using MapReduce jobs. Such batch indexing
does not use or require the HBase replication feature, and it does not use or require the Lily HBase Indexer
Service, and it does not require registering an Lily HBase Indexer configuration with the Lily HBase Indexer
Service. The indexer supports flexible custom application-specific rules to extract, transform, and load HBase
data into Solr. Solr search results can contain columnFamily:qualifier links back to the data stored in HBase.
This way, applications can use the Search result set to directly access matching raw HBase cells.
Batch indexing column families of tables in an HBase cluster requires:
• Populating an HBase table
• Creating a corresponding SolrCloud collection
• Creating a Lily HBase Indexer configuration
• Creating a Morphline configuration file
• Running HBaseMapReduceIndexerTool
$ hbase shell
hbase(main):002:0> create 'record', {NAME => 'data'}
hbase(main):002:0> put 'record', 'row1', 'data', 'value'
hbase(main):001:0> put 'record', 'row2', 'data', 'value2'
-->
<param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/>
<!-- The optional morphlineId identifies a morphline if there are multiple morphlines
in morphlines.conf -->
<!-- <param name="morphlineId" value="morphline1"/> -->
</indexer>
The Lily HBase Indexer configuration file also supports the standard attributes of any HBase Lily Indexer on the
top-level <indexer> element, meaning the attributes table, mapping-type, read-row,
unique-key-formatter, unique-key-field, row-field column-family-field. It does not support
the <field> element and <extract> elements.
Note: For proper functioning, the morphline must not contain a loadSolr command. The enclosing
Lily HBase Indexer must load documents into Solr, rather than the morphline itself.
• The extractHBaseCells morphline command extracts cells from an HBase Result and transforms the
values into a SolrInputDocument. The command consists of an array of zero or more mapping specifications.
• Each mapping has:
– The inputColumn parameter, which specifies the data to be used from HBase for populating a field in
Solr. It takes the form of a column family name and qualifier, separated by a colon. The qualifier portion
can end in an asterisk, which is interpreted as a wildcard. In this case, all matching column-family and
qualifier expressions are used. The following are examples of valid inputColumn values:
– mycolumnfamily:myqualifier
– mycolumnfamily:my*
– mycolumnfamily:*
– The outputField parameter specifies the morphline record field to which to add output values. The
morphline record field is also known as the Solr document field. Example: "first_name".
– Dynamic output fields are enabled by the outputField parameter ending with a * wildcard. For example:
inputColumn : "m:e:*"
outputField : "belongs_to_*"
belongs_to_1 : foo
belongs_to_9 : bar
– The type parameter defines the datatype of the content in HBase. All input data is stored in HBase as
byte arrays, but all content in Solr is indexed as text, so a method for converting from byte arrays to the
actual datatype is required. The type parameter can be the name of a type that is supported by
org.apache.hadoop.hbase.util.Bytes.toXXX (currently: "byte[]", "int", "long", "string", "boolean", "float",
"double", "short", bigdecimal"). Use type "byte[]" to pass the byte array through to the morphline without
any conversion.
– type:byte[] copies the byte array unmodified into the record output field
– type:int converts with org.apache.hadoop.hbase.util.Bytes.toInt
– type:long converts with org.apache.hadoop.hbase.util.Bytes.toLong
– type:string converts with org.apache.hadoop.hbase.util.Bytes.toString
– type:boolean converts with org.apache.hadoop.hbase.util.Bytes.toBoolean
– type:float converts with org.apache.hadoop.hbase.util.Bytes.toFloat
– type:double converts with org.apache.hadoop.hbase.util.Bytes.toDouble
– type:short converts with org.apache.hadoop.hbase.util.Bytes.toShort
– type:bigdecimal converts with org.apache.hadoop.hbase.util.Bytes.toBigDecimal
Alternately the type parameter can be the name of a Java class that implements the
com.ngdata.hbaseindexer.parse.ByteArrayValueMapper interface.
– The source parameter determines what portion of an HBase KeyValue is used as indexing input. Valid
choices are "value" or "qualifier". When "value" is specified, then the HBase cell value is used as input for
indexing. When "qualifier" is specified, then the HBase column qualifier is used as input for indexing. The
default is "value".
Running HBaseMapReduceIndexerTool
Run the HBaseMapReduceIndexerTool to index the HBase table using a MapReduce job, as follows:
Note: For development purposes, use the --dry-run option to run in local mode and print documents
to stdout, instead of loading them to Solr. Using this option causes the morphline to execute in the
client process without submitting a job to MapReduce. Executing in the client process provides quicker
turnaround during early trial and debug sessions.
Note: To print diagnostic information, such as the content of records as they pass through the
morphline commands, consider enabling TRACE log level. For example, you can enable TRACE log level
diagnostics by adding the following to your log4j.properties file.
log4j.logger.com.cloudera.cdk.morphline=TRACE
log4j.logger.com.ngdata=TRACE
The log4j.properties file can be passed via the --log4j command line option.
HBaseMapReduceIndexerTool
HBaseMapReduceIndexerTool is a MapReduce batch job driver that takes input data from an HBase table and
creates Solr index shards and writes the indexes into HDFS, in a flexible, scalable, and fault-tolerant manner. It
also supports merging the output shards into a set of live customer-facing Solr servers in SolrCloud.
More details are available through the command line help:
Go live arguments:
Arguments for merging the shards that are built into a live Solr
cluster. Also see the Cluster arguments.
--go-live Allows you to optionally merge the final index
shards into a live Solr cluster after they are
built. You can pass the ZooKeeper address with --
zk-host and the relevant cluster information will
be auto detected. (default: false)
--collection STRING The SolrCloud collection to merge shards into
when using --go-live and --zk-host. Example:
collection1
--go-live-threads INTEGER
Tuning knob that indicates the maximum number of
live merges to run in parallel at one time.
(default: 1000)
Optional arguments:
--help, -help, -h Show this help message and exit
--output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside
there one output directory per shard will be
generated. Example: hdfs://c2202.mycompany.
com/user/$USER/test
--overwrite-output-dir
Overwrite the directory specified by --output-dir
if it already exists. Using this parameter will
result in the output directory being recursively
deleted at job startup. (default: false)
--morphline-file FILE Relative or absolute path to a local config file
that contains one or more morphlines. The file
must be UTF-8 encoded. The file will be uploaded
to each MR task. If supplied, this overrides the
value from the --hbase-indexer-* options.
Example: /path/to/morphlines.conf
--morphline-id STRING The identifier of the morphline that shall be
executed within the morphline config file, e.g.
specified by --morphline-file. If the --morphline-
id option is ommitted the first (i.e. top-most)
morphline within the config file is used. If
supplied, this overrides the value from the --
hbase-indexer-* options. Example: morphline1
--update-conflict-resolver FQCN
Fully qualified class name of a Java class that
implements the UpdateConflictResolver interface.
This enables deduplication and ordering of a
series of document updates for the same unique
document key. For example, a MapReduce batch job
might index multiple files in the same job where
some of the files contain old and new versions of
the very same document, using the same unique
document key.
Typically, implementations of this interface
forbid collisions by throwing an exception, or
ignore all but the most recent document version,
or, in the general case, order colliding updates
ascending from least recent to most recent
(partial) update. The caller of this interface (i.
--conf /etc/hbase/conf/hbase-site.xml \
-D 'mapred.child.java.opts=-Xmx500m' \
--hbase-indexer-file indexer.xml \
--zk-host 127.0.0.1/solr \
--collection collection1 \
--reducers 0 \
--log4j src/test/resources/log4j.properties
# (Re)index a table based on a indexer config stored in ZK
hadoop --config /etc/hadoop/conf \
jar hbase-indexer-mr-*-job.jar \
--conf /etc/hbase/conf/hbase-site.xml \
-D 'mapred.child.java.opts=-Xmx500m' \
--hbase-indexer-zk zk01 \
--hbase-indexer-name docindexer \
--go-live \
--log4j src/test/resources/log4j.properties
Configuring Lily HBase NRT Indexer Service for Use with Cloudera
Search
The Lily HBase NRT Indexer Service is a flexible, scalable, fault tolerant, transactional, Near Real Time (NRT)
oriented system for processing a continuous stream of HBase cell updates into live search indexes. Typically it
is a matter of seconds from data ingestion into HBase to that content potentially appearing in search results,
though this duration is tunable. The Lily HBase Indexer uses SolrCloud to index data stored in HBase. As HBase
applies inserts, updates, and deletes to HBase table cells, the indexer keeps Solr consistent with the HBase table
contents, using standard HBase replication features. The indexer supports flexible custom application-specific
rules to extract, transform, and load HBase data into Solr. Solr search results can contain
columnFamily:qualifier links back to the data stored in HBase. This way, applications can use the Search
result set to directly access matching raw HBase cells. Indexing and searching do not affect operational stability
or write throughput of HBase because the indexing and searching processes are separate and asynchronous
to HBase.
The Lily HBase NRT Indexer Service must be deployed in an environment with a running HBase cluster, a running
SolrCloud cluster, and at least one ZooKeeper cluster. This can be done with or without Cloudera Manager. See
the HBase (Keystore) Indexer Service topic in Managing Clusters with Cloudera Manager for more information.
Pointing an Lily HBase NRT Indexer Service at an HBase cluster that needs to be indexed
Configure individual Lily HBase NRT Indexer Services with the location of a ZooKeeper ensemble that is used
for the target HBase cluster. This must be done before starting Lily HBase NRT Indexer Services. Add the following
property to /etc/hbase-solr/conf/hbase-indexer-site.xml. Remember to replace
hbase-cluster-zookeeper with the actual ensemble string as found in hbase-site.xml configuration file:
<property>
<name>hbase.zookeeper.quorum</name>
<value>hbase-cluster-zookeeper</value>
</property>
Configure all Lily HBase NRT Indexer Services to use a particular ZooKeeper ensemble to coordinate among each
other. Add the following property to /etc/hbase-solr/conf/hbase-indexer-site.xml. Remember to replace
hbase-cluster-zookeeper:2181 with the actual ensemble string:
<property>
<name>hbaseindexer.zookeeper.connectstring</name>
<value>hbase-cluster-zookeeper:2181</value>
</property>
After starting the Lily HBase NRT Indexer Services, you can verify that all daemons are running using the jps
tool from the Oracle JDK, which you can obtain from the Java SE Downloads page. If you are running a
pseudo-distributed HDFS installation and an Lily HBase NRT Indexer Service installation on one machine, jps
shows the following output:
$ hbase shell
hbase shell> disable 'record'
hbase shell> alter 'record', {NAME => 'data', REPLICATION_SCOPE => 1}
hbase shell> enable 'record'
For every new table, set the REPLICATION_SCOPE on every column family that needs to be indexed. Do this by
issuing a command of the form:
$ hbase shell
hbase shell> create 'record', {NAME => 'data', REPLICATION_SCOPE => 1}
Registering a Lily HBase Indexer configuration with the Lily HBase Indexer Service
Once the content of the Lily HBase Indexer configuration XML file is satisfactory, register it with the Lily HBase
Indexer Service. This is done with a given SolrCloud collection by uploading the Lily HBase Indexer configuration
XML file to ZooKeeper. For example:
$ hbase-indexer add-indexer \
--name myIndexer \
--indexer-conf $HOME/morphline-hbase-mapper.xml \
--connection-param solr.zk=solr-cloude-zk1,solr-cloude-zk2/solr \
--connection-param solr.collection=hbase-collection1 \
--zookeeper hbase-cluster-zookeeper:2181
$ hbase-indexer list-indexers
Number of indexes: 1
myIndexer
+ Lifecycle state: ACTIVE
+ Incremental indexing state: SUBSCRIBE_AND_CONSUME
+ Batch indexing state: INACTIVE
+ SEP subscription ID: Indexer_myIndexer
+ SEP subscription timestamp: 2013-06-12T11:23:35.635-07:00
+ Connection type: solr
+ Connection params:
+ solr.collection = hbase-collection1
+ solr.zk = localhost/solr
+ Indexer config:
110 bytes, use -dump to see content
+ Batch index config:
(none)
+ Default batch index config:
(none)
+ Processes
+ 1 running processes
+ 0 failed processes
Existing Lily HBase Indexers can be further manipulated by using the update-indexer and delete-indexer
command line options of the hbase-indexer utility.
For more help use the following help commands:
Note: The morphlines.conf configuration file must be present on every node that runs an indexer.
Note: The morphlines.conf configuration file can be updated using the Cloudera Manager Admin
Console.
To update morphlines.conf using Cloudera Manager
1. On the Cloudera Manager Home page, click the Key-Value Indexer Store, often KS_INDEXER-1.
2. Click Configuration > View and Edit .
3. Expand Service-Wide and click Morphlines.
4. For the Morphlines File property, paste the new morphlines.conf content into the Value field.
Cloudera Manager automatically copies pasted configuration files to the current working directory of
all Lily HBase Indexer cluster processes on start and restart of the Lily HBase Indexer Service. In this
case the file location /etc/hbase-solr/conf/morphlines.conf is not applicable.
Note: Morphline configuration files can be changed without recreating the indexer itself. In such a
case, you must restart the Lily HBase Indexer service.
$ hbase shell
hbase(main):001:0> put 'record', 'row1', 'data', 'value'
hbase(main):002:0> put 'record', 'row2', 'data', 'value2'
If the put operation succeeds, wait a few seconds, then navigate to the SolrCloud's UI query page, and query the
data. Note the updated rows in Solr.
To print diagnostic information, such as the content of records as they pass through the morphline commands,
consider enabling TRACE log level. For example, you might add the following lines to your log4j.properties
file:
log4j.logger.com.cloudera.cdk.morphline=TRACE
log4j.logger.com.ngdata=TRACE
In Cloudera Manager 4, this can be done by navigating to Services > KEY_VALUE_STORE_INDEXER service >
Configuration > View and Edit > Lily HBase Indexer > Advanced > Lily HBase Indexer Logging Safety Valve ,
followed by a restart of the Lily HBase Indexer Service.
Note: Prior to Cloudera Manager 4.8, the service was referred to as Keystore Indexer service.
In Cloudera Manager 5, this can be done by navigating to Clusters > KEY_VALUE_STORE_INDEXER > Configuration
> View and Edit > Lily HBase Indexer > Advanced > Lily HBase Indexer Logging Safety Valve , followed by a
restart of the Lily HBase Indexer Service.
Note: The name of the particular KEY_VALUE_STORE_INDEXER you select varies. With Cloudera
Manager 4, the names are of the form ks_indexer1. With Cloudera Manager 5, the names are of the
form ks_indexer.
Using Kerberos
The process of enabling Solr clients to authenticate with a secure Solr is specific to the client. This section will
demonstrate:
• Using Kerberos and curl on page 63
• Configuring SolrJ Library Usage on page 63
• Configuring Flume Morphline Solr Sink Usage on page 64
Secure Solr requires that the CDH components that it interacts with are also secure. Secure Solr interacts with
HDFS, ZooKeeper and optionally HBase, MapReduce, and Flume. See the CDH 5 Security Guide or the CDH4
Security Guide for more information.
The following instructions only apply to configuring Kerberos in an unmanaged environment. Kerberos
configuration is automatically handled by Cloudera Manager if you are using Search in a Cloudera managed
environment.
Note: Depending on the tool used to connect, additional arguments may be required. For example,
with curl, --negotiate and -u are required. The username and password specified with -u is not
actually checked because Kerberos is used. As a result, any value such as foo:bar or even just : is
acceptable. While any value can be provided for -u, note that the option is required. Omitting -u
results in a 401 Unauthorized error, even though the -u value is not actually used.
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true
principal="user/fully.qualified.domain.name@<YOUR-REALM>";
};
• You want the client application to authenticate using a keytab you specify:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/path/to/keytab/user.keytab"
storeKey=true
useTicketCache=false
principal="user/fully.qualified.domain.name@<YOUR-REALM>";
};
2. Set the Java system property java.security.auth.login.config. Let's say the JAAS configuration file
you created in step 1 is located on the filesystem as /home/user/jaas-client.conf. The Java system
property java.security.auth.login.config must be set to point to this file. Setting a Java system
property can be done programmatically, for example using a call such as:
System.setProperty("java.security.auth.login.config",
"/home/user/jaas-client.conf");
Alternately, you could set the property when invoking the program. For example, if you were using a a jar,
you might use:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="/etc/flume-ng/conf/flume.keytab"
principal="flume/<fully.qualified.domain.name>@<YOUR-REALM>";
};
3. Add the flume JAAS configuration to the JAVA_OPTS in /etc/flume-ng/conf/flume-env.sh. For example,
you might change:
JAVA_OPTS="-Xmx500m"
to:
JAVA_OPTS="-Xmx500m
-Djava.security.auth.login.config=/etc/flume-ng/conf/jaas-client.conf"
Sentry enables role-based, fine-grained authorization Cloudera Search. Follow the instructions below to configure
Sentry under CDH 4.5. Sentry is included in the Search installation.
Note that this document is for configuring Sentry for Cloudera Search. To download or install other versions of
Sentry for other services, see:
• Setting Up Search Authorization with Sentry for instructions for using Cloudera Manager to install and
configure Hive Authorization with Sentry.
• Impala Security for instructions on using Impala with Sentry.
• Sentry Installation to install the version of Sentry that was provided with CDH 4.4 and earlier.
• Sentry Installation to install the version of Sentry that was provided with CDH 5.
collection=logs->action=Query
A role can contain multiple such rules, separated by commas. For example the engineer_role might contain
the Query privilege for hive_logs and hbase_logs collections, and the Update privilege for the current_bugs
collection. You would specify this as follows:
engineer_role = collection=hive_logs->action=Query, \
collection=hbase_logs->action=Query, \
collection=current_bugs->action=Update
Here the group dev_ops is granted the roles dev_role and ops_role. The members of this group can complete
searches that are allowed by these roles.
Important: You can use either Hadoop groups or local groups, but not both at the same time. Use
local groups if you want to do a quick proof-of-concept. For production, use Hadoop groups.
Note: Note that, by default, this uses local shell groups. See the Group Mapping section of the HDFS
Permissions Guide for more information.
OR
To configure local groups:
1. Define local groups in a [users] section of the Sentry Configuration File on page 67, sentry-site.xml. For
example:
[users]
user1 = group1, group2, group3
user2 = group2, group3
<property>
<name>sentry.provider</name>
<value>org.apache.sentry.provider.file.LocalGroupResourceAuthorizationProvider</value>
</property>
Policy file
The sections that follow contain notes on creating and maintaining the policy file.
1. Replication count - Because the file is read for each query, you should increase this; 10 is a reasonable value.
2. Updating the file - Updates to the file are only reflected when the Solr process is restarted.
Defining Roles
Keep in mind that role definitions are not cumulative; the newer definition replaces the older one. For example,
the following results in role1 having privilege2, not privilege1 and privilege2.
role1 = privilege1
role1 = privilege2
Sample Configuration
This section provides a sample configuration.
Note: Sentry with CDH Search does not support multiple policy files. Other implementations of Sentry
such as Sentry for Hive do support different policy files for different databases, but Sentry for CDH
Search has no such support for multiple policies.
Policy File
The following is an example of a CDH Search policy file. The sentry-provider.ini would exist in an HDFS
location such as hdfs://ha-nn-uri/user/solr/sentry/sentry-provider.ini.
sentry-provider.ini
[groups]
# Assigns each Hadoop group to its set of roles
engineer = engineer_role
ops = ops_role
dev_ops = engineer_role, ops_role
[roles]
# The following grants all access to source_code.
# "collection = source_code" can also be used as syntactic
# sugar for "collection = source_code->action=*"
engineer_role = collection = source_code->action=*
# The following imply more restricted access.
ops_role = collection = hive_logs->action=Query
dev_ops_role = collection = hbase_logs->action=Query
<configuration>
<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
</property>
<property>
<name>sentry.solr.provider.resource</name>
<value>/path/to/authz-provider.ini</value>
<!--
If the HDFS configuration files (core-site.xml, hdfs-site.xml)
pointed to by SOLR_HDFS_CONFIG in /etc/default/solr
point to HDFS, the path will be in HDFS;
alternatively you could specify a full path,
e.g.:hdfs://namenode:port/path/to/authz-provider.ini
-->
</property>
SOLR_AUTHORIZATION_SENTRY_SITE=/location/to/sentry-site.xml
SOLR_AUTHORIZATION_SUPERUSER=solr
To enable sentry index-authorization checking on a new collection, the instancedir for the collection must use
a modified version of solrconfig.xml with Sentry integration. The command solrctl instancedir
--generate generates two versions of solrconfig.xml: the standard solrconfig.xml without sentry
integration, and the sentry-integrated version called solrconfig.xml.secure. To use the sentry-integrated
version, replace solrconfig.xml with solrconfig.xml.secure before creating the instancedir.
If you have an existing collection using the standard solrconfig.xml called "foo" and an instancedir of the
same name, perform the following steps:
If you have an existing collection using a version of solrconfig.xml that you have modified, contact Support
for assistance.
SOLR_SECURITY_ALLOWED_PROXYUSERS=hue,foo
SOLR_SECURITY_PROXYUSER_hue_HOSTS=*
SOLR_SECURITY_PROXYUSER_hue_GROUPS=*
SOLR_SECURITY_PROXYUSER_foo_HOSTS=host1,host2
SOLR_SECURITY_PROXYUSER_foo_GROUPS=bar
Note: Cloudera Manager has its own management of secure impersonation. To use Cloudera Manager,
go to the Configuration Menu for Solr rather than editing /etc/default/solr.
Mission critical, large-scale online production systems need to make progress without downtime despite some
issues. Cloudera Search provides two routes to configurable, highly available, and fault-tolerant data ingestion:
• Near Real Time (NRT) ingestion using the Flume Solr Sink
• MapReduce based batch ingestion using the MapReduceIndexerTool
If Cloudera Search throws an exception according the rules described above, the caller, meaning Flume Solr Sink
and MapReduceIndexerTool, can catch the exception and retry the task if it meets the criteria for such retries.
agent.sinks.solrSink.isProductionMode = true
agent.sinks.solrSink.isIgnoringRecoverableExceptions = true
In addition, Flume SolrSink automatically attempts to load balance and failover among the hosts of a SolrCloud
before it considers the transaction rollback and retry. Load balancing and failover is done with the help of
ZooKeeper, which itself can be configured to be highly available.
Further, Cloudera Manager can configure Flume so it automatically restarts if its process crashes.
To tolerate extended periods of Solr downtime, you can configure Flume to use a high-performance transactional
persistent queue in the form of a FileChannel. A FileChannel can use any number of local disk drives to buffer
significant amounts of data. For example, you might buffer many terabytes of events corresponding to a week
of data. Further, using the optional replicating channels Flume feature, you can configure Flume to replicate the
same data both into HDFS as well as into Solr. Doing so ensures that if the Flume SolrSink channel runs out of
disk space, data delivery is still delivered to HDFS, and this data can later be ingested from HDFS into Solr using
MapReduce.
Many machines with many Flume Solr Sinks and FileChannels can be used in a failover and load balancing
configuration to improve high availability and scalability. Flume SolrSink servers can be either co-located with
live Solr servers serving end user queries, or Flume SolrSink servers can be deployed on separate industry
standard hardware for improved scalability and reliability. By spreading indexing load across a large number of
Flume SolrSink servers you can improve scalability. Indexing load can be replicated across multiple Flume SolrSink
servers for high availability, for example using Flume features such as Load balancing Sink Processor.
Solr performance tuning is a complex task. The following sections provide more details.
General information on Solr caching is available here on the SolrCaching page on the Solr Wiki.
Information on issues that influence performance is available on the SolrPerformanceFactors page on the Solr
Wiki.
Configuration
The following parameters control caching. They can be configured at the Solr process level by setting the respective
system property or by editing the solrconfig.xml directly.
Note:
Increasing the direct memory cache size may make it necessary to increase the maximum direct
memory size allowed by the JVM. Add the following to /etc/default/solr to do so. You must also
replace MAXMEM with a reasonable upper limit. A typical default JVM value for this is 64 MB.
CATALINA_OPTS="-XX:MaxDirectMemorySize=MAXMEMg -XX:+UseLargePages"
Solr HDFS optimizes caching when performing NRT indexing using Lucene's NRTCachingDirectory.
Lucene caches a newly created segment if both of the following conditions are true
• The segment is the result of a flush or a merge and the estimated size of the merged segment is <=
solr.hdfs.nrtcachingdirectory.maxmergesizemb.
• The total cached bytes is <= solr.hdfs.nrtcachingdirectory.maxcachedmb.
The following parameters control NRT caching behavior:
<directoryFactory name="DirectoryFactory"
class="org.apache.solr.core.HdfsDirectoryFactory">
<bool name="solr.hdfs.blockcache.enabled"> \
${solr.hdfs.blockcache.enabled:true}</bool>
<int name="solr.hdfs.blockcache.slab.count"> \
${solr.hdfs.blockcache.slab.count:1}</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation"> \
${solr.hdfs.blockcache.direct.memory.allocation:true}</bool>
<int name="solr.hdfs.blockcache.blocksperbank"> \
${solr.hdfs.blockcache.blocksperbank:16384}</int>
<bool name="solr.hdfs.blockcache.read.enabled"> \
${solr.hdfs.blockcache.read.enabled:true}</bool>
<bool name="solr.hdfs.blockcache.write.enabled"> \
${solr.hdfs.blockcache.write.enabled:true}</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable"> \
${solr.hdfs.nrtcachingdirectory.enable:true}</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb"> \
${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb"> \
${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}</int>
</directoryFactory>
The following example illustrates passing Java options by editing the /etc/default/solr configuration file:
For better performance, Cloudera recommends disabling the Linux swap space on all Solr server nodes as shown
below:
# minimize swapiness
sudo sysctl vm.swappiness=0
sudo bash -c 'echo "vm.swappiness=0">> /etc/sysctl.conf'
# disable swap space until next reboot:
sudo /sbin/swapoff -a
MapReduceIndexerTool Metadata
The MapReduceIndexerTool generates metadata fields for each input file when indexing. These fields can be
used in morphline commands. These fields can also be stored in Solr, by adding definitions like the following to
your Solr schema.xml file. After the MapReduce indexing process completes, the fields are searchable through
Solr.
Example output:
"file_upload_url":"foo/test-documents/sample-statuses-20120906-141433.avro",
"file_download_url":"hdfs://host1.mycompany.com:8020/user/foo/ \
test-documents/sample-statuses-20120906-141433.avro",
"file_scheme":"hdfs",
"file_host":"host1.mycompany.com",
"file_port":8020,
"file_name":"sample-statuses-20120906-141433.avro",
"file_path":"/user/foo/test-documents/sample-statuses-20120906-141433.avro",
"file_last_modified":1357193447106,
"file_length":1512,
"file_owner":"foo",
"file_group":"foo",
"file_permissions_user":"rw-",
"file_permissions_group":"r--",
"file_permissions_other":"r--",
"file_permissions_stickybit":false,
After installing and deploying Cloudera Search, use the information in this section to troubleshoot problems.
Troubleshooting
The following table contains some common troubleshooting techniques.
Note: In the URLs in the following table, you must replace entries such as <server:port> with values from
your environment. The port defaults value is 8983, but see /etc/default/solr for the port if you are in doubt.
All Varied Examine Solr log. By default, the log can be found at
/var/log/solr/solr.out.
No documents found Server may not be Browse to http://server:port/solr to see if the server responds.
running Check that cores are present. Check the contents of cores to ensure
that numDocs is more than 0.
• Almost anything available on the admin page. Note that drilling down into the “schema browser” can be
expensive.
• Testing with unrealistic data sets. For example, a users may test a prototype that uses faceting, grouping,
sorting, and complex schemas against a small data set. When this same system is used to load of real data,
performance issues occur. Using realistic data and use-cases is essential to getting accurate results.
• If the scenario seems to be that the system is slow to ingest data, consider:
– Upstream speed. If you have a SolrJ program pumping data to your cluster and ingesting documents at
a rate of 100 docs/second, the gating factor may be upstream speed. To test for limitations due to
upstream speed, comment out only the code that sends the data to the server (for example,
SolrHttpServer.add(doclist)) and time the program. If you see a throughput bump of less than around 10%,
this may indicate that your system is spending most or all of the time getting the data from the
system-of-record.
– This may require pre-processing.
– Indexing with a single thread from the client. ConcurrentUpdateSolrServer can use multiple threads to
avoid I/O waits.
– Too-frequent commits. This was historically an attempt to get NRT processing, but with SolrCloud hard
commits this should be quite rare.
– The complexity of the analysis chain. Note that this is rarely the core issue. A simple test is to change
the schema definitions to use trivial analysis chains and then measure performance.
– When the simple approaches fail to identify an issue, consider using profilers.
• Exceptions. The Solr log file contains a record of all exceptions thrown. Some exceptions, such as exceptions
resulting from invalid query syntax are benign, but others, such as Out Of Memory, require attention.
• Excessively large caches. The size of caches such as the filter cache are bounded by maxDoc/8. Having, for
instance, a filterCache with 10,000 entries is likely to result in Out Of Memory errors. Large caches occurring
in cases where there are many documents to index is normal and expected.
• Caches with low hit ratios, particularly filterCache. Each cache takes up some space, consuming resources.
There are several caches, each with its own hit rate.
– filterCache. This cache should have a relatively high hit ratio, typically around 80%.
– queryResultCache. This is primarily used for paging so it can have a very low hit ratio. Each entry is quite
small as it is basically composed of the raw query as a string for a key and perhaps 20-40 ints. While
useful, unless users are experiencing paging, this requires relatively little attention.
– documentCache. This cache is a bit tricky. It’s used to cache the document data (stored fields) so various
components in a request handler don’t have to re-read the data from the disk. It’s an open question how
useful it is when using MMapDirectory to access the index.
• Very deep paging. It is uncommon for user to go beyond the first page and very rare to go through 100 pages
of results. A "&start=<pick your number>" query indicates unusual usage that should be identified. Deep
paging may indicate some agent is completing scraping.
Note: Solr is not built to return full result sets no matter how deep. If returning the full result set
is required, explore alternatives to paging through the entire result set.
• Range queries should work on trie fields. Trie fields (numeric types) store extra information in the index to
aid in range queries. If range queries are used, it’s almost always a good idea to be using trie fields.
• "fq" clauses that use bare NOW. “fq” clauses are kept in a cache. The cache is a map from the "fq" clause to
the documents in your collection that satisfy that clause. Using bare NOW clauses virtually guarantees that
the entry in the filter cache is not be re-used.
• Multiple simultaneous searchers warming. This is an indication that there are excessively frequent commits
or that autowarming is taking too long. This usually indicates a misunderstanding of when you should issue
commits, often to simulate Near Real Time (NRT) processing or an indexing client is improperly completing
commits. With NRT, commits should be quite rare, and having more than one simultaneous autowarm should
not happen.
• Stored fields that are never returned ("fl=" clauses). Examining the queries for “fl=” and correlating that with
the schema can tell if stored fields that are not used are specified. This mostly wastes disk space. And "fl=*"
can make this ambiguous. Nevertheless, it’s worth examining.
• Indexed fields that are never searched. This is the opposite of the case where stored fields are never returned.
This is more important in that this has real RAM consequences. Examine the request handlers for “edismax”
style parsers to be certain that indexed fields are not used.
• Queried but not analyzed fields. It’s rare for a field to be queried but not analyzed in any way. Usually this is
only valuable for “string” type fields which are suitable for machine-entered data, such as part numbers
chosen from a pick-list. Data that is not analyzed should not be used for anything that humans enter.
• String fields. String fields are completely unanalyzed. Unfortunately, some people confuse “string” with Java’s
“String” type and use them for text that should be tokenized. The general expectation is that string fields
should be used sparingly. More than just a few string fields indicates a design flaw.
• Whenever the schema is changed, re-index the entire data set. Solr uses the schema to set expectations
about the index. When schemas are changed, there’s no attempt to retrofit the changes to documents that
are currently indexed, but any new documents are indexed with the new schema definition. So old and new
documents can have the same field stored in vastly different formats (for example, String and TrieDate)
making your index inconsistent. This can be detected by examining the raw index.
• Query stats can be extracted from the logs. Statistics can be monitored on live systems, but it is more common
to have log files. Here are some of the statistics you can gather.
• Too-frequent commits have historically been the cause of unsatisfactory performance. This is not so important
with NRT processing, but it is valuable to consider.
• Optimizing an index, which could improve search performance before, is much less necessary now. Anecdotal
evidence indicates optimizing may help in some cases, but the general recommendation is to use
“expungeDeletes”, instead of committing.
– Modern Lucene code does what “optimize” used to do to remove deleted data from the index when
segments are merged. Think of this process as a background optimize. Note that merge policies based
on segment size can make this characterization inaccurate.
– It still may make sense to optimize a read-only index.
– “Optimize” is now renamed “forceMerge”.
commit
An operation that forces documents to be made searchable.
• hard - A commit that starts the autowarm process, closes old searchers and opens new ones. It may also
trigger replication.
• soft - New functionality with NRT and SolrCloud that makes documents searchable without requiring the
work of hard commits.
embedded Solr
The ability to execute Solr commands without having a separate servlet container. Generally, use of embedded
Solr is discouraged because it is often used due to the mistaken belief that HTTP is inherently too expensive to
go fast. With Cloudera Search, and especially if the idea of some kind of MapReduce process is adopted, embedded
Solr is probably advisable.
faceting
“Counting buckets” for a query. For example, suppose the search is for the term “shoes”. You might want to
return a result that there were various different quantities, such as "X brown, Y red and Z blue shoes" that
matched the rest of the query.
replica
In SolrCloud, a complete copy of a shard. Each replica is identical, so only one replica has to be queried (per shard)
for searches.
sharding
Splitting a single logical index up into some number of sub-indexes, each of which can be hosted on a separate
machine. Solr (and especially SolrCloud) handles querying each shard and assembling the response into a single,
coherent list.
SolrCloud
ZooKeeper-enabled, fault-tolerant, distributed Solr. This is new in Solr 4.0.
SolrJ
A Java API for interacting with a Solr instance.