Elasticsearch is renowned for its powerful full-text search capabilities. At the heart of this functionality are analyzers and tokenizers, which play a crucial role in how text is processed and indexed. This guide will help you understand how analyzers and tokenizers work in Elasticsearch, with detailed examples and outputs to make these concepts easy to grasp.
Introduction to Full-Text Search
Full-text search allows you to search for documents that contain specific words or phrases. Unlike keyword search, which looks for exact matches, full-text search involves breaking down the text into individual terms (or tokens) and processing them to enable more flexible and comprehensive search capabilities. This is where analyzers and tokenizers come into play.
What are Analyzers?
An analyzer in Elasticsearch is a component that converts text into tokens (terms) and can apply various filters to normalize the tokens. An analyzer typically consists of three parts:
- Character Filters: Preprocess the text by modifying or removing certain characters.
 
- Tokenizer: Splits the text into individual terms (tokens).
 
- Token Filters: Apply additional processing to the tokens, such as lowercasing or removing stop words.
What are Tokenizers?
A tokenizer is a component of an analyzer that breaks down the text into a stream of tokens. Different tokenizers split text in different ways, depending on the specific use case.
Default Analyzer and Tokenizer
Elasticsearch uses the standard analyzer by default, which includes a standard tokenizer. Let's see an example to understand how the default analyzer works.
Example: Default Analyzer
Consider the following text: "Elasticsearch is a powerful search engine."
GET /_analyze
{
  "text": "Elasticsearch is a powerful search engine"
}
Output:
{
  "tokens": [
    { "token": "elasticsearch", "start_offset": 0, "end_offset": 14, "type": "<ALPHANUM>", "position": 0 },
    { "token": "is", "start_offset": 15, "end_offset": 17, "type": "<ALPHANUM>", "position": 1 },
    { "token": "a", "start_offset": 18, "end_offset": 19, "type": "<ALPHANUM>", "position": 2 },
    { "token": "powerful", "start_offset": 20, "end_offset": 28, "type": "<ALPHANUM>", "position": 3 },
    { "token": "search", "start_offset": 29, "end_offset": 35, "type": "<ALPHANUM>", "position": 4 },
    { "token": "engine", "start_offset": 36, "end_offset": 42, "type": "<ALPHANUM>", "position": 5 }
  ]
}In this example:
- The text is tokenized into individual words.
- Each token includes information such as the start and end offsets and the token's position.
Custom Analyzers
Sometimes, the default analyzer might not fit your needs, and you may need to create a custom analyzer. Custom analyzers allow you to define specific character filters, tokenizers, and token filters to tailor the text analysis process.
Example: Custom Analyzer
Let's create a custom analyzer that:
- Converts text to lowercase.
- Uses a whitespace tokenizer.
- Removes English stop words.
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "stop"
          ]
        }
      }
    }
  }
}
Now, let's analyze some text using this custom analyzer.
GET /my_index/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "Elasticsearch is a powerful search engine"
}
Output:
{
  "tokens": [
    { "token": "elasticsearch", "start_offset": 0, "end_offset": 14, "type": "word", "position": 0 },
    { "token": "powerful", "start_offset": 20, "end_offset": 28, "type": "word", "position": 1 },
    { "token": "search", "start_offset": 29, "end_offset": 35, "type": "word", "position": 2 },
    { "token": "engine", "start_offset": 36, "end_offset": 42, "type": "word", "position": 3 }
  ]
}In this example:
- The text is converted to lowercase.
- The text is tokenized based on whitespace.
- Stop words ("is", "a") are removed from the tokens.
Commonly Used Tokenizers
Elasticsearch provides various built-in tokenizers, each suited for different purposes.
Standard Tokenizer
The standard tokenizer is the default tokenizer used by Elasticsearch. It splits text into terms on word boundaries and removes most punctuation.
Whitespace Tokenizer
The whitespace tokenizer splits text into terms whenever it encounters whitespace.
PUT /whitespace_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "whitespace_tokenizer": {
          "type": "whitespace"
        }
      }
    }
  }
}
Analyzing text:
GET /whitespace_example/_analyze
{
  "tokenizer": "whitespace_tokenizer",
  "text": "Elasticsearch is a powerful search engine"
}
Output:
{
  "tokens": [
    { "token": "Elasticsearch", "start_offset": 0, "end_offset": 14, "type": "word", "position": 0 },
    { "token": "is", "start_offset": 15, "end_offset": 17, "type": "word", "position": 1 },
    { "token": "a", "start_offset": 18, "end_offset": 19, "type": "word", "position": 2 },
    { "token": "powerful", "start_offset": 20, "end_offset": 28, "type": "word", "position": 3 },
    { "token": "search", "start_offset": 29, "end_offset": 35, "type": "word", "position": 4 },
    { "token": "engine", "start_offset": 36, "end_offset": 42, "type": "word", "position": 5 }
  ]
}In this example:
- The text is split into tokens based on whitespace.
NGram Tokenizer
The ngram tokenizer breaks text into smaller chunks (n-grams) of specified lengths. It's useful for partial matching and autocomplete features.
PUT /ngram_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}
Analyzing text
GET /ngram_example/_analyze
{
  "tokenizer": "ngram_tokenizer",
  "text": "search"
}
Output:
{
  "tokens": [
    { "token": "sea", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 },
    { "token": "sear", "start_offset": 0, "end_offset": 4, "type": "word", "position": 1 },
    { "token": "searc", "start_offset": 0, "end_offset": 5, "type": "word", "position": 2 },
    { "token": "earc", "start_offset": 1, "end_offset": 5, "type": "word", "position": 3 },
    { "token": "ear", "start_offset": 1, "end_offset": 4, "type": "word", "position": 4 },
    { "token": "earc", "start_offset": 1, "end_offset": 5, "type": "word", "position": 5 },
    { "token": "arch", "start_offset": 2, "end_offset": 5, "type": "word", "position": 6 }
  ]
}In this example:
- The text "search" is broken into n-grams of lengths 3 to 5.
Custom Token Filters
Token filters are used to modify the tokens produced by a tokenizer. Common token filters include lowercasing, removing stop words, stemming, and more.
Example: Lowercase and Stop Filter
Let's create a custom analyzer that includes both lowercase and stop filters.
PUT /custom_filter_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop"]
        }
      }
    }
  }
}
Analyzing text:
GET /custom_filter_example/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "Elasticsearch is a powerful search engine"
}
Output:
{
  "tokens": [
    { "token": "elasticsearch", "start_offset": 0, "end_offset": 14, "type": "word", "position": 0 },
    { "token": "powerful", "start_offset": 20, "end_offset": 28, "type": "word", "position": 1 },
    { "token": "search", "start_offset": 29, "end_offset": 35, "type": "word", "position": 2 },
    { "token": "engine", "start_offset": 36, "end_offset": 42, "type": "word", "position": 3 }
  ]
}In this example:
- The text is tokenized using the standard tokenizer.
- Tokens are converted to lowercase.
- Stop words ("is", "a") are removed.
Practical Application: Indexing and Searching
Now that we've covered the basics of analyzers and tokenizers, let's see how to apply them in a practical indexing and searching scenario.
Creating an Index with Custom Analyzer
PUT /articles
{
  "settings": {
    "analysis": {
      "analyzer": {
        "article_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    },
    "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "article_analyzer"
        },
        "content": {
          "type": "text",
          "analyzer": "article_analyzer"
        }
      }
    }
  }
}
Indexing Documents
POST /articles/_doc/1
{
  "title": "Introduction to Elasticsearch",
  "content": "Elasticsearch is a powerful search engine based on Apache Lucene."
}
POST /articles/_doc/2
{
  "title": "Full-Text Search Techniques",
  "content": "Learn how to use full-text search with Elasticsearch to build powerful search applications."
}
Searching with Custom Analyzer
GET /articles/_search
{
  "query": {
    "match": {
      "content": "search powerful"
    }
  }
}
Output:
{
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "hits": [
      {
        "_index": "articles",
        "_id": "1",
        "_score": 0.75,
        "_source": {
          "title": "Introduction to Elasticsearch",
          "content": "Elasticsearch is a powerful search engine based on Apache Lucene."
        }
      },
      {
        "_index": "articles",
        "_id": "2",
        "_score": 0.65,
        "_source": {
          "title": "Full-Text Search Techniques",
          "content": "Learn how to use full-text search with Elasticsearch to build powerful search applications."
        }
      }
    ]
  }
}In this example:
- We search for documents containing "search" and "powerful" in the content field.
- The custom analyzer processes the content field, improving the relevance of the search results.
Conclusion
Understanding and utilizing analyzers and tokenizers in Elasticsearch is essential for leveraging its full-text search capabilities. By customizing these components, you can tailor the text processing to fit your specific requirements, resulting in more relevant and effective search results.
In this guide, we explored how analyzers and tokenizers work, created custom analyzers, and demonstrated their practical application in indexing and searching documents. With these concepts and examples, you are well-equipped to enhance your Elasticsearch implementation and build powerful search solutions.
                                
                                
                            
                                                                                
                                                            
                                                    
                                                
                                                        
                            
                        
                                                
                        
                                                                                    
                                                                Explore
                                    
                                        Elasticsearch Fundamentals
Concepts of Elasticsearch
Data Indexing and Querying
Advanced Querying and Full-text Search
Data Modeling and Mapping
Scaling and Performance
Data Ingestion and Processing
Advanced Indexing Techniques
Monitoring and Optimization