I previously wrote about mappings in ElasticSearch, but my previous article was very simple…similar to most “hello world” type tutorials on the internet.

Much of what makes ElasticSearch fantastic comes from understanding and utilizing advanced mappings.  There is a serious dearth of quality guides online when it comes to constructing more advanced hierarchies.  

This is hopefully the first of several articles where I explore layouts that are more complicated than the traditional intro tutorials.

The Problem

I’m going to use an example from a recent project, since it is both “real world” and relatively complicated.

In this project, I needed to search a number of product titles.  Getting the data into ElasticSearch is the easiest part, searching it is nearly as easy.  But returning good, relevant search results…that’s pretty hard.

The key is to analyze the titles appropriately so that search queries can find the correct terms, and then boosting those terms depending on their relevance to the query.

Let’s take a look at the data that I’ll be working with. The field we are interested in is `productName`:

{
  "properties":{
    "productName":{   },
    "productID":{   },
    "warehouse":{   },
    "vendor":{   },
    "productDescription":{   },
    "categories":{   },
    "stockLevel":{   },
    "cost":{   }
  }
}

Exact Term Matches

Setting up proper analyzers in ES is all about thinking about the search query. You have to provide instructions to ES about the appropriate transformations so you can search intelligently.

So, the first place to start with a general query search is exact term matching. If the user enters “Metal Servo Gear 23C” and there is a product in your database that matches…that is probably the most relevant result.

Our mapping will start with this:

{
  "mappings":{
    "item":{
      "properties":{
        "productName":{
          "fields":{
            "productName":{
              "type":"string",
              "analyzer":"full_name"
            }
          },
          "type":"multi_field"
        }
      },
      "settings":{
        "analysis":{
          "filter":{
 
          },
          "analyzer":{
            "full_name":{
              "filter":[
                "standard",
                "lowercase",
                "asciifolding"
              ],
              "type":"custom",
              "tokenizer":"standard"
            }
          }
        }
      }
    }
  }
}

So to start, we are defining `productName` as a string that is analyzed by `full_name`. This is specified by the “analyzer” field. If you look at the Analyzer section of the mapping, you’ll see a corresponding “full_name” analyzer. This is the definition that will be used to parse the input into terms.

The first thing that happens to an input query is tokenization – breaking an input query into smaller chunks called tokens. There are several tokenizers available, which you should explore on your own when you get a chance.

The Standard tokenizer is being used in this example, which is a pretty good tokenizer for most English-language search problems. You can query ES to see how it tokenizes a sample sentence:

curl -X GET "http://localhost:9200/test/_analyze?tokenizer=standard&pretty=true" -d 'The quick brown fox is jumping over the lazy dog.'
{
  "tokens" : [ {
    "token" : "The",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "quick",
    "start_offset" : 4,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "brown",
    "start_offset" : 10,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "fox",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "is",
    "start_offset" : 20,
    "end_offset" : 22,
    "type" : "<ALPHANUM>",
    "position" : 5
  }, {
    "token" : "jumping",
    "start_offset" : 23,
    "end_offset" : 30,
    "type" : "<ALPHANUM>",
    "position" : 6
  }, {
    "token" : "over",
    "start_offset" : 31,
    "end_offset" : 35,
    "type" : "<ALPHANUM>",
    "position" : 7
  }, {
    "token" : "the",
    "start_offset" : 36,
    "end_offset" : 39,
    "type" : "<ALPHANUM>",
    "position" : 8
  }, {
    "token" : "lazy",
    "start_offset" : 40,
    "end_offset" : 44,
    "type" : "<ALPHANUM>",
    "position" : 9
  }, {
    "token" : "dog",
    "start_offset" : 45,
    "end_offset" : 48,
    "type" : "<ALPHANUM>",
    "position" : 10
  } ]
}

You can see that in this example, the Standard tokenizer basically strips punctuation and splits on whitespace.

Ok, so our input query has been turned into tokens. Referring back to the mapping, the next step is to apply filters to these tokens. In order, these filters are applied to each token: Standard Token Filter, Lowercase Filter, ASCII Folding Filter.

The Standard Token filter docs are sparse, but ES once again rescues us with an illustrative example:

curl -X GET "http://localhost:9200/test/_analyze?filter=standard&pretty=true" -d 'The quick brown fox is jumping over the lazy dog.'
{
  "tokens" : [ {
    "token" : "quick",
    "start_offset" : 4,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "brown",
    "start_offset" : 10,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "fox",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "jumping",
    "start_offset" : 23,
    "end_offset" : 30,
    "type" : "<ALPHANUM>",
    "position" : 6
  }, {
    "token" : "over",
    "start_offset" : 31,
    "end_offset" : 35,
    "type" : "<ALPHANUM>",
    "position" : 7
  }, {
    "token" : "lazy",
    "start_offset" : 40,
    "end_offset" : 44,
    "type" : "<ALPHANUM>",
    "position" : 9
  }, {
    "token" : "dog",
    "start_offset" : 45,
    "end_offset" : 48,
    "type" : "<ALPHANUM>",
    "position" : 10
  } ]
}

As you can see, we basically lost all the “stop words” that are common in the English language (“the”, “is”). If these kinds of words are important to your query, you should skip the Standard filter. In my case (and many search applications) these words do not help search relevance at all, and usually hurt it.

Lowercase Filter and ASCII Folding are simple. Lowercase does what it says on the label: it lowercases each token. Unless case-sensitivity is something that is critical to your application, this is a pretty default filter.

ASCII folding smooshes unicode characters into their ASCII “equivalent”. An é becomes an e. While the folded letters may not be functionally equivalent, it helps provide a coherent search interface since most people don’t use accents in their queries.

Still with me so far? The mapping we’ve created will match exact term queries. Astute readers will notice that this query will match exact terms, and not exact queries. This was on purpose, since it is more practical.

If you want to match exact queries, however, you could use the Keyword Analyzer and skip everything else that I just talked about.

Fuzzy matching with nGrams

This mapping will get you 60% of the way there, but it leaves a lot to be desired. The next step is to add a little bit of “fuzziness” to your search capability. Often, users don’t quite know what they are looking for. Or perhaps they misspelled their query.

There are several ways to do this in ES. I’m going to show you how to do it with nGram filters. nGrams take a token and create a number of smaller tokens by sliding a “window” across the token. A “window” two characters long will create these tokens:

curl -X GET "http://localhost:9200/test/_analyze?tokenizer=nGram&pretty=true" -d 'token'
{
  "tokens" : [ {
    "token" : "t",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "o",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "k",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "e",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "n",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "to",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "ok",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "ke",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "word",
    "position" : 8
  }, {
    "token" : "en",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "word",
    "position" : 9
  } ]
}

Another variation of the nGram is the Edge-nGram. It’s the same idea, except the string is always started at the specified edge (e.g. “t”, “to”, “tok”, etc).

Each of those new tokens will match parts of an input query, and the more tokens that match, the higher the relevance. This allows for misspellings since only a handful of tokens will be incorrect, while the bulk of the query will match. It also allows for variations in wording or phrasing.

We implement it into our mapping like this:

{
   "mappings":{
      "item":{
         "properties":{
            "productName":{
               "fields":{
                  "productName":{
                     "type":"string",
                     "analyzer":"full_name"
                  },
                  "partial":{
                     "search_analyzer":"full_name",
                     "index_analyzer":"partial_name",
                     "type":"string"
                  }
               },
               "type":"multi_field"
            }
         },
         "settings":{
            "analysis":{
               "filter":{
                  "name_ngrams":{
                     "side":"front",
                     "max_gram":20,
                     "min_gram":2,
                     "type":"edgeNGram"
                  }
               },
               "analyzer":{
                  "full_name":{
                     "filter":[
                        "standard",
                        "lowercase",
                        "asciifolding"
                     ],
                     "type":"custom",
                     "tokenizer":"standard"
                  },
                  "partial_name":{
                     "filter":[
                        "standard",
                        "lowercase",
                        "asciifolding",
                        "name_ngrams"
                     ],
                     "type":"custom",
                     "tokenizer":"standard"
                  }
               }
            }
         }
      }
   }
}

Three things changed.

  • We added a new field under “productName” called “partial”.
  • We added a filter called “name_ngrams”
  • We added a new analyzer called “partial_name”

Let’s start with “name_ngrams” filter first:

"name_ngrams":{
   "side":"front",
   "max_gram":20,
   "min_gram":2,
   "type":"edgeNGram"
 }

This defines an ngram filter that produces tokens from the front edge. Min and max_gram tell ES how many tokens to create for each term. My input queries were pretty long so I made max_grams relatively large. A front edge-nGram with min_gram=2 will always use the first two characters. (e.g. “ra”, “rac”, “race”). This filter is then given a name so we can use it in analyzers.

So, looking at our new analyzer next:

"partial_name":{
      "filter":[
         "standard",
         "lowercase",
         "asciifolding",
         "name_ngrams"
      ],
      "type":"custom",
      "tokenizer":"standard"
   }

This analyzer is basically the same as our full_name analyzer, except we apply our new name_ngrams filter to each token at the end. Given the sentence “A quick Brown FOX”, we would generate this list (separated by token so you can see how everything gets broken up):

qu, qui, quic, quick
br, bro, brow, brown
fo, fox

With me so far? Ok, so the last part that was added is another field property, named partial. This field is a little different than the productName field:

 "partial":{
      "search_analyzer":"full_name",
      "index_analyzer":"partial_name",
      "type":"string"
  }

As you can see, we specify both a search and index analyzer. Huh? These two separate analyzers instruct ES what to do when it is indexing a field, or searching a field. But why are these needed?

The index analyzer is easy to understand. We want to break up our input fields into various tokens so we can later search it. So we instruct ES to use the new partial_name analyzer that we built, so that it can create nGrams for us.

The search analyzer is a little trickier to understand, but crucial to getting good relevance. Imagine querying for “Race”. We want that query to match “race”, “races” and “racecar”. When searching, we want to make sure ES eventually searches with the token “race”. The full_name analyzer will give us the needed token to search with.

If, however, we used the partial_name nGram analyzer, we would generate a list of nGrams as our search query. The search query “Race” would turn into ["ra", "rac", "race"].

Those tokens are then used to search the index. As you might guess, “ra” and “rac” will match a lot of things you don’t want, such as “racket” or “ratify” or “rapport”.

So specifying different index and search analyzers is critical when working with things like ngrams. Make sure you always double check what you expect to query ES with…and what is actually being passed to ES.

Wrapping things up

This tutorial is long enough already, so I’m just going to leave you with this Gist of a full example. The Gist expands on the example here in the tutorial by adding two more fields (nGram and another edge-nGram) as well as their associated filters/analyzers. It also shows a query at the bottom as an example of actually querying the mapping and filtering on some unrelated fields.

This was long, but I hope it was relatively clear. Feel free to drop me an email (contact form at the bottom of the page) or leave a comment if you have questions.