Constructing more complicated mapping in ElasticSearch
I previously wrote about mappings in ElasticSearch, but my previous article was very simple…similar to most “hello world” type tutorials on the internet.
Much of what makes ElasticSearch fantastic comes from understanding and utilizing advanced mappings. There is a serious dearth of quality guides online when it comes to constructing more advanced hierarchies.
This is hopefully the first of several articles where I explore layouts that are more complicated than the traditional intro tutorials.
The Problem
I’m going to use an example from a recent project, since it is both “real world” and relatively complicated.
In this project, I needed to search a number of product titles. Getting the data into ElasticSearch is the easiest part, searching it is nearly as easy. But returning good, relevant search results…that’s pretty hard.
The key is to analyze the titles appropriately so that search queries can find the correct terms, and then boosting those terms depending on their relevance to the query.
Let’s take a look at the data that I’ll be working with. The field we are interested in is `productName`:
{ "properties":{ "productName":{ }, "productID":{ }, "warehouse":{ }, "vendor":{ }, "productDescription":{ }, "categories":{ }, "stockLevel":{ }, "cost":{ } } } |
Exact Term Matches
Setting up proper analyzers in ES is all about thinking about the search query. You have to provide instructions to ES about the appropriate transformations so you can search intelligently.
So, the first place to start with a general query search is exact term matching. If the user enters “Metal Servo Gear 23C” and there is a product in your database that matches…that is probably the most relevant result.
Our mapping will start with this:
{ "mappings":{ "item":{ "properties":{ "productName":{ "fields":{ "productName":{ "type":"string", "analyzer":"full_name" } }, "type":"multi_field" } }, "settings":{ "analysis":{ "filter":{ }, "analyzer":{ "full_name":{ "filter":[ "standard", "lowercase", "asciifolding" ], "type":"custom", "tokenizer":"standard" } } } } } } } |
So to start, we are defining `productName` as a string that is analyzed by `full_name`. This is specified by the “analyzer” field. If you look at the Analyzer section of the mapping, you’ll see a corresponding “full_name” analyzer. This is the definition that will be used to parse the input into terms.
The first thing that happens to an input query is tokenization – breaking an input query into smaller chunks called tokens. There are several tokenizers available, which you should explore on your own when you get a chance.
The Standard tokenizer is being used in this example, which is a pretty good tokenizer for most English-language search problems. You can query ES to see how it tokenizes a sample sentence:
curl -X GET "http://localhost:9200/test/_analyze?tokenizer=standard&pretty=true" -d 'The quick brown fox is jumping over the lazy dog.' { "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "quick", "start_offset" : 4, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "fox", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "is", "start_offset" : 20, "end_offset" : 22, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "jumping", "start_offset" : 23, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "over", "start_offset" : 31, "end_offset" : 35, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "lazy", "start_offset" : 40, "end_offset" : 44, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "dog", "start_offset" : 45, "end_offset" : 48, "type" : "<ALPHANUM>", "position" : 10 } ] } |
You can see that in this example, the Standard tokenizer basically strips punctuation and splits on whitespace.
Ok, so our input query has been turned into tokens. Referring back to the mapping, the next step is to apply filters to these tokens. In order, these filters are applied to each token: Standard Token Filter, Lowercase Filter, ASCII Folding Filter.
The Standard Token filter docs are sparse, but ES once again rescues us with an illustrative example:
curl -X GET "http://localhost:9200/test/_analyze?filter=standard&pretty=true" -d 'The quick brown fox is jumping over the lazy dog.' { "tokens" : [ { "token" : "quick", "start_offset" : 4, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "fox", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "jumping", "start_offset" : 23, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "over", "start_offset" : 31, "end_offset" : 35, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "lazy", "start_offset" : 40, "end_offset" : 44, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "dog", "start_offset" : 45, "end_offset" : 48, "type" : "<ALPHANUM>", "position" : 10 } ] } |
As you can see, we basically lost all the “stop words” that are common in the English language (“the”, “is”). If these kinds of words are important to your query, you should skip the Standard filter. In my case (and many search applications) these words do not help search relevance at all, and usually hurt it.
Lowercase Filter and ASCII Folding are simple. Lowercase does what it says on the label: it lowercases each token. Unless case-sensitivity is something that is critical to your application, this is a pretty default filter.
ASCII folding smooshes unicode characters into their ASCII “equivalent”. An é becomes an e. While the folded letters may not be functionally equivalent, it helps provide a coherent search interface since most people don’t use accents in their queries.
Still with me so far? The mapping we’ve created will match exact term queries. Astute readers will notice that this query will match exact terms, and not exact queries. This was on purpose, since it is more practical.
If you want to match exact queries, however, you could use the Keyword Analyzer and skip everything else that I just talked about.
Fuzzy matching with nGrams
This mapping will get you 60% of the way there, but it leaves a lot to be desired. The next step is to add a little bit of “fuzziness” to your search capability. Often, users don’t quite know what they are looking for. Or perhaps they misspelled their query.
There are several ways to do this in ES. I’m going to show you how to do it with nGram filters. nGrams take a token and create a number of smaller tokens by sliding a “window” across the token. A “window” two characters long will create these tokens:
curl -X GET "http://localhost:9200/test/_analyze?tokenizer=nGram&pretty=true" -d 'token' { "tokens" : [ { "token" : "t", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 1 }, { "token" : "o", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 2 }, { "token" : "k", "start_offset" : 2, "end_offset" : 3, "type" : "word", "position" : 3 }, { "token" : "e", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 4 }, { "token" : "n", "start_offset" : 4, "end_offset" : 5, "type" : "word", "position" : 5 }, { "token" : "to", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 6 }, { "token" : "ok", "start_offset" : 1, "end_offset" : 3, "type" : "word", "position" : 7 }, { "token" : "ke", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 8 }, { "token" : "en", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 9 } ] } |
Another variation of the nGram is the Edge-nGram. It’s the same idea, except the string is always started at the specified edge (e.g. “t”, “to”, “tok”, etc).
Each of those new tokens will match parts of an input query, and the more tokens that match, the higher the relevance. This allows for misspellings since only a handful of tokens will be incorrect, while the bulk of the query will match. It also allows for variations in wording or phrasing.
We implement it into our mapping like this:
{ "mappings":{ "item":{ "properties":{ "productName":{ "fields":{ "productName":{ "type":"string", "analyzer":"full_name" }, "partial":{ "search_analyzer":"full_name", "index_analyzer":"partial_name", "type":"string" } }, "type":"multi_field" } }, "settings":{ "analysis":{ "filter":{ "name_ngrams":{ "side":"front", "max_gram":20, "min_gram":2, "type":"edgeNGram" } }, "analyzer":{ "full_name":{ "filter":[ "standard", "lowercase", "asciifolding" ], "type":"custom", "tokenizer":"standard" }, "partial_name":{ "filter":[ "standard", "lowercase", "asciifolding", "name_ngrams" ], "type":"custom", "tokenizer":"standard" } } } } } } } |
Three things changed.
- We added a new field under “productName” called “partial”.
- We added a filter called “name_ngrams”
- We added a new analyzer called “partial_name”
Let’s start with “name_ngrams” filter first:
"name_ngrams":{ "side":"front", "max_gram":20, "min_gram":2, "type":"edgeNGram" } |
This defines an ngram filter that produces tokens from the front edge. Min and max_gram tell ES how many tokens to create for each term. My input queries were pretty long so I made max_grams relatively large. A front edge-nGram with min_gram=2 will always use the first two characters. (e.g. “ra”, “rac”, “race”). This filter is then given a name so we can use it in analyzers.
So, looking at our new analyzer next:
"partial_name":{ "filter":[ "standard", "lowercase", "asciifolding", "name_ngrams" ], "type":"custom", "tokenizer":"standard" } |
This analyzer is basically the same as our full_name analyzer, except we apply our new name_ngrams filter to each token at the end. Given the sentence “A quick Brown FOX”, we would generate this list (separated by token so you can see how everything gets broken up):
qu, qui, quic, quick br, bro, brow, brown fo, fox |
With me so far? Ok, so the last part that was added is another field property, named partial. This field is a little different than the productName field:
"partial":{ "search_analyzer":"full_name", "index_analyzer":"partial_name", "type":"string" } |
As you can see, we specify both a search and index analyzer. Huh? These two separate analyzers instruct ES what to do when it is indexing a field, or searching a field. But why are these needed?
The index analyzer is easy to understand. We want to break up our input fields into various tokens so we can later search it. So we instruct ES to use the new partial_name analyzer that we built, so that it can create nGrams for us.
The search analyzer is a little trickier to understand, but crucial to getting good relevance. Imagine querying for “Race”. We want that query to match “race”, “races” and “racecar”. When searching, we want to make sure ES eventually searches with the token “race”. The full_name analyzer will give us the needed token to search with.
If, however, we used the partial_name nGram analyzer, we would generate a list of nGrams as our search query. The search query “Race” would turn into ["ra", "rac", "race"].
Those tokens are then used to search the index. As you might guess, “ra” and “rac” will match a lot of things you don’t want, such as “racket” or “ratify” or “rapport”.
So specifying different index and search analyzers is critical when working with things like ngrams. Make sure you always double check what you expect to query ES with…and what is actually being passed to ES.
Wrapping things up
This tutorial is long enough already, so I’m just going to leave you with this Gist of a full example. The Gist expands on the example here in the tutorial by adding two more fields (nGram and another edge-nGram) as well as their associated filters/analyzers. It also shows a query at the bottom as an example of actually querying the mapping and filtering on some unrelated fields.
This was long, but I hope it was relatively clear. Feel free to drop me an email (contact form at the bottom of the page) or leave a comment if you have questions.
[...] In my next post, I’ll explain how to construct a more advanced, non-trivial example of mapping (e.g. a practical, real-world usage instead of toy examples). [...]
Great tutorial, very clear written. Thank you so much!
Very nice post. Congrats!
Hi Zachary,
Thanks for the nice post about the mappings and analysis. Actually,
I was wondering about implementing facet search count considering space with non case sensitive. Is there a way to do that? If possible, how we can implement?
Thanks for the great post! Just what I needed to get going with ES mappings.
Thanks for this excellent post! You have written about topics not found elsewhere.
Very nice article. Easy to understand for me. Thank you very much. Helped me a lot!
Thank you! This is presented so much more clearly than anything on elasticsearch.org – or anywhere else that I’ve looked for that matter. Elasticsearch is pretty cool – I really hope they improve their documentation to match.
Great tutorial, Zachary – probably the most helpful one I’ve read so far, actually. Thanks very much for taking the time. You’re right about there being a lack of good resources. Elasticsearch is an excellent product otherwise.
I implemented something like this for indexing filenames and im not getting the results I would expect. Here is the gist https://gist.github.com/4346189
I have a file name like neverland.zip but when I search “ever” i get no results. I was under the assumption that indexing with ngrams would allow search of strings within the a larger string.
I see what is going on. I had to search `partial_middle` field to get the results I was expecting. Is there anyway to encompass all of these fields in the search for just `filename`? I guess `multi_field` is only for the indexing part and not the search part?
Ok I answered my own question but hopefully this helps out others. Using query string you can encompass multiple fields for search with wildcards. With bool search it has to be separated our into multiple fields. Example:
{
“query”: {
“query_string” : {
“fields” : ["filename.*"],
“query” : “everla”,
“use_dis_max”: true
}
}
}
`use_dis_max` will automatically union the results for you from all the sub queries. In this case 4 subqueries happen (filename, filename.partial_front, filename.partial_middle, filename.partial_back)
Hey Glenbot, I like your solution. Much more elegant than adding them all into a single boolean query. I’ll try to update my tutorial over the holidays and include your solution, as well as cleaning up some of the names (“partial_middle_ngram” etc drives me crazy, not sure why I published such silly names)
Glad you figured it out! Note of caution about query_string: it will automatically split tokens by whitespace *before* handing the query to the analyzer. So multi-word queries may behave differently than you expect (this bit me a while ago before I realized what was going on)
Great article. If I don’t create a mapping or create a basic mapping now, can I change it in the future, preserving my data? Thank You!
Glad to help! If you don’t create a mapping, ES will assign default mappings (numbers, strings, etc). You can append to mappings in the future, such as adding new analyzers or fields. However, ES will not allow you to update existing fields. So if you make a basic mapping (or use the defaults) and want to update in the future, you will have to create a new index with the appropriate mapping and then re-index your data. Irritating, I know =(
I had an issue with the examples if the settings were specified after the properties with the custom analyzer. Moving the settings up top, as is in the current documentation fixed this.
Great post, thanks.
[...] by Zachary Tong in ElasticSearch0 Comments One of my older tutorials used a complete mess of analyzers to return relevant search results. While this worked, it also returned a lot of garbage results [...]