My love for ElasticSearch grows every time I use it.  It’s a truly great piece of software. But it confuses me sometimes. Maybe it confuses you too.

ElasticSearch is based on Apache’s Lucene and borrows many concepts. The documentation for ES is passable…but there are many areas where the documentation is shrouded in a “fog of knowledge”. It assumes you already have a background in Lucene.

So if you’re like me and have never used either Solr or Lucene, some concepts may take some time to puzzle out. This usually involves digging through the (excellent) mailing lists and experimenting on your own.

In particular, mappings were an area where I spent a lot of time fiddling around. Eventually it clicked and I had an “Aha!” moment. I’m writing this article in hopes of speeding up your “Aha!” moment.

Default Mapping

Since ES does not require a schema a priori, it is tempting to just throw data at ES and hope for the best.

So if we do this:

curl -XPUT http://localhost:9200/test/item/1 -d '{"name":"zach", "description": "A Pretty cool guy."}'

ES is smart enough to realize that both “name” and “description” are strings. ES implicitly creates this mapping:

mappings: {
    item: {
        properties: {
            description: {
                type: string
            }
            name: {
                type: string
            }
        }
    }
}
Wait, hold up. What’s a mapping?

The simplest analogy for ES mappings are data-types in statically-typed programming languages. A variable can be typed as an integer…and then it can only hold integers. Similarly, a number mapping in ES means that a field can only hold numbers.

However, this analogy leaves out a very, very important aspect of mappings. A mapping not only tells ES what is in a field…it tells ES what terms are indexed and searchable.

This unassuming statement took me a while to figure out but it is incredibly important to working with ElasticSearch. Is your query returning poor results? Your mapping probably sucks. Not returning any results? Check your mapping.

When in doubt: check your mapping.

Anatomy of a mapping

A mapping is composed of one or more ‘analyzers’, which are in turn built with one or more ‘filters’. When ES indexes your document, it passes the field into each specified analyzer, which pass the field to individual filters.

Filters are easy to understand: a filter is a function that transforms data. Given a string, it returns another string (after some modifications). A function that converts strings to lowercase is a good example of a filter.

An analyzer is simply a group of filters that are executed in-order. So an analyzer may first apply a lowercase filter, then apply a stop-word removal filter. Once the analyzer is run, the remaining terms are indexed and stored in ES.

Which means a mapping is simply a series of instructions on how to transform your input data into searchable, indexed terms.

Default Analyzer

Ok, so back to our example. Above, ES guessed “string” as the mapping for “description”. When ES implicitly creates a “string” mapping, it applies the default global analyzer. Unless you’ve changed it, the default analyzer is the Standard Analyzer. This analyzer will apply the standard token filter, lowercase filter and stop token filter to your field.

We can query ES with the “_analyze” endpoint to see how fields and queries are actually transformed. Using our above example, the “description” field will be transformed into:

curl -X GET "http://localhost:9200/test/_analyze?analyzer=standard&pretty=true" -d "A Pretty cool guy."
 
{
  "tokens" : [ {
    "token" : "pretty",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "cool",
    "start_offset" : 9,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "guy",
    "start_offset" : 14,
    "end_offset" : 17,
    "type" : "<ALPHANUM>",
    "position" : 4
  } ]

As you can see, our field was transformed into [pretty], [cool], [guy]. We lost a word (“A”) capitalization and punctuation. Importantly, even though ES continues to store the original document in it’s original form, the only parts that are searchable are the parts that have been run through an analyzer.

Here is an example of it failing:

$ curl -X GET "http://localhost:9200/test/_search?pretty=true" -d '{
    "query" : {
        "text" : { "description": "a" }
    }
}'
 
{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

“Text” search queries run the search query through the same analyzer/filter system that the field was indexed with. This means that our query for “a” returns nothing, since the term “a” was never indexed and stored in ES. Conversely, if we search for “cool”:

curl -X GET "http://localhost:9200/test/_search?pretty=true" -d '{
    "query" : {
        "text" : { "description": "cool" }
    }
}'
 
{
  "took" : 29,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "test",
      "_type" : "item",
      "_id" : "1",
      "_score" : 0.15342641, "_source" : {"name":"zach", "description": "A pretty cool guy"}
    } ]
  }
}

We now return the correct result. This was an admittedly simple example, but it highlights how mappings work. Don’t think of mappings as data-types – think of them as instructions on how you will eventually search your data. If you care about stop-words like “a”, you need to change the analyzer so that they aren’t removed.

In my next post, I’ll explain how to construct a more advanced, non-trivial example of mapping (e.g. a practical, real-world usage instead of toy examples).