Five things I learned from ElasticSearch Training
I recently attended the Boston ElasticSearch training seminar and had a great time. In the spirit of “Elasticsearch: Five Things I was Doing Wrong“, I thought I’d write up a few tips that I learned.
Before I get started, I just want to say how fantastic the training seminar was. I consider myself an intermediate ElasticSearch user – I have a good handle on most queries, understand basic cluster management and am learning more about performance.
The ES training is two days and covers everything from basic query syntax to advanced cluster management. We had two ES devs giving the training in Boston. It was awesome asking the devs themselves questions and getting knowledgeable answers…sometimes from the devs literally looking through the source code to find the answer.
If you are on the fence about taking the course…go buy your ticket now, it’s well worth it. I found myself taking notes even on simple things like the `match` query. I picked up dozens of little tidbits about topics already understood well, and entire pages of notes about unfamiliar topics.
Ok, enough shilling the course, on to the tips.
1. Stop using Query_String
I have never really used query_string – as a non-lucene developer I didn’t fully understand what it is and why it should be used. Query_string is basically a thin wrapper to the Lucene query parser. When you send text to query_string, it parses by whitespace and then lets Lucene attack it.
This makes query_string very powerful, since you can perform inline operations like negation and wildcards. E.g. “+fred -jones AND (Mr OR Mrs)”.
Unfortunately, this comes with a price…too much power. In many cases, your users do not know or care about Lucene syntax. Query_string will simply explode with an error if the syntax is incorrect. It is also possible to create some truly terrible queries using wildcards. Suddenly your simple searchbox is taking down your cluster because every user thinks they are being clever with wildcards.
Stop using query_string, use match instead. If your users are accustomed to Lucene syntax…re-educate them. There is not a good reason to use query_string in 99% of the situations.
2. Prefer the Bool filter over And/Or/Not filters (usually)
This is a performance tip: when creating compound filters (e.g. filters within filters), the Bool filter is usually more performant than the And/Or/Not filters. The Bool filter can take advantage of term bitsets in the filter memory cache, increasing retrieval speed. The And/Or/Not filters do not use the filter cache and do not gain a speed boost.
However, if you are working with non-bitset operations like Geo functions or scripts, you should use the And/Or/Not set over Bool. In these situations, the filter data has to be loaded into the field memory cache anyway, obviating the need for a fast bitset cache. Here the And/Or/Not filters are more performant than Bool.
In some complex, compound situations, the best performance can be had by combining Bool and And/Or/Not
3. ElasticSearch + SSD = match made in heaven
I think everyone realizes SSDs are pretty fantastic, but also fairly expensive and limited in size. The ES devs, however, illustrated just how fantastic SSDs are when it comes to ElasticSearch.
Basically, SSDs have a limited lifetime because the data blocks can only be erased so many times before failing. To make matters worse, SSDs can only erase large sections of the drive at a time. Random access writes/deletes to an SSD drastically shortens the life of an SSD to around 1-2 years. This is caused by a phenomenon known as “Write Amplification“, which can cause performance problems in addition to reducing lifetime.
In contrast to “normal” usage, ElasticSearch plays very nicely with SSDs because of the underlying data model that Lucene uses. Lucene segments are immutable – they are written in a single block and will never change. Because of this, Lucene has a write amplification of 1 and does not produce much SSD wear leveling.
What does this mean in practice? You can exchange your 120 IOPS spinning disks for 20,000-40,000 IOPS SSD drives without fearing an extremely short life. On a per-IOPS basis, SSDs are waaaay cheaper than spinning disks ($0.02 vs $1.25).
4. Design filters that are cacheable
Maybe it’s just me, but I found this tip super cool. The ES devs encouraged us to think about filters that can be cacheable. For example, if you do a time-based range filter from 12am – Now(), you won’t gain any caching benefit. The filter is immediately invalidated as soon as you build it, since Now() is always changing.
Instead, you should build a 12am – 12am filter, even if the time right now is only 6pm. ElasticSearch will happily build the filter bitset in memory up to the most available time (6pm), and continually add new segments to the bitset as they become available. In this way, you’ve created a single filter bitset that is applicable to many different circumstances.
5. Scale out, not up
So this tip wasn’t explicitly stated, but it was a general feeling that I took away from the training. I’ve personally played with benchmarks and fiddling with various knobs that ElasticSearch provides, hoping to tune individual machines for better performance.
After the training, I feel this is ultimately not a worthy use of time. ElasticSearch has extremely sane defaults which are hard to beat. There are a few knobs you can twiddle (segments in a merge tier, or refresh interval, for example) but on the whole ES runs just about as well as it can.
Instead of mucking with settings and potentially making ElasticSearch less performant….your time is probably better spent optimizing your data (routing, index management, analyzers and queries) and simply provisioning a new node when your cluster start to struggle. This will boost your performance far higher than trying to determine the optimal lucene segment size.
Conclusion – Go Take The Class
I took about 40 pages of notes at the ES seminar. If you are trying to learn ElasticSearch, or improve your current skills, the training is easily the best time:knowledge ratio available. Having devs available to answer questions was great, and listening to the particular problems of other trainees was equally useful. It seems everyone is using ElasticSearch for a different purpose, which makes for some great learning questions.
Like this article? Share it with your friends.
Great summary of points.
I think the query_string query is a double edged sword. Its true that it can cause a lot of exceptions when users make mistakes, but in practice I have found it more helpful than not. Particularly for internal search tools where users want to build some pretty complex queries.
Mostly it comes down to UI. Providing Lucene syntax is a simple way to expose access to different document fields, and some sophisticated users want to construct complex queries. But the syntax is still pretty obscure, so I don’t think as many users think to use those fields as would benefit from them.
This UI problem is what started me down the path of building es-backbone. Faceted search combined with easy prompts for filtering different fields seems like a good way to provide the power of Lucene query syntax in a safer way.
Still a work in progress though. I shudder when I look at that design.
Very good point about UI, and ultimately, intended audience of the search. If you are targeting your own developers or other people within your own organization, it may make sense to utilize the power of a query_string. Especially for situations like internal dashboards where quick, concise querying is important.
Whereas a public-facing searchbar that is probably overkill and dangerous.
ES-backbone looks interesting…I’ll definitely take a look this weekend and play around. Thanks for stopping by to comment, I really enjoyed your post on the ES training!
Actually, I think the query_string is very useful even in such situations where you have to provide full power of it for uneducated (or overeducated) users. In this case the real ElasticSearch query_string should be created on the backend after pre-processing users’ queries with a custom middleware, so kind of wrapping and filtering before execution.