Optimizing Elasticsearch for reads

Shubham Aggarwal
6 min readOct 30, 2020

This blog post is an attempt to jot down the scaling challenges faced during the recent traffic surge and how we tried to solve them. One can skip to the tech part of the blog but hey what fun is climax without the story, right?

Prelims

COVID-19 times have been a roller coaster ride for everyone. Take a look below for instance.

Nifty50 Index

What above graph conveys is — while this pandemic wrecked havoc, it also compelled the world to think about novel ways of doing things (think remote-*, * ~ school classes,shopping essentials etc) and hence e-platforms were an instant hit. As a result many tech companies went through days when they were operating at 5% of their usual traffic to 200% in just a few weeks.

Characters

In this post we will be discussing about one of my services — Ratings/Reviews. The essence nature of the service as nobody wants to buy a product online without having a look at its ratings and what other customers have to say, makes this one of the high throughput services.

Tech stack: Elasticsearch, Redis, Kafka, Mysql, Hbase

The Story

Few weeks after scaling down our whole infrastructure just enough to serve lock-down traffic, we had a major feature release. Fast forward a few days and we started getting alerts on our monitoring channels. We decide to add a few machines to handle the increased traffic but that didn’t help as alarms kept ringing when we finally decide to upscale the whole infra across all services to pre COVID levels to handle the surge.

Past a few breathing days, our monitoring channels are once again flooded. Each day forming a new peak on the traffic charts. Everyone was in awe.

Traffic pattern

The Challenge

In case of review service, it’s Elasticsearch cluster was overwhelmed and couldn’t cope with the load. So either we had to scale up our cluster or we could try optimizing it. After a brief discussion with AWS team we got a hint that review ES cluster was not fully utilized and can be optimized.

Initial Elasticsearch cluster setup

  • 6 data nodes (C5.2xlarge), 3 master nodes (C5.xlarge)
  • 5P 1R Main Index + 1P 1R .kibana (P: Primary, R:Replica)
  • Index size ~12Gb, ~12M documents
  • Shard size 1.2Gb
  • No routing

We were observing high CPU and queuing in search thread pool on the cluster. Although the ES search latency metric was under 5ms, review was becoming latent and this did not correlate at all. Eventually we realized AWS search latency metric doesn’t count for how long the requests were waiting in queue before even reaching any shard.

ES CPU

The Struggle

First thing we iterated on is shard strategy. Since our index size was small and AWS folks also recommended to have shards of size ~15Gb, so our 5P 1R doesn’t seem to make sense. So we try just the opposite 1P 5R.

This way we would still have shard size under limit and a search request would only go to one shard and thus save on query_then_fetch computations.

Perf environment all setup and i am ready to test the theory. As expected perf results were actually much better. ES CPU was down, no queuing in search pool, review service overall latency improved significantly. Cool, all set let’s deploy to production. After deploying, as i am monitoring the metrics

  • ES CPU low, check
  • No queuing in search pool, check
  • ES search latency, increased by 2X 😕
  • Review service latency increased by 2X

Initially i thought may be this has something to do with index warmup. So i wait for sometime and hope that it recovers. And then i get a message from one of my colleague that it is impacting the overall latency of platform. In a minute a team of 3–4 people were on call, trying to figure out and correlate things. Finally had to rollback that night.

The Struggle Continues

After some discussion on why the solution didn’t work in prod, we could find one potential reason — perf environment was not exactly the same as prod. This seemed to be a valid point as write load was not being simulated in perf. Although the write load is very less ~500 RPM (average) on the cluster, anyways we had to validate that.

Once again with whole infra setup, i start evaluating shard strategies. This time not just 1P5R but i try all promising permutations and combinations. Again the results were same and 1P5R was best performing strategy among all with same metrics as before even with write load in parallel.

With a little bit of fear this time, we deploy to production and the fear comes true. Again the same pattern as before after deploying to production. Result — rollback. I just couldn’t comprehend what was wrong this time.

I started to look deep for some clues. I wanted to know what was actually happening in the cluster. Elasticsearch provides various APIs to get cluster metrics and one API stood out.

GET _nodes/hot_threads

This API provides returns most time consuming threads on the node with breakdown of tasks and time/CPU consumed . For eg:

100.5% (502.3ms out of 500ms) cpu usage by thread 'elasticsearch[bh0TRel][warmer][T#4]'
7/10 snapshots sharing following 19 elements
org.apache.lucene.util.PriorityQueue.updateTop(PriorityQueue.java:211)
org.apache.lucene.index.OrdinalMap.<init>(OrdinalMap.java:261)
org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:168)
org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:147)
org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build(GlobalOrdinalsBuilder.java:65)

After trying a few times and analyzing, i could notice that most of the CPU consuming threads had this term OrdinalMap. In search thread also, this was the major time consuming component. I didn’t have any clue about ordinal map but for sure it was stemming the performance of queries. After a bit of googling around i ended up here

In simple terms global ordinals is like a map which ES maintains on shard level to speed up term aggregation types queries. Ought to try this because in our case most of the queries are term aggregation queries and lazy loading(default) would make initial requests slower.

Show time!

This time we setup a new ES cluster just to avoid any existing issues with current cluster. Modified the index mapping to enable eager loading of global ordinals and increased the refresh interval from 1s (default) to 1 minute. Re-indexed documents to new mapping.(Note: Enabling eager global ordinals will increase re-index time so use cautiously and only if it fits your use case well).

With everything in place, we started routing some percent of traffic to new cluster. After sometime when things seemed to be stable and all metrics looked fine, we routed full traffic. Below are some metrics with new cluster.

ES CPU reduced from 55% -> 15% (On 2x load ~30%)
Review Service 99.5% latency reduced from 200ms -> 40ms
Able to achieve 2x load having still less ES CPU

Hope this may be helpful and saves someone’s time. Please share feedback and feel free to reach out. Website Link

--

--