Apache Solr and Elasticsearch are the most popular open source search engines. Both are built on top of the Apache Lucene text search engine library and most of the functionalities are similar. However, there are significant differences in terms of ease of deployment and ease of scalability. Popularity of Elastic Search has risen significantly in recent times as evident from the following google trend between the two terms.
Search at Tokopedia
Tokopedia has been growing fast and with growth come challenges. We see millions of new products added to the platform every month by sellers and millions of searches happening every day.
For almost five years, Tokopedia has been using Apache Solr as search platform and it performed well to the needs. We store products data including their attributes in Solr to speed up search experience. Our Solr setup consist of single Apache Solr core with one shard. When data we had was small it was quite good, but over the time when our data got bigger, changed faster and became more complex this setup started to show its age.
Ideally every search request should be served in under 1 second, but it’s starting to spike until 5 minutes for single search request. After debugging and contemplating at last we found the culprit. Our search infrastructure rely on indexer to keep search index up to date, every product added, deleted or changed the indexer would push that to Solr. Ingestion rate of solr has been never bad, we really had no problem until one day every time Solr doing commit it would eat all cpu resources. This fact was really bad because all query cache would be cleared every time solr commit data to the index.
Turned out, our data was too big for our infrastructure that time, temporarily we delayed the commit time to a lesser frequency. This reduced the number of spikes but we understood that was not a proper solution. What we realized was that we couldn’t use one machine one shard setup anymore, commit time would hit disk IO really hard and keep bottlenecking cpu. With multi shards multi machines cluster not only data commit would be divided but also search operation.
Why Elastic Search?
We had two options in mind, building Solr Cluster using SolrCloud or Using ElasticSearch as this one gaining really good traction in community. Learning from our past and further research we decided to pick ElasticSearch. It comes in very solid bundle and very easy to deploy, maintain and scale, and also compared to SolrCloud it would enable us to do it in more frugal way.
More concretely, we decided to move to elasticsearch due to:
- Elasticsearch is easy to shard. You only need to set number of shard in config and it will automatically shard your indices.
- Elasticsearch have something similar to “Commit” called “Refresh”. Refresh is much faster and less expensive compared to Commit.
When we tried to move to Elasticsearch, we didn’t know much about elasticsearch. While migrating, we faced some issues (performance, optimization etc.). Following are some of the lessons learnt:
Initially, we tried to deploy machines with high Memory because we assumed Apache Solr and Elasticsearch machine requirements will be similar. But it is not. Elasticsearch need higher compute units and faster disk.
After we use nodes with more compute units and SSD, it drastically improved performance and stability.
doc_values vs fielddata
Fielddata is used mainly when sorting on a field or computing aggregations on a field. It loads all the field values to memory in order to provide fast document based access to those values.
Doc values are the on-disk data structure, built at document index time, which makes the data access pattern possible. They store the same values as the source but in a column-oriented fashion that is way more efficient for sorting and aggregations.
On a huge index, doc_values works much better compared to fielddata in sort and aggregation.
We tried to design an ideal custom mapping for us, but it ended up with performance issues. We used nested mapping but nested mapping will make the data retrieval much slower. Don’t use nested mapping unless you have to.
Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation. Elasticsearch have lot of available tokenizers, when we need a custom tokenizer we can define one.
However, custom tokenizer will disable doc_values. If you have requirement to sort or aggregate, consider to create custom tokenizer on your indexer.
For example, we are using alphanumeric custom tokenizing, that will parse keyword “keyboard1” to “keyboard0001” for alphanumeric sort. We move the tokenizing process to indexer, so we can use standard tokenizer and doc_values.
Enable and make sure you use all cache that is available. Elasticsearch cache is less aggressive compared to Solr. If you don’t have real-time requirement, adding caching on top of Elasticsearch will boost the performance. E.g. we use nginx caching on top of elastic search.
Moving from Solr to Elasticsearch showed major improvements for us in terms of stability, latency and scale:
- More frequent updates: Due to better stability and the dreaded CPU spike gone, our index is updated more frequently.
- 10x faster queries: We notice that peak time latencies have reduced from 3s to 300ms.
- Easy sharding and easy scale: During the experimentation we’d to play around with the nodes in cluster. We found upscale and reindex is much “easier”.
Following shows the latency improvement when we moved from Apache Solr to Elastic Search:
Following shows the stability within elastic search as we tweak based on the learnings stated above (e.g. solving CPU spike due to high IO by using faster disks):
In our migration from solr to elasticsearch we did many tweakings and optimizations as advised via numerous articles or mailing lists from veterans and hence the performance and stability improvements observed are a function of them too in addition to moving to elasticsearch. We do feel that Elastic Search made experimentation easier though as compared to Solr!