Apache Solr, an open-source search platform built on Lucene, has become a cornerstone for scalable search and analytics applications. As organizations scale their data infrastructure, performance engineering and scalability optimization become critical. This article explores a real-world implementation of Solr, focusing on performance tuning, scalability challenges, and advanced architectural strategies to handle extreme classification workloads. The case study highlights how Solr’s flexibility and extensibility enable robust solutions for high-traffic environments, such as managing millions of restaurant entries and multilingual queries.
Apache Solr is a distributed search engine that provides full-text search, faceting, and real-time indexing capabilities. It operates on a Solr Cloud architecture, which leverages Apache ZooKeeper for distributed coordination, enabling horizontal scaling and fault tolerance. Key features include:
The system manages approximately 500,000 restaurants, 50,000 brands, and 40,000 dishes, with peak traffic of 3,000 requests per second (RPS) during lunch and dinner hours. The initial architecture employed a master-slave replication model, where the master node handled updates and slave nodes processed read queries. To reduce costs, AWS Spot Instances were utilized, achieving an 80% cost saving. However, this introduced challenges with network load spikes due to Spot instance termination.
To address these issues, the following optimizations were implemented:
JVM configuration was optimized to balance startup time and cost efficiency. Graal VM was selected for its performance benefits. Additionally, HTTP connection pooling was fine-tuned to handle high-traffic scenarios, improving request throughput.
Dynamic fields were enhanced by adding dock values to facilitate facet queries, significantly boosting search performance for complex classification tasks.
Initial node startup times were excessive due to a 50GB index size. To resolve this, the cluster was split by language, creating:
This reduced index size by 66%, enabling faster node initialization and improved scalability.
A critical failure occurred when the master node, running on On-Demand instances, was unexpectedly decommissioned, causing a service outage. To prevent this, the following measures were adopted:
A hybrid approach combining TLog replication (for transaction logs) and Pull replication (for read operations) was introduced. This minimized leader election conflicts and ensured continuous read availability during leader failures. The cluster was configured with 33% On-Demand and 67% Spot instances, balancing cost and reliability.
By leveraging Solr’s distributed architecture and performance engineering best practices, organizations can achieve robust, scalable search solutions for extreme classification workloads. The case study underscores the importance of balancing cost, performance, and reliability in large-scale search infrastructure.