Geographic load balancing for scalable distributed web systems. Dynamic load balancing on web-server systems. A term-based inverted index partitioning model for efficient distributed query processing. Stoica, editors, Peer-to-Peer Systems II, pages 80-87, Berlin, Heidelberg, 2003.
Simple load balancing for distributed hash tables. Launer, editor, Robustness in Statistics, pages 201-236. Robustness in the strategy of scientific model building. Osborne/McGraw-Hill, Berkeley, CA, USA, 2002. Apache Server 2.0: The Complete Reference. van Renesse, editors, Peer-to-Peer Systems IV, pages 217-225, Berlin, Heidelberg, 2005. Dynamic load balancing in distributed hash tables. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, NeurIPS '17, 4-9 December 2017, Long Beach, CA, USA, pages 6867-6877, 2017. Affinity clustering: Hierarchical clustering at scale. Web search for a planet: The google cluster architecture. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, WSDM '16, San Francisco, CA, USA, February 22-25, 2016, pages 387-396, 2016. Distributed balanced partitioning via linear embedding. Expander flows, geometric embeddings and graph partitioning. OpenStack Swift: Using, Administering, and Developing for Swift Object Storage. SIGCOMM Computer Communication Review, 44(4):503-514, Aug. Conga: Distributed congestion-aware load balancing for datacenters. This yields a good but simplistic initial voting table, which we then iteratively refine via cache simulation to capture feedback effects. We first construct a large-scale term-query graph from logs and apply a distributed balanced graph partitioning algorithm to cluster each term to a preferred replica. We develop a multi-stage scalable algorithm to learn these weights. We solve this via a voting scheme, whereby the load balancer conducts a weighted vote by the terms in each query, and sends the query to the winning replica. However, most queries contain multiple terms and we have to send the whole query to one replica, so it is not possible to achieve a perfect partitioning of terms to replicas. Sending the same term reliably to the same replica would increase the chance it hits cache, and avoid polluting the other replicas' caches. Flash bandwidth is a critical bottleneck, motivating an application-directed RAM cache on each replica. The replica pulls each term's postings list into RAM from flash, either locally or over the network. This innovation has benefited all production workloads since 2015, serving billions of queries daily.Ī load balancer forwards each query to one of several identical serving replicas. Our deployment of cache-aware load balancing in the Google web search backend reduced cache misses by ~0.5x, contributing to a double-digit percentage increase in the throughput of our serving clusters by relieving a bottleneck.