To keep up with the continuous growth in demand, cloud providers spend millions of dollars augmenting the capacity of their wide-area backbones and devote significant effort to efficiently utilizing WAN capacity. A key challenge is striking a good balance between network utilization and availability, as these are inherently at odds; a highly utilized network might not be able to withstand unexpected traffic shifts resulting from link/node failures. We advocate a novel approach to this challenge that draws inspiration from financial risk theory: leverage empirical data to generate a probabilistic model of network failures and maximize bandwidth allocation to network users subject to an operator-specified availability target (e.g., 99.9% availability). Our approach enables network operators to strike the utilization-availability balance that best suits their goals and operational reality. We present TEAVAR (Traffic Engineering Applying Value at Risk), a system that realizes this risk management approach to traffic engineering (TE). We compare TEAVAR to state-of-the-art TE solutions through extensive simulations across many network topologies, failure scenarios, and traffic patterns, including real-world data about failures and traffic from a large service provider. Our results show that with TEAVAR, operators can support up to twice as much throughput as state-of-the-art TE schemes, at the same level of availability.
TeaVaR: Striking the Right Utilization-Availability Balance in WAN Traffic Engineering
J. Bogle, N. Bhatia, M. Ghobadi, I. Menache, N. Bjorner, A. Valadarsky, M. Schapira ACM SIGCOMM'19 (to appear)
[paper] [slides] [poster] [video] [code] [demo]