Profiling Network Performance for Multi-Tier Data Center Applications
- M. Yu ,
- A. Greenberg ,
- D. Maltz ,
- J. Rexford ,
- L. Yuan ,
- S. Kandula ,
- C. Kim ,
- Dave Maltz ,
- Srikanth Kandula ,
- Albert Greenberg
NSDI |
Network performance problems are notoriously tricky to diagnose, and this is magnified when applications are often split into multiple tiers of application components spread across thousands of servers in a data center. Problems often arise in the communication between the tiers, where either the application or the network (or both!) could be to blame. In this paper, we present SNAP, a scalable network-application profiler that guides developers in identifying and fixing performance problems. SNAP passively collects TCP statistics and socket-call logs with low computation and storage overhead, and correlates across shared resources (e.g., host, link, switch) and connections to pinpoint the location of the problem (e.g., send buffer mismanagement, TCP/application conflicts, application-generated microbursts, or network congestion). Our one-week deployment of SNAP in a production data center (with over 8,000 servers and over 700 application components) has already helped developers uncover 15 major performance problems in application software, the network stack on the server, and the underlying network.