Guest post on the The ICDCS 2017 Best Paper
By: Amy Babay (Johns Hopkins University; USA), Emily Wagner (Johns Hopkins University, LTN Global Communications; USA), Michael Dinitz (Johns Hopkins University; USA), Yair Amir (Johns Hopkins University, LTN Global Communications; USA)
The Internet natively supports end-to-end reliable communication (e.g. using TCP) or best-effort timely communication (e.g. using UDP). However, emerging highly interactive applications such as remote manipulation and remote robotic surgery bring severe constraints on timeliness, while still requiring high reliability. The timeliness constraints come from the fact that human perception requires feedback to be received within about 130ms to be perceived as natural. This 130ms includes both the time for the command to reach the destination and for the feedback to be returned, which translates to a latency requirement of 65ms each way. This makes supporting such applications on a continent-wide scale demanding: the network propagation delay across North America, for example, is about 35-40ms.
We develop an overlay transport service to support these applications over the Internet. In recent years, overlay network architectures have been deployed to support applications that require both timeliness and reliability, such as VoIP and TV. A live TV service, supporting interviews from remote studios, requires a one-way latency bound of about 200ms, with 99.999% of packets delivered on time. A global overlay network with about 20 well-situated overlay nodes can support such a service by using the 160-165ms available after accounting for a 35-40ms propagation delay to allow some buffering and hop-by-hop recovery on overlay links. Such a service is available today, and we are involved with LTN Global Communications, a commercial service provider that uses this approach to support the TV and media industries (www.ltnglobal.com).
However, for the demanding applications we target, there is almost no flexibility to allow for recovery or buffering. Redundant dissemination, or sending multiple copies of each packet to improve the probability that at least one copy reaches the destination on time, is a useful building block, but it increases the cost of the transport service. Since ISPs charge for each packet sent, the cost of a redundant dissemination scheme corresponds to the number of overlay links a packet is sent on.
One approach to employing redundant dissemination is to send on multiple disjoint paths. Disjoint paths offer a coarse-grained tradeoff between cost and reliability: for example, sending on two disjoint paths costs slightly more than twice the cost of the single best path and allows a packet to reach its destination as long as it is successfully transmitted on one of the two paths. However, this approach uniformly invests resources along the paths from a source to a destination, while investing fewer resources in more reliable parts of the network and more resources in less reliable parts of the network can improve the cost-reliability tradeoff.
We present a new approach that transports packets in a timely, reliable, and cost-effective manner by constructing dissemination graphs based on network characteristics, application latency and reliability requirements, and cost. A dissemination graph is a connected subgraph of the overlay network topology that connects a source and destination. Each packet from the source to the destination is sent over all the links in the dissemination graph.
Ideally, we would calculate the cheapest dissemination graph that meets an application’s reliability and latency constraints. However, the problem of constructing such a dissemination graph is NP-hard. Instead, we analyzed real-world network data and examined the types of problems that occur in the field. A key finding of this analysis was that a routing approach using two disjoint paths performs well in most cases, and that cases where two disjoint paths do not perform well typically involve problems around a source or destination.
Based on the types of problems we observed, we developed a timely dissemination-graph-based routing method that precomputes four graphs per flow and switches between them as network conditions change. Specifically, the approach uses two disjoint paths under normal conditions, and switches to use a dissemination graph that adds targeted redundancy around a source or destination (or in rare cases both) when problems are detected in that region. We show that this approach can cover over 99% of the performance gap between a traditional single-path approach and an optimal (but prohibitively expensive) scheme, compared with about 70% for two dynamic disjoint paths or about 45% for two static disjoint paths. This performance improvement is obtained at a cost increase of about 2% over two disjoint paths.
We have implemented dissemination-graph-based routing in the Spines open-source overlay messaging framework (www.spines.org) to create a complete dissemination-graph-based transport service over the Internet. The full paper can be found here: www.dsn.jhu.edu/publications.html
Amy is a Ph.D. student in Computer Science at Johns Hopkins University, where she is a member of the Distributed Systems and Networks lab. Prior to starting her Ph.D., she worked at LTN Global Communications, after receiving her M.S.E. in Computer Science and B.A. in Cognitive Science from Johns Hopkins University. Her research focuses on using structured overlay networks to support a new generation of Internet services and building intrusion-tolerant critical infrastructure systems. She is a co-creator of the Spire intrusion-tolerant SCADA system for the power grid (www.dsn.jhu.edu).
Emily is a software engineer at LTN Global Communications. She obtained her BS with honors from Johns Hopkins University in 2015 with a major in physics and a minor in mathematics. She received her masters in Computer Science from Johns Hopkins in 2016. At LTN, Emily works on high-performance networking applications for LTN’s commercial overlay network.
Michael is an Assistant Professor in the Department of Computer Science at Johns Hopkins University. Prior to joining JHU he was a postdoctoral fellow at the Weizmann Institute of Science, after receiving his Ph.D. from Carnegie Mellon University and his A.B. from Princeton University. His research is in theoretical computer science, with an emphasis on algorithms (particularly approximation algorithms). He is also interested in algorithmic aspects of computer networking and distributed systems.
Yair is a Professor and Chair of the Computer Science Department, and the director of the Distributed Systems and Networks lab at Johns Hopkins University. He holds a B.Sc. and M.Sc. from the Technion, Israel Institute of Technology, and a Ph.D from the Hebrew University of Jerusalem, Israel. He is the recipient of the Alumni Association Excellence in Teaching Award, Whiting School of Engineering, Johns Hopkins University for 2014 and was nominated for the DARPA agency-wide “Performer with Significant Technical Achievement” award in 2004. Yair served on various technical program committees including co-chair of the IFIP/IEEE Dependable Systems and Networks (DSN) for 2015. He is a creator of several open source projects including the Spread toolkit (www.spread.org), the Spines overlay network (www.spines.org), and the Spire intrusion-tolerant control system for the power grid (www.dsn.jhu.edu).