Bootstrapping evolvability for inter-domain routing with D-BGP

Conference papers
Raja R. Sambasivan, David Tran-Lam, Aditya Akella, Peter Steenkiste
In Proceedings of SIGCOMM 2017.
Publication year: 2017

The Internet’s inter-domain routing infrastructure, provided today by BGP, is extremely rigid and does not facilitate the introduction of new inter-domain routing protocols.  This rigidity has made it incredibly difficult to widely deploy critical fixes to BGP.  It has also depressed ASes’ ability to sell value-added services or replace BGP entirely with a more sophisticated protocol.  Even if operators undertook the significant effort needed to fix or replace BGP, it is likely the next protocol will be just as difficult to change or evolve.  To help, this paper identifies two features needed in the routing infrastructure (i.e., within any inter-domain routing protocol) to facilitate evolution to new protocols.  To understand their utility, it presents D-BGP, a version of BGP that incorporates them.

Principled workflow-centric tracing of distributed systems

Conference papers
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger
In Proceedings of SoCC 2016
Publication year: 2016

Workflow-centric tracing captures the workflow of causally-related events (e.g., work done to process a request) within and among the components of a distributed system.  As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for understanding distributed system behavior.  Yet, there is a fundamental lack of clarity about how such infrastructures should be designed to provide maximum benefit for important management tasks, such as resource accounting and diagnosis.  Without research into this important issue, there is a danger that workflow-centric tracing will not reach its full potential.  To help, this paper distills the design space of workflow-centric tracing and describes key design choices that can help or hinder a tracing infrastructure’s utility for important tasks.  Our design space and the design choices we suggest are based on our experiences developing several previous workflow-centric tracing infrastructures.

Visualizing request-flow comparison to aid performance diagnosis in distributed systems

Conference papersJournal papers
Raja R. Sambasivan, Ilari Shafer, Michelle Mazurek, Gregory R. Ganger
IEEE Transactions on Visualization and Computer Graphics (Proc. Information Visualization 2013), Vol. 19, no. 12, Dec. 2013
Publication year: 2013

Distributed systems are complex to develop and administer, and performance problem diagnosis is particularly challenging. When performance degrades, the problem might be in any of the system’s many components or could be a result of poor interactions among them. Recent research efforts have created tools that automatically localize the problem to a small number of potential culprits, but research is needed to understand what visualization techniques work best for helping distributed systems developers understand and explore their results.  This paper compares the relative merits of three well-known visualization approaches (side-by-side, diff, and animation) in the context of presenting the results of one proven automated localization technique called {\it request-flow comparison}.  Via a 26-person user study, which included real distributed systems developers, we identify the unique benefits that each approach provides for different problem types and usage modes.

Diagnosing performance changes by comparing request flows

Conference papers
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Gregory R. Ganger
In Proceedings of NSDI 2011
Publication year: 2011

The causes of performance changes in a distributed system often elude even its developers. This paper develops a new technique for gaining insight into such changes: comparing system behaviours from two executions (e.g., of two system versions or time periods). Building on end-to-end request flow tracing within and across components, algorithms are described for identifying and ranking changes in the flow and/or timing of request processing. The implementation of these algorithms in a tool called Spectroscope is described and evaluated. Six case studies are presented of using Spectroscope to diagnose performance changes in a distributed storage system caused by code changes, configuration modifications, and component degradations, demonstrating the value and efficacy of comparing request flows. Preliminary experiences of using Spectroscope to diagnose performance changes within Google are also presented.

A transparently-scalable metadata service for the Ursa Minor storage system

Conference papers
Shafeeq Sinnamohideen, Raja R. Sambasivan, James Hendricks, Likun Liu, Gregory R. Ganger
In Proceedings of ATC'10
Publication year: 2010

The metadata service of the Ursa Minor distributed storage system scales metadata throughput as metadata servers are added. While doing so, it correctly han- dles operations that involve metadata served by differ- ent servers, consistently and atomically updating such metadata. Unlike previous systems, Ursa Minor does so by reusing existing metadata migration functionality to avoid complex distributed transaction protocols. It also assigns object IDs to minimize the occurrence of multi- server operations. This approach allows Ursa Minor to implement a desired feature with less complexity than al- ternative methods and with minimal performance penalty (under 1% in non-pathological cases).

Modeling the relative fitness of storage

Conference papers
Michael Mesnier, Matthew Wachs, Raja R. Sambasivan, Gregory R. Ganger
In Proceedings of SIGMETRICS 2007
Publication year: 2007

Relative fitness is a new black-box approach to modeling the performance of storage devices. In contrast with an abso- lute model that predicts the performance of a workload on a given storage device, a relative fitness model predicts per- formance differences between a pair of devices. There are two primary advantages to this approach. First, because a relative fitness model is constructed for a device pair, the application-device feedback of a closed workload can be cap- tured (e.g., how the I/O arrival rate changes as the workload moves from device A to device B). Second, a relative fitness model allows performance and resource utilization to be used in place of workload characteristics. This is beneficial when workload characteristics are difficult to obtain or concisely express (e.g., rather than describe the spatio-temporal char- acteristics of a workload, one could use the observed cache behavior of device A to help predict the performance of B).

This paper describes the steps necessary to build a relative fitness model, with an approach that is general enough to be used with any black-box modeling technique. We compare relative fitness models and absolute models across a vari- ety of workloads and storage devices. On average, relative fitness models predict bandwidth and throughput within 10– 20% and can reduce prediction error by as much as a factor of two when compared to absolute models.

//Trace: Parallel trace replay with approximate causal events

Conference papers
Michael P. Mesnier, Matthew Wachs, Raja R. Sambasivan, Julio Lopez, James Hendricks, Gregory R. Ganger
In Proceedings of FAST 2007
Publication year: 2007

//TRACE is a new approach for extracting and replaying traces of parallel applications to recreate their I/O behavior. Its tracing engine automatically discovers inter-node 10 data dependencies and inter-I/O compute times for each node (process) in an application. This information is reflected in per-node annotated I/O traces. Such annotation allows a parallel replayer to closely mimic the behavior of a traced application across a variety of storage systems. When compared to other replay mechanisms, //TRACE offers significant gains in replay accuracy. Overall, the average replay error for the parallel applications evaluated in this paper is below 6%.

Ursa Minor: versatile cluster-based storage

Conference papers
Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, Jay J. Wylie
In Proceedings of FAST 2005
Publication year: 2005

No single encoding scheme or fault model is optimal for all data. A versatile storage system allows them to be matched to access patterns, reliability requirements, and cost goals on a per-data item basis. Ursa Minor is a cluster-based storage system that allows data-specific selection of, and on-line changes to, encoding schemes and fault models. Thus, different data types can share a scalable storage infrastructure and still enjoy specialized choices, rather than suffering from “one size fits all.” Ex- periments with Ursa Minor show performance benefits of 2–3× when using specialized choices as opposed to a single, more general, configuration. Experiments also show that a single cluster supporting multiple workloads simultaneously is much more efficient when the choices are specialized for each distribution rather than forced to use a “one size fits all” configuration. When using the specialized distributions, aggregate cluster through- put nearly doubled.

Replication policies for layered clustering of NFS servers

Conference papers
Raja R. Sambasivan, Andrew J. Klosterman, Gregory R. Ganger
In Proceedings of MASCOTS 2005
Publication year: 2005

Layered clustering offers cluster-like load balancing for unmodified NFS or CIFS servers. Read requests sent to a busy server can be offloaded to other servers holding repli- cas of the accessed files. This paper explores a key de- sign question for this approach: which files should be repli- cated? We find that the popular policy of replicating read- only files offers little benefit. A policy that replicates read- only portions of read-mostly files, however, implicitly coor- dinates with client cache invalidations and thereby allows almost all read operations to be offloaded. In a read-heavy trace, 75% of all operations and 52% of all data transfers can be offloaded.