I am a research scientist at Boston University (BU). My research focuses on building the multi-party distributed systems needed to sustain innovation in the cloud ecosystem and the sophisticated tools needed to diagnose problems in them. At BU, I lead the Diagnosis and Control of Clouds Lab (DOCC Lab) and work with Orran Krieger on the Mass Open Cloud (MOC) project. I am on the job market for tenure-track faculty positions in systems & networking.
From 2013 to 2016, I was a postdoctoral researcher in the Computer Science Department at Carnegie Mellon University (CMU). I worked on the XIA project and was advised by Professor Peter Steenkiste. My research focused on identifying ways to facilitate the deployment of new, advanced inter-domain routing protocols on the Internet and was published in SIGCOMM’17. In the Fall of 2013, I co-developed and co-taught the initial offering of CMU’s graduate class on cloud computing (15-719).
I completed my PhD in the Electrical & Computer Engineering department at CMU in May 2013. I worked at the Parallel Data Lab (PDL) and was advised by Professor Greg Ganger. My dissertation focused on creating tools to reduce the difficulty of diagnosing problems in complex distributed systems. I built one of the earliest workflow-centric tracing infrastructures capable of strongly supporting diagnosis tasks. (Workflow-centric tracing is also called end-to-end tracing or distributed tracing.) I also built one of the earliest automated diagnosis tools that used workflow-centric traces. Conference papers related to my dissertation were published in NSDI’11, InfoVis’13, and SoCC’16. My dissertation work has informed both industrial workflow-centric tracing approaches and industrial diagnosis tools that use the resulting traces.
In 2007, I appeared in a PhDComics strip encouraging CS grad students to wear lab coats to work. In my spare time, I enjoy playing tennis, running, and photography. I also occasionally blog at Formalized Curiosity.
My students and I celebrated the end of the Fall’18. with dinner at the Q restaurant in downtown Boston last night. It was an evening of good food and learning how to use chopsticks :).
Lily did a fabulous job presenting her early work on this research. I’ve listed the abstract and video below.
Logging what matters: The Pythia just-in-time instrumentation framework (Lily Sturmann) (Slides): We will present our current work on Pythia, a just-in-time instrumentation framework for distributed systems that automatically enables instrumentation in the right areas to provide visibility into newly-observed problems in a running system. The talk will discuss key challenges involved in creating such a framework: (1) understanding where in the distributed system (e.g., which components) additional instrumentation is needed, (2) understanding what instrumentation (e.g., log statements or information contained in logs, such as function parameter values) is needed, (3) Understanding how to limit the overheads of enabling too much instrumentation. It will discuss how end-to-end tracing, combined with statistical measures and machine-learning techniques, provide a foundation to address these challenges. The talk will conclude with our current progress building Pythia and applying it to problems in OpenStack.
Many of our students presented at Red Hat’s developer conference (DevConf.US) this year. I’ve listed abstracts and talk videos below.
Logging what matters: Just-in-time instrumentation and tracing (Lily Sturmann and Emre Ates) (Video): Diagnosing problems in distributed systems is time-consuming and heavily reliant on developer guesswork to know where to instrument the system. The Pythia “Just-in-Time” Instrumentation Framework uses statistical measures to detect where instrumentation is needed in a distributed system to isolate specific problems as they occur. We will demonstrate an initial proof of concept by showing that one key statistical measure—high-performance variation among work that is expected to perform similarly—can predict where additional instrumentation is needed.
Skua: Extending distributed-systems tracing into the Linux Kernel (Harshal Sheth and Andrew Sun) (Video): Modern applications are often architected as a sprawling fleet of microservices. While this does have benefits, it also makes it incredibly difficult for developers to diagnose issues with their applications. Many tools have been developed to trace applications by recording timing data and resolving service dependencies. However, these tools miss an important part of application performance: the kernel. We present Skua, a modified suite of tracing utilities that gains insight into both application- and kernel-level behavior. Logging information produced by LTTng is augmented with tracing context information and integrated into the existing distributed-systems tracing framework provided by Jaeger.
Tracing Ceph using Jaeger-Blkkin (Mania Abdi) (Video): Blkkin is a custom end-to-end tracing infrastructure for Ceph. It captures the work done to process individual requests within and among Ceph’s components. But, it can only be turned on for individual requests and cannot be left always-on due to the resulting overhead. We present Jaeger-Blkkin, which can be used in always-on fashion in production with low overhead. Jaeger-BlkKin is constructed by replacing much of Blkkin’s tracing functionality with that of Jaeger, a widely-deployed open-source tracing infrastructure. Jaeger-Blkkin is OpenTracing compatible, meaning that it can be replaced easily with other, even more, advanced tracing infrastructures when they become available.
Diagnosing and fixing problems in distributed applications running in cloud environments is extremely challenging. One key reason is a lack of needed instrumentation: it is difficult to predict a priori where instrumentation is needed, what instrumentation is needed, and within what datacenter stack layer (e.g., application, virtualization, network) instrumentation is needed to provide visibility into future problems.
To help, this proposal describes a framework that will explore the search space of possible instrumentation choices to automatically enable the instrumentation needed to help engineers diagnose a new problem. This work builds on workflow-centric tracing (also called end-to-end tracing or distributed tracing), which was a focus of my dissertation work, machine-learning techniques, and domain-specific knowledge.
My Co-PIs and I are very excited to make progress on this project!
NSF CNS CSR Small: A just-in-time, cross-layer instrumentation framework to help diagnose performance problems in distributed applications. Raja R Sambasivan, Ayse K. Coskun, Orran Krieger. $460,249.
Thanks to NSF for selecting me to attend this workshop and for funding my travel costs. I’m looking forward to learn more about NSF’s programs and how to write great proposals :).
Harshal and Andrew’s project, Tarpan: a router that supports evolvability, involved implementing a robust version of D-BGP in Quagga. D-BGP is a version of BGP that includes extensions that let it bootstrap evolvability to new inter-domain routing protocols—i.e., facilitate their deployment and gradually deprecate itself in favor of one or more of them. D-BGP was the focus of my postdoc research and was published in SIGCOMM’17. Harshal and Andrew are high school students from the MIT primes research program.
Congrats to Harshal and Andrew on this accomplishment!
Workflow-centric tracing (also called end-to-end tracing or distributed-systems tracing) captures the work done within and among distributed-system components to service individual requests. Due to its ability to provide deep visibility into complex distributed-system behaviors, it is rapidly being adopted by industry (e.g., by Facebook, Google, Yelp). However, there is a dangerous belief both in academia and industry that a single workflow-centric tracing design can serve all of the use cases commonly attributed to it (e.g., diagnosing different types of problems, resource attribution).
For this paper, we teamed up with other academics and practitioners working on workflow-centric tracing to distill its key design axes. For each axis, we identified design choices best suited for various tracing use cases. We also discussed how seemingly innocuous design choices for different axes can lead to poor outcomes due to the way they interact with one other.
We have been trying to get this paper published for four years, so I’m very happy about this acceptance! The initial technical report version of this paper, which we published in 2014, has already been cited by dozens of other research papers and covered in various “Papers We Love” meetups.