Deprecated: Unparenthesized `a ? b : c ? d : e` is deprecated. Use either `(a ? b : c) ? d : e` or `a ? b : (c ? d : e)` in /home/rajasam4/ on line 644

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/rajasam4/ on line 111
Raja R. Sambasivan – Raja's academic site
Photo of Assistant Professor Raja Sambasivan (Fall 2019)


Note: I am recruiting PhD students at Tufts!  If you are interested, please apply to the Tufts’ Computer Science Department here.  Also, drop me an email telling me: (1) that you applied and (2) what interests you about systems and networking research.  The deadline is January 15th, 2020.

I am an assistant professor in the Computer Science Department at Tufts University.  My research focuses on supporting innovation in the cloud ecosystem.  This involves addressing three broad research questions: (1) How can we build cloud-based distributed systems that are easily upgradeable or evolvable to support new use cases?  (2) How can we sophisticated tools to help engineers diagnose problems in cloud-based systems?  (3) How can we build systems that are simpler and easier to observe and understand?  The latter two research questions are crticially important because engineers’ ability to diagnose problems in cloud systems is an upper bound on the amount of complexity (innovation) that they can support.  To make progress on these questions, I combine systems/networking domain knowledge with ideas from other fields, such as machine learning and visualization.

In my previous life, I was a Red Hat Visiting Research Scientist at Boston University (BU). I led a research group that focused on building diagnosis tools for cloud systems and I worked with Orran Krieger on the Mass Open Cloud (MOC) project. From 2013 to 2016, I was a postdoctoral researcher in the CS Department at Carnegie Mellon University (CMU).  I worked on the XIA project and was advised by Peter Steenkiste.  My research focused on creating mechanisms for upgrading inter-domain routing on the Internet (SIGCOMM’17).

I completed my Ph.D. in the ECE Department at CMU in May 2013.  I worked at the Parallel Data Lab (PDL) and was advised by Professor Greg Ganger.  My dissertation focused on creating tools to reduce the difficulty of diagnosing problems in distributed systems (NSDI’11, InfoVis’13,  SoCC’16).   My research on creating workflow-centric tracing infrastructures and diagnosis tools that use the resulting traces has been cited over 200 times.  It has also influenced industrial tracing efforts—examples include Jaeger’s trace-comparison visualizations and Uber’s recent efforts to create many-to-many-trace-comparison tools.   (Workflow-centric tracing is also called end-to-end tracing.)

In 2007, I appeared in a PhDComics strip encouraging CS grad students to wear lab coats to work. In my spare time, I enjoy playing tennis (poorly), running (poorly), and reading sci-fi novels.  I also occasionally blog at Formalized Curiosity.


My students and I celebrated the end of the Fall’18. with dinner at the Q restaurant in downtown Boston last night.  It was an evening of good food and learning how to use chopsticks :).

Lily did a fabulous job presenting her early work on this research.  I’ve listed the abstract and video below.

Logging what matters: The Pythia just-in-time instrumentation framework (Lily Sturmann) (Slides): We will present our current work on Pythia, a just-in-time instrumentation framework for distributed systems that automatically enables instrumentation in the right areas to provide visibility into newly-observed problems in a running system. The talk will discuss key challenges involved in creating such a framework: (1) understanding where in the distributed system (e.g., which components) additional instrumentation is needed, (2) understanding what instrumentation (e.g., log statements or information contained in logs, such as function parameter values) is needed, (3) Understanding how to limit the overheads of enabling too much instrumentation.  It will discuss how end-to-end tracing, combined with statistical measures and machine-learning techniques, provide a foundation to address these challenges.  The talk will conclude with our current progress building Pythia and applying it to problems in OpenStack.

Many of our students presented at Red Hat’s developer conference (DevConf.US) this year.  I’ve listed abstracts and talk videos below.

Logging what matters: Just-in-time instrumentation and tracing (Lily Sturmann and Emre Ates) (Video)Diagnosing problems in distributed systems is time-consuming and heavily reliant on developer guesswork to know where to instrument the system. The Pythia “Just-in-Time” Instrumentation Framework uses statistical measures to detect where instrumentation is needed in a distributed system to isolate specific problems as they occur. We will demonstrate an initial proof of concept by showing that one key statistical measure—high-performance variation among work that is expected to perform similarly—can predict where additional instrumentation is needed.

Skua: Extending distributed-systems tracing into the Linux Kernel (Harshal Sheth and Andrew Sun) (Video): Modern applications are often architected as a sprawling fleet of microservices. While this does have benefits, it also makes it incredibly difficult for developers to diagnose issues with their applications. Many tools have been developed to trace applications by recording timing data and resolving service dependencies. However, these tools miss an important part of application performance: the kernel. We present Skua, a modified suite of tracing utilities that gains insight into both application- and kernel-level behavior. Logging information produced by LTTng is augmented with tracing context information and integrated into the existing distributed-systems tracing framework provided by Jaeger.

Tracing Ceph using Jaeger-Blkkin (Mania Abdi) (Video):  Blkkin is a custom end-to-end tracing infrastructure for Ceph. It captures the work done to process individual requests within and among Ceph’s components. But, it can only be turned on for individual requests and cannot be left always-on due to the resulting overhead. We present Jaeger-Blkkin, which can be used in always-on fashion in production with low overhead. Jaeger-BlkKin is constructed by replacing much of Blkkin’s tracing functionality with that of Jaeger, a widely-deployed open-source tracing infrastructure. Jaeger-Blkkin is OpenTracing compatible, meaning that it can be replaced easily with other, even more, advanced tracing infrastructures when they become available.

Diagnosing and fixing problems in distributed applications running in cloud environments is extremely challenging.  One key reason is a lack of needed instrumentation: it is difficult to predict a priori where instrumentation is needed, what instrumentation is needed, and within what datacenter stack layer (e.g., application, virtualization, network) instrumentation is needed to provide visibility into future problems.

To help, this proposal describes a framework that will explore the search space of possible instrumentation choices to automatically enable the instrumentation needed to help engineers diagnose a new problem.  This work builds on workflow-centric tracing (also called end-to-end tracing or distributed tracing), which was a focus of my dissertation work, machine-learning techniques, and domain-specific knowledge.

My Co-PIs and I are very excited to make progress on this project!

NSF CNS CSR Small: A just-in-time, cross-layer instrumentation framework to help diagnose performance problems in distributed applications.  Raja R Sambasivan, Ayse K. Coskun, Orran Krieger.  $460,249.

Thanks to NSF for selecting me to attend this workshop and for funding my travel costs.  I’m looking forward to learn more about NSF’s programs and how to write great proposals :).

Harshal and Andrew’s project, Tarpan: a router that supports evolvability, involved implementing a robust version of D-BGP in Quagga. D-BGP is a version of BGP that includes extensions that let it bootstrap evolvability to new inter-domain routing protocols—i.e., facilitate their deployment and gradually deprecate itself in favor of one or more of them.  D-BGP was the focus of my postdoc research and was published in SIGCOMM’17.  Harshal and Andrew are high school students from the MIT primes research program.

Congrats to Harshal and Andrew on this accomplishment!

Workflow-centric tracing (also called end-to-end tracing or distributed-systems tracing) captures the work done within and among distributed-system components to service individual requests.  Due to its ability to provide deep visibility into complex distributed-system behaviors, it is rapidly being adopted by industry (e.g., by Facebook, Google, Yelp).  However, there is a dangerous belief both in academia and industry that a single workflow-centric tracing design can serve all of the use cases commonly attributed to it (e.g., diagnosing different types of problems, resource attribution).

For this paper, we teamed up with other academics and practitioners working on workflow-centric tracing to distill its key design axes.  For each axis, we identified design choices best suited for various tracing use cases.  We also discussed how seemingly innocuous design choices for different axes can lead to poor outcomes due to the way they interact with one other.

We have been trying to get this paper published for four years, so I’m very happy about this acceptance!  The initial technical report version of this paper, which we published in 2014, has already been cited by dozens of other research papers and covered in various “Papers We Love” meetups.


  • Creating tools for problem diagnosis
  • Workflow-centric tracing
  • Cloud Computing
  • Distributed systems
  • Storage systems
  • Network architecture

Selected publications

Bootstrapping evolvability for inter-domain routing with D-BGP

Conference papers
Raja R. Sambasivan, David Tran-Lam, Aditya Akella, Peter Steenkiste
In Proceedings of SIGCOMM 2017.
Publication year: 2017

Principled workflow-centric tracing of distributed systems

Conference papers
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger
In Proceedings of SoCC 2016
Publication year: 2016

Diagnosing performance changes by comparing request flows

Conference papers
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, Gregory R. Ganger
In Proceedings of NSDI 2011
Publication year: 2011

Visualizing request-flow comparison to aid performance diagnosis in distributed systems

Conference papersJournal papers
Raja R. Sambasivan, Ilari Shafer, Michelle Mazurek, Gregory R. Ganger
IEEE Transactions on Visualization and Computer Graphics (Proc. Information Visualization 2013), Vol. 19, no. 12, Dec. 2013
Publication year: 2013