Diagnosing and fixing problems in distributed applications running in cloud environments is extremely challenging. One key reason is a lack of needed instrumentation: it is difficult to predict a priori where instrumentation is needed, what instrumentation is needed, and within what datacenter stack layer (e.g., application, virtualization, network) instrumentation is needed to provide visibility into future problems.
To help, this proposal describes a framework that will explore the search space of possible instrumentation choices to automatically enable the instrumentation needed to help engineers diagnose a new problem. This work builds on workflow-centric tracing (also called end-to-end tracing or distributed tracing), which was a focus of my dissertation work, machine-learning techniques, and domain-specific knowledge.
My Co-PIs and I are very excited to make progress on this project!
NSF CNS CSR Small: A just-in-time, cross-layer instrumentation framework to help diagnose performance problems in distributed applications. Raja R Sambasivan, Ayse K. Coskun, Orran Krieger. $460,249.