TracingSummit2014FirstFailure

Abstract
Successful software problem determination depends heavily on the availability of debugging data such as logs, traces, and dumps. More often than not the required information is not readily available resulting in the need to perform life debugging, instrumentation, and problem reproduction. What if this approach is impracticable because the system is not accessible, a further outage is not acceptable, or the problem cannot be easily reproduced?

First Failure Data Capture (FFCD) is a concept that aims at ensuring that all relevant data is collected, retained and reported at the first occurrence of an error. It has been implemented successfully for years in core mainframe components with high availability requirements, such as system firmware or operating systems. This presentation discusses ideas on how the FFDC concept could be applied to Linux.