wiki:DesignOverview

NetCheck components

NetCheck overview

As shown in the figure above, NetCheck orders the syscalls with a heuristic Ordering Algorithm and a Network Model. Following this, NetCheck uses a Diagnoses Engine to compile any detected deviations from the network model into a diagnosis.

Ordering host traces

The key to the efficiency of the order reconstruction algorithm is to simulate syscalls in an order that is derived from the POSIX syscalls dependencies. Such dependencies were created by examining the POSIX specification for each system call and looking at which calls can modify the state used by other calls.

Model-based syscall simulation

The network model component of NetCheck simulates syscalls to determine if a given syscall can be added as the next syscall in the global order. The network model treats the network and the application that generated the traces as a blackbox and requires no application-specific information.

To simulate a syscall, the model uses the current network and host states tracked by the model, and network semantics defined by the POSIX API. The network model state includes information related to the observed connections/protocols (e.g., pending or established TCP connections), buffer lengths and their contents, datagrams sent/lost, etc. Simulating a syscall with a model results in one of three determinations: accept the call, reject the call, or permanently reject the call. Call rejections are actually invalid model transitions and you can look at the list of these conditions here.

Fault diagnoses engine

When NetCheck finished processing the trace, either through consuming all actions, finding an order error, or permanently rejecting an action, the state of the model contains valuable information. The diagnoses engine in NetCheck analyzes the model simulation state and any simulation errors to derive a fault diagnosis. The diagnoses engine makes the simulation results more meaningful to an administrator who might be tasked with resolving the issue.

Efficiency and performance

NetCheck successfully classifies more than 90% of failures when applied to execution traces of bugs affecting dozens of popular applications and protocols including FTP, Python, VLC, and Ruby. The set of these unit tests that were generated are defined in more detail on the following page:

Related work

Numerous fault diagnosis tools have been developed [1 - 6], but few of these tools are applicable to large applications whose source code is not available. Without source code administrators often resort to probing tools such as ping and traceroute, which can help to diagnose reachability, but cannot diagnose application-level issues.

References

[1] Bhavish Aggarwal, Ranjita Bhagwan, Tathagata Das, Siddharth Eswaran, Venkata N. Padmanabhan, and Geoffrey M. Voelker. Netprints: diagnos ng home network misconfigurations using shared knowledge. In NSDI (2009)

[2] Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, Jeffrey S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In OSDI (2004).

[3] Nick Feamster and Hari Balakrishnan. Detecting BGP configuration faults with static analysis. In NSDI (2005).

[4] Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. Pip: Detecting the unexpected in distributed systems. In NSDI (2006).

[5] Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, Paramvir Bahl. Detailed diagnosis in enterprise networks. ACM SIGCOMM Computer Communication Review 39, 4 (2009), 243–254.

[6] Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-Trace: A pervasive network tracing framework. NSDI (2007).

Last modified 3 years ago Last modified on Feb 10, 2014 1:04:11 PM

Attachments (1)

Download all attachments as: .zip