Self-Healing System in Distributed Computing Environments



One of the important design criteria for distributed systems and their applications is their reliability and robustness to hardware and software failures. The increase in complexity, interconnectedness, dependency and the asynchronous interactions between the components that include hardware resources (computers, servers, network devices), and software (application services, middleware, web services, etc.) makes the fault detection and tolerance a challenging research problem. In this paper, we present an innovative approach based on statistical and data mining techniques to detect faults (hardware or software) as well as root-cause analysis of system and application faults. In our approach, we monitor and analyze all the interactions between all the components of a distributed system. We use data mining and supervised learning techniques to obtain the rules that can accurately model the normal interactions among these components. Our anomaly analysis engine will immediately produce an alert whenever one or more of the interaction rules that capture normal operations is violated due to a software or hardware failure. Our analysis show that our approach is superior when compared to other techniques. For example, the precision value that is trained with Tranining with PN equals to 0.998 in 50% noise value and the missed alarm and false alarm rate is near 0%. We also developed algorithms to automatically acquire new rules of the environment and load changes over time.


Our approach is based on autonomic computing paradigm that requires continuous monitoring and analysis of the system state, and then plan and execute the appropriate actions if it is determined that the system is not meeting its requirements.




Figure 1: Self-Healing System


This project is sponsored by NSF Gransts number 0758579