發明
美國
16/252,733
US 10,831,579
The method and design of an efficient hierarchical root cause analysis mechansim
國立中央大學
2020/11/10
本發明旨在快速診斷出環境中所發生的根本原因錯誤。本發明透過大量蒐集各種錯誤形成錯誤集,並使用分析工具對錯誤集進行分析,目標為找尋錯誤間的特性,而根據找到的錯誤特性發展可快速處理此問題的演算法與流程,達到減少偵測時間之目的。 隨著科技、資訊產業的蓬勃發展,一個機器或系統的故障將會在停機的這段時間中造成巨大的損失,對於企業來說尋找減少停機時間的方法已經刻不容緩,所以容錯、高可用性系統的重要性也日益增加。 如今已有許多提供容錯、高可用性的系統,而根據我們的觀察,這些系統的基礎流程大多也是遵循偵測錯誤後執行回復機制的方式,而其可處理的錯誤種類大多不只一個,這些錯誤有些彼此獨立但是有些卻存在相依性,也就是說,當一種錯誤發生時,他也同時會引發其它種錯誤的錯誤症狀,這時若沒有加以分辨則會導致誤判的情形發生。對於此種狀況,在現有的系統中大多採取偵測所有錯誤症狀以後再一起分析的方式,如此雖然確保了正確性但是無疑增加了許多的偵測時間;此外,還有一些錯誤可能是暫時性錯誤,這種錯誤通常需要足夠長的偵測時間才能確定是暫時性或永久性,而所需的偵測時間也大多會是其它種錯誤的偵測時間的好幾倍。 而依據前言,企業所關心的是如何有效減少停機時間,所以一個耗時許久的偵測方式無疑無法有效滿足他們的需求;因此,我們提出了一種改良式錯誤偵測流程來有效減少所需的偵測時間,此流程分為偵測階段、診斷階段、確認階段及回復階段,在偵測階段上我們只偵測最後一個錯誤症狀以確定是否有錯誤發生,而到了診斷階段才會透過我們提出的演算法來確認真正發生的是哪一種錯誤,接著在確認階段則是會確認診斷出的症狀是否有可能是暫時性錯誤,若有可能則會再進行確認以分辨這次發生的錯誤是暫時性或永久性,當完成最後確認後才會進入回復階段來執行該錯誤的回復機制。 The present invention aims to quickly diagnose the root cause of errors that occur in the environment. In order to achieve this goal, we forms an error set by collecting a large number of errors, and then using an analysis tool to analyzes the error set to find the wrong characteristics. According to the wrong characteristics, we develop an algorithm and process that can quickly deal with this problem, so as to reduce the detection time. With the vigorous development of science and technology and information industry, a failure of machines or system can result in significant losses during the downtime, so there is an urgent need for companies to find ways to reduce downtime, similarly, the importance of fault-tolerant and high-availability systems is also growing. There are many systems that provide fault tolerance or high availability today. According to our observation, the processes of these systems are mostly the same, they all implement the recovery mechanism after detecting the errors. Mostly, they can handle more than one type of errors, some of these errors are independent but some are dependent; in other words, it will also trigger the symptoms of other kinds error when an error occurs, so if not resolved it will lead to miscarriage of justice. To this kind of situation, most of the existing systems adopt the method of detecting all the error symptoms and then analyze them together. Although this method has ensured the correctness, it undoubtedly adds a lot of detection time. In addition, there are some errors may be “transition failures”, which means it will recover itself after a period of time. For this type of error, we should give it more time to judge it is transition failure or permanent failure, so the detection time of it should be several times longer than other errors. Based on the preface, companies are concerned about how to effectively reduce downtime, so a time-consuming detection method is undoubtedly unable to meet their needs. Therefore, we propose a modified error detection process to effectively reduce the required detection time. This process is divided into the detection phase, diagnosis phase, confirmation phase and recovery phase, during the detection phase we only detect the last error symptom to determine if any error occurred; and in the diagnostic phase, we will use the algorithm proposed by us to confirm what kind of error really happened; then during the confirmation phase, it will check the possible of transition failure of the diagnosed symptom, if the diagnosed symptom may be a transition failure, it will confirm whether the diagnosed symptom is transition failure or permanent failure; at last, it will execute the error recovery mechanism during the recovery phase.
智權技轉組
03-4227151轉27076
版權所有 © 國家科學及技術委員會 National Science and Technology Council All Rights Reserved.
建議使用IE 11或以上版本瀏覽器,最佳瀏覽解析度為1024x768以上|政府網站資料開放宣告
主辦單位:國家科學及技術委員會 執行單位:台灣經濟研究院 網站維護:台灣經濟研究院