How to Manage system failures


It is typically at the system level that service failures are detected. The helpdesk staff gets a call that the customer quotation system is down. The included services are not apparent to the helpdesk person and so system level procedures will be used. Every failure must be treated as an incident. Formal incident recording, tracking, management and monitoring will be needed if an organization is to be able to understand its current performance level and also develop the potential for long-term improvement of its incident management. Only in the case of life-threatening, hazardous situations should incident recording not be commenced immediately.

Historically IT staff have been disproportionately involved in service continuity as they were the suppliers of IT services that were particularly unreliable. Now that IT hardware is much more reliable, the emphasis has changed to focus on the complexity of the IT installations and the dependence for these services within the organization.

The majority of incidents will have an established recovery approach that has been pre-designed, pre-planned and pre-approved. Staff will have been trained in the appropriate recovery actions and will then complete the incident record. In large installations any incident that exceeds the defined specifications that have been part of the standard training is then quickly escalated to a specialist team that is able to investigate, request additional facilities, and work towards developing a solution.

All the defined incidents will have an incident level that establishes communication channels and priority. Incident notification can be generated by all members of the organization directly through the IT helpdesk. Additional incident notifications will be generated by the IT staff themselves.

Higher levels of criticality will invoke the business continuity plan or the crisis management plan. IT staff will need to know the escalation plan, contacts and urgent actions that they must carry out. The complementary situation arises when a business continuity failure in another part of the organization leads to special, additional requirements on the IT facilities in this location. With critical incidents the focus is on restoration – the challenge of finding the root cause of the problem will have to wait until appropriate resources can be brought to bear and priorities determined.


Trackback URL for this post:

http://www.securityprocedure.com/trackback/145