|
Business Continuity & Disaster Recovery: Issues for IT The Challenge Despite considerable advances in equipment and telecommunications design and recovery services, IT disaster recovery is becoming increasingly challenging. Continuity and recovery aspects are impacting IT strategy and cost implications are challenging IT budgets. The time window for recovery is shrinking in face of the demand for 24 / 365 operations. Some studies claim that around 30% of high-availability applications have to be recovered in under three hours and a further 45% within 24 hours before losses become unsustainable [1]; others claim that 60% of Enterprise Resource Planning (ERP) Systems have to be restored in under 24 hours [2]. This means that traditional off-site backup and restore methods are often no longer adequate. It simply takes too long to recover incremental and full image backups of various inter-related applications (backed up at different times), synchronise them and re-create the position as at disaster. Continuous operation data mirroring to off-site locations and standby computing and telecommunications may be the only solution. A risk assessment and business impact analysis should establish the justification for continuity for specific IT and telecommunication services and applications. Non-Stop Dependability or Recovery Especially for e-commerce, the emphasis is increasingly on non-stop dependability rather than on recovery. To create a highly robust capability costs money in replication and resilience of equipment and communications. This should be considered at project initiation stage. Even this design will not guarantee availability: it begs the question of continuity of operations of perhaps dozens of intermediate suppliers to provide access up to the firewall and issues of application continuity and scalability. Impact on IT Strategy The need for non-stop computing may drive IT equipment procurement policy, suggesting a Neverfail approach. Consolidation of mission-critical applications and services may help to justify investment resilient servers for them, or possibly clustering. It is usually cheaper and easier to concentrate protection than to spread it around over many servers. Also, the fewer servers that need high protection, the cheaper it is in subscriptions to recovery services and the easier and quicker it is to effect restoration. Not all commercial standby sites keep up to date with equipment. They exist to make a profit. They do not, therefore, usually invest in the latest equipment and then try to sell subscriptions for it. They are probably more likely to wait for client demand to justify acquiring the new equipment so that its cost is quickly recovered. This means that those organizations that buy the latest and greatest may find a time gap between acquiring this equipment and having a standby recovery site available. They either have to bear the full cost themselves as an in-company solution, or pay a high price for their subscriptions to a commercial service, since the vendor has little competition and will seek to recover as much of the investment as possible from the first customer. While it might be difficult to justify investment in Storage Area Networks (SAN) on the basis of recovery capability, it certainly helps recovery if data is concentrated in a few places and can be easily backed up from them. Similarly, intelligent back-up capability such as that offered by Veritas [3] or Previo [4] has everyday operational benefits, providing e-support capability as well as speeding recovery in the event of major data loss. Incidentally, we have come across numerous cases where standard, routine backups have failed. This may be because:
Contractual Arrangements for Disaster Recovery Services In theory a commercial hot or warm standby site is available 24 / 365. It has staff skilled in assisting recovery. Its equipment is constantly kept up to date, while older equipment remains supported. It is always available for use and offers testing periods once or twice a year. The practice may be different. These days, organizations have a wide range of equipment from different vendors and different models from the same vendor. Not every commercial standby site is able to support the entire range of equipment you may have. Instead, vendors form alliances with others but this may mean that your recovery effort is split between more than one standby site. When you invoke the standby site, the desks may have mainframe or midrange terminals or thin clients on them, instead of the PC / Local Area Network (LAN) / server environment you need. There may be a lead time to de-install the terminals and install the PC / LAN environment. Perhaps the standby site is already occupied by another client who may be testing. This client has to be allowed time to close down and move out. Typically four hours notice may be required before the client can actually occupy the standby facility (although they may gain access to meeting rooms earlier). The standby site may not have identical IT equipment: instead of the use of an identical piece of equipment, it will offer a partition on a compatible large computer or server. Operating systems and security packages may not be the same version as the client usually uses. These things may cause setbacks when attempting recovery of IT systems and applications and weak change control at the recovery site could cause a disaster on return to the normal site. Call Center standby sites may not always be compatible in equipment nor able to replicate the whole of an integrated customer relationship management system. Indeed, most of the Call Center recovery plans we have seen simply would not work. Commercial standby sites vary in the standard of facilities they provide. Many have workspaces with small desks and limited storage space. Some have meeting rooms, restaurants, shower and rest facilities while others may have very basic facilities. The presentation and location of some is superb: others are basic buildings in insalubrious surroundings that may cause concern for staff arriving and leaving at night. Note that some commercial standby sites only have a limited amount of equipment actually on site and will contract for quick resupply of additional equipment to their site if a client invokes use of the standby site in a disaster. Some are not convenient for public transport. Others may not have sufficient parking space. Telecommunications issues may also arise: it is important to ensure that relevant links are in place and that communications capability is compatible. The adequacy of voice and data capacity needs to be checked. Telephony needs to be switched from the disaster site to the standby site: can this be done? Can your staff operate the switchboard at the standby site? (Incidentally, having telephone and fax numbers mixed up sequentially does not help in recovery it is much easier if fax and telephone numbers are separable in distinct blocks.) Most commercial standby sites offering IT and work area recovery facilities do not guarantee a service: the contract merely provides access to the equipment. Although most reputable vendors will negotiate a Service Level Agreement that specifies the quality of the service, it is rarely offered. It is important to ensure that your service will not suffer from unacceptable downtime or response. The vendor charging structure needs to be carefully considered. It is not unusual for vendors to seek to recover fixed costs plus basic profit margin for the whole recovery site from the first five subscribers. These first five may be tempted by discounts for three or five year contracts. However, as soon as the fixed costs plus basic profit margin are achieved, the vendor can afford to discount significantly, since his additional costs are marginal and new subscriptions go almost entirely to the profit line. Knowing this, in one case, we achieved a discount of over 70% from one vendor for the same service. However, the initial subscribers may be locked in to paying top dollar for the next three or five years. When one considers that each facility may have typically up to 35 subscribers, it can be a lucrative business for the vendor. That, coupled with future earnings being covered by medium and long term contracts, is why one service vendor came to the market on 68 times earnings. Some vendors offer a drop-ship service as an alternative to occupying the standby site. That is, in the event of equipment failure, for instance, they will drop off a replacement rather than insist the client occupy the standby site, with all the inconvenience that may involve. Some vendors include this as part of the standard subscription, while others treat it as a premium service and charge extra for it. The vendor may have skilled staff available but this is rarely guaranteed and they come at a cost. In terms of cost, there may be additional fees to pay for testing, on invocation of a disaster, and for occupation in a disaster. Some of these costs may be covered by your insurance as extra cost of working, but this should be checked before the disaster. Mobile / portable alternate facilities sound attractive, but it is essential that a site survey is undertaken to ensure they can be parked on the required site. We know of at least one disaster invocation where a mobile unit arrived outside the disaster site, only to be moved on by the police. The better vendors have algorithms and checks to ensure a minimal possibility of a client invoking and finding the standby site occupied (i.e. they calculate an actuarial basis for standby). They will welcome your attendance at user group meetings and will provide references (subject to confidentiality agreements with their clients). Larger vendors will have a number of other sites to which clients can overflow if necessary. Among the questions we need to ask are:
This may seem negative or even hostile to vendors. It is not intended to be so. Most standby site vendors provide sound service at reasonable cost and are genuinely dedicated to assisting their clients under the most difficult circumstances. They have an enviable record of successful recoveries. But, as in any industry, there are a few unscrupulous suppliers. It is the responsibility of the IT manager to ensure effective recovery by those vendors who apply the highest standards, supporting this by a stringent contract, clearly defining service specifications and technical requirements and Service Level Agreements.
Credit: Andrew Hiles is a director of the Kingswell International,
consultants in business risk management and service management. [1]Source: Compaq / Tandem [2]Source: Comdisco |
||
| |