Coworkers huddled around a desk working on a business plan

How to test a BCDR plan

You already know the importance of having a robust disaster recovery plan for your customers—and your own business. But that’s just the first part of preparing for a disruptive event. A comprehensive BCDR plan must also include testing that covers three areas: people, processes, and technology. You have to determine that:

  • The technical infrastructure can handle the demands of the plan.
  • Employees have the information and tools they need to carry out their responsibilities in the BCDR plan.
  • The procedures and protocols in place work in practice as well as in theory.

Testing is the final part of designing and implementing BCDR plans that work the way they are supposed to. This post provides comprehensive guidance on BCDR testing to ensure those plans will function as required when needed.

Importance of testing a disaster recovery plan

Putting detailed business continuity disaster recovery (BCDR) plans in place for your customers is one of an MSP’s most critical functions. If a client’s organization does face a disruptive event, you need to make sure it—and you—are ready. Watch our on-demand webinar, BDR + NOC: Backup Your Data Better, to learn more about the different solutions that will ensure you have all the necessary bases covered.

If you already have those plans in place—great! But have you tested them recently?

BCDR plans don’t fall under the “set it and forget it” category. Threats evolve, technologies change, and unexpected issues arise. Even the most detailed and thorough plan can look flawless in theory, but in practice, you may uncover some serious issues that could lead to potentially catastrophic data loss or downtime. The most careful planning is pointless without regular and rigorous testing.

BCDR testing involves running exercises and simulations to ensure there are no gaps, vulnerabilities, or unforeseen issues with a BCDR plan. Key aspects generally include:

  • Defining and designing specific scenarios that align with possible real-world threats (such as cyberattacks or natural disasters)
  • A detailed testing plan that outlines objectives, scope, methodologies, personnel, timelines, logistics, and criteria for success
  • Post-testing assessment and analysis to identify areas of improvement and lessons learned

It’s also essential to also assess communication and coordination processes (such as notifications, employee responsibilities, and escalation procedures) at every step, as these are critical to the success of the plan. You should ensure that organizational stakeholders understand their roles and responsibilities as well as where and how to share and get information during a crisis (for example, by using instant messaging if the business’s email system is inaccessible).

The consequences of not engaging in BCDR plan testing can be severe for both you and your clients, including:

  • Lost data
  • Downtime
  • Loss of professional reputation and credibility
  • Significant financial costs

The loss of customer trust can be catastrophic. Current clients may decide to work with another provider, and the damage to your reputation could scare off potential customers. If you are serious about ensuring your customers can survive a disaster, cyberattack, or any other incident, you must include testing as a consistent element of BCDR planning and readiness. By doing so, you also help develop their customers’ resilience against evolving threats and cultivate professional credibility.

ch3-why-test-bcdr-plans.jpg

BCDR goals for testing

Goals are beneficial for providing a clear direction for BCDR testing, including making sure tests align with overall business goals. In particular, you should establish Recovery Point Objectives (RPOs), which refer to the amount of data that is acceptable to lose before restoration, and Recovery Time Objectives (RTOs), the amount of time before services are restored.

Additional goals for disaster recovery and business continuity plan tests can relate to:

  • The integrity and availability of the recovered data
  • The functionality and performance of recovered systems and applications
  • Feedback from personnel and other users of the recovered systems
  • Comparison with results and outcomes from previous BCDR tests

Again, these objectives will vary for each customer. You should help define goals and other desired results by working with key stakeholders from executive leadership, IT teams, and departmental managers; considering budget and resources; and emphasizing continuous improvement.

Types of testing

There are several different types of BCDR testing, each of which offers pros and cons. The business continuity and disaster recovery test types that are appropriate for an organization will depend on a variety of factors, including its size and nature, available resources, and the stage of BCDR testing taking place.

Tabletop exercises

These involve real-time discussions with organizational leaders and anyone else with a critical role in the BCDR plan. The group examines the plan, explores different scenarios, and ensures that all business units are accounted for.

  • Pros: This method requires limited resources, offers an opportunity to ask questions and add to knowledge, and supports cross-departmental communication and coordination.
  • Cons: Since the test is “on paper,” there is no chance to validate technical aspects or see how it plays out in practice.

This type of testing is best suited for the beginning stages of the process. Tabletop exercises can also be an effective training tool.

Walk-throughs

In walk-through BCDR testing, the team is faced with a specific type of disruptive event, and each member goes through their individual roles and responsibilities to identify any gaps or inefficiencies.

  • Pros: Walk-throughs provide the opportunity to do a comprehensive evaluation of an entire plan to find bottlenecks or other inefficiencies. Team members can also share expertise and gain an overview of the entire BCDR process.
  • Cons: Like tabletop exercises, walk-throughs do not provide technical or practical validation.

This is another type of test that is most appropriate for the preliminary stages of the testing process.

Parallel tests

This test checks if failover systems — backup modes that go into action when a primary system fails — can handle required business operations, processes, and applications after a disrupting event.

  • Pros: By providing validation of data integrity and security, this test is more realistic than purely theoretical options.
  • Cons: Parallel tests can be complex, time-consuming, and resource intensive. There also may be a risk to the production environment.

To reduce the risk of wasted time and resources, parallel tests should be undertaken only when teams have successfully addressed all gaps and issues with tabletop exercises and walk-throughs

Cutover tests

Unlike the parallel test, in a cutover test, the failover systems are completely disconnected from the primary systems to take on the full load of business operations. It is the closest possible simulation of an actual disaster event. 

  • Pros: This test is highly realistic and provides comprehensive insights into the readiness and sufficiency of the BCDR plan as well as potential gaps.
  • Cons: Because primary systems must be offline, this test can be highly complex and can be difficult to schedule.

Because cutover tests require critical systems to be disconnected, these tests should be conducted in the final phase of the BCDR testing process.

Levels of testing for MSPs

In addition to various types of tests, a comprehensive BCDR testing strategy checks systems at different levels of depth to ensure all aspects function as expected.

  • Data verification shows whether the BCDR plan made consistent and accurate backups of original data files and that the data is recoverable. Validating the data integrity provides confidence that it can be restored in the event of an incident. However, it often requires validating data across different systems, databases, or physical locations, which can be complicated. In addition, certain validation techniques may miss subtle variations, especially in complex files.
  • Database mounting ensures that a database backup can read data and perform other basic functions. It offers realistic testing in an environment that resembles an actual recovery scenario and also enables testing of applications that rely on the database. However, mounting the backup may affect primary systems, and ensuring that data is consistent between the backup and original database can be difficult. Database mounting often goes hand-in-hand with data verification.
  • Single machine boot verification tests whether a server can be rebooted after going down. It allows MSPs to test individual systems or machines, enabling them to isolate issues in the recovery process. It can also be performed relatively quickly, making it easy to incorporate into testing. However, this level only tests the server, not the applications or data on it, and focuses only on the booting process.
  • Runbook testing checks the functionality and efficiency of step-by-step procedures for different recovery processes. It exposes any weaknesses or vulnerabilities and offers an opportunity for team members with BCDR responsibilities to get familiar with the process. It also serves as evidence that the organization is complying with audit rules, industry regulations, and other requirements. However, because runbook testing occurs in a controlled environment and time period, it may not incorporate the complexities and pressures of an actual event. Runbook testing is best suited for highly detailed and comprehensive BCDR plans.
  • Recovery assurance is the most advanced level of BCDR testing, involving many components of hardware and applications, the assessment of service level agreements, and diagnostics to evaluate the successful recovery of critical systems, applications, and data. It provides the highest level of confidence that BCDR plans will be successful, but its complexity also requires significant time, personnel, and infrastructure to complete. However, for businesses that provide critical services, it is essential.

When advising customers on the levels of disaster recovery and business continuity plan testing they need, keep these factors in mind:

  • The business requirements of the organization (such as the priority of critical systems and tolerance for downtime)
  • The risks and potential impact of disruptive events on the business
  • Budget and resource constraints
  • Compliance and regulatory requirements
  • The customer’s plans for growth and expansion

You should also take time to educate customers on the different levels of BCDR plan testing to help them understand which ones are most appropriate for their needs and capabilities.

How often should BCDR testing take place?

Because they require less infrastructure and fewer employees, theoretical tests like tabletop exercises and walk-throughs should be undertaken several times a year. More comprehensive and advanced tests that require significant resources and time, such as parallel and cutover testing, should be done at least annually.

However, the schedule will also depend on several factors, such as:

  • The size, nature, and industry of the business
  • The complexity of its data and networks
  • Its security and vulnerability profile
  • Whether it must meet specific compliance and auditing regulations

Failing to test often enough can have consequences, ranging from annoying and expensive to disastrous. These include a lack of preparedness, compliance risks, fines, and permanent data loss.

When working with customers to design a BCDR testing strategy and schedule, you should aim to align the timing with business cycles, any business updates, and regular maintenance periods to reduce the disruption to normal operations. The agreed-upon schedule should be communicated to all employees who will be affected, particularly those who will be needed for the testing process.

Protect your business from unexpected disasters

BCDR testing is a critical but often-overlooked aspect of planning for business continuity and disaster recovery. Creating and executing a testing plan can be a time-consuming and complicated process, which is why many businesses fail to do it. MSPs can help by:

  • Working with their customers to identify their testing needs, including types and levels of tests
  • Defining goals for testing
  • Scheduling testing on a consistent basis
  • Documenting and communicating outcomes
  • Incorporating findings into the BCDR plan for continuous improvement

In addition to discovering that processes do not include essential steps or employees do not know what their responsibilities are, businesses that don’t test their plans regularly may find backups have been corrupted or are otherwise unusable.

BCDR solutions from ConnectWise help MSPs provide clients with secure, automated, and reliable data recovery—a key element of BCDR planning and testing. Start your free BCDR demo today to take the next step toward improving your disaster recovery service offering. Also, ConnectWise Co-Managed Backup includes regular disaster recovery testing on behalf of ConnectWise MSP partners and their clients.

FAQs

MSPs should take care to document, review, and analyze the outcomes of every test, then prioritize the lessons and revise the affected elements of the BCDR plan accordingly.

Technology is absolutely essential for data backup and recovery during testing, as well as to protect primary systems during the testing process. Many tools also support automation and simulation, which can streamline testing and make it more realistic.

Knowing what specific way disaster recovery plans can be tested safely is critical to a BCDR testing strategy. All testing should take place in an environment that has been securely isolated from primary systems. To reduce disruptions to normal operations, testing should be scheduled during slower periods of business activity, such as in the evenings, early mornings, or on weekends.

Use data validation scripts to verify data consistency, relationships between data, and any errors. Data should also be encrypted, anonymized, and tracked during testing to provide further assurance that it retains its integrity.

Phases of the testing lifecycle include:

  • Planning, including defining goals and scope, and selecting the types of tests to conduct
  • Carrying out the tests
  • Analyzing performance and outcomes
  • Addressing significant issues
  • Implementing changes and updates into the BCDR plan
  • Communicating changes to the plan and retraining teams as needed

Upon completion of BCDR plan testing and analysis of the outcomes, MSPs should write a comprehensive report that covers details on testing goals, how testing was carried out, and the results. It should include:

  • Findings from the tests, such as gaps or weaknesses
  • How they were remedied
  • How lessons learned will be incorporated into BCDR planning, including training needs
  • Visual representations and concise summaries to help communicate complex information for non-technical personnel