carbonite logo

Commonly searched topics:

backupcloud backupaccount sign in

Article · Apr 6, 2021

Backup and Disaster Recovery Testing for MSPs and Enterprise Organizations

Backup is an essential element in an overall cyber resilience strategy. But backup has one purpose, which is recovery. No one wants to find out, in the middle of a disaster, that their backup plan will not enable them to recover in the time or recover all the data they need to recover. That’s why disaster recovery testing should be a regular part of the cyber resilience strategy.

color illustration of heartbeat and shield

Our last post on backup and recovery testing was intended for small to midsize businesses (SMBs). In this post, we’ll look at the same topic but with an eye toward managed service providers (MSPs) and enterprise organizations.

The ability to recover from data loss is a measure of cyber fitness. As with any good fitness plan, it requires regular workouts as part of an overall fitness regimen. The difference is, instead of lifting heavier weights for more reps, the goal you’re working toward is assurance that service level agreements (SLAs) can be met. To get started, take an accounting of all the systems, software and platforms in the organization and create categories:

  • Mission-critical to the business
    • Recovery Time Objective (RTO) or acceptable downtime
    • Recovery Point Objective (RPO) or acceptable loss of data
    • The people who accesses this information
    • The department responsible for the application or platform
    • Total cost of downtime
  • Important to the business but can do without for a period of time
    • Recovery Time Objective (RTO) or acceptable downtime
    • Recovery Point Objective (RPO) or acceptable loss of data
    • The people who accesses this information
    • The department responsible for the application or platform
    • Total cost of downtime
  • Not critical to the business and can be rebuilt from new without major impact
    • Recovery Time Objective (RTO) or acceptable downtime
    • Recovery Point Objective (RPO) or acceptable loss of data
    • The people who accesses this information
    • The department responsible for the application or platform
    • Total cost of downtime

Now that we have buckets for different systems, we can look at where they fit in our recovery plan. This will help us define a framework for mapping our internal SLAs back to the business need and category. It also helps IT determine the best strategy for recovering and, in turn, the best method of protecting systems in each category.

At this point, it’s important to have the business application stakeholders along with IT build the plan for stress-testing the different workflows. Together, they can establish the different scenarios to be tested based on business needs and risk tolerances. This means figuring out where the risks are and properly planning and testing for them. It’s not everything under the sun – just the worst that could happen, with some other basic testing scenarios, like recovering a file or application.

For example, if backup is the only thing protecting the system and there are three mission-critical servers, then one test should be recovering all three servers from backup in addition to an application recovery test. Say you have three servers running inside VMWare’s ESXi hypervisor protected with a solution that can quickly (in minutes) recover them back into the existing production environment. Under this scenario, you should be able to meet an RTO of no longer than an hour for these three servers. Now the question is, can you meet your RPO? This depends on how many backups you’re taking daily. If you’re backing up every hour, then you should always be able to meet an RPO of an hour. But the only way to know for sure is through regularly scheduled tests. What backed up today in minutes may take longer tomorrow.

One method that’s very common is screenshot validation where, after backups run, there’s a process on the backend that boots the VM outside of production and captures a screenshot of the login to show the VM booted. While it’s great to test your VM environment and verify that it will boot, it doesn’t necessarily ensure the system will boot in the presence of other complexities, including whether these systems need to come up in a set order, and if there’s data or application consistency.

A couple of other essential elements are regular reporting and the ability to automate alerts should there be an issue, like an error with a CRC (cyclic redundancy check) or a data corruption. Having solid monitoring APIs and robust reporting helps. With good reporting, you can plan better and see how changes can affect your recovery plan.

Routine testing should occur every month and, in some cases, more. It’s important to cover all the bases and ensure that you’re following the 3-2-1 backup rule, where you have three copies or versions of your data in two different storage locations with one being offsite or on a separate network. For systems that are not protected with backup, you can use high availability protection to replicate your data offsite. However, for these types of systems, you still need backup because if corruption is replicated, your fallback can be your backup.

The final question is, what do you need to test? This is where you should go back to the beginning and know the risks you should be planning for. If you’re running Microsoft SQL as the database and two Apache servers as a web front end, then at a minimum, you need to test recovery of these systems as well as access to them on the front end and the back end. Since you can’t take down production, you need to have a test environment set up with a test network that can be used to verify connectivity to the systems, the data and the web interface.

For more information on protecting servers and performing disaster recovery, visit our Carbonite™ Server product page.

Author

img

Matt Seeley

Sr. Principal Solutions Consultant

Related content