Journal

Aikido, Creative Work, Photography & Documentary.

Backup vs Fault Tolerance

 

A sexy title eh!? Well, its that elephant in the room. Remember when you feel sick to your stomach? Your work, your photos, your digital life gone in a puff of binary smoke.

Wether at home or at work the end result is the same. Lost time, lost energy and just plain misery. 

I have been giving the concept of backup some consideration whilst helping a colleague develop a backup solution for their business. The process led me to re-evaluate my own workflows and will inform my future choices for my own business.

I wanted to discuss some distinctions, conclusions and ideas with the aim of encouraging others to think deeper about protecting their most valuable digital assets.

Fault Tolerance

One key distinction is fault tolerance as opposed to backup; they are often confused as one in the same thing. Fault tolerance provides quick access to data in the event of a problem, allowing the user to be up and running again without loss of data and without too down time. Think of this of a quick fix, but generally a fault tolerance procedure may only account for a single problem such as disk failure. Thereafter follow-up action needs to be taken to correct the problem in order to re-establish the fault tolerance system. One technology thats provides different levels of fault tolerance in relation to hard drives is RAID.

RAID or Redundant Array of Independent Disks allows multiple drives to work together in lots of possible configurations. Usually given number designation - for more information check out this awesome article on RAID.

RAID arrays are just simply a group of drives, starting from two, which are joined together with software or dedicated hardware. Some configurations (such as RAID-1) mirror content from one disk to another. Giving the user a chance to recover the data if one single drive fails. 

 RAID-1 Mirrored

RAID-1 Mirrored

RAID configurations can bring other benefits such as speed, lots of drives working together create faster read and write times than a single lone drive. This has the benefit of providing much greater storage too that can appear as one single volume. 

With this multiple drive system in mind fault tolerance can be configured through striping data across all the drives and reserving a portion of space for redundancy. A RAID-5 configuration. So if single or multiple drives fail, the faulty drive can be replaced, allowing the RAID to rebuild without loss of data.

These examples are brilliant for many applications, and can save our bacon in the event of a crisis, but this still isn't backup.

Just as a side note RAID systems can be configured for the highest level of performance, RAID-0 but it offers no built in fault tolerance.

How we integrate fault tolerance with large data sets is pretty simple. We have a master RAID server that provides fast access to our media assets. It combines multiple drives into a single large volume using RAID-5.

 Example of RAID-5

Example of RAID-5

This is our working platform, we don't use it for anything else other than current, live projects. RAID-5 with 6 disks provides a 1 drive fault tolerance, so if a drive goes down we can replace it and be online again within a couple of hours. The time delay is caused by the array working out what gone missing on the dead drive, then rebuilding itself. 

This gives us a good balance of speed and security. But what if a few hours is too long to rebuild the array? Or if 2 or more drives go down, or the entire server?

We have a second point of tolerance which is a independent G-Tech, G-Drive set to automatically backup our master RAID volume using Carbon Copy Cloner. If we had to use the G-Drive as a working drive we can. And although slower, it connects over Thunderbolt, allowing us to work, unlike a USB 2 drive.

Alas, even this is not considered as backup despite what I used to believe - we are still the realms of fault tolerance.

In summary, RAID for true backups is not the smart move. The purpose of a backup is to forestall data loss, but a RAID takes more drives and adds hardware and software complexity: these are additional risks to any system. 

And if a RAID has no fault tolerance, then it is at higher risk than a single drive (for example, using a RAID-0 stripe as a backup). 

 RAID-0 Configuration

RAID-0 Configuration

The appeal of a RAID is three fold:

  • Fault tolerance (RAID-1 mirror)
  • Higher performance (RAID-4/5 for example)
  • Large single volume storage (if required).

A single drive backup has less chance of going wrong, but implemented on its own a single drive backing up your data is still only fault tolerance. Having more than one backup of your data, in more than one place moves us away from fault tolerance and into the world of backup. 

RAID especially cannot be a substitute for multiple backups, but if you require the benefits of a RAID system then further backups above the inbuilt fault tolerance is required. More on this soon.

BackupJames Stier2