This post is mostly for my own reference. I am posting it here to possibly aid someone else in the future.
Backing up computers is a complex topic- entire books have been written on the subject. Check the wikipedia article for a good introduction to the topic. One factor to weigh heavily when choosing a backup method is that hard drives are fairly fragile mechanical devices which have an expected lifetime and sometimes fail prematurely. Magnetic tape is more durable, but doesn't allow quick random seeks into the data set. Optical discs are the most durable, but their capacity hasn't kept up with the high capacity of magnetic disks. All things considered, the magnetic disk form factor is a relatively attractive choice. Another benefit is that magnetic disks are available in external form factors (firewire, USB, eSATA) that allows for quick and ad-hoc computer connections. If magnetic disks are selected as the storage medium, then addressing their fragile nature is a top concern. One way to ensure safety is to use redundancy, that is, it is important to keep multiple copies of the data on multiple physical devices. To that end, this discussion addresses methods for duplicating a data set to multiple physical devices.
Let's jump in with a concrete example. Let's assume I need to have the contents of the data on disk partition sdx1 duplicated to one or more other disk partitions. I have a suitable target device, named sdy1. To further simplify the problem, I will also assume that there are no other complications with open files- the data is completely static. To meet that assumption, I will specify that both sdx1 and sdy1 are dismounted. The obvious choice to perform the copy is to use the dd command:
# dd if=/dev/sdx1 of=/dev/sdy1 bs=4M
Upon successful completion of that operation, I will have two copies of the data. I have the original copy on sdx1 and a new copy on sdy1. For data safety, I can remove either device and move it to a safe location. I can place it in a drawer; I can move it physically offsite*; I can contract with a data storage firm to store the device for me; etc.
The above command is certainly simple, but simplicity has a downside, too. In the case of the "dd" (duplicate device-to-device) command, the downside is that it simply rewrites the entire contents of sdx1 to sdy1 in bulk. The time required to execute the command will be proportional to the write speed of the device and its capacity. The larger the capacity the longer the time to write. Today's TB size disks will take several hours to copy completely. The obvious question to ask is whether it makes sense to rewrite the entire copy if only a few bytes have been changed. Is another method available which can optimize the copy operation, possibly reducing the time required from hours to minutes? The following discusses one possibility using a RAID 1 array.
The key piece to solving this problem uses the fact that RAID 1 can be setup with a write-intent bitmap. Let's proceed with a concrete example. I will work with three identical devices: sdx1, sdy1, sdz1. We'll proceed assuming they are blank, fresh from the store.
Because the devices are assumed to be identical, it is a good idea to identify them in some way. Linux includes tools to read the device serial number, etc. A simple way to identify disks is to connect each new device one at a time. Identify the disk, create a partition table, and record the unique identifiers. It is up to you to determine exactly how you want to do that. Possible tools are smartctl, hdparm, etc. Here is some sample output.
# hdparm -i /dev/sdw /dev/sdw: Model=Hitachi HTS541060G9SA00, FwRev=MB3OC60R, SerialNo=MPBCP0XGJW6PJM : :
Partition the new drives (not shown). Use partition type, 0xFD.
Because we will be using encryption, you may want to prefill the disk with random data. The general idea is not to leak information about used/unused portions of the disk. This step is not shown.
With all drives connected, previously partitioned (not shown), proceed to create a RAID 1 array with an internal write-intent bitmap.
# mdadm --create /dev/mdx --level 1 --raid-devices=3 /dev/sdx1 missing missing # mdadm /dev/mdx -Gb internal
The other mirror devices are not initially joined to the RAID array, but can be joined at user convenience/preference using this command:
# mdadm --manage -add /dev/mdx /dev/sdy1 # mdadm --manage -add /dev/mdx /dev/sdz1
Check the progress of the sync using the proc interface:
# cat /proc/mdstat
If your disk controller is not state of the art, then it may be better to delay the above action until later.
Linux includes block level encryption via its device mapper interface. A complete discussion of encryption is beyond the scope of this discussion, but it is shown here to note where the layer exists in the device stack. The RAID layer is created from raw devices. The encrypted layer is created on top of the RAID layer. This command creates an encrypted container with a specified key and other default encryption parameters.
# cat key | cryptsetup create emdx /dev/mdx
I have been using the XFS filesystem for quite a while. Use whatever filesystem that you are comfortable with. Here is a typical format command:
# mkfs.xfs /dev/mapper/emdx
Here is a typical mount command:
# mount /dev/mapper/emdx /mnt/emdx
I should remember to add a graphic to this page which illustrates the device "stack" right here.
The method used to actually backup target data is highly variable depending on many factors. For this example, let's assume the initial point-in-time backup can be obtained with a simple solution. Here is sequence using two locally mounted filesystems: one with the source data and one with the device for the copy.
# SRC=/mnt/source_data # DEST=/mnt/emdx/snapshot.2011-01-21 # mkdir $DEST # (cd $SRC && tar -cpf - .) | (cd $DEST && tar -xvf -)
At this point, add the "missing" elements of the RAID array (that is, if there are any "missing" elements.) Wait for the operation to complete. Check with the proc interface as before.
Again, the actual method for making this backup varies a lot. I show a simple method that uses hardlinks combined with the rsync command.
# SRC=/mnt/source_data # PRV=/mnt/emdx/snapshot.2011-01-21 # DEST=/mnt/emdx/snapshot.2011-01-28 # mkdir $DEST # cp -anl ${PRV}/. $DEST # rsync -lptrv --delete --dry-run ${SRC} ${DEST}
We need to move the data that has been backed up offsite. RAID 1 elements can be removed from active arrays.
# mdadm --manage --fail /dev/mdx /dev/sdz1 # mdadm --manage --remove /dev/mdx /dev/sdz1
Once removed, the device associated with sdz1 can be taken offline and offsite for extra data safety. The RAID array remains active with one missing device.
# SRC=/mnt/source_data # PRV=/mnt/emdx/snapshot.2011-01-28 # DEST=/mnt/emdx/snapshot.2011-02-04 # mkdir $DEST # cp -anl ${PRV}/. $DEST # rsync -lptrv --delete --dry-run ${SRC} ${DEST}
These jobs can be scheduled in advance using cron or at.
Return the disk drives from offsite storage and reconnect to the computer. Rejoin the disk drive to the drive array and wait for the automatic resync to complete.
# mdadm --manage --re-add /dev/mdx /dev/sdz1 # cat /proc/mdadm
The device sdz1 can be removed again after the RAID resync operation completes. Another idea would swap in a series of devices. RAID 1 seems to be adaptable to a simple alternating scheme. That is, the first week sdz1 is brought back from offsite, the next week sdy1 is brought back, and so on.
to be determined
Page Last Modified: 2011-01-25