How to save a hard drive or corrupt filesystem under Linux

How to save a hard drive or corrupt filesystem under Linux

It doesn't happen often, but in case it does, we'll step you through Linux disaster recovery.

Disaster recovery is an interesting beast. Even though it’s such a critical task, it’s usually not something we think about until we find ourselves in the middle of a disaster.

With any luck, you’ll never have a disaster under Linux, but if it happens, that suite of tools you’ve gathered for Windows over the years may not do you much good. I helped a friend through some Linux data woes recently, and found some great tools that could help you survive the worst.

Types of disasters
With data, I group disasters into three categories:

Hardware failures, such as a hard drive with bad sectors;

‘Soft’ failures, where the physical hard drive is fine, but the filesystem on it has been corrupted somehow, and;

Human error, where the hardware and filesystems are intact, but the data on them has been deleted.

Of course, the root causes often aren’t as simple as this – while a ‘soft’ failure may be due to a kernel bug, it could as easily be caused by faulty RAM or controller hardware that corrupted the data before it got to the hard drive. It’s still a useful grouping though, because of the different approaches used to correct them.

Soft and hard failures
Linux is generally resistant to Filesystem corruption, because all of the major file systems, including ext3, ReiserFSs, and XFS, support journaling, a technique that keeps track of Filesystem modifications in a log file before writing them to the Filesystem itself. On start-up, if the Filesystem was not cleanly unmounted at shutdown (due to a forced reboot, for example), it ‘plays’ the log and completes the changes to the Filesystem. While data changes made in files at the time of the crash may be lost, the log playback ensures that the Filesystem structure is consistent.

Filesystems can still become corrupted through software bugs or hardware failures. In some cases, where accessing particular files results in I/O errors, the cause is usually hardware. You can confirm this by checking the output of the ‘dmesg’ command for ‘I/O’ messages:

end_request: I/O error, dev 03:00 (hda), sector 1322520
hda: irq timeout: status=0xd0 { Busy }
hda: ATAPI reset complete

Other cases can be harder to diagnose. If you have problems with certain files, but no I/O errors in your ‘dmesg’ output, then the hard drive is probably fine, but the filesystem may be corrupted. If you can’t boot the system at all, then it’s hard to tell without further testing what the problem is.

The best idea is to boot from a live CD. General live CDs like Knoppix tend to come with a lot of data recovery tools, but there are also smaller CDs designed specifically for the task, such as SystemRescueCD. Once booted, you’ll find a range of options. If you’re particularly paranoid about your data, you can mount the damaged Filesystem in read-only mode and copy as many files as you can from it on to other local drives or across a network.

Another step to try before any Filesystem repair tools is a block scan of the hard drive. You can skip this if you’re sure the drive is fine, because it will take a while to run, but it doesn’t hurt to play it safe. Use the ‘badblocks’ command to run the block scan:

sudo badblocks -sv /dev/sdc

The ‘badblocks’ command will, unsurprisingly, list any bad blocks discovered as it runs, so it should be clear if you’ve hit problems. If you’re still not sure, run ‘dmesg’ in another console and look for I/O errors.

If the hardware passes inspection, you can run the appropriate repair tool for your Filesystem. For ext2 and ext3 Filesystems, use ‘e2fsck’. The ‘-y’ option tells ‘e2fsck’ to take all available measures to correct the Filesystem, which is usually what’s needed if the Filesystem is corrupted enough to require manual repair:

e2fsck -v -y /dev/sdc1

For XFS Filesystems, use the ‘xfs_repair’ command:

xfs_repair /dev/sdc1

For ReiserFS Filesystems, first run an extended Filesystem check:

reiserfsck --check /dev/sdc1

If the ‘--check’ option reports fixable corruptions, you can use the ‘--fix-fixable’ option to repair the Filesystem. If it reports fatal corruptions, or crashes, you can try the more extreme (and time-consuming) ‘--rebuild-tree’ option.

Dealing with dying drives
In the case of a drive failure, all you can do is copy everything possible on to another drive. If you don’t have any luck simply copying files from the drive, or you just want to try something more sophisticated, try performing a raw copy of the entire drive to another device: typically either as an image file or a new partition (at least as large as the damaged partition).

The traditional ‘dd’ tool can do this in a pinch, using some extra options to have it skip over errors:

dd if=/dev/sdc1 of=/dev/sdb1 bs=4k conv=noerror,sync

However, a better option is GNU ddrescue – it’s available in most distributions (in the ‘gddrescue’ package in Ubuntu), and it’s also on the SystemRescueCD. It’s quicker than ‘dd’, gets better results and it also keeps a logfile that records each of the bad blocks found – which you can feed back in to ddrescue to have it retry only those areas. Use this command for an initial pass copying only the good blocks:

ddrescue -n /dev/sdc1 /dev/sdb2 logfile

Then retry each bad block three times and hopefully rescue some more data:

ddrescue -r3 /dev/sdc1 /dev/sdb2 logfile

Once you’ve recovered as much data from the faulty drive as possible, you can run the appropriate Filesystem repair command on the new drive to correct the errors caused by the unreadable blocks, and then mount the new Filesystem as usual to access the files.

Human error
When you delete a file on Linux, there’s really no going back – none of the common Filesystems support any useful form of file undelete. This is why file managers like Nautilus and Konqueror use Trash folders, so for your personal files, make sure you’re sending files to the Trash rather than deleting them outright.

If you’ve deleted key system files, like a folder under ‘/usr’ for example, there are a few options for recovery. If the damage is minor, and you know exactly what was affected, you may be able to get away with forcibly reinstalling the packages that you’ve deleted files from. Otherwise, an OS reinstallation is in order.

The first step is to ensure that your personal files are safe. If you have ‘/home’ on a separate partition, you should be right. If not, grab a CD and backup your home folder to another hard drive, or even another system across the network. To make life easier after the reinstallation, you might want to grab your ‘/etc’ folder as well, assuming it wasn’t what you deleted. You could also grab the list of installed packages, to refer back to after the reinstallation. On a Debian or Ubuntu system, this command should do the job:

cat var/lib/dpkg/status | grep -B 1 “Status: install” | grep Package

Missing partition table?
One problem, usually human-caused, that’s fairly easy to cause and is totally catastrophic, is a missing or corrupted partition table. Partitioning a drive is usually a safe operation, but if something goes wrong when writing the partition table to disk, or if you accidentally delete the wrong partitions, you could lose access to some or all of your data. If this occurs, boot in to a live CD (SystemRescueCD is again an excellent choice) and run TestDisk:
testdisk /dev/sdc

This tool scans the drive for Filesystems, determining their start points and sizes, and then builds a new partition table to match. In most cases, where no space on the drive is left unpartitioned, TestDisk can rebuild the partition table within seconds. It’s also quite easy to use – just take your time, follow the prompts, double-check everything before selecting the ‘Write’ option, and then reboot to ensure the changes take effect.

Skipping the ‘prevention’ speech
I could go on about the need for backups, and some of the Linux tools for making them, but if you’re like me you’d think ‘Hmm, that’s interesting’ and then not do anything about it, due to the effort involved. Similarly, you can use RAID to prevent data loss from disk failures, but converting your system to use multiple drives in a RAID configuration takes time and money too.

Instead, I’ll show you something that takes minimal effort, and might help you shift your data off a nearly-failing drive before it dies. Most hard drives ship with a monitoring system called S.M.A.R.T. (Self-Monitoring Analysis and Reporting Technology), which keeps track of the drive’s health. With appropriate tools and configuration, you can use this information to help predict imminent drive failures.

Under Linux, S.M.A.R.T. is handled using smartmontools, which you should find packaged in your distribution. Once installed, open the configuration file (usually /etc/smartd.conf) and have a read through the examples. If there’s a line that begins with ‘DEVICESCAN’, comment it out, and then add a configuration line for each drive you wish to monitor. I use a setup like this on each drive:

/dev/sda -a -d ata -o on -S on -s (S/../.././02|L/../../6/03)

This enables S.M.A.R.T. monitoring on the drive, with a short test each morning at 2am and a long test each Saturday morning. Automatic email notifications follow any failure. The ‘-d ata’ flag signifies that while this is a SCSI device name (‘sda’ rather than an ATA name like ‘hda’), the drive is actually an ATA device. You’ll need this for SATA drives, and on some distributions, like Ubuntu Feisty, you may need it for plain IDE drives too, since they also appear as SCSI-style devices, but you must remove it for real SCSI drives if you have any.

The email notifications are sent to the root account on your system. You can read them by installing a simple command-line mail client like Mutt. You can also check the test logs with the ‘smartctl’ command:

sudo smartctl -l selftest -d ata /dev/sda

If you’re using Ubuntu, you can also install the ‘smart-notifier’ package. This contains a small GUI notification tool that will automatically display any warnings on the desktop for you to see.

This feature appeared in the October 2007 issue of PC & Tech Authority Magazine

Most Read Articles

Review: WD Blue 3D &  Sandisk Ultra 3D 1TB SSDs

Review: WD Blue 3D & Sandisk Ultra 3D 1TB SSDs

The 31 best podcasts of 2017

The 31 best podcasts of 2017

Are these the biggest gaming franchises in history?

Are these the biggest gaming franchises in history?

Explainer: Who was Ada Lovelace?

Explainer: Who was Ada Lovelace?