I was up late last night working on the dead hard drive problems and spent pretty much all day today fixing it. In retrospect I should have just gone out and purchased a replacement HD to salvage everything onto and be done with it.
Instead I spent a tremendous amount of time that frankly wasn’t much worth it. While I did learn a lot, I don’t think that the lesson wasn’t worth 1.5 days of work.
The setup was:
[code]Old WD 80GB IDE HD (/dev/hda):
/dev/hda1 - 50MB mounted on /boot
/dev/hda2 - 1GB swap
/dev/hda3 - 70GB mounted on /
Newer 250GB SATA HD (/dev/sda):
/dev/sda1 - 250GB mounted on /home[/code]
/dev/hda was failing. Yesterday when I attempted to fix the root (/dev/hda3) partition with ‘reiserfsck –rebuild-tree’ things got really bad. It detected a seek error and gave up. At this point, I was unable to boot or mount the partition.
So now I had a broken partition on a dying HD. After letting spinrite work on the drive yesterday, most of the bad blocks went away again (for how long I do not know).
I booted up into a knoppix-based System Rescue CD, I used dd_rescue to dump the contents of /dev/hda3 onto some free space onto /dev/sda1 (currently used as the /home partition):
[code]dd_rescue -b 4096 /dev/hda3 /mnt/sda1/hda3.img[/code]
This took a while to dump the 70GB partition, but it completed with no problems, and I don’t recall seeing any seek errors.
Once I had an image-dump of the old HD, I now had to repair it (repairo!)
In order to do that, I had to ‘mount’ the image:
[code]losetup /dev/loop0 /mnt/sda1/hda3.img
mount -t reiserfs -o ro /dev/loop0 /mnt/hda3
[/code]
Then I could check it (reiserfsck –check /dev/loop0), but that just told me that it was really screwed up so I had to run with –rebuild-tree instead:
[code]reiserfsck –rebuild-tree /dev/loop0[/code]
The –rebuild-tree operation took a long time, but it did find and repair a lot of errors.
Once it eventually completed, the image was fine. I then used dd_rescue to also save an image of the /boot partition (ext3) as well:
[code]dd_rescue /dev/hda1 /mnt/sda1/hda1.img[/code]
Now I had all of the important data off of the old 80GB HD (/dev/hda). I decided to just combine everything on the 250GB HD (sda) that I was using for /home. This meant that I had to re-arrange the partition.
Before I did this, I had to move-off the hda1.img and large hda3.img files from that partition to somewhere else. I didn’t have any external HDs handy so I decided to move them across the network to freeside which is sporting a new 300GB HD with plenty of space to hold the images.
I wrestled with how to copy the files (in particular the large 70GB hda3.img file) the fastest way possible. I didn’t have an ftp server running on freeside and didn’t want to bother with setting one up, so that left me with either rsync, scp, or samba.
Naturally I figured that rsync (via ssh) would be the fastest way - it was not. scp was not fast ether. I even tried both rsync and scp with the ssh encryption cipher for fast file transfers (arcfour) but weak encryption. This did not help.
Instead, I mounted a windows share with smbmount and simply copied the files with ‘cp’:
[code]cp -aug /mnt/sda1/hda1.img /mnt/freeside/
cp -aug /mnt/sda1/hda3.img /mnt/freeside/[/code]
It was super fast. I’m still baffled as to why samba was faster than the other two methods since everything I read on the nets suggested that it should not be.
Now that the old hda images were safely over on freeside, I ran qtparted from the rescue CD to resize the /dva/sda1 partition. This turned out to be a big mistake.
Only after I began the very long resize operation did I read somewhere that it doesn’t behave with reiserfs partitions. So I aborted it, but now everything was screwed up with /dev/sda1 and I was jeopardy of losing all the data on my /home partition. This isn’t too disastrous because I do daily backups of the home directories, but I would like to salvage what I can.
I think I had to run [code]reiserfsck –rebuild-tree /dev/sda1[/code] in order to repair the drive. This took a long time. There were a lot of files sent to lost found and it looked like a mess. I was mainly concerned about the 41GB mp3 collection. It turns out that this was mostly all in lost found with random characters for filenames. What a mess! I don’t keep daily backups of the mp3s, but there is a quarterly backup saved offsite on an external HD. I didn’t have this handy and wanted to salvage them now. I had an idea.
First, I needed to get everything from /dev/sda1 off of the HD, so I tarred it all up (dd_rescue would have created a large 230GB image file which would not have fit on freeside). In total, the tar file was about 60-80GB. I saved the tar directly to freeside over samba.
Now that everything was off the 250GB HD, I blew it away and repartitioned it in the following scheme:
[code]/dev/sda1: 50MB /boot partition
/dev/sda2: 1GB swap partition
/dev/sda3: 50GB / partition
/dev/sda4: 182GB /home partition[/code]
I then created the swap ‘partition’ and formatted the other partitions. I think I used a GUI tool to do this.
After this was done, I mounted the boot, root, and home partitions:
[code]mount /dev/sda1 /mnt/sda1
mount /dev/sda3 /mnt/sda3
mount /dev/sda4 /mnt/sda4[/code]
I mounted the hda1.img and hda3.img files remotely over samba using the loop device trick. However, the old way of mounting (mount -t reiserfs -o ro /dev/loop0 /mnt/hda3) did not work so I had to use mount a different way - perhaps because it was over samba?
[code]mount /mnt/freeside/hda1.img /mnt/hda1 -t reiserfs -o loop=/dev/loop0
mount /mnt/freeside/hda3.img /mnt/hda3 -t reiserfs -o loop=/dev/loop1[/code]
With the two images mounted remotely from the freeside samba share, I was able to recursively restore all of the files, preserving everything:
[code]cp -aug /mnt/hda1 /mnt/sda1/
cp -aug /mnt/hda3 /mnt/sda3/[/code]
I also untarred everything from the tarfile I made of the home partition:
[code]cd /mnt/sda4
tar xvf /mnt/freeside/home.tar .[/code]
Next, I did what I could to salvage any files from lost found. I was a lot more successful at this than I thought. The only stuff that got missed was most of the mp3s.
I used my old backups to restore the /home/jeff and /home/jen home dirs with the latest data taken from the backups. Because I do a full backup on Monday followed by incrementals every other day, I first restored the full and then did all of the incrementals in succession.
At this point everything was back to ‘normal’ except for the mp3s. I wanted to work on those after getting turing up and running. So I tweaked /etc/fstab to handle the new partitions correctly:
[code]/dev/sda1 /boot ext2 noauto,noatime 1 2
/dev/sda2 none swap sw 0 0
/dev/sda3 / reiserfs noatime 0 1
/dev/sda4 /home reiserfs noatime 0 1[/code]
And then rebooted. To my delight, everything came back online just as I left it and the system was running fine.
There was one oddity that exposed a hole in my backup scheme. I ended up with a lot of duplicated email. The reason for this is because of the following:
1) A full backup is taken including everything in /home/jeff/.maildir
2) During the day new emails are received and some are deleted and some are moved around.
3) The next day, an incremental backup is taken to pick up only new or changed items in /home/jeff/.maildir
4) Steps 2 and 3 repeat for however many days went on before I had the outage.
So when I restored firs the full backup, the .maildir snapshot was restored - no problem. But as I restored the remaining incremental backups in order, I ended up with extra emails that were once deleted. The only way around this seems to be to take a snapshot of that directory every time a backup is done instead of just the changes. I’ll ponder on how to accomplish this in the context of my existing backup scheme at a later date.
With The system mostly repaired, I could take a breather and focus on fixing the mp3s. Because most of my mp3s were scattered in randomly-named directores in /home/lost found/, I had to find a way to get them all back in the names and directories that they once were.
MediaMonkey is perfect for that. I had been using MediaMonkey to manage the music collection and so the database still knew were everything was supposed to be. I made a backup of the current database file in case I need to reference it again. Then I used MediaMonkey to scan through all of the files