Jan 12, 2008

image recovery


I take a lot of pictures. On occasion, I get too impatient when downloading images from my compact flash cards. I'll swap the card without ejecting it properly and sometimes the cards get corrupted. Typically, when that happens the file allocation table of the previous card that was in the reader gets written onto the new card, or the FAT gets corrupted in some other way. The images are still there, but you can't access them. This happened to me last weekend and I didn't have any recovery software on this laptop. I had a look around online and the only recovery programs I could find were close to $100. I had a bit of free time so I decided to try writing my own instead. Turns out a basic recovery tool is actually really simple to put together.

A couple of things made it possible to do quite simple image recovery, successfully. Firstly, I always format the cards in the camera before I use them. So I know when the camera is writing images to the cards, the card is empty. Secondly, I never delete images in the camera. This means there is no fragmentation on the drive. The images are simply stored sequentially on the memory. The FAT format is fairly simple, based on sectors that are multiples of 512 bytes in size, that are collected together in clusters that vary depending on the disk formating. Images are written into linked lists of those clusters. Potentially the clusters could be fragmented across the drive, particularly if images are deleted and new ones stored on the disk. With a clean start and no images deleted, it is reasonable to assume that the images will just be stored on concurrent clusters. I think damaged sectors are managed at the a physical level on the disks, so they are mapped out of the available space (feel free to correct me on this). Anyway, with these assumptions made, it is possible to write a simple tool to parse a disk image and extract images, with a high likelihood of a successful result.

The first step is to get the data. I did the recovery on a unix system and used dd to get the initial image. You have to dump the actual physical device, not one of the disk partitions (as those are essentially what has become corrupt)

dd if=/dev/rdisk1 of=image.img bs=512

The block size is set to 512 to match the formating of the compact flash card. This step takes a while, but eventually you'll have an image file, image.img which is a low level copy of the data on the drive. The next step is to work out a way to identify the files you are looking to recover. I wrote a simple hex dump tool that prints the first few bytes of a file. I used this on a representative sample of the Canon cr2 RAW files to get a search key to identify the start of a file.

--- show_header.py ---

import sys

file = open(sys.argv[1], 'rb')

header = file.read(12)

headerhex = header.encode('hex')

print headerhex

--- end show_header.py ---

This little bit of python can be applied to a group of files with xargs

ls *.cr2 | xargs -n 1 python show_header.py

From that output, it is easy enough to find a representive number of bytes that can be used to identify the start of a file. I also had recorded some audio with the camera, so did a similar process with .wav files to extract them correctly.

Then all you have to do is iterate through the disk image in block_size chunks, checking for those file signatures at the start of each sector. When you find a file signature, start dumping all the data to a new file, until you find another signature. That's all there is to it. Note that there are no warranties with this. I'm offering no guarantees that it will work, or even will not wipe your computer. Use at your own risk. With this I was able to recover the 150+ images that I'd taken and several audio files. It actually works surprisingly quickly once the disk image has been made. Also worth mentioning that the JPEG header matching is untested, as I didn't have any JPEG files on this particular disk, but is included here for completeness.

Download the source for image_recovery.py (you'll probably need to change the file extension - web server doesn't like serving .py files)

There are comments.

Comments !