rentzsch.com: tales from the red shed

Surviving I/O Errors Followup

Notes

It’s a treat Alastair came out of his blogging shell to comment on my little piece. You should write more, Alastair.

Jonathan, in his post, takes the hacker’s approach of trying to copy as much of the file as possible and for the bits that can’t be copied, ignore the errors and pretend the data was zero. This is fine for people like Jonathan, or like me, who understand about disks and computer programs and know the kinds of problems this might create. However, very few people in the general user population should really be trying to use tricks like overwriting blocks with zeroes or copying the file with options set to ignore the I/O error, since the consequences of those actions are not always wholly predictable.

Absolutely, this is cowboy computing. “Kids, don’t try this at home.” Substituting zeros on an I/O error is like using a screwdriver to avoid super-criticality, do it enough and you’ll get burned.

Alastair mentions the correct way to get out of this mess is a restoration from backups. Agreed. The only reason why I didn’t take that course of action is that the error occurred to data I created that day. There was no backup available. That work also represented hundreds of dollars of onsite client time, so I was loathe to attempt to recreate it.

Dave Nanian of SuperDuper fame also chimed in:

I?m shocked, frankly, that Wolf?s Parallels disk was OK given the damage: he was very lucky.

Heh, it’s good Dave is equally horrified at my technique. I don’t know where that 4K of zeros wound up file-system wise. And Alastair is right it’s a great way of silently corrupting your data. Fortunately, my data was source code, which is ASCII text. Compilers usually don’t take kindly to 4K-zero bullet holes in source text (probably perl notwithstanding), so I knew the files I had authored that day were still safe. After I recovered them, I threw away the disk image — it’s not worth attempting to figure out where those zeros landed and what got smashed.

Dave makes three other great points:

  1. ditto is merely a fallback to SuperDuper’s usual copy mechanism. Sorry Dave, I saw “Error ditto: Input/output error” as the last line of the log file, and jumped to an incorrect conclusion.

  2. I knew iSights can wreck havoc on Firewire busses, (causing I/O errors) but iPods were news to me. Perhaps it isn’t all bad that iPods have gone completely USB.

  3. Indeed, SMART would have been nice to have here. Unfortunately, it’s a Firewire drive, and SMART doesn’t work over Firewire for reasons I don’t understand. I have a 2.5"-to-3.5" IDE adapter, so I plan on dissecting the Firewire drive and slamming it into my old G4 Sawtooth tower so I can read it’s SMART status and decide if the drive goes in for warranty replacement.

Both Alastair and Dave mention sparing. That’s where a drive takes note of a bad section of a disk and cordons it off so it can’t play with the other magnetic children. For better or worse, sparing usually happens transparently and data isn’t lost, but as evidenced in my scenario, this isn’t always the case.

I was able to force sparing on the affected blocks by writing zeros to the entire drive. Before that, executing dd if=/dev/disk3 of=/dev/null conv=noerror would spew nine errors. Afterwards, the same command completed peacefully. Unnerving, but there you go.

Wednesday, June 21, 2006
03:08 AM