Unix Observations

Today has been a somewhat odd day. This morning I had an ext3 partition crap out completely randomly and had to munge about 2 million strings into SQL. But through it I've noticed that the things I've done today and the problems I've solved would not have easily been solvable or doable on a less robust operating system (read, Windows).

So, this morning I notice that my email is down, thats odd, email never goes down. So I SSH in to my server and notice bind has crashed. This too is odd so I restarted it, and nothing. The entire / filesystem was read only. Thats really odd because the / partition lives on a software raid 1 which was reporting that it was healthy. On top of that the /home partition (on the same raid) was perfectly normal. After a little freaking out I had one of the NOC guys at the datacenter hook up a remote KVM and I ran an fsck at boot. That solved my problem completely and life went on. (As an aside, mad props to the guys at NetActuate they are lifesaver in an emergency.) While that may seem normal, think about it... when is the last time you where able to repair a corrupted NTFS filesystem that easily (to Microsoft's credit I'm sure its just as hard to repair an HFS volume, but I have no exprience there). In this Windows world this kind of problem is called a format-and-start-over problem. My evening was spent munging about a million rows of ambiguous text to uniqe SQL insert statements. The end result was reasonably clever. I was given a data dump of categories in comma seperated format (id,category) which I had to match up to a file with the category name. I then had to pick out only the unique entries in the file and create an insert statement for each row using the category ID as one of the columns in the insert. Oh... and did I mention the files where randomly littered with ^M (thats a windows end of line character). Sounds simple, no?

Given sed, sort, uniq and some bash scripting the project turned out to be a piece of cake. Then I started to think about how I would do this on a non-Unix platform and couldn't think of way that was half as simple, since most of the commands I needed don't even exist for non-Unix platforms. For the second time that day Unix saved my sorry butt.

Two other observations for today: sed is incredibly slow when called serially many times, if you are processing massive amounts of text it is not efficient or practical (or sane for that matter) to call it for each line of text, rather it should be called on the original file first. Following that simple advice will save you hours if not days of processing.

When compressing a SQL file that was about 320MB bzip2 is the most space efficient creating a file of about 7MB in 5 minutes while gzip is less space efficient but much faster, creating a 10MB file in about 1 minute.