Over at O’Reilly they have an article by a sysadmin who recently did some rescue work using a Knoppix CD. (Knoppix is a distribution of Linux that runs entirely off the CD without touching the hard drive.)
Let’s look at his first paragraph:
| As a sysadmin, I wear many hats. Some days I’m the janitor–I clean up discarded files on the file server and clear spam from the mail server. Other days I’m the maintenance man–I make sure all the servers are running smoothly and that any holes have been patched. Some days I’m the architect–I plan, organize, and design systems to suit our needs. Some of my favorite days, however, are the days I put on my rescue hat. When a machine is in trouble, the whistle sounds, I grab my rescue gear, and I run down the beach with my life preserver. |
Wonderful! This sysadmin has given me the opportunity to say something I’ve wanted to say for a long time. You see, he’s given us a very common (I think) view of what a system administrator’s job is. And, in a way, this is what sysadmins do. More specifically, it’s the extent of what mediocre sysadmins do.
While at Indiana University Bloomington, I discovered that the Computer Science department there has some of the best systems administrators in the world. Well, I must admit, I’ve only sampled three academic institutions to date (and have word from a number of others through friends and colleagues), and so far noone believes me when I tell them what our systems staff were like.
Practically speaking, the machines never went down. It wasn’t an issue. There was the occasional downtime for the big servers when memory or hardware upgrades took place, but here “occasional” means “yearly” and “scheduled really far in advance.” When machines went down on schedule, downtime was typically arranged to take place early in the morning on a weekend. So the main departmental fileserver would go offline on (say) a Saturday at 7 AM. For an hour.
When something catastrophic happened, things were fixed fast; a pager number was publicly posted for all to use in the event of systems going down. I called once on a weekend when an NFS handle went stale. Occasionally, someone would let a process run away on the webserver, killing things for all other users (OK, OK, that was me). These were often remedied in minutes after the call was made. Likewise, when someone discovered they had deleted a file or email they shouldn’t have, file restores were handled in minutes or, at worst, hours.
The largest failure I recall was on the undergraduate fileserver. It lost a drive in the RAID, which was replaced. Noone noticed. Not, anyway, until a second drive failed before the new drive had finished being introduced into the RAID. RAID5 dies completely if it looses two drives. As it turns out, it was a bad controller, which in turn was destroying the drives—hard to diagnose. But, they had the parts on hand to rebuild the array (that’s two spare drives and a controller in place), and we lost something like 5 hours of data (between midnight and 5 AM) on a subset of our machines. In otherwords, the losses were minimal, and the machine was back up after a complete restore of over a million files from backup—a number of hours, yes, and our admins were there through the night to see that things were back up as soon as possible. (We were discouraged from using the pager for backup requests, mind you.)
But do you know what? The integrity of the machines was secondary. The systems group was, first and foremost, a group committed to supporting the productivity of the faculty, staff, and students in the department. Their commitment was to the people, and the smooth-running operation of the systems was just part of how they manifested that commitment.
Software requests were promptly handled, and often pro-active. “I’ve installed both the stable and unstable versions, as it looks like the unstable tree had the features you need; they’re in /l/blahblah-(stable|unstable)” might be a common reply to a request for an application or set of libraries. Once, when I found that my BerkeleyDB files were no longer accessible, not only were the apps for the older files restored to a part of the tree, but the admins (quickly) investigated likely problems, and were able to give me a direct solution: “Run x on your databases, and it should bring them up to the current version; I tried it on y, and it looks like it does the trick.” I didn’t know why my CGI application couldn’t access the DB files; they not only discovered why, but proactively found the tools I needed to get my app up-and-running again.
So when I encounter sysadmins who are chuffed about how great their systems run, I think I’ve met a sysadmin who doesn’t really understand what it takes to be great at what they do. Why? Because when the systems are running smoothly, it’s not time to fire up a video game and play some Civ or Counterstrike all afternoon; that’s when you start finding out what your users are doing, what their needs are, and how you can best support those needs.
And that’s what lets your users not just get their jobs done, but do really cool stuff and really great things.

That surprised me. What surprised me more was my poorly aimed drop put the image directly into a shape I already had on the canvas. What I didn’t expect was that the shape was automatically filled by the image. This didn’t really matter to me at the time, but it’s come back with a vengance now that I’m trying to lay out attractive photobooks for publishing on 

