Once upon a time, I worked for a department with an enormous website.  This website consisted of more than 18,000 static files scattered through dozens of subdirectories. Some folders had a structure reflecting the office hierarchy, while others were put in the root to create URLs that were “shorter and easier to remember.”

As if that weren’t complicated enough, the website had over a dozen webmasters who could add, change or delete the web server’s files any time they wanted to do so.

Although the IT department managed the web server for the entire organization, they had seemingly little interest working with the webmasters from other departments.  (They also managed to delete our entire department’s website on three separate occasions within a three year period– but that’s a story for another day.)

When the IT department told me it was my responsibility to make backups of the website, I put together an automated system with an “obsolete” laptop, Windows Scheduler and a nifty piece of software that allowed you to map a network drive to FTP servers.

But it didn’t take long for me to realize, with little additional effort, I could use a directory listing command on the files I’d just finished downloading to create an inventory list and compare it against an inventory list from the previous night to see what had been added, changed or removed.

It worked something like this– every night, a Windows Scheduler Task would trigger and copy the contents of the department’s public html folder from the web server to a folder on the local hard drive, overwriting the previous night’s result. Following that, a batch file would run a recursive listing of the copied folders and files, creating a simple text file I could use as an inventory snapshot.  A final step invoked WinDiff, so I could compare today’s inventory snapshot against yesterday’s inventory snapshot– allowing me to quickly see what files had been added or deleted as soon as I walked into my office the next morning.

I know that may sound like OCD to some, but consider what having such detailed information readily available every morning means when you are running a large and complicated website without a content management system.

It makes conversations like this one possible:

“Hi, Maxine, it’s Andrew.  I couldn’t help but notice that several dozen pictures got deleted from your web folder yesterday?”

“Yes! We had a whole bunch of event photos on the website, but moved them over to Flickr instead . . . .”

“Oh, good.  Just checking.  Have a nice day.”

Or this one:

“Hi, Jack, it’s Andrew.  Got a second?”

“Sure, what’s up?”

“I see you copied a PDF file of [such and such book] yesterday, but it looks like it’s copyrighted material . . . .”

“Yeah, but the author’s posting it for download on his website, so that makes it all right.”  (Note: This ain’t necessarily so.)

“Tell you what, instead of putting the file on our server and risking a nastygram from their lawyer, can you do me a favor and make a link to it instead?”

“All right, I’ll do it . . . but I still think it isn’t a big deal.”

“Thank you. I’ll rest easier not worrying about that possibility.”

Or this one:

“Hi, Andrew, it’s Brenda.”

“Hi, Brenda, what’s up?”

“We accidentally deleted some image files yesterday . . . ”

“The seven jpgs in the staff folder?”

“Yes! Can you restore them from backup?”

“No problem.  I’ll take care of it.”

Or, God forbid, even this one:

“The folder with the phishing website in it has a timestamp from six month ago . . . . ”

“Yes, I know. But timestamps can be altered with the touch command in Unix. I know it was created some time within the last 24 hours because I run a daily inventory report.”

“In that case, I’ll need access to your web server’s traffic logs from the last 24 hours.”

“I’ll give you my point of contact in the IT department. They should be able to provide that.”

And that’s just from comparing the two inventory lists for differences.  Imagine what you can do with analyzing the most current inventory list with a few command prompt tricks:

  • Using a series of FIND commands to break your files down by year, based on the last five characters (e.g. “/2010”) of the datestamp.  It gives you a quick idea of which portions of the website need a content review and update.
  • Use the FIND command to track down extensions of unsupported file formats.

The point is, backing up your website is a good start– but it’s even better if you have a decent sense of the inventory of that website.