Yesterday I attended GovLoop’s “Open Data Event” and was inspired to revisit a project I’d set aside a few months ago . . . to try a different approach.
For those who don’t know, the Fairfax County Police Department posts a weekly “data dump” of the tickets issued and arrests made. I know Fairfax Underground has created a search engine based around a collection of this data, but it’s little more than a novelty item to see if anyone you know (e.g. friend, neighbor, blind date, etc.) has been arrested or ticketed. I haven’t seen anyone plot the original data set on a Google Map, let alone mashup with other data sets such as real estate prices or school test scores. It’s become a curiosity for me, to see if something more significant can be done with that information.
One frustrating challenge with this particular data file is that the data is stored in a fixed width text file— a very generic format, but not as immediately useful as XML or JSON, for instance. You have to write code that tells your application to use characters 1-40 for column 1 data, use characters 41-60 for column 2 data, and so on . . . plus you have to hard code the names and quantity of the data fields, and that just didn’t seem like a particularly robust situation for me.
I wanted to create a php function that could, for lack of a better term, “read and analyze” any fixed width text file and return the names of the fields as well as the start positions for each field dynamically. I’ve written before about the struggle to come up with reusable algorithms that we can apply to different variations of the same basic principles, but this one took longer than I thought.
I won’t rehash the different approaches and failures, and jump to the approach that finally worked for me.
/* let's open our fixed width text file */ $handle = @fopen("arrests.txt", "r"); if ($handle) { // read our first line $headerRow = fgets($handle, 4096); /* parse for header row names, using TWO or more spaces for RegEx pattern */ $headerRowNames = preg_split('/(?:\s\s+|\n|\t)/', $headerRow, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_OFFSET_CAPTURE); // display array contents for examination echo "<pre>"; var_dump($headerRowNames); echo "</pre>";
Basically, we grab the first line of the file and use a regular expression of two or more consecutive spaces to figure out where one header row name ends and the next one begins. Since one of the header rows in my sample file has a space in it (i.e. Charge Description), we can’t use a single space as the delimiter. NOTE: There is an edge case where a header row name could be so long that only a single space separates it from the next header row name; so it’s not 100% “bulletproof”, but this was as close I could get to a solution with only one evening of work.
Here are the results when we display the array:
array(8) { [0]=> array(2) { [0]=> string(5) "LName" [1]=> int(0) } [1]=> array(2) { [0]=> string(5) "FName" [1]=> int(40) } [2]=> array(2) { [0]=> string(5) "MName" [1]=> int(60) } [3]=> array(2) { [0]=> string(3) "Age" [1]=> int(100) } [4]=> array(2) { [0]=> string(7) "DateArr" [1]=> int(105) } [5]=> array(2) { [0]=> string(6) "Charge" [1]=> int(135) } [6]=> array(2) { [0]=> string(15) "Charge Descript" [1]=> int(160) } [7]=> array(2) { [0]=> string(7) "Address" [1]=> int(210) } }
This gives us a multidimensional array with both the field names contained in the header row and their offset positions, which will certainly come in handy as we process the remaining lines of the file. Time permitting, I’d like to drop this as a method into a php class, alongside methods to convert the extracted data into either a JSON or XML file. That way, I’ll have a utility class that can add value to other fixed width text files in the future.
Just to add a quick update, I’ve created a project on GitHub for the Fixed Width Text File toolkit and will be working to add functionality as time permits.