Procedure To Clean Fairfax County Police Department Ticket/Arrest Data

a sticker depicting a fog aerosol can has been placed on a faded out parking notice sign.

The weekly ticket/arrests data file for Fairfax County Police Department has changed quite a bit since 2015. For starters, it is no longer uploaded weekly, contains both tabs and spaces, and appears to be cumulative over the entire year to date. Since the file is no longer a fixed-width text file, it made more sense to come up with a clean up procedure which converted the hybrid file into a pure tab-separated file.

Note: This procedure was based http://www.fairfaxcounty.gov/police/crime/arrest.txt published on July 25th, 2017. If new inconsistencies are introduced in subsequently published files, it may be necessary to revise this procedure.

1) Open the most recently downloaded arrest.txt file in Microsoft Notepad.

2) Replace all tabs with [IAMATAB].

(We’re replacing real tabs with virtual ones)

3) Replace all “[IAMATAB] [IAMATAB]” with “[IAMATAB][IAMATAB]”

(This removes tabs separated by a single space in addresses)

4) Replace all ” [IAMATAB]” with “[IAMATAB]”

(You repeat this step dozens of times until all blank spaces BEFORE virtual tabs have been removed).

5) Replace all “[IAMATAB] ” with “, ”

(A virtual tab with a single trailing space gets transformed into a comma with a trailing space)

6) Replace all “[IAMATAB]” with real tabs.

7) Use “Save As” feature to create a copy, such as “arrest-tabbed.txt”, which you should be able to open in your preferred spreadsheet program.

NOTE: This is almost perfect, but there are some inconsistencies which cannot be addressed.

1) Addresses with apartment numbers in them.

Some of the apartment numbers are flanked by tabs, while others are flanked with trailing spaces within the tabs. This inconsistency makes it difficult to deal with at the text processing level, but can be addressed with a spreadsheet formula which concatenates two columns simply enough.

2) Missing header row

This will have to be manually added to the file after it is converted. Again, this is easily done in a spreadsheet program.

3) Incorrect Charge Notation

Some charges were converted into date format (e.g. Drunk in Public, which is §5-1-1, became 5/1/2001) in the FCPD original document, and then subsequently exported. In other words, it’s broken before we even get our hands on it. Since it matches date formats, it’s tricky to modify it without accidentally changing actual dates in our file.

Working With Fixed Width Text Files

"Someday" is the busiest day of the week.

Yesterday I attended GovLoop’s “Open Data Event” and was inspired to revisit a project I’d set aside a few months ago . . . to try a different approach.

For those who don’t know, the Fairfax County Police Department posts a weekly “data dump” of the tickets issued and arrests made.  I know Fairfax Underground has created a search engine based around a collection of this data, but it’s little more than a novelty item to see if anyone you know (e.g. friend, neighbor, blind date, etc.) has been arrested or ticketed.  I haven’t seen anyone plot the original data set on a Google Map, let alone mashup with other data sets such as real estate prices or school test scores. It’s become a curiosity for me, to see if something more significant can be done with that information.

One frustrating challenge with this particular data file is that the data is stored in a fixed width text file— a very generic format, but not as immediately useful as XML or JSON, for instance.  You have to write code that tells your application to use characters 1-40 for column 1 data, use characters 41-60 for column 2 data, and so on . . . plus you have to hard code the names and quantity of the data fields, and that just didn’t seem like a particularly robust situation for me.

I wanted to create a php function that could, for lack of a better term, “read and analyze” any fixed width text file and return the names of the fields as well as the start positions for each field dynamically. I’ve written before about the struggle to come up with reusable algorithms that we can apply to different variations of the same basic principles, but this one took longer than I thought.

I won’t rehash the different approaches and failures, and jump to the approach that finally worked for me.

 /* let's open our fixed width text file */
 $handle = @fopen("arrests.txt", "r");

 if ($handle) {
 // read our first line
 $headerRow = fgets($handle, 4096);
 
 /* parse for header row names, using TWO
  or more spaces for RegEx pattern */
 $headerRowNames = preg_split('/(?:\s\s+|\n|\t)/', $headerRow, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_OFFSET_CAPTURE);

// display array contents for examination
 echo "<pre>";
 var_dump($headerRowNames);
 echo "</pre>";

Basically, we grab the first line of the file and use a regular expression of two or more consecutive spaces to figure out where one header row name ends and the next one begins. Since one of the header rows in my sample file has a space in it (i.e. Charge Description), we can’t use a single space as the delimiter. NOTE: There is an edge case where a header row name could be so long that only a single space separates it from the next header row name; so it’s not 100% “bulletproof”, but this was as close I could get to a solution with only one evening of work.

Here are the results when we display the array:

array(8) {
  [0]=>
  array(2) {
    [0]=>
    string(5) "LName"
    [1]=>
    int(0)
  }
  [1]=>
  array(2) {
    [0]=>
    string(5) "FName"
    [1]=>
    int(40)
  }
  [2]=>
  array(2) {
    [0]=>
    string(5) "MName"
    [1]=>
    int(60)
  }
  [3]=>
  array(2) {
    [0]=>
    string(3) "Age"
    [1]=>
    int(100)
  }
  [4]=>
  array(2) {
    [0]=>
    string(7) "DateArr"
    [1]=>
    int(105)
  }
  [5]=>
  array(2) {
    [0]=>
    string(6) "Charge"
    [1]=>
    int(135)
  }
  [6]=>
  array(2) {
    [0]=>
    string(15) "Charge Descript"
    [1]=>
    int(160)
  }
  [7]=>
  array(2) {
    [0]=>
    string(7) "Address"
    [1]=>
    int(210)
  }
}

This gives us a multidimensional array with both the field names contained in the header row and their offset positions, which will certainly come in handy as we process the remaining lines of the file. Time permitting, I’d like to drop this as a method into a php class, alongside methods to convert the extracted data into either a JSON or XML file. That way, I’ll have a utility class that can add value to other fixed width text files in the future.