Stop Sabotaging by Not Standardizing your Data

As GIS professionals, we deal with a lot of data, much of which must be entered by hand – sometimes by multiple people over long periods of time.  While we often like to focus on new and exciting ways to use this data, such as new technologies or sophisticated forms of analysis, we don’t always focus on the details of how data was created in the first place.  Unfortunately, this can lead to problems that make those exciting methods less effective and harder to work with.  One important factor to consider is consistency in how values are entered in text fields.  

Here’s an example of inconsistently formatted data getting in the way of a good idea. I was once asked to create a Web AppBuilder widget for a city government that would let city employees select a section of a proposed construction project and then generate and download a PDF. This PDF needed to include the names and addresses of everyone who owned property within a certain distance, so that those addresses could easily be used to mail letters alerting citizens to  upcoming projects.  Getting all the properties within the required distance and extracting the address data was easy enough, but because the same person often owned multiple properties in the same area, the list of addresses was full of duplicates.  

I was asked to filter out the duplicates to get a set of unique property addresses, and that was where we ran into trouble – the addresses were very inconsistent.  If Jane C Doe owned property at 123 Main Street, the name in the Owner field might have been entered as “Jane C Doe”, or “Doe, Jane C”, or “Doe Jane C”.  Sometimes the Owner was “Doe J”, in which case someone needed to manually try to figure out whether this was the same person or a different person with the same initials.  The Address might be “123 Main”, with “Street” in the Suffix field, or the Address field might also have the street suffix, as either “123 Main Street” or “123 Main St”.  Sometimes the zipcode was in the State field and vice versa.  Though the goal had been to create a tool that completely automated address extraction, the end result still required a lot of work from the user to remove duplicates and sort out data that was ambiguous – all because the data was inconsistent.

When multiple people input data without clear standards in place, inconsistencies can arise.

This isn’t the only situation I’ve encountered where inconsistent data makes an otherwise good application harder to use.  You can only search ArcGIS fields by string match, so inconsistently formatted data will affect any application that involves searching for or comparing text in a field.  If you have an application that lets users search for an address, if someone types in “123 Main Street” and the address in the database is “123 Main St”, that result will not be found.    

This kind of problem can be found in a lot of datasets, but it’s particularly common in addresses.This is because addresses have so many pieces of information that can be entered in a variety of ways.   Fixing these kinds of issues after the fact can be a daunting task, as no one wants to go back and redo work that’s already “done”.  That’s why it’s important to enforce consistency in datasets early and often.

Adding field domains to the street suffix field to ensure consistency

One way you can do this with some fields is through the use of field domains. Field domains restrict the values that can be entered in a field. For example, you might create field domains that only allow “ST” in the suffix field, not “St” or “Street”.  Or you might create field domains that only allow a field to have one of the standard two-letter abbreviations for the fifty states; if someone tries to enter a zipcode in that field, the value will be rejected.  For online datasets, you can improve your use of field domains with GEO Jobe’s Admin Tools for ArcGIS tool Update Field Domains for Feature Layers.  However, domains won’t work for fields that can have any value, such as owner names or street addresses.  In that case, it’s important to establish and rigorously enforce standards for entering data.  Everyone involved in the data entry process should be informed of these standards, and new data entries should be reviewed regularly to make sure they conform to the chosen formatting.  
If you already have a large dataset with many inconsistencies, in an ideal world someone would review this data and make all necessary changes, but that isn’t always practical.  However, it’s never too late to come up with standards and put them in place for all future entries.  If you would like professional assistance in cleaning and standardizing data, or in working with data with inconsistencies, GEO Jobe’s Professional Services team is available for individualized consulting. This kind of work isn’t as fun or interesting as other aspects of GIS, but it’s crucial to ensuring that tools and applications can work as smoothly and effectively as possible.

Senior Application Developer