Spatially enabling the data: What is geocoding?|
Chapter 4: Mapping Crime and Geographic Information Systems
If we break the word geocoding into its components, it means coding the Earth-providing geographic reference information that can be used for computer mapping. The history of geocoding is tied to efforts at the U.S. Census Bureau to find ways of mapping data gathered across the country, address by address.
In the 1960 Census of Population and Housing, questionnaires were mailed to respondents and picked up from each household by enumerators. In 1970, the plan was to use the mail for both sending and returning surveyshence references to that census as mail out/mail back. This demanded geocoding capability and, subsequently, the development of an address coding guide (ACG). According to Cooke (1998), the Data Access and Use Labs created to accomplish this were responsible for creating today's demographic analysis industry.
The first geocoding efforts permitted only street addresses to be digitized (admatch), but the capability to show blocks and census tracts was soon added. This demanded that block faces be recognized, and this was done by digitizing the nodes representing intersections. This, in turn, meant that intersections had to be numbered and address ranges had to be reconciled to the correct block faces. The shape of the lines on the map had to be precisely determined and annotated, creating the map's topology. The name given to this new block mapping process was dual independent map encoding (DIME) and, when combined with the address matching process, it was referred to as ACG/DIME. By 1980, ACG/DIME had become geographic base file (GBF)/DIME. This was followed by
a call for a nationwide, seamless, digital map, to be called TIGER, short for topologically integrated geographic encoding and referencing. Census Bureau geographer Robert Marx and his team implemented TIGER for the 1990 Census (Cooke, 1998; Marx, 1986).
TIGER files contain address ranges rather than individual addresses. An address range refers to the first and last possible structure numbers along a block face, even though the physical structures may not exist (figure 4.5). For each chain of addresses between the start node and end node, there are two address ranges, one for odd numbers on the left, the other for even numbers on the right. For a complete explanation, see U.S. Census Bureau (1997).
Geocoding is vitally important for crime mapping since it is the most commonly used way of getting crime or crime-related data into a GIS. Crime records almost always have street addresses or other locational attributes, and this information enables the link between the database and the map.
How does the computer map in a GIS know where the data points should be put? It reads the x-y coordinates representing their locations. When crime locations are geocoded, the address is represented by x-y coordinates, usually either in latitude and longitude decimal degrees or in State-plane x-y coordinates identified by feet or meter measurements from a specific origin. The big headache in working with address data is that those data are often ambiguous and may be erroneously entered in field settings. Common field errors include:
As you initially attempt automatic geocoding, street addresses are compared against the existing street file database, and coordinates are assigned to the "hits." This process is sometimes called batch matching. The process is a one-time affair, done automatically. Then, it becomes necessary to deal with the "misses," those addresses that did not geocode automatically.
- Giving a street the wrong directional identifier, such as using east instead of west or north instead of south.
- Giving a street the wrong suffix or street type (e.g., "avenue" instead of "boulevard"); providing no suffix when there should be one.
- Using an abbreviation the streets
database may not recognize (e.g., St., Ave., Av., or Blvd.).
- Misspelling the street name.
- Providing an out-of-range, or impossible, address. For example, a street is numbered 100 to 30000, but an extra zero is added, accidentally producing the out-of-range number 300000.
- Omitting the address altogether.
Handling misses is done manually. The bad address is displayed with the closest possible matches the database includes. Analysts use these options to select the most likely match. This involves some guesswork and risks geocoding errors. For example, if the address entered is 6256 Pershing Street, and the only reference to Pershing in the database is to Pershing Avenue, then assigning the geocode to "avenue" is not likely to be an error. On the other hand, if the database also contains Pershing Boulevard, Pershing Circle, and other Pershing suffixes, assigning "avenue" could be wrong. This shows how important it is to have standards for entering addresses into a file, whether the system deals with records or computer-assisted dispatching (CAD).
Not all records in large data sets are likely to be successfully geocoded. The title of a section in a chapter in the MapInfo Professional User's Guide (MapInfo Corporation, 1995), "Troubleshooting: Approaching the 100% Hit Rate," hints at this. Some records may not be salvageable for a variety of reasons, including ambiguity in an address that cannot be resolved. Two other issues deserve mention, as well. One is that street addresses are estimated along block faces and may not represent true block face locations. (For more on this, consult technical documentation.) Second, address matching can be done for locations other than street addresses, such as street centerlines, land parcels, or buildings, depending on the availability of each element in a spatially enabled format.
Surprisingly, there is no minimum standard for geocoding. Maps can be produced and distributed based on a 25-percent hit rate. Readers may have no idea that a map represents only a small fraction of all cases. Worse, the missing cases may not be randomly distributed, thus possibly concealing a critical part of the database. For example, in the geocoding process, a person or persons may be inept or may decide to distort the data. If this error originates in the field, it will probably have a geographic bias based on the location of the person making errors. Analysts may consider reporting the hit rate for geocoding to better inform map readers.
Although most map users may not understand the hit rate, a technical footnote reading, "X percent of cases were omitted due to technical problems, but, the police department considers the pattern shown to be representative of the total cases under consideration," may clarify the information. (Seek legal advice for actual wording.)
Given that there is no minimum standard, the issue becomes: What hit rate is acceptable? This is a subjective decision, but a 60-percent hit rate is unacceptable and may lead to false assumptions. Hit rates this low should raise questions about a crime analysis unit's level of readiness because low hit rates indicate that the base maps in use and/or incoming data are seriously deficient.
A distinction needs to be made between the hit rate and another geocoding measure, the match score. The latter is a score derived from matches on each component of the address. If all components of an address are correct-street name, direction, street type-the address will receive a
perfect score. Missing or incorrect parts reduce the score. This differs from the
hit rate, which is the percentage of all addresses that are capable of being geocoded in either batch or manual mode. Therefore, the hit rate and match score can be used to set acceptable geocoding standards. However, setting the acceptable threshold of either rate too high or too low may result in too few records making the cut or, in the worst case scenario, incidents being given wrong addresses, thus placing crimes on the map where they did not happen.
Like some other aspects of computer mapping, geocoding can be quite involved and demand considerable practice and expertise before you can regard yourself as an expert. The technical procedures used to fix geocoding problems are beyond this document's scope. Readers are referred to the user's guides, online help, and reference guides that accompany software or that are available on a proprietary basis. Asking more experienced GIS users for advice, perhaps in other departments of local government (management information systems, planning, engineering, and so forth), is another possibility. For additional information, see Block (1995).