Wednesday Half-Day Class Ron Cody

An Introduction to Data Cleaning Techniques

Presented: Wednesday September 5, 2018, 8:00am-11:30am

Presented by:
Dr. Ron Cody was a Professor of Biostatistics at the Rutgers Robert Wood Johnson Medical School in New Jersey for 26 years and is now a writer and contract instructor for SAS Institute. He has been a SAS user since the late 70’s and has authored or coauthored books on both SAS programming and using SAS to perform statistical analysis.

This class discusses ways to look for data errors using both DATA step programming and Base SAS procedures.  We start with programs and procedures to detect errors in character data. Functions such as NOTDIGIT and NOTALPHA are especially useful in this regard.  For numeric data, we investigate methods for detecting data points outside of fixed ranges and continue on to develop programs for automatic outlier detection.

Although this course does not require you to know the SAS macro language, you will be taught how to run macros that perform a variety of data cleaning functions.  For example, one macro, called AUTO_OUTLIERS, checks for possible data errors in numeric data, using a concept called “trimmed statistics.”

The class ends with a demonstration of integrity constraints and audit trails.  They allow you to define rules or constraints on one or more data values.  For example, you may require that values for Gender must be M’s or F’s.  As another example, you may define valid ranges for each of your numeric variables.  All these constraints are stored in the data descriptor portion of your SAS data set.  Once these constraints are in place, they can prevent new data that violates one or more constraints, from being added to your data set.

Intended Audience: Between beginner and intermediate

Tools Discussed: Base SAS

Prerequisite: None

Class Outline:

  • Introduction
  • Working with character data
    • Introduction
    • Description of the Patients data set
    • Using PROC FREQ to list values
    • Using a DATA step to check for invalid values
    • Describing some useful data cleaning functions
      • VERIFY
      • COMPBL
      • COMPRESS
      • MISSING
      • TRIM and STRIP
    • Using formats to check for invalid values
  • Using Perl regular expressions to verify the structure of character data
    • Introduction
    • Using PRXMATCH to verify data
    • Examples using US Zip codes and Canadian postal codes
    • Standardizing data
    • Describing the PUT function
    • Standardizing company names using SAS formats
    • Describing the TRANWORD function
    • Using TRANWORD to standardize street names
  • Data cleaning techniques for numeric data
    • Using PROC UNIVARIATE to list the ‘n’ lowest and highest values
    • Developing a macro to list the ‘n’ lowest and highest values
    • Using PROC UNIVARIATE to list the top and bottom ‘n’ percent of data values
    • Using pre-determined ranges to check for possible data errors
      • Using a DATA step to check for out-of-range values
      • Developing a macro to check for out-of-range values
      • Describing a macro to check multiple out-of-range values and produce a consolidated report.
    • Describing automatic numeric outlier detection
      • Using means and standard deviations to detect outliers
      • Using trimmed statistics to improve outlier detection
      • Describing a macro for automatic detection of outliers
      • Using exploratory data techniques (EDT) to detect outliers
    • Describing issues related to missing values and special values (such as 9999) that represent types of missing data.
      • Introduction
      • Using the SAS log to detect invalid values
      • Detecting character missing values using the DATA step
      • Counting the number of missing values for character or numeric variables
      • Creating a macro to detect and count specific data values (such as 9999)
    • Working with dates
      • Introduction
      • Checking for dates outside a specified range
      • Checking for invalid dates
    • Looking for duplicates and ‘n’ observations per group
      • Introduction – describing the difference between duplicate records and duplicate key variables
      • Using PROC SORT to detect duplicate records
      • Using PROC SORT to detect duplicate key values
      • Detecting duplicates using a DATA step
      • Using a DATA step to count the number of patient visits
      • Using a DATA step to ensure there are ‘n’ observations per subject
    • Working with multiple files
      • Determining that there is a corresponding ID in each of two files
      • Determining that there is a corresponding ID in each of three files
      • Describing a macro to determine if there is a corresponding ID in each of ‘n’ files
    • Correcting errors
      • “Hard coding” corrections for small data sets with few errors
      • Describing List Input
      • Using a transaction data set to correct data errors in a master file
    • Creating integrity constraints and audit trails
      • Introduction
      • Creating and using general integrity constraints
        • Adding an audit trail to keep track of errors
        • Demonstrating general constraints involving more than one variable