Big Data and Data Science Track

TB Fuzzy-Matches Made Easy: Designing shareable SAS code to accurately and efficiently identify exact and close genetic matching tuberculosis isolates

evan timme

The Tuberculosis Genotyping Information Management System (TB-GIMS) is a genetic data source available to public health programs. We attempted to design universally sharable SAS code to efficiently analyze TB GIMS data for the purpose of detecting exact and close matching isolates to an identified active TB case of interest (IATCI).

The Arizona TB-GIMS extract contains 2,800+ isolates. Coding was designed in Base SAS-9.4 and created for easy adaptability by another health department. Modifiable macro thresholds set inclusion criteria for matches. Conventional assessments require an exact match on one 15-digit number and two 12-character alphanumeric variables. Genetic changes overtime are common and fuzzy-matches would accommodate such changes. Using a simple do loop and substr function counter, we were able to count the total number of place-value matches in the 15-digit and two 12-character variables for each IATCI to all other isolates. Only fuzzy-matches meeting or exceeding the macro set thresholds are retained. Fuzzy-matches were then assessed for distance [zipcitydistance function] and the time [years between isolate collections]. Output reports include a dot-plot (time x-axis, distance y-axis), a single .XML with unique sheets for each IATCI, and a single .PDF with individual reports for each IATCI. When combined with epidemiologic data, the outputs were helpful for understanding potential transmission clustering.

The utility of this code is helpful and complementary to epidemiologic-linking data. User defied thresholds allow for simple and easy code adaption, while being an efficient and effective use of staff time and resources. These findings are encouraging and warrant further exploration.

View paper.

Multi-way Splits in Decision Trees Where The Dependent Variable Has More Than Two Levels

Russ Lavery and YuTing Tian

Decision trees are no longer new tools of Data Scientists and are frequently used to split people into binary groups (two way splits). However SAS Enterprise Miner has the ability to create decision trees that split into more than two levels. This can be very useful if an analyst is trying to assign observations into more than two groups. This paper uses examples to explore this powerful feature (multi-way splits) of SAS Enterprise Miner to predict both n-way categorical variables and continuous variables.

View paper.

Mocking Data To Support Performance Load Testing

Troy Martin Hughes

This text introduces the MOCKDATA macro, which creates sample data sets that can be used in load testing, stress testing, and other performance testing. Users are able to configure the MOCKDATA macro to alter the mock data set that is produced. Parameters include the number of observations, number of character variables, length of character variables, number of numeric variables, highest number saved as a numeric variable, and percentage of variables that are complete. MOCKDATA creates SAS data sets and/or text flat files, thus it is ideal for testing the relative performance of input/output (I/O) functionality of flat files as compared with SAS data sets. Used in coordination with the author’s PINCHLOG macro, MOCKDATA is able to mathematically demonstrate the performance advantages (i.e., increased runtime) of certain functionally methods over others. Data mocking is critical when data are sensitive and cannot be used, or data cannot be transferred (but must be tested among various locations), or simply to ensure statistically equivalent data for all types of load and performance testing.

View paper.

Bayesian Nonparametric Clustering

Hend Aljobaily

Traditional parametric models use a fixed and finite number of parameters which cannot be used in data mining and machine learning. This is because they may result in the over or under fitting of data due to the complexity of the models used in data mining and machine learning. The Bayesian Nonparametric approach is an alternative to the traditional parametric approach. Probabilistic models are appropriate nonparametric models for data mining and machine learning since they are data-driven. One example for a Bayesian Nonparametric model is the Dirichlet Process. The Dirichlet Process is one of the most popular BNP models. For clustering, the Dirichlet Process in the Gaussian Mixture Model (GMM) is used to find the best number of clusters among the data using the gmm action in the CAS procedure within SAS®. It has the ability to add new clusters and remove existing clusters during the clustering process, thus finding the best number of clusters adaptively. In the gmm action, the Dirichlet Process serves as the prior for the proportion of the Gaussian mixture. In this study, a real-world example will be used to demonstrate the use of Dirichlet Process Gaussian Mixture Model for nonparametric clustering in SAS® to analyze a big dataset.

View paper.

Treating the Data Overdose: Navigating Big Data to Find Answers to Our Growing Opioid Problem

Deanna Schreiber-Gregory

It is a well-known problem that the prevalence and severity of our opioid issue has been growing quickly over the past few years. As a health care provider in mental health, I have seen the impact that this issue has had not only on our culture & environment, but on the individuals suffering from this addiction and the effect their pain has on their loved ones. This is a heart-breaking issue that could benefit greatly from the expertise and dedication of data scientists, analysts, and data managers alike. The relevant data available to the general public is vast and robust, if only we knew how to use it!

This presentation will focus on how to identify meaningful databases to answer our data-for-good themed question (opioid dependence and overdose prevalence), how to scrub and prepare our data for analysis (including a brief outline of common pitfalls, obstacles, and frequently overlooked considerations), what analyses are or could be meaningful, and how to present the data/results in a meaningful and attention-catching way. Given that not everyone has access to a variety of SAS software packages, this presentation concentrates on how to achieve our goals using Base SAS programming techniques. This author invites a spirited discussion on the different ways to explore this issue and where best to present and distribute our findings. Data scientists, analysts, and data managers would benefit the most from this presentation, but anyone with an interest in this topic is invited to attend!

View paper.