Analytics and Statistics Track

Exploring the Skills Needed by the Data Science / Analytics Professional

Kirk Paul Lafler

As 2.5 quintillion bytes (1 with 18 zeros) of new data are created each and every day, the age of big data has taken on new meaning with a renewed sense of urgency to prepare students, young professionals, and other workers across job functions for todays and tomorrows analytics-roles along with the necessary analytical skills to tackle growing data demands. With the number of organizations embracing Data Science / Analytics skills and tools, organizations like LinkedIn, a leading professional networking and employment-oriented website and app, found that Data Scientists saw a 56% increase in the US job market in 2018. To keep up with the huge demand for analytics talent in 2019 and beyond, many colleges, Universities, and training organizations offer comprehensive Data Science / Analytics degrees and certificate programs to fulfill the increasing demand for analytical skills. This presentation explores the skills needed by the Data Science / Analytics professional including critical thinking; statistical programming languages such as SAS®, R or Python; Structured Query Language (SQL); Microsoft Excel; and data visualization.

View paper.

Limit of Detection (LoD) Estimation Using Maximum Likelihood from (Hit) Rate Data: The LoD_MLE SAS Macro

Jesse Canchola, Jeffrey E. Vaks and Shaowu Tang

The Limit of Detection (LoD) is defined as the lowest concentration or amount of material, target or analyte that is consistently detectable (for PCR quantitative studies, in at least 95% of the samples tested)1. In practice, the estimation of the LoD uses a parametric curve fit to a set of panel member (PM1, PM2, PM3, etc.) data where the responses are binary. Typically, the parametric curve fit to the percent detection levels takes on the form of a probit or logistic distribution. The LoD_Est SAS Macro (Canchola & Hemyari, SAS Global Forum 2016), using the SAS PROBIT procedure as the main engine, is used to fit such a parametric curves. The rarely used but preferred method uses the method of maximum likelihood (ML) to estimate the LoD assuming one detectable copy of template. We introduce the LOD_MLE SAS macro that maximizes the log likelihood function and returns the ML estimate (MLE) of the LoD along with its 95% confidence interval (CI). In addition, the macro returns the percent detection table with associated 95% exact (Clopper-Pearson) confidence intervals for the hit rates at each level.

View paper.

Measuring Test-Retest Reliability: The Intraclass Kappa

Dennis Fisher, Grace Reynolds-Fisher, Eric Neri, Art Noda and Helena Kraemer

Anyone using structured interview, or questionnaire instruments must establish the psychometric properties of their instrument (i.e. reliability and validity). The first property which must be established is reliability, because one cannot have a valid measure unless the measure has sufficient reliability. When the response data are dichotomous (Yes/No, Presence/Absence, Positive/Negative etc.) the most common measure in the literature is Cohen’s kappa (Cohen, 1960). This measure is appropriate for interrater reliability in which the responses from two different raters are assessed for agreement. However, many reliability studies have data from the same rater at two different points in time. This is known as intrarater reliability or test-retest reliability. Cohen’s kappa “forgives” rater bias which is not desirable for a measure that is used in test-retest reliability assessment. The correct statistic to use is the Intraclass kappa (Kraemer, Periyakoil, & Noda, 2002). We present a SAS macro which uses a bootstrap procedure to obtain both the point value and the confidence limits of the intraclass kappa so that applied researchers reporting test-retest reliability will be able to report the correct statistic. The Macro will not run in SAS 9.4 (TS1M2). It will run in 9.4(TS1M5) and 9.4(TS1M6).

View paper.

General, Age-Specific, and Total Fertility Rates in Montana, 2008-2017

Matthew Ringel

Fertility rates have been steadily declining in the United States over the past several years, and this trend may have implications for the economy and for other elements of society. Some are concerned as declining fertility rates may be indicative of economic anxiety and could also cause future problems for the economy due to dwindling number of workers and consumers. Other see this as a positive change because it could indicate higher education and career achievement among women and it could lead to decreased consumption and pollution. Either way, this trend may be of interest to federal, state, and local governments. This paper shows readers how to use SAS to calculate general, age-specific, and total fertility rates and how to use linear regression to test for a significant trend. It uses Montana Birth Data and National Center for Health Statistics (NCHS) Population Data to illustrate the process. In Montana, between 2008 and 2017, there was a significant downward trend in both the general fertility rate and the total fertility rate. This trend was largely due to a substantial reduction in births among women between the ages of 15 and 24 and may be due to women delaying childbirth for educational purposes. Therefore, there is reason to believe that the current low total fertility rate may be temporary. This trend should be monitored into the future to see if it continues or reverses course.

View paper.

Automating a Summary Report of PSM Model and Match Results

Desiree Hall and Blake Krebs

When performing propensity score modeling and matching, various output is produced and located either in separate reports or datasets. As you iterate the process after adjusting the model or match inputs, it is helpful to have the key results in one report for easy comparison with later iterations. This paper demonstrates a method to automate the creation of a report to summarize key results using the ODS output from PROC LOGISTIC and PROC PSMATCH. The intended audience are those with basic familiarity with SAS/STAT software.

View paper.

Statistical Power calculations using SAS software

Russ Lavery

SAS has the ability to perform power calculations for many of the tests performed by pharmaceutical companies. This paper reviews the basics of statistical power calculations and provides examples of how to use the two PROCS that perform these calculations.

View paper.

Using convergence status in a macro to iterate through a series of models

Kristi Metzger

Some analyses require running a series of models with the same specifications but with different outcomes. In epidemiologic analyses, we often want to record the results of a crude model (i.e., one with no covariates) and an adjusted model (i.e., one that includes covariates). In these situations, it can be helpful to construct a macro to loop through both the crude and adjusted models and then output the results into data sets using ODS OUTPUT statements. For analyses in which there is potential for an iterative model to not converge (such as for rarer outcomes), it is helpful to also output the convergence status. The convergence status can be used within a macro to automatically direct the analytic process through the desired analyses if convergence is attained or to alternative options if it is not. In this example, I will use a series of repeated-measures generalized estimating equations models in PROC GENMOD to model the rate ratios of specific types of motor vehicle crashes by driver age. For each type of motor vehicle crash, the macro loops through a crude model, then to a partially adjusted model (controlling for a limited number of covariates) then to a fully adjusted model (controlling for all covariates) if the more parsimonious model converges or exits the macro loop if it does not converge. The macro concatenates the results of each model for all outcomes into a single dataset with indicators for the type of model (crude, partially adjusted, or fully adjusted).

View paper.

That's the One: Choosing the Correct Statistical Test for Trend in SAS

Christina Park, Jui-Ting Hsiung, Melissa Soohoo and Elani Streja

Tests for trend are an informative and useful tool to examine the directionality of the differences in the means, medians or proportions of continuous or categorical variables across ordered groups. In clinical and epidemiological research, comparisons of baseline patient characteristics (e.g., demographic, clinical and laboratory data) across ordered levels of the categorized primary exposure are often examined with chi-square or analysis of variance (ANOVA) statistical tests, which identify the existence of differences but not trends in the ordered groups. Trend tests provide additional insight into the pattern of the relationship between independent and dependent variables. Multiple methods are available in SAS to evaluate trends of continuous and categorical variables using PROC REG (simple linear regression) and PROC FREQ (Cochran-Armitage, Jonckheere-Terpstra and Cochran-Mantel-Haenszel tests) statements. However, choosing the appropriate statistical test can be a challenge. Tests will vary depending on the assumptions about the variable of interest including its type and distribution. Selecting an inappropriate test may lead to incorrect inferences about the association of the variable across ordered exposure groups. This is important, especially when the results from trend tests may influence which variables are considered as covariates in models of adjustment. In this paper, we aim to (1) describe when to use specific statistical tests to evaluate trends in continuous or categorical variables across ordered groups, and (2) provide examples of SAS codes for trend tests and interpret the resulting output.

View paper.

A strategy for speed Dating

YuTing Tian

Online dating is a growing industry with recent quarterly profits well in excess of millions. The goal of PIZZAZ.com is to break into this industry that using the power of statistics to optimally match couples.
In order to attain this target, we will use a fictional dataset speeding dataset to generate models to test feasibility of using statistics to match couples.

View paper.

Implementing the Hamiltonian Monte Carlo sampling algorithm in PROC MCMC

Maria Vazquez and Jeffrey Wilson

Bayesian statistical methods are widely being used in a variety of fields, such as biomedical research, public health, social sciences and banking, among others. Researchers may fit Bayesian models in SAS using the PROC MCMC procedure. However, default options offered to researchers for sampling algorithms for Markov Chains in this procedure are based on the Metropolis sampling method and can sometimes be inefficient and take a high number of iterations to converge. The Hamiltonian Monte Carlo algorithm is an alternative sampling methods that usually converges faster than Metropolis approaches. We show how to use this algorithm in PROC MCMC using the ALG statement and compare its performance with the default option using clustered data to fit a hierarchical logistic regression model.

View paper.

Exploring the Difference in Multicollinearty Adjustments Between Logistic and Linear Regression

Deanna Schreiber-Gregory

Multicollinearity can be briefly described as the phenomenon in which two or more identified predictor variables are linearly related, or codependent. The presence of this phenomenon can have a negative impact on an analysis as a whole and can severely limit the conclusions of a research study, regardless of whether you are employing linear or logistic regression techniques.

In this paper, we will briefly review how to detect multicollinearity, and once it is detected, which regularization techniques would be the most appropriate to combat it. The nuances and assumptions of R1 (Lasso), R2 (Ridge Regression), and Elastic Nets and their application to linear and logistic regression model construction will be covered in order to provide adequate background for appropriate analytic implementation. This paper is intended for any level of SAS® user. This paper is also written to an audience with a background in theoretical and applied statistics, though the information within will be presented in such a way that any level of statistics/mathematical knowledge will be able to understand the content.

View paper.

Variable Selection Using Random Forests

Denis Nyongesa

Random forests is an increasingly used statistical method for classification and regression. It was introduced by Leo Breiman in 2001. A good prediction model begins with a great feature selection process. This paper proposes the ways of selecting important variables to be included in the model using random forest. The variables to be included in the model will be indexed or ranked according to the score of importance of each variable.

View paper.

Polynomial Regression for Modeling Curvilinear Data: A Biological Example

Elisha Johnston

This paper introduces polynomial regression, which is useful for analyzing curvilinear data. I illustrate the power of the procedure with biological data I generated for the 2019 science fair. To select the best fitting model, I employed Bayesian Information Criterion. BIC reduces overfitting by penalizing higher order models. To visualize model fit, I graphed the selected model function in a data graph. To estimate the best fitting model with coefficient variance maximally accurate, I centered the data and then confirmed with orthogonal polynomial regression. In reducing multi-collinearity, I surfaced that some coefficients were not statistically significant, but retained in the model to meet model hierarchy guidelines. This paper is geared to a general audience.

View paper.