Dealing with Missing Data: A Comparative Exploration of Approaches Utilizing the Integrated City Sustainability Database

Editor’s Note: The article this blog post is based on is now available in the new March issue of UAR (Volume 55 Issue 2)

By Cali Curley, Rachel Krause, Richard Feiock, and Chris Hawkins

In our UAR article, we seek to raise awareness about how to treat missing data in urban studies research. A large proportion of the empirical research on urban politics and policy relies on data collected through surveys of local government or community organization leaders. Surveys provide a relatively efficient way to collect large amounts of consistently measured individual or organizational information needed to conduct comprehensive and accurate statistical analysis. This is particularly important if the aim of research is to produce generalizable findings and contribute to understanding a particular phenomenon by testing theory. However, missing data is a common and significant challenge in survey-based research. It often influences the selection of a statistical method of analysis, and, depending on its severity, can undermine the confidence of analysis. Nonetheless, the problems associated with missing data are among the least acknowledged issues when conducting and reporting analysis.

The goal of this article is to compare three different techniques (listwise deletion, mean replacement, and multiple imputation) that deal with missing data to demonstrate their utility in analyzing survey data. The table included below is an overview of the detailed comparison regarding the three different techniques explored throughout the paper. To demonstrate the performance of these three approaches we utilize data from the Integrated City Sustainability Database (ICSD), portions of which is available to the larger public upon request. The ICSD merges seven different surveys administered to US cities during an 18 month period in 2010-2011 and all include similar questions about local sustainability policy (economic, environmental, equity, climate governmental priority, collaboration, policy adoption, etc). All seven of the surveys were sent to all US cities with populations greater than 50,000 in 2010. The ICSD provides scholars and practitioners with a unique opportunity to examine a very robust set of responses to important questions.

We generate three versions of the ICSD data using each of the common missing data techniques mentioned above – listwise deletion, mean replacement, and multiple imputation – and use them to run three identically specified models.  Our analysis finds great variation in the models’ performance based on the version of data used. The paper suggests that understanding why data is missing and how to treat the missingness explain the inflation of certain findings as well as null results that diminish theoretical progress.

One key finding of our study are the advantages of employing a theory-based imputation process. The mechanics of imputation may be relatively straightforward, but by developing ‘informing variables’ – broad groupings of variables that have theoretic relationships – we have greater confidence the results reflect more accurate explanatory relationships than alternative methods of handling missing data. Overall the results of our analysis confirm the usefulness of the ICSD in the study of environmental and sustainability and other policy in U.S. cities, and provide suggested pathways for studying urban issues with survey data

In our analysis, the multiple imputation approach was most appropriate and resulted in the strongest outcomes.  This is because the missing values in the ICSD are Missing at Random (MAR) and the pattern of missingness that emerged in the multivariate regression model that we estimated would have resulted in a large number of observations being dropped in the absence of value replacement. Despite the strong performance of multiple imputed data in our example, we emphasize that there is not a one-size-fits-all “best” approach for handling missing data and it is imperative that researchers understand the causes behind the missingness in their own data and the consequences of each potential approach.

TECHNIQUES Listwise Deletion (Complete Case Analysis) Mean Replacement (Mean Substitution) Multiple Imputation
Technique Summary Remove any entries with missing values; perform analysis without these observations For variable “a” with missing values, take the mean of all included observations. Substitute the mean of “a” for missing values of “a.” Estimate the distribution (Bayesian posterior distribution) of the missing variable, given covariates; take random draws from this distribution to produce multiple versions (usually 3-10) of an imputed data set; Perform analysis on each imputed data set and pool the results
Missingness Assumption* MCAR, occasionally MAR MCAR MCAR or MAR
Advantages Easiest, simplest Preserves the mean of the dataset; Simple; allows use of all observations Accounts for the extra uncertainty produced by imputing data; produces better estimates of missing values
Disadvantages Loses valuable information; potentially contributes to bias Artificially reduces standard deviation of data set, distorts relationships between variables Requires complicated statistical methods or complicated software; harder to understand; takes extra steps
Impacts on Interpretation Statistical analysis loses power; estimates could be biased if data is not missing completely at random Estimate could be biased, Standard errors will be artificially low; Could produce results that are highly statistically significant, but inaccurate Because the method accounts for extra uncertainty, results can be interpreted as if data was not missing.
References
Method Exploration Jones 1996, 223; Schafer and Graham 2002, 155. Downey and King 1998; Shafer and Graham 2002, 159. Donders et al. 2006, 1089; King et al. 2001; Rubin 1987; Schafer 1997; Zhang 2003;
Application Park and Ha 2012, 394; Ryff and Keyes 1995, 722. Allen et al. 2006, 572; Gallimore et al 2011, 186-187 Abayomi et al. 2008; Fox and Swatt 2009; Miyama and Managi 2014;
*Missingness Assumption Abbreviations: Missing Completely at Random (MCAR), Missing at Random (MAR)

Read the article here.

Photo by Lukas Blazek on Unsplash

Author Biography

Cali Curley is an assistant professor at Indiana University Purdue University Indianapolis School of Public and Environmental Affairs. Her research is focused on environmental policy, energy policy, and local governance

Rachel M. Krause is an associate professor at the University of Kansas School of Public Affairs and Administration. She researches urban sustainability, local governance, and climate protection initiatives

Richard Feiock holds the Jerry Collins Eminent Scholar Endowed Chair and is the Augustus B. Turnbull Professor of Public Administration and Policy in the Askew School at Florida State University where he directs of the FSU Local Governance Research Laboratory. He is an elected fellow of the National Academy of Public Administration, serves on the U.S. Environmental Protection Agency, Board of Scientific Counselors.

Christopher V. Hawkins is an associate professor in the School of Public Administration at the University of Central Florida. His research focuses on local economic development, metropolitan governance, and urban sustainability policy.​

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s