# Analyzing the 2011-2012 California Health Inteview Survey with R

The California Health Interview Survey (CHIS) is a remarkable biannual survey of health status, care access, and demographics for California residents. The 2011-2012 public use survey data has recently been released and is freely available after registering at the CHIS site. CHIS currently offers data in SAS, Stata, and SPSS formats. However, thanks to a few great R packages, it’s not very hard to bring this complex and interesting data set into R. In this post, we’ll cover some practical details about using CHIS data. We’ll also take up a more technical topic about how the CHIS weights were build and about how they should and should not be used. Before we begin, here are some key references for CHIS:

• UCLA maintains extensive documentation of the CHIS methodlogy including important details of how the sample weighting was developed and implemented. Definitions for the CHIS variables can be found in the data dictionary included with the CHIS public use datasets.
• Thomas Lumley created and maintains the survey package. He has provided a description of how to work with the CHIS survey weights here and also in his book.

### Getting The Data

To begin, we first need to register (free) at the CHIS site and download the data. The CHIS is a landline and mobile telephone survey, designed to represent the non-institutionalized resident population of California. The data is divided into three subsets: child, adolescent, and adult. For simplicity, we’ll just work with the adult data set which has about 43k records. To bring the data into R, download the Stata flavor from CHIS, unzip it, and load it into R as follows:

Now that we’ve got the data loaded, we want to take a stab at analyzing it. Browsing through the data dictionary, we see that CHIS records educational attainment in the variable aheduc. How can we tabulate this variable? Observe that CHIS includes 81 variables, rakedw0, rakedw1,…,rakedw80 , which are called replicate weights. For the moment, we’ll disregard all of these weights except for rakedw0. The variable rakedw0 is the sample weight and it tells us how much each of the 43k observations should “count” in an analysis. Using R’s xtabs function, we can tabulate educational attainment with the weight rakedw0:

Though we had only 43k records in the sample, the sum over all the categories is about 28M people which is roughly the size of the non-institutionalized adult population of California. This is because the sample weight rakedw0 was designed so that CHIS totals match closely to known CA demographics (primarily from the California Department of Finance’s Demographics Unit).

Also, we observe that:

The raw CHIS sample contains many more women than we’d expect from a completely random draw from the California population. The sample weight rakedw0 includes an adjustment to rebalance the over-representation of females. Later in the post, we’ll explore the CHIS weighting method in more detail and also explain the reason for the other weights rakedw1,…, rakedw80.

### Simple point estimates with CHIS

CHIS is a great resource for investigating how demographic and health outcome factors are related. Let’s use the data to assess a simple hypothesis: Californians with lower educational attainment are more likely to be overweight/obese (ovrwt).

When working with CHIS data it is frequently convenient to group and recode categorical variables. There are a number of ways we could do this but the “car” package offers a recoding function with a simple syntax:

Here’s a simple visual tabulation from CHIS to support our hypothesis:

Now that we’ve found a potentially interesting relationship in the dataset, our next step might be to check that this relationship is statistically significant. This means we need to compute standard errors for our cross tabulation. Here’s where working with the complex CHIS survey design gets a bit tricky…

### Replicate weights and estimator variances

At this point, it’s useful take a more detailed look at the weighting methodology employed in CHIS and other complex surveys. This section is a little technical but the key practical consequences are summarized at the end. The CHIS sample weights are designed through a process called raking (which is similar to iterated proportional fitting). To understand this process, imagine that we are conducting a study on smoking and have collected the following sample:

From an external source, such as census data, suppose we know with high-confidence that 52% of the total population (city A + city B) are female and 48% are male. Suppose we also know that the population is distributed so that 55% of people live in A and 45% live in B. These external facts are called usually called control totals. Our sample is clearly not representative of the population as 40% of respondents are from city A and 70% are female. If we want to estimate the joint distribution of smoking rates by gender and city, what can we do?

Consider the strategy of assigning a weight $w_i$ to each respondent in the sample. This weight tells us how much each observation should count for in the cross tabulation. Then, for example, we can estimate:
$$p(\mathrm{smoke}, M ,A) \approx \frac{1}{10} \sum_{i: \mathrm{gender}_i = M, \mathrm{city}_i = A } w_i \cdot \mathrm{smoke}.$$ The key issue is how to choose the weight $w_i$. Raking is one such strategy which can make sense for complex surveys.
In a raking procedure, we begin with one of the control variables (say gender) and choose $w_i$ so that the sample distribution will match the population distribution. For example, to get 48% male representation we’d need to give each male in the sample a weight such that
$$.3 \cdot w_i(\mathrm{gender}_i = M) = .48 \qquad \Rightarrow \qquad w_i(\mathrm{gender}_i = M) = \frac{0.48}{0.3} = 1.6.$$ Similarly, to get 52% female representation, we’d need to give each female in the sample a weight such that
$$w_i(\mathrm{gender}_i = F) = \frac{0.52}{0.7} \approx 0.7429.$$ We can then recompute the crosstabs with this initial weight:

Now gender is in balance but city remains out of balance. To balance the city proportions, we need to weight the weighted data.

So we’ve balanced city but in doing so we’ve also unbalanced gender. However, gender is not nearly as out of balance as it was when we began. This suggests that we can iterate this process to build a weight variable that will balance both gender and city simultaneously. Indeed, here is a simple code that will produce the sample weight for this toy problem:

This simplified raking algorithm is a good approximation to how the weights in CHIS (and other complex surveys) are developed. Of course in CHIS, the process is much more complicated and the rebalancing is done over 11 design variables and with many more factor levels.

Raking is an interesting procedure from a technical perspective. Why would we expect this algorithm to converge? When it does converge, why can we use these weights to obtain unbiased estimates within cells? These questions are well outside the scope of this post but there is a fair amount of literature on the topic (this article, for example). Practically speaking, what we need to know for working with CHIS are the following facts:

• The raked weight RAKEDW0 can be used to obtain unbiased estimators of means and proportion in the usual way, e.g. with xtabs in R.
• Variances (and hence standard errors) for these estimators cannot be computed directly using RAKEDW0. A nice description of this issue can be found in this article.

In the next section, we’ll explain how to correctly compute the standard errors for means and proportions in CHIS.

### Standard errors with Replicate Weights

To handle the standard errors, we need to make use of the replicate weights rakedw1,…,rakedw80. Essentially, these weights capture variability across subsamples of the CHIS sample and from the variability within these subsamples we can deduce the correct variance estimates for our point estimators. The details of how the CHIS replicate weights were created and how they are used can be found in section 9.2 of the CHIS methodology manual.

Fortunately, R’s survey package was built precisely to handle complex survey designs, including replicate weights. The key feature of the survey package is that all trickiness of the CHIS design can be encapsulated in a survey design object $D$. Once R knows about the design, survey’s methods will take care of the details of correctly computing the standard errors behind the scenes. Let’s return to our education/obesity example and compute standard errors for our cross tabulations:

## 2 thoughts on “Analyzing the 2011-2012 California Health Inteview Survey with R”

1. this is awesome. could we collaborate or (if you are busy) could i broaden your code to work on every year and republish that? please e-mail me? 🙂 ajdamico@gmail.com

2. This is one of those posts that validates my love for the R ecosystem. Fantastic in many ways. Thanks for sharing!