I’m working through Wasserman’s *All of Nonparametric Statistics*, a wonderful and concise tour of nonparametric techniques. What is nonparametric statistics? It is a collection of estimation techniques that make as few assumptions as possible about the distribution from which your data came.

Let’s work through an example in R that’s mentioned in Chapter 3 of the book but not worked in much detail. We’ll use the nerve data set from Cox and Lewis (1966) frequently referenced by Wasserman and available on his website (data | description). The data measures the waiting time between pulses along a nerve fiber.

1 2 3 4 5 |
# load the data data_url = "http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/nerve.dat" download.file(data_url, "nerve.dat") wt = scan("nerve.dat") n = length(wt) |

### Estimating the CDF

Given a sample $x_1,…,x_n \sim F$, where $F$ is a CDF, we can approximate $F$ by the empirical CDF:

$$\hat{F}(x) = \frac{1}{n}\sum_{i=1}^n I_{x_i}(x)$$ where $I_{x_i}$ is $0$ for $x<x_i$ and 1 otherwise. The $\hat{F}$ is designed so that the cumulative probability increases by $1/n$ after each sample points $x_i$. We can plot $\hat{F}$ with the following R snippet:

1 2 3 4 5 6 7 8 9 10 11 |
# empirical CDF F_hat = function(x) { sum(wt<=x)/length(wt) } # construct plot t = seq(0, 1.5, by=1.5/10000) pr = vapply(t, F_hat, 1) plot(t, pr, xlab="wait time", ylab="probability", type="s", col="blue", lwd=2) points(wt, vapply(wt, F_hat, 1), type="p", pch=20, cex=0.7, col="red", lwd=0) rug(wt) # adds the "rug" plot to the bottom of the plot |

### Estimating the Skewness

The skewness of a distribution is a measure of its symmetry. The skewness of a distribution $F$ is the third standardized moment of the distribution:

$$E\left[\frac{(x-\mu)^3}{\sigma^3}\right] =\frac{1}{\sigma^3} \int (x –\mu)^3 dF.$$

A normal distribution is perfectly symmetric and therefore has skewness $0$.

There are various estimators for the sample skewness. From the example above, we might chose to take: $$T = \frac{1}{\sigma} \int (x –\mu)^3 d\hat{F} = \frac{1}{n}\sum_{i=1}^n \left(\frac{ (x_i – \bar{x})}{\bar{\sigma}} \right)^3$$ where $\bar{x}$ is the sample mean and $\bar{\sigma}$ is the sample standard deviation. With this definition we can compute that the skewness is approximately $1.75794$.

1 2 3 4 5 6 7 8 9 |
skewness = function(x) { n = length(x) mu_hat = mean(x) m2 = sd(x) m3 = sum((x-mu_hat)**3)/n m3/(m2^3) } sk_plug = sk_form(wt) cat("Plug-in estimated skewness: ", sk_plug) |

### Standard Errors: Jackknife

How can we decide if the skewness estimate of $1.75794$ is significantly different from $0$? Let’s compute a confidence interval. To do this though, we’ll need a way to estimate standard errors for the distribution of the skewness estimator $T$. If we’re not willing to make any assumptions about this distribution, how can we estimate its standard errors? One approach is to apply the Jackknife algorithm. With the simplest version of Jackknife, we first estimate the skewness over the sample $n$ times, excluding one value at each pass. Let $T_{-i}$ denote the skewness estimate from the data $x_1,…,x_n$ with the $i$-th observation excluded. The Jackknife estimator is:

$$T_{\mathrm{jack}} = T – (n-1)\left(T – \sum_{i=1}^n T_{-i}\right) = \frac{1}{n} \sum_{i=1}^n \psi_i$$

where:

$$ \psi_i = nT – (n-1)T_{-i}$$

are called the pseudo-values. The estimated variance of $T$ via the Jackknife is

$$ V(T) = \frac{v_{\psi}}{n} $$

where $v_{\psi}$ is the sample variance of the pseudo-values. We compute this in R with:

1 2 3 4 |
sk_plug = skewness(wt) pseudo_values = vapply(1:n, function(i) (n*sk_plug - (n-1)*skewness(wt[-i])), 1) sk_se_jack = sqrt(var(pseudo_values)/n) cat("Jackknife estimated skewness SE: ", sk_se_jack) |

and we find that the standard error is approximately $0.1716485$. If we apply a normal confidence interval approximation, we estimate that a 95% confidence interval for the skewness of the nerve data is:

$$1.75794 \pm 2 \cdot 0.1716485$$

Because $0$ does not lie in this confidence interval, we can be fairly certain that the true distribution of the nerve wait times is skewed.