13.3: Simple Random Samples and Statistics

Simple Random Samples and Statistics

The sample average and the population mean

Consider the numerical average of the values in the sample \(\bar = \dfrac \sum_^ t_i\). This is an observation of the sample average

\(A_n = \dfrac \sum_^ X_i = \dfrac S_n\)

The sample sum \(S_n\) and the sample average \(A_n\) are random variables. If another observation were made (another sample taken), the observed value of these quantities would probably be different. Now \(S_n\) and \(A_n\) are functions of the random variables \(\\) in the sampling process. As such, they have distributions related to the population distribution (the common distribution of the \(X_i\)). According to the central limit theorem, for any reasonable sized sample they should be approximately normally distributed. As the examples demonstrating the central limit theorem show, the sample size need not be large in many cases. Now if the population mean \(E[X]\) is \(\mu\) and the population variance \(\text [X]\) is \(\sigma^2\), then

\(E[S_n] = \sum_^ E[X_i] = nE[X] = n\mu\) and \(\text[S_n] = \sum_^ \text [X_i] = n \text [X] = n \sigma^2\)

\(E[A_n] = \dfrac E[S_n] = \mu\) and \(\text[A_n] = \dfrac \text [S_n] = \sigma^2/n\)

Herein lies the key to the usefulness of a large sample. The mean of the sample average \(A_n\) is the same as the population mean, but the variance of the sample average is \(1/n\) times the population variance. Thus, for large enough sample, the probability is high that the observed value of the sample average will be close to the population mean. The population standard deviation, as a measure of the variation is reduced by a factor \(1/\sqrt\).

Example \(\PageIndex\) Sample size

Suppose a population has mean \(\mu\) and variance \(\sigma^2\). A sample of size \(n\) is to be taken. There are complementary questions:

  1. If \(n\) is given, what is the probability the sample average lies within distance a from the population mean?
  2. What value of \(n\) is required to ensure a probability of at least p that the sample average lies within distance a from the population mean?

Solution

Suppose the sample variance is known or can be approximated reasonably. If the sample size \(n\) is reasonably large, depending on the population distribution (as seen in the previous demonstrations), then \(A_n\) is approximately \(N(\mu, \sigma^2/n)\).

1. Sample size given, probability to be determined.

2. Sample size to be determined, probability specified.

\(2 \phi (a \sqrt/\sigma) - 1 \ge p\) iff \(\phi (a\sqrt /\sigma) \ge \dfrac

\)

Find from a table or by use of the inverse normal function the value of \(x = a\sqrt/\sigma\) required to make \(\phi (x)\) at least \((p + 1)/2\). Then

We may use the MATLAB function norminv to calculate values of \(x\) for various \(p\).

p = [0.8 0.9 0.95 0.98 0.99]; x = norminv(0,1,(1+p)/2); disp([p;x;x.^2]') 0.8000 1.2816 1.6424 0.9000 1.6449 2.7055 0.9500 1.9600 3.8415 0.9800 2.3263 5.4119 0.9900 2.5758 6.6349

For \(p = 0.95\), \(\sigma = 2\), \(a = 0.2\), \(n \ge (2/0.2)^2 3.8415 = 384.15\). Use at least 385 or perhaps 400 because of uncertainty about the actual \(\sigma^2\)

The idea of a statistic

As a function of the random variables in the sampling process, the sample average is an example of a statistic.

A statistic is a function of the class \(\\) which uses explicitly no unknown parameters of the population.

Example \(\PageIndex\) Statistics as functions of the sampling progress

The random variable

\(W = \dfrac \sum_^ (X_i - \mu)^2\), where \(\mu = E[X]\)

is not a statistic, since it uses the unknown parameter \(\mu\). However, the following is a statistic.

\(V_n^* = \dfrac \sum_^ (X_i - A_n)^2 = \dfrac \sum_^ X_i^2 - A_n^2\)

It would appear that \(V_n^*\) might be a reasonable estimate of the population variance. However, the following result shows that a slight modification is desirable.

Example \(\PageIndex\) An estimator for the population variance

is an estimator for the population variance.

VERIFICATION

Consider the statistic

\(V_n^* = \dfrac \sum_^ (X_i - A_n)^2 = \dfrac \sum_^ X_i^2 - A_n^2\)

Noting that \(E[X^2] = \sigma^2 + \mu^2\), we use the last expression to show

\(E[V_n^*] = \dfrac n (\sigma^2 + \mu^2) - (\dfrac + \mu^2) = \dfrac \sigma^2\)

The quantity has a bias in the average. If we consider

\(V_n = \dfrac V_n^* = \dfrac \sum_^ (X_i - A_n)^2\), then \(E[V_n] = \dfrac \dfrac \sigma^2 = \sigma^2\)

The quantity \(V_n\) with \(1/(n - 1)\) rather than \(1/n\) is often called the sample variance to distinguish it from the population variance. If the set of numbers

\((t_1, t_2, \cdot\cdot\cdot, t_N)\)

represent the complete set of values in a population of \(N\) members, the variance for the population would be given by

Here we use \(1/N\) rather than \(1/(N -1)\).

Since the statistic \(V_n\) has mean value \(\sigma^2\), it seems a reasonable candidate for an estimator of the population variance. If we ask how good is it, we need to consider its variance. As a random variable, it has a variance. An evaluation similar to that for the mean, but more complicated in detail, shows that

\(\text [V_n] = \dfrac (\mu_4 - \dfrac \sigma^4)\) where \(\mu_4 = E[(X - \mu)^4]\)

For large \(n\), \(\text [V_n]\) is small, so that \(V_n\) is a good large-sample estimator for \(\sigma^2\).

Example \(\PageIndex\) A sampling demonstration of the CLT

Consider a population random variable \(X\) ~ uniform [-1, 1]. Then \(E[X] = 0\) and \(\text [X] = 1/3\). We take 100 samples of size 100, and determine the sample sums. This gives a sample of size 100 of the sample sum random variable \(S_\), which has mean zero and variance 100/3. For each observed value of the sample sum random variable, we plot the fraction of observed sums less than or equal to that value. This yields an experimental distribution function for \(S_\), which is compared with the distribution function for a random variable \(Y\) ~ \(N(0, 100/3)\).

rand('seed',0) % Seeds random number generator for later comparison tappr % Approximation setup Enter matrix [a b] of x-range endpoints [-1 1] Enter number of x approximation points 100 Enter density as a function of t 0.5*(t<=1) Use row matrices X and PX as in the simple case
qsample % Creates sample Enter row matrix of VALUES X Enter row matrix of PROBABILITIES PX Sample size n = 10000 % Master sample size 10,000 Sample average ex = 0.003746 Approximate population mean E(X) = 1.561e-17 Sample variance vx = 0.3344 Approximate population variance V(X) = 0.3333 m = 100; a = reshape(T,m,m); % Forms 100 samples of size 100 A = sum(a); % Matrix A of sample sums [t,f] = csort(A,ones(1,m)); % Sorts A and determines cumulative p = cumsum(f)/m; % fraction of elements Figure 13.3.1)

Figure one is a graph of two plots, titled, Central limit theorem for sample sums. The horizontal axis is labeled, sample sum values, and the vertical axis is labeled, cumulative fraction. The values on <a href=the horizontal axis range from -15 to 20 in increments of 5. The values on the vertical axis range from 0 to 1 in increments of 0.1. There are two captions inside the graph. The first reads, X uniform on [-1 1], and the second reads, E[X] = 0 Var[X] = 1/3. The first plot is a smooth, dashed line, labeled gaussian. The second plot is a wavering, jagged solid line labeled experimental. Both plots follow generally the same shape. They begin in the bottom right at approximately (-12, 0) with a positive slope, and they move to the right, increasing at an increasing rate. At nearly the midpoint in the graph, approximately (0, 0.5), the graphs adjust and begin increasing at a decreasing rate, approaching the top-right corner of the graph while tapering off to a horizontal line. The gaussian, dashed line follows this path" width="418" />


Figure 13.3.1. The central limit theorem for sample sums.

This page titled 13.3: Simple Random Samples and Statistics is shared under a CC BY 3.0 license and was authored, remixed, and/or curated by Paul Pfeiffer via source content that was edited to the style and standards of the LibreTexts platform.

  1. Back to top