Hypothesis Testing:
Continuous Variables (1 Sample)
 Sampling Distribution of the Mean
 Empirical
 Theoretical
 Sampling Distributions & Normality
 1 Sample Z  Parameters Known
 Rationale
 Formal Example
 [Minitab]
 Research Question
 Hypotheses
 Assumptions
 Decision Rules
 Computation
 Decision
 Errors & the Power of a Test
 1 Sample t  Sigma Unknown
 Rationale
 Formal Example
 [Minitab]
 Research Question
 Hypotheses
 Assumptions
 Decision Rules
 Computation
 Decision
 Interval Estimation
Practice Problems (Answers)
Homework
I. Sampling Distribution of the Mean
As might be expected, inference with continuous variables is more complicated than with dichotomous variables. Fortunately however, the general principles are the same. Again, we will use a sampling distribution to index the probability that the observed outcome is due to chance.
 Empirical (capable of
being verified or disproved by observation or experiment)
The sampling distribution of the mean is a probability distribution of the possible
values of the mean that would occur if we were to draw all possible samples of
a fixed size from a given population.
To get a better feel for this notion, let’s consider an empirical example
(or one that could be actually performed). Let us choose 10 samples of size 4
from a population of size 20.
Population
Distribution 
10 observed
Sample
Distributions 
Empirical Sampling
Distribution
of the Mean 
6, 2, 9, 5,
0, 1, 3, 2,
1, 1, 5, 2,
7, 7, 7, 8,
1, 1, 3, 7 
1, 5, 9, 0 
3.75 (& s_{X}=4.11) 
0, 3, 1, 5 
2.25 
5, 8, 3, 0 
4.00 
1, 5, 0, 7 
3.25 
7, 6, 1, 3 
4.25 
3, 2, 1, 7 
3.25 
2, 0, 3, 5 
2.50 
1, 2, 1, 1 
1.25 
2, 7, 1, 7 
4.25 
9, 7, 6, 2 
6.00 

Ns=4 

Notes:


 There are three types of distributions illustrated above: population,
sample, and sampling.
 Empirical sampling distributions are only used to help students understand
the concept. They are not true sampling distributions, since all possible samples
are not chosen.
 Theoretical
Are called theoretical because all possible samples (an infinite number) should
be drawn. Since this is impossible, the characteristics (i.e., the mean &
standard deviation) of the distribution are determined mathematically.
It turns out that
(note that and are
synonyms).
The standard deviation of the distribution of sample means is called the standard
error of the mean (or more simply, the standard error). It measures
variability in the distribution of sample means or, in other words, sampling
error (the amount of error we can expect due to using a sample mean to
estimate a population mean). Perhaps it is easier to think of sampling error as "chance" like we did at the beginning of the semester.
One would expect the size of the standard error to
be related to the sample size, and it is.
When population values are known:
Thus, as the sample size gets bigger, sampling error gets smaller.
When population values are estimated from sample values:
This formula requires s_{x}
to be an unbiased estimator of σ_{x}
Computational formula for the standard error estimated from sample values:
Example (using the population distribution from the empirical sampling
distribution above):
If we didn’t know the population values, we could use the S_{X} from
the first sample.
As you can see, only
estimates and it
does so poorly in this case (because of the small sample size).
 Sampling Distributions & Normality
The techniques that we are discussing require that the sampling distribution (in
this case the distribution of sample means) be normal in shape. This will be the
case if either of the following two conditions are met.
 The population distribution of raw scores is normal.
It is difficult to actually know this, but fortunately, many variables are.
 The sampling distribution will approach normality as the sample size is
increased.
This occurs even though the population distribution may not be normal in shape.
Note, though, that the more skewed the population distribution, the larger the
N (sample size) needed for the sampling distribution of the mean to be normal.
II. 1 Sample Z  Parameters Known
 Rationale
Now that we have an understanding of sampling distributions of a continuous variable,
we can go on to test a hypothesis. Recall that any normally distributed variable
can be transformed into a standard normal
distribution (i.e., z scores). We also saw that area under the curve implies
probability. Thus, if the sampling distribution of the mean is normal we can establish
the probability of obtaining a particular sample mean.
 Formal Example  [Minitab]
Let us look at an example from your book. Animal studies suggest that the anticholinergic drug physostigmine improves memory. This could have some clinical applications in humans (e.g., senility, Alzheimer’s disease). Studies with humans typically report that we remember an average of seven of 15 words given an 80minute retention interval. These studies also suggest a standard deviation for the population of two.
 Research Question
Does physostigmine improve memory in humans?
 Hypotheses

In Symbols 
In Words 
H_{O} 
μ=7 
Physostigmine has no effect on memory. 
H_{A} 
μ≠7 
Physostigmine has an effect on memory. 
 Assumptions
 Population of nondrugged folks has μ=7 and σ=2 (i.e., the null).
 Sample is randomly selected.
 Population of nondrugged folks is normal.
Reason is so that the sampling distribution of the mean will be normal. Although
a large sample size would also produce a normally shaped sampling distribution,
we will rarely use large samples.

Decision Rules
We will use the standard normal curve (Z scores) to obtain the probabilities.
Our alpha level is .05 with a twotailed test. When we look in a Z table, we see
that the critical value of Z is 1.96 (Z_{crit}).
Thus, the shaded area is the critical region. If our observed z value
falls into this area, we will reject the null hypothesis. More formally:
If Z_{obs} ≤ 1.96 or Z_{obs} ≥ 1.96, then reject H_{O}.
If Z_{obs} > 1.96 and Z_{obs} < 1.96, then do not reject
H_{O}.
 Computation
The computations have two goals corresponding to the descriptive and inferential
statistics. Suppose we obtain the following scores for a sample of 20 subjects:
9

8

8

9

9

7 
7 
8 
8 
10 
8 
10 
8 
10 
7 
9 
8 
8 
7 
9 
The first step is to describe the data. The most important descriptive
statistic in this case is the mean or average number of words remembered by the
20 subjects receiving the drug. The calculation reveals a mean of (∑X/N=167/20=) 8.35, which
is greater the the mean of the population of 7.
The second step of the computation
is to perform an inferential test to determine whether this difference between
means is worth paying attention to (in other words, is the improvement in memory
due to sampling error or to the drug?).
Remember:
More generally:
Thus, the appropriate formula would be:
And substituting the values in for the standard error gives:
 Decision
Since 3.02 (Z_{obs}) > 1.96 (Z_{crit}) we reject H_{O}
and assert the alternative. Now we must go beyond this simple decision of rejecting
the null or not to what it all means. In other words, we need to make a conclusion
based on our decision and the particular results observed. In this case, we would
conclude that the physostigmine improves memory. Notice that we have actually
gone beyond the alternative hypothesis by specifying that the effect has a direction
(memory was improved). We do this because the mean words remembered for the drugged
group was higher than for the population.
III. Errors & the Power of a Test
As can be seen, hypothesis testing is just educated guessing. Moreover, guesses (educated or not) are sometimes wrong. Consider the possible decisions we can make:
Let us now consider each decision in more detail.
A Type I Error is the false rejection of a true null. It has a probability of alpha (α). In other words, this error occurs as a result of the fact that we have to somehow separate probable from improbable.
 Correct Decision I occurs when we fail to reject a true null. It has a probability of 1α.
From a scientist's perspective this is a "boring" result.
 A Type II Error is the false retention of a false null. It has a probability equal to beta (β).
 Correct Decision II occurs when we reject a false
null. The whole purpose of the experiment is to provide the occasion for this
type of decision. In other words, we performed the statistical test because we
expect the sample to differ. This decision has a probability of 1β.
This probability is also known as the power
of the statistical test. In other words, the ability of a test to find a difference
when there really is one, is power.
 Factors Influencing Power:
 Alpha (α). Alpha and beta are inversely related. In other words, as one increases, the other decreases (i.e., α x β = K). Thus, all other things being equal, using an alpha of .05 will result in a more powerful test than using an alpha of .01.
 Sample Size (N). The bigger the sample (i.e., the more work we do), the more powerful the test.
 Type of Test. Metric tests (as compared to nonparametric tests that we discuss later in the semester) are generally more powerful due to assumptions that are more restrictive.
 Variability. Generally speaking, variability in the sample and/or population results in a less powerful test.
 Test Directionality. Onetailed tests have the potential to be more powerful than twotailed tests.
 Robustness of the Effect. Six beers are more likely to influence reaction time than one beer.
IV. 1 Sample t  Sigma Unknown
In the 1 Sample Z example, both the mean (μ) and standard deviation (σ) of the population were given. However, these parameters are rarely known. In this section, we will consider how the test is performed when σ is unknown.
 Rationale
As we noted earlier, can be used to estimate . One complication of doing this is that the shape of the theoretical distribution of sample means will depend on the sample size. Thus, this sampling distribution is actually a family of distributions and is called Student’s t. To better understand the t distributions, we need to consider a new way of thinking of sample size.
The Degrees of Freedom (df) for a statistic refer to the number of calculations in its computation that are free to vary. For example, the df for the variance of a sample (S_{x}^{2}) is N1.
In other words, since the sum of the deviations equals zero, N1 of the deviations are free to vary. That is, given N1 of the deviations, we can easily determine the final deviation because it is not free to vary. In the example below where N=5, the unknown value must be 2.
With the 1 sample t test, the df for t equals the df for S_{x} which is N1. And Student’s t is a family of distributions differing in their kurtosis (or peakedness).
Note that when df are infinite (i.e., the sample size is very large), the t distribution will equal the z distribution.
As for the formula, remember the z test:
The formula for the t is similar.
Like the z test, the critical values of t are obtained from a table. To determine the critical value of t from the table, you will need to know α, the df, and whether you are using a one or twotailed test.
You must be conservative when using these tables. For example, if your df=45 and the table only gives values for a df of 40 and 60, then you must use the critical value given for the df of 40 (or find yourself a better table).
 Formal Example
 [Minitab]
You are interested in whether the average IQ for a group of "bad kids" (the ones that put a tack on your seat before you sit down) in a school is different from the rest of the kids in the school. The average IQ for the school as a whole is 102 with the standard deviation unavailable.
 Research Question
Do "bad kids" have normal intelligence?
 Hypotheses

In Symbols 
In Words 
H_{O} 
μ=102 
Bad kids have normal IQs. 
H_{A} 
μ≠102 
Bad kids do not have normal IQs. 
 Assumptions
 Population of IQ has μ=102 (i.e., the Null).
 Sample is randomly selected.
 Population of IQ is normal.
Reason is so that the sampling distribution of the mean will be normal. Although
a large sample size would also produce a normally shaped sampling distribution,
we will rarely use large samples.
 Decision Rules
Using alpha of .05 with a twotailed test and N=20 (df=N1=19), we determine from
the t table that the critical value is 2.093.
Thus:
If t_{obs} ≤ 2.093 or t_{obs}
≥ 2.093, then reject H_{O}.
If t_{obs} > 2.093 and t_{obs} < 2.093, then
do not reject H_{O}.
 Computation
The IQs for the 20 bad kids are as follows:
Subj. 
X 
X^{2} 
1 
106

11236 
2 
120

14400 
3 
118

13924 
4 
124

15376 
5 
111 
12321 
6 
123 
15129 
7 
88 
7744 
8 
116 
13456 
9 
120 
14400 
10 
127 
16129 
11 
97 
9409 
12 
118 
13924 
13 
88 
7744 
14 
91 
8281 
15 
110 
12100 
16 
114 
12996 
17 
109 
11881 
18 
130 
16900 
19 
92 
8464 
20 
108 
11664 
∑ 
2,210 
247,478 
N 
20 

Mean 
110.5 

Describing the data, we see that the average IQ is 110.5 which is higher than the "normal kids".
We also need the standard deviation to be able to estimate the standard error when
performing the inferential test.
Thus,
Now we can compute the t test:
 Decision
Since 2.90 (t_{obs}) > 2.093 (t_{crit}) we reject
H_{O} and assert the alternative. In other words, we conclude that the
"bad kids" are smarter than average. Notice that we have actually gone
beyond the alternative hypothesis by specifying that the effect has a direction
(bad kids are smarter).
V. Interval Estimation
Sometimes the t test does not give us enough information. Simply knowing that a sample mean differs from a population mean may not be enough. Suppose you are a researcher interested in selfdestructiveness. You develop a scale to measure this trait. Example questions might include:
 I like to listen to loud music.
 I use (or have used) drugs.
 I like to drive fast.
Next you obtain a random sample of 25 people and give them the scale. (The difficulty in obtaining a random sample might be noted.) The mean for this sample is 120 and the standard deviation is 10.
One of the things that we may want to know is what is the range of scores expected for the population. If we knew this, we would be easily able to identify the deviant scorer (possibly for a case study).
Let’s say we wanted to know the expected range of scores for 95% of the population. This is termed the 95% Confidence Interval (CI) and is given by:
with df=N1,
and remembering
Thus, the current example has df=N1=251=24, alpha of .05 (for 95% CI), twotailed test. From the t table, we determine that the t_{crit} is 2.064.
Therefore, the upper limit will be
And the lower limit will be:
We can now be confident that 95% of people would be expected to score between 115.87 and 124.13.
Copyright © 19972016 M. Plonsky, Ph.D.
Comments? mplonsky@uwsp.edu.