Based on the study you designed in the Research Design discussion, apply the scientific method by specifying the five steps of hypothesis testing for your study.Step 1: State the hypothesis.Step 2: Collect the data – For the purpose of this discussion, you will state how you would collect the data.Step 3: Calculate statistics – For the purpose of this discussion, you will indicate the statistical analysis technique(s) you would use.Step 4: Compare to a critical value – For the purpose of this discussion, you will indicate where you would set the alpha value and why. Note: This step is hypothetical, as you are not actually conducting a statistical analysis. Consequently, you will choose if the results of your hypothetical analysis are above or below the critical value.Step 5: Make a decision – For the purpose of this discussion, you will create a conclusion based on the hypothetical results from Step 4. Be sure to include a recommendation on the effectiveness of the new drug based on the results.Utilize a minimum of two peer-reviewed sources that were published within the last 10 years and are documented in APA style.psy326_chapter02.pdfchapter 2
Design, Measurement, and
Testing Hypotheses
Canabi Hugo/Iconotec/photolibrary
Chapter Contents
• Overview of Research Designs
• Reliability and Validity
• Scales and Types of Measurement
• Hypothesis Testing
new66480_02_c02_p047-088.indd 47
10/31/11 9:23 AM
CHAPTER 2
Introduction
I
n the early 1950s, Canadian physician Hans Selye introduced the term stress into both
the medical and popular lexicons. By that time, it had been accepted that humans have
a well-evolved fight-or-flight response, which prepares us to either fight back or flee
from danger, largely by releasing adrenaline and mobilizing the body’s resources more
efficiently. While working at McGill University, Selye began to wonder about the health
consequences of this adrenaline and designed an experiment to test his ideas using rats.
Selye injected rats with doses of adrenaline over
a period of several days and then euthanized
the rats in order to examine the physical effects
of the injections. As expected, the rats that were
exposed to adrenaline had developed ill effects,
such as ulcers, increased arterial plaques, and
decreases in the size of reproductive glands—all
now understood to be consequences of long-term
stress exposure. But there was just one problem.
When Selye took a second group of rats and
injected them with a placebo, they also developed ulcers, plaques, and shrunken reproductive
glands!
Fortunately, Selye was able to solve this scientific
mystery with a little self-reflection. Despite all his
methodological savvy, he turned out to be rather
Canadian physician Hans Selye introduced
the term “stress.”
clumsy when it came to handling rats, occasionally dropping one when he removed it from its
cage for an injection. In essence, the experience
for these rats was one that we would now call stressful, and it is no surprise that they
developed physical ailments in response to it. Rather than testing the effects of adrenaline
injections, Selye was inadvertently testing the effects of being handled by a clumsy scientist. It is important to note that if Selye ran this study in the present day, ethical guidelines
would dictate much more stringent oversight of his study procedures in order to protect
the welfare of the animals.
iStockphoto/Royalty-free
This story illustrates two key points about the scientific process. First, as we discussed in
Chapter 1, it is always good to be attentive to your apparent mistakes because they can
lead to valuable insights. Second, it is absolutely vital to measure what you think you
are measuring. In this chapter, we get more concrete about what it means to do research,
beginning with a broad look at the three types of research design. Our goal at this stage
is to get a general sense of what these designs refer to, when they are used, and the main
differences among them. (Chapters 3, 4, and 5 are each dedicated to one type of research
design and elaborate further on each one.) Following our overview of designs, this chapter covers a set of basic principles that are common to all research designs. Regardless of
the particulars of your design, all research studies involve making sure our measurements
are accurate and consistent and that they are captured using the appropriate type of scale.
Finally, we will discuss the general process of hypothesis testing, from laying out predictions to drawing conclusions.
new66480_02_c02_p047-088.indd 48
10/31/11 9:23 AM
Section 2.1 Overview of Research Designs
CHAPTER 2
2.1 Overview of Research Designs
A
s you learned in Chapter 1, scientists can have a wide range of goals going in to a
research project, from describing a phenomenon to attempting to change people’s
behavior. It turns out that these goals lend themselves to different approaches to
answering a research question. That is, you will approach the problem differently when
you want to describe voting patterns than when you want to explain them or predict
them. These approaches are called research designs, or the specific methods that are used
to collect, analyze, and interpret data. The choice of a design is not one to be made lightly;
the way you collect data trickles down to the kinds of conclusions that you can draw
about them. This section provides a brief introduction to the three main types of design—
descriptive, correlational, and experimental.
Descriptive Research
Recall from Chapter 1 that one of the basic goals of research is to describe a phenomenon. If your research question centers around description, then your research design
falls under the category of descriptive research, in which the primary goal is to describe
thoughts, feelings, or behaviors. Descriptive research provides a static picture of what
people are thinking, feeling, and doing at a given moment in time, as seen in the following
examples of research questions:
•
•
•
•
•
What percentage of doctors prefer Xanax for the treatment of anxiety? (thoughts)
What percentage of registered Republicans vote for independent candidates?
(behaviors)
What percentage of Americans blame the president for the economic crisis?
(thoughts)
What percentage of college students experience clinical depression? (feelings)
What is the difference in crime rates between Beverly Hills and Detroit?
(behaviors)
What these five questions have in common is the attempt to get a broad understanding of
a phenomenon without trying to delve into its causes.
The crime rate example highlights the main advantages and disadvantages of descriptive
designs. On the plus side, descriptive research is a good way to get a broad overview of a
phenomenon and can inspire future research. It is also a good way to study things that are
difficult to translate into a controlled experimental setting. For example, crime rates can
affect every aspect of people’s lives, and this importance would likely be lost in an experiment that manipulated income in a laboratory. On the downside, descriptive research
provides a static overview of a phenomenon and cannot dig into the reasons for it. A
descriptive design might tell us that Beverly Hills residents are half as likely as Detroit
residents to be assault victims, but it would not reveal the reasons for this discrepancy.
(If we wanted to understand why this was true, we would use one of the other designs.)
Descriptive research can be either qualitative or quantitative; in fact, the large majority of
qualitative research falls under the category of descriptive designs. Descriptions are quantitative when they attempt to make comparisons and/or to present a random sampling
new66480_02_c02_p047-088.indd 49
10/31/11 9:23 AM
Section 2.1 Overview of Research Designs
CHAPTER 2
of people’s opinions. The majority of our sample questions above would fall into this
group because they quantify opinions from samples of households, or cities, or college
students. Good examples of quantitative description appear in the “snapshot” feature on
the front page of USA Today. The graphics represent poll results from various sources; the
snapshot for August 3, 2011, reveals that only 61%
of Americans turn off the water while they brush
their teeth (i.e., behavior).
Descriptive designs are qualitative when they
attempt to provide a rich description of a particular set of circumstances. A great example of this
approach can be found in the work of neurologist Oliver Sacks. Sacks has written several books
exploring the ways that people with neurological
damage or deficits are able to navigate the world
around them. In one selection from The Man Who
Mistook His Wife for a Hat (1998), Sacks relates
the story of a man he calls William Thompson.
As a result of chronic alcohol abuse, Thompson
developed Korsakov’s syndrome, a brain disease
marked by profound memory loss. The memory
loss was so severe that Thompson had effectively
“erased” himself and could remember only scattered fragments of his past.
Whenever Thompson encountered people, he
would frantically try to determine who he was.
He would develop hypotheses and test them, as
in this excerpt from one of Sacks’s visits:
Erik Charlton
Dr. Oliver Sacks studied how people with
neurological damage formed and retained
memories.
I am a grocer, and you’re my customer, right? Well, will that be paper or plastic? No, wait, why are you wearing that white coat? You must be Hymie, the
kosher butcher. Yep. That’s it. But why are there no bloodstains on your coat?
(Sacks, 1998, p. 112)
Sacks concludes that Thompson is “continually creating a world and self, to replace what
was continually being forgotten and lost” (p. 113). In telling this story, Sacks helps us to
understand Thompson’s experience and to be grateful for our ability to form and retain
memories. This story also illustrates the trade-off in these sorts of descriptive case studies:
Despite all its richness, we cannot generalize these details to other cases of brain damage;
we would need to study and describe each patient individually.
Correlational Research
The second goal of research that we discussed in Chapter 1 was to predict a phenomenon.
If your research question centers around prediction, then your research design falls under
the category of correlational research, in which the primary goal is to understand the
relationships among various thoughts, feelings, and behaviors. Examples of correlational
research questions include:
new66480_02_c02_p047-088.indd 50
10/31/11 9:23 AM
CHAPTER 2
Section 2.1 Overview of Research Designs
•
•
•
•
•
Are people more aggressive on hot days?
Are people more likely to smoke when they are drinking?
Is income level associated with happiness?
What is the best predictor of success in college?
Does television viewing relate to hours of exercise?
What each of these questions has in common is that the goal is to predict one variable
based on another. If you know the temperature, can you predict aggression? If you know
a person’s income, can you predict her level of happiness? If you know a student’s SAT
scores, can you predict his college GPA?
These predictive relationships can turn out in one of three ways (more detail on each
one when we get to Chapter 4): A positive correlation means that higher values of one
variable predict higher values of the other variable. As in, more money predicts higher
levels of happiness, and less money predicts lower levels of happiness. The key is that
these variables move up and down together, as shown in the first row of Table 2.1. A
negative correlation means that higher values of one variable predict lower values of
the other variable. As in, more television viewing predicts fewer hours of exercise, and
fewer hours of television predict more hours of exercise. The key is that one variable
increases while the other decreases, as seen in the second row of Table 2.1. Finally, it is
worth noting a third possibility, which is to have no correlation between two variables,
meaning that you cannot predict one variable based on another. The key is that changes
in one variable are not associated with changes in the other, as seen in the third row of
Table 2.1.
Table 2.1: Three Possibilities for Correlational Research
Outcome
Description
Positive Correlation
Variables go up and down
together
For example: Taller people
have bigger feet, and shorter
people have smaller feet
Negative Correlation
One variable goes up and
the other goes down
For example: as the number
of beers consumed goes up,
speed of reactions go down
No Correlation
Height
Shoe size
Reac on
speed
# of beers
The variables have nothing
to do with one another
For example: shoe size
and number of siblings are
completely unrelated
new66480_02_c02_p047-088.indd 51
Visual
# Siblings
Shoe size
10/31/11 9:23 AM
Section 2.1 Overview of Research Designs
CHAPTER 2
Correlational designs are about prediction, and we are still unable to make causal, explanatory statements (that comes next. . .). A common mantra in the field of psychology is that
correlation does not equal causation. In other words, just because variable A predicts variable B does not mean that A causes B. This is true for two reasons, which we refer to as the
directionality problem and the third variable problem. (See Figure 2.1.)
First, when we measure two variables at the same time, we have no way of knowing
the direction of the relationship. Take the relationship between money and happiness:
It could be true that money makes people happier because they can afford nice things
and fancy vacations. It could also be true that happy people have the confidence and
charm to obtain higher-paying jobs, resulting in more money. In a correlational study, we
are unable to distinguish between these possibilities. Or, take the relationship between
television viewing and obesity: It could be that people who watch more television get
heavier because TV makes them snack more and exercise less. It could also be that people
who are overweight don’t have the energy to move around and end up watching more
television as a consequence. Once again, we cannot identify a cause–effect relationship
in a correlational study.
Second, when we measure two variables as they naturally occur, there is always the possibility of a third variable that actually causes both of them. For example, imagine we find
a correlation between the number of churches and the number of liquor stores in a city. Do
people build more churches to offset the threat of
Figure 2.1: Correlation Is
liquor stores? Do people build more liquor stores
to rebel against churches? Most likely, the link
Not Causation!
involves the third variable of population: The
The Directionality Problem
more people there are living in a city, the more
churches and liquor stores they can support.
A
B
Income
Happiness
The Third Variable Problem
B
A
B
Ice Cream Sales
Homicides
Temperature
Or, consider this example from analyses of
posts on the recommendation website Hunch
.com. One of the cofounders of the website conducted extensive analyses of people’s activity
and brand preferences and found a positive
correlation between how much people liked to
dance and how likely they were to prefer Apple
computers (Fake, 2009). Does this mean that
owning a Mac makes you want to dance? Does
dancing make you think highly of Macs? Most
likely, the link here involves a third variable of
personality: People who are more unconventional may be more likely to prefer both Apple
computers and dancing.
Experimental Research
Finally, recall that the most powerful goal of research is to attempt to explain a phenomenon. When your research goal involves explanation, then your research design falls under
the category of experimental research, in which the primary goal is to explain thoughts,
new66480_02_c02_p047-088.indd 52
10/31/11 9:23 AM
Section 2.1 Overview of Research Designs
CHAPTER 2
feelings, and behaviors and to make causal statements. Examples of experimental research questions include:
•
•
•
•
•
Does smoking cause cancer?
Does alcohol make people more
aggressive?
Does loneliness cause alcoholism?
Does stress cause heart disease?
Can meditation make people healthier?
What these five questions have in common is a
focus on understanding why something happens.
Experiments move beyond asking, for example,
whether alcoholics are more aggressive to ask
whether alcohol causes an increase in aggression.
Photodisc/Kraig Scarbinsky
Experimental designs are able to address the Testing the hypothesis that meditation
improves health requires an experimental
shortcomings of correlational designs because
group and a control group.
the researcher has more control over the environment. We will cover this in great detail in Chapter
5, but for now, experiments are a relatively simple process: A researcher has to control the
environment as much as possible so that all participants in the study have the same experience. She will then manipulate, or change, one key variable and then measure outcomes
in another key variable. The variable that gets manipulated by the experimenter is called
the independent variable. The outcome variable that is measured by the experimenter is
called the dependent variable. The combination of controlling the setting and changing
one aspect of this setting at a time allows her to state with some certainty that the changes
caused something to happen.
Let’s make this a little more concrete. Imagine that you wanted to test the hypothesis
that meditation causes improvements in health. In this case, meditation would be the
independent variable and health would be the dependent variable. One way to test this
hypothesis would be to take a group of people and have half of them meditate 20 minutes
per day for several days while the other half did something else for the same amount of
time. The group that meditates would be the experimental group because it provides
the test of our hypothesis. The group that does not meditate would be the control group
because it provides a basis of comparison for the experimental group. You would want
to make sure that these groups spent the 20 minutes in similar conditions so that the only
difference would be the presence or absence of meditation. One way to accomplish this
would be to have all participants sit quietly for the 20 minutes but give the experimental
group specific instructions on how to meditate. Then, to test whether meditation led
to increased health and happiness, you would give both groups a set of outcome measures—perhaps a combination of survey measures and a doctor’s examination. If you
found differences between these groups on the dependent measures, you could be fairly
confident that meditation caused them to happen. For example, you might find lower
blood pressure in the experimental group; this would suggest that meditation causes a
drop in blood pressure.
new66480_02_c02_p047-088.indd 53
10/31/11 9:23 AM
Section 2.1 Overview of Research Designs
CHAPTER 2
Research: Making an Impact
Helping Behaviors
The 1964 murder of Kitty Genovese in plain sight of her neighbors, none of whom helped, drove
numerous researchers to investigate why people may not help others in need. Are people selfish and
bad, or is there a group dynamic at work that leads to inaction? Is there something wrong with our
culture, or are situations more powerful than we think?
Among the body of research conducted in the late 1960s and 1970s was one pivotal study that
revealed why people may not help others in emergencies. Darley and Latane (1968) conducted an
experiment with various individuals in different rooms, communicating via intercom. In reality, it
was one participant and a number of confederates, one of whom pretends to have a seizure. Among
participants who thought they were the only other person listening over the intercom, more than
80% helped, and they did so in less than 1 minute. However, among participants who thought they
were one of a group of people listening over the intercom, less than 40% helped, and even then only
after more than 2.5 minutes. This phenomenon, that the more people who witness an emergency,
the less likely any of them is to help, has been dubbed the “bystander effect.” One of the main reasons that this occurs is that responsibility for helping gets “diffused” among all of the people present, so that each one feels less personal responsibility for taking action.
This research can be seen in action and has influenced safety measures in today’s society. For example, when witnessing an emergency, no longer does it suffice to simply yell to the group, “Call 9-11!” Because of the bystander effect, we know that most people will believe someone else will do it,
and the call will not be made. Instead, it is necessary to point to a specific person to designate them
as the person to make the call. In fact, part of modern-day CPR training involves making individuals
aware of the bystander effect and best practices for getting people to help and be accountable.
Although this phenomenon may be the rule, there are always exceptions. For example, on September 11, 2001, the fourth hijacked airplane was overtaken by a courageous group of passengers. Most
people on the plane had heard about the twin tower crashes, and recognized that their plane was
heading for Washington, D.C. Despite being amongst nearly 100 other people, a few people chose
to help the intended targets in D.C. Risking their own safety, these heroic people chose to help so
as to prevent death and suffering to others. So, while we see events every day that remind us of the
reality of the bystander effect, we also see moments where people are willing to help, no matter the
number of people that surround them.
Choosing a Research Design
The choice of a research design is guided first and foremost by your research question
and then adjusted depending on practical and ethical concerns. At this point, there may
be a nagging question in the back of your mind: If experiments are the most powerful
type of design, why not use them every time? Why would you ever give up the chance
to make causal statements? One reason is that we are often interested in variables that
cannot be manipulated, for ethical or practical reasons, and that therefore have to be
studied as they occur naturally. In one example, Matthias Mehl and Jamie Pennebaker
happened to start a weeklong study of college students’ social lives on September 10,
2001. Following the terrorist attacks on the morning of September 11, Mehl and Pennebaker were able to track changes in people’s social connections and use this to understand how groups respond to traumatic events (Mehl & Pennebaker, 2003). Of course,
it would have been unthinkable to experimentally manipulate a terrorist attack for this
new66480_02_c02_p047-088.indd 54
10/31/11 9:23 AM
CHAPTER 2
Section 2.1 Overview of Research Designs
study, but since it occurred naturally, the researchers were able to conduct a correlational study of coping.
Another reason to use descriptive and correlational designs is that these are useful in the
early stages of a research program. For example, before you start to think about the causes
of binge drinking among college students, it is important to understand how common this
phenomenon is. Before you design a time- and cost-intensive experiment on the effects of
meditation, it is a good idea to conduct a correlational study to test whether meditation even
predicts health. In fact, this example comes from a series of real research studies conducted
by psychiatrist Sara Lazar and her colleagues at Massachusetts General Hospital. This
research team first discovered that experienced practitioners of mindfulness meditation had
more development in brain areas associated with attention and emotion. But this study was
correlational at best; perhaps meditation causes changes in brain structure or perhaps people
who are better at integrating emotions are drawn to meditation. In a follow-up study, they
randomly assigned people to either meditate or complete stretching exercises for 2 months.
These experimental findings confirmed that mindfulness meditation actually caused structural changes to the brain (Hölzel et al., 2011). In addition, this is a fantastic example of how
a research program can progress from correlational to experimental designs. Table 2.2 summarizes the main advantages and disadvantages of our three types of design.
Table 2.2: Summary of Research Designs
Research Design
Goal
Advantages
Disadvantages
Descriptive
Describe characteristics
of an existing
phenomenon
Provides a complete
picture of what is
occurring at a given
time
Does not assess
relationships; no
explanation for
phenomenon
Correlational
Predict behavior; assess
strength of relationship
between variables
Allows testing of
expected relationships;
predictions can be made
Can’t draw inferences
about causal
relationships
Experimental
Explain behavior; assess
impact of IV on DV
Allows conclusions to
be drawn about causal
relationships
Many important
variables can’t be
manipulated
Designs on the Continuum of Control
Before we leave our design overview behind, a few words on how these designs relate to
one another. The best way to think about the differences between the designs is in terms
of the amount of control you have as a researcher. That is, experimental designs are the
most powerful because the researcher controls everything from the hypothesis to the environment in which the data are collected. Correlational designs are less powerful because
the researcher is restricted to measuring variables as they occur naturally. However, with
correlational designs, the researcher does maintain control over several aspects of data
collection, including the setting and the choice of measures. Descriptive designs are the
least powerful because it is difficult to control outside influences on data collection. For
example, when people answer opinion polls over the phone, they might be sitting quietly
new66480_02_c02_p047-088.indd 55
10/31/11 9:23 AM
CHAPTER 2
Section 2.2 Reliability and Validity
and pondering the questions or they might be watching television, eating dinner, and
dealing with a fussy toddler. As a result, a researcher is more limited in the conclusions
he or she can draw from these data. Figure 2.2 shows an overview of research designs in
order of increasing control, from descriptive, to predictive, to experimental. As we progress through Chapters 3, 4, and 5, we will cover variations on these designs in more detail.
Figure 2.2: The Continuum of Control Framework
Descriptive
Methods
Predictive
Methods
• Case Study
• Archival Research
• Observation
• Survey Research
Experimental
Methods
• Quasi-experiments
• “True” Experiments
Increasing Control . . .
2.2 Reliability and Validity
E
ach of the three types of research design has the same basic goal: to take a hypothesis about some phenomenon and translate it into measurable and testable terms.
That is, whether we use a descriptive, correlational, or experimental design to test
our predictions about income and happiness, we still need to translate (or operationalize) the concepts of income and happiness into measures that will be useful for the study.
The sad truth is that our measurements will always be influenced by factors other than
the conceptual variable of interest. Answers to any set of questions about happiness will
depend both on actual levels of happiness and the ways people interpret the questions.
Our meditation experiment may have different effects depending on people’s experience
with meditation. Even describing the percentage of Republicans voting for independent
candidates will vary depending on characteristics of a particular candidate.
These additional sources of influence can be grouped into two categories: random and systematic errors. Random error involves chance fluctuations in measurements, such as when
a few people misunderstand the question or the experimenter enters the wrong values into a
statistical spreadsheet. Although random errors can influence measurement, they generally
cancel out over the span of an entire sample. That is, some people may overreact to a question while others underreact. The experimenter may accidentally type a 6 instead of a 5 but
then later type a 5 instead of a 6 when entering the data. While both of these examples would
add error to our dataset, they would cancel each other out in a sufficiently large sample.
Systematic errors, in contrast, are those that systematically increase or decrease along
with values on our measured variable. For example, people who have more experience
with meditation may show consistently more improvement in our meditation experiment
new66480_02_c02_p047-088.indd 56
10/31/11 9:23 AM
Section 2.2 Reliability and Validity
CHAPTER 2
than those with less experience. Or, people with higher self-esteem may score higher on
our measure of happiness than those with lower self-esteem. In this case, our happiness
scale will end up assessing a combination of happiness and self-esteem. These types of
errors can cause more serious trouble for our hypothesis tests because they interfere with
our attempts to understand the link between two variables.
In sum, the measured values of our variable reflect a combination of the true score, random error, and systematic error, as shown in the following conceptual equation:
Measured Score 5 True Score 1 (Random Error 1 Systematic Error)
For example:
Happiness Score 5 Level of Happiness 1 (Misreading Question 1 Self-Esteem)
So, if our measurements are also affected by outside influences, how do we know whether
our measures are meaningful? Occasionally, the answer to this question is straightforward; if we ask people to report their weight or their income level, these values can be
verified using objective sources. However, many of our research questions within psychology involve more ambiguity. How do we know that our happiness scale is the best
one? The problem in answering this question is that we have no way to objectively verify
happiness. What we need, then, are ways to assess how close we are to measuring happiness in a meaningful way. This assessment involves two related concepts, reliability, or
the consistency of a measure; and validity, or the accuracy of a measure. In this section,
we will examine both of these concepts in detail.
Reliability
The consistency of time measurement by watches, cell phones, and clocks reflects a high
degree of reliability. We think of a watch as reliable when it keeps track of the time consistently. Likewise, our scale is reliable when it gives the same value for our weight in backto-back measurements.
Reliability is technically defined as the extent to which a measured variable is free from
random errors. As we discussed above, our measures are never perfect, and reliability is
threatened by five main sources of random error:
•
•
•
•
•
new66480_02_c02_p047-088.indd 57
Transient states, or temporary fluctuations in participants’ cognitive or mental
state; for example, some participants may complete your study after an exhausting midterm or in a bad mood after a fight with their significant others
Stable individual differences among participants; for example, some participants are
habitually more motivated, or happier, than other participants
Situational factors in the administration of the study; for example, running your
experiment in the early morning may make everyone tired or grumpy
Bad measures that add ambiguity or confusion to the measurement; for example,
participants may respond differently to a question about “the kinds of drugs you
are taking. “Some may take this to mean illegal drugs, “whereas others interpret
it as prescription or over-the-counter drugs
Mistakes in coding responses during data entry; for example, a handwritten 7 could
be mistaken for a 4
10/31/11 9:23 AM
Section 2.2 Reliability and Validity
CHAPTER 2
We naturally want to minimize the influence of all of these sources of error, and we will
touch on techniques for doing so throughout the book. However, researchers are also
resigned to the fact that all of our measurements contain a degree of error. The goal, then,
is to develop an estimate of how reliable our measures are. Researchers generally estimate
reliability in three ways.
Test–Retest Reliability refers to the consistency of our measure over time—much like our
examples of a reliable watch and a reliable scale. A fair number of research questions in
the social and behavioral sciences involve measuring stable qualities. For example, if you
were to design a measure of intelligence or personality, both of these characteristics should
be relatively stable over time. Your score on an intelligence test today should be roughly
the same as your score when you take it again in 5 years. Your level of extraversion today
should correlate highly with your level of extraversion in 20 years. The test–retest reliability of these measures is quantified by simply correlating measures at two time points.
The higher these correlations are, the higher the reliability will be. This makes conceptual
sense as well; if our measured scores reflect the true score more than they reflect random
error, then this will result in increased stability of the measurements.
Interitem Reliability refers to the internal consistency among different items on our measure. If you think back to the last time you completed a survey, you may have noticed that
it seemed to ask the same questions more than once (more on this technique in Chapter
4). This is done because a single item is more likely to contain measurement error than is
the average of several items—remember that small random errors tend to cancel out. Consider the following items from Sheldon Cohen’s Perceived Stress Scale (Cohen, Kamarck, &
Mermelstein, 1983):
1. In the last month, how often have you felt that you were unable to control the
important things in your life?
2. In the last month, how often have you felt confident about your ability to handle
your personal problems?
3. In the last month, how often have you felt that things were going your way?
4. In the last month, how often have you felt difficulties were piling up so high that
you could not overcome them?
Digial Vision/Thinkstock
In a study on aggression it would be
important to consider that people may
have different ideas about what aggressive
behavior looks like.
new66480_02_c02_p047-088.indd 58
Each of these items taps into the concept of
“stressed out,” or overwhelmed by the demands
of one’s life. One standard way to evaluate a measure like this is by computing the average correlation between each pair of items, a statistic referred
to as Cronbach’s alpha. The more these items tap
into a central, consistent construct, the higher the
value of this statistic is. Conceptually, a higher
alpha means that variation in responses to the different items reflects variation in the “true” variable being assessed by the scale items.
Interrater Reliability refers to the consistency
among judges observing participants’ behavior.
The previous two forms of reliability were relevant in dealing with self-report scales; interrater
10/31/11 9:23 AM
Section 2.2 Reliability and Validity
CHAPTER 2
reliability is more applicable when research involves behavioral measures. Imagine you
are studying the effects of alcohol consumption on aggressive behavior. You would most
likely want a group of judges to observe participants in order to make ratings of their
levels of aggression. In the same way that using multiple scale items helps to cancel out
the small errors of individual items, using multiple judges cancels out the variations in
each individual’s ratings. In this case, people could have different ideas and thresholds
for what constitutes aggression. Much like the process of evaluating multiple scale items,
we can evaluate the judges’ ratings by calculating the average correlation among the ratings. The higher our alpha values, the more the judges agree in their ratings of aggressive
behavior. Conceptually, a higher alpha value means that variation in the judges’ ratings
reflects real variation in levels of aggression.
Validity
Let’s return to our watch and scale examples. Perhaps you are the type of person who
sets your watch 10 minutes ahead to avoid being late. Or perhaps you have adjusted
your scale by 5 pounds to boost your motivation or your self-esteem. In these cases, your
watch and your scale may produce consistent measurements, but the measurements are
not accurate. It turns out that the reliability of a measure is a necessary but not sufficient
basis for evaluating it. Put bluntly, our measures can be (and have to be) consistent but
might still be garbage. The additional piece of the puzzle is the validity of our measures,
or the extent to which they accurately measure what they are designed to measure.
Whereas reliability is threatened more by random error, validity is threatened more by systematic error. If the measured scores on our happiness scale reflect, say, self-esteem more
than they reflect happiness, this would threaten the validity of our scale. We discussed in
the previous section that a test designed to measure intelligence ought to be consistent
over time. And in fact, these tests do show very high degrees of reliability. However, several researchers have cast serious doubts on the validity of intelligence testing, arguing
that even scores on an official IQ test are influenced by a person’s cultural background,
socioeconomic status (SES), and experience with the process of test taking (for discussion
of these critiques, see Daniels et al., 1997; Gould, 1996). For example, children growing up
in higher SES households tend to have more books in the home, spend more time interacting with one or both parents, and attend schools that have more time and resources available—all of which are correlated with scores on IQ tests. Thus, all of these factors amount
to systematic error in the measure of intelligence and, therefore, threaten the validity of a
measured score on an intelligence test.
Researchers have two main ways to discuss and evaluate the validity, or accuracy, of measures: construct validity and criterion validity.
Construct Validity is evaluated based on how well the measures capture the underlying
conceptual ideas (i.e., the constructs) in a study. These constructs are equivalent to the “true
score” discussed in the previous section. That is, how accurately does our bathroom scale
measure the concept of weight? How accurately does our IQ test measure the construct of
intelligence relative to other things? There are a couple of ways to assess the validity of our
measures. On the subjective end of the continuum, we can assess the face validity of the
measure, or the extent to which it simply seems like a good measure of the construct. The
items from the Perceived Stress Scale have high face validity because the items match what
new66480_02_c02_p047-088.indd 59
10/31/11 9:23 AM
Section 2.2 Reliability and Validity
CHAPTER 2
we intuitively mean by “stress” (e.g., “how often
have you felt difficulties were piling up so high
that you could not overcome them?”). However,
if we were to measure your speed at eating hot
dogs and then tell you it was a stress measure,
you might be dubious because this would lack
face validity as a measure of stress.
Although face validity is nice to have, it can
sometimes (ironically) reduce the validity of
the measures. Imagine seeing the following two
measures on a survey of your attitudes:
1. Do you dislike people whose skin color
is different from yours?
2. Do you ever beat your children?
Stockbyte
On the one hand, these are extremely face-valid Researchers must consider how accurately IQ
measures of attitudes about prejudice and cor- tests measure the construct of “intelligence”
poral punishment—they very much capture relative to other factors.
our intuitive ideas about these concepts. On the
other hand, even people who do support these
attitudes may be unlikely to answer honestly because they can recognize that neither attitude is popular. In cases like this, a measure low in face validity might end up being the
more accurate approach. We will discuss ways to strike this balance in Chapter 4.
On the less subjective end, we can assess the validity of our constructs by examining their
empirical connections to both related and unrelated measures. Imagine for a moment that
you are developing a new measure of liberal political attitudes. If we think about someone
who describes herself as liberal, she is likely to support gun control, equal rights, and a
woman’s right to choose. And, she is less likely to be pro-war, anti-immigration, or homophobic. Therefore, we would expect our new liberalism measure to correlate positively
with existing measures of attitudes toward guns, affirmative action, and abortion. This
pattern of correlations taps into the metric of convergent validity, or the extent to which
our measure overlaps with conceptually similar measures. In addition, we would expect
our new liberalism measure not to correlate with attitudes toward wars, immigrants, or
gays and lesbians. This hypothesized lack of correlations taps into the metric of discriminant validity, or the extent to which our measure diverges from unrelated measures.
To take another example, imagine you wanted to develop a new measure of narcissism,
usually defined as an intense desire to be liked and admired by other people. Narcissists
tend to be self-absorbed but also very attuned to the feedback they receive from other
people—at least as it pertains to the extent to which people admire them. Narcissism is
somewhat similar to self-esteem but different enough; it is perhaps best viewed as high
and unstable self-esteem. So, given these facts, we might assess the discriminant validity
of our measure by making sure it did not overlap too closely with measures of self-esteem
or self-confidence. This would establish that our measure stands apart from these different constructs. We might then assess the convergent validity of our measure by making
new66480_02_c02_p047-088.indd 60
10/31/11 9:23 AM
CHAPTER 2
Section 2.2 Reliability and Validity
sure that it did correlate with things like sensitivity to rejection and need for approval.
These correlations would place our measure into a broader theoretical context and help to
establish it as a valid measure of the construct of narcissism.
Criterion Validity is evaluated based on the association between measures and relevant
behavioral outcomes. The criterion in this case refers to a measure that can be used to make
decisions. For example, if you developed a personality test to assess management style,
the most relevant metric of its validity would be whether it predicted a person’s behavior as a manager. That is, you might expect people scoring high on this scale to be able to
increase the productivity of their employees and to maintain a comfortable work environment. Likewise, if you developed a measure that predicted the best careers for graduating
seniors based on their skills and personalities, then criterion validity would be assessed
through people’s actual success in these various careers. Whereas construct validity is more
concerned with the underlying theory behind the constructs, criterion validity is more concerned with the practical application of measures. As you might expect, this approach is
more likely to be used in applied settings.
Comstock
Criterion validity can be used to predict
a future behavioral outcome like
management success.
That said, criterion validity is also a useful way to
supplement validation of a new questionnaire. For
example, a questionnaire about generosity should
be able to predict people’s annual giving to charities, and a questionnaire about hostility ought to
predict hostile behaviors. To supplement the construct validity of our narcissism measure, we might
examine its ability to predict the ways people
respond to rejection and approval. Based on the definition of our construct, we might hypothesize that
narcissists would become hostile following rejection and perhaps become eager to please following
approval. If these predictions were supported, we
would end up with further validation that our measure was capturing the concept of narcissism.
Criterion validity falls into one of two categories, depending on whether the researcher is
interested in the present or the future. Predictive validity involves attempting to predict a
future behavioral outcome based on the measure, as in our examples of the management
style and career placement measures. Predictive validity is also at work when researchers
(and colleges) try to predict likelihood of school success based on SAT or GRE scores. The
goal here is to validate our construct via its ability to predict the future.
In contrast, concurrent validity involves attempting to link a self-report measure with a
behavioral measure collected at the same time, as in our examples of the generosity and
hostility questionnaires. The phrase “at the same time” is used vaguely here; our selfreport and behavioral measures may be separated by a short time span. In fact, concurrent
validity sometimes involves trying to predict behaviors that occurred before completion
of the scale, such as trying to predict students’ past drinking behaviors from an “attitudes
toward alcohol” scale. The goal in this case is to validate our construct via its association
with similar measures.
new66480_02_c02_p047-088.indd 61
10/31/11 9:23 AM
CHAPTER 2
Section 2.2 Reliability and Validity
Summary: Comparing Reliability and Validity
As we have seen in this section, both reliability (consistency) and validity (accuracy) are
ways to evaluate measured variables and to assess how well these measurements capture
the underlying conceptual variable. In establishing estimates of both of these metrics, we
essentially examine a set of correlations with our measured variables. But while reliability
involves correlating our variables with themselves (e.g., happiness scores at week 1 and
week 4), validity involves correlating our variables with other variables (e.g., our happiness scale with the number of times a person smiles). Figure 2.3 displays the relationships
among types of reliability and validity.
Figure 2.3: Types of Reliability and Validity
Reliability
(Consistency)
Test–Retest
Reliability
Inter item
Reliability
Validity
(Accuracy)
Interrater
Reliability
Construct
Validity
Convergent
Validity
Discriminant
Validity
Criterion
Validity
Predictive
Validity
Concurrent
Validity
We learned earlier that reliability is necessary but not sufficient to evaluate measured variables. That is, reliability has to come first and is an essential requirement for any variable—
you would not trust a watch that was sometimes 5 minutes fast and other times 10 minutes slow. If we cannot establish that a measure is reliable, then there is really no chance
of establishing its construct validity because every measurement might be a reflection of
random error. However, just because a measure is consistent does not make it accurate.
Your watch might consistently be 10 minutes fast; your scale might always be 5 pounds
under your actual weight. For that matter, your test of intelligence might result in consistent scores but actually be capturing respondents’ cultural background. Reliability tells us
the extent to which a measure is free from random error. Validity takes the second step of
telling us the extent to which the measure is also free from systematic error.
Finally, it is worth pointing out that establishing validity for a new measure is hard work.
Reliability can be tested in a single step by correlating scores from multiple measures,
multiple items, or multiple judges within a study. But testing the construct validity of
a new measure involves demonstrating both convergent and discriminant validity. In
developing our narcissism scale, we would need to show that it correlated with things
like fear of rejection (convergent) but was reasonably different from things like self-esteem
(discriminant). The latter criterion is particularly difficult to establish because it takes time
and effort—and multiple studies—to demonstrate that one scale is distinct from another.
There is, however, an easy way to avoid these challenges: Use existing measures whenever
new66480_02_c02_p047-088.indd 62
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
possible. Before creating a brand new happiness scale, or narcissism scale, or self-esteem
scale, check to see if one exists that has already gone through the ordeal of being validated.
2.3 Scales and Types of Measurement
A
s you may remember from your statistics class, not all measures are created equal.
One of the easiest ways to decrease error variance, and thereby increase our reliability and validity, is to make smart choices when we design and select our measures. Throughout this book, we will discuss guidelines for each type of research design
and ways to ensure that our measures are as accurate and unbiased as possible. In this
section, we examine some basic rules that apply across all three types of design. We first
review the four scales of measurement and discuss the proper use of each one; we then
turn our attention to three types of measurement used in psychological research studies.
Scales of Measurement
Whenever we go through the process of translating our conceptual variables into measurable variables (i.e., operationalization; see Chapter 1), it is important to ensure that our
measurements accurately represent the underlying concepts. We have covered this process already; in our discussion of validity, you learned that this accuracy is a critical piece
of hypothesis testing. For example, if we develop a scale to measure job satisfaction, then
we need to verify that this is actually what the scale is measuring. But there is an additional, subtler dimension to measurement accuracy: We also need to be sure that our chosen measurement accurately reflects the underlying mathematical properties of the concept. In many cases in the natural sciences, this process is automatically precise. When we
measure the speed of a falling object or the temperature of a boiling object, the underlying
concepts (speed and temperature) translate directly into scaled measurements. But in the
social and behavioral sciences, this process is trickier; we have to decide carefully how
best to represent abstract concepts such as happiness, aggression, and political attitudes.
As we take the step of scaling our variables, or specifying the relationship between our
conceptual variable and numbers on a quantitative measure, we have four different scales
to choose from, presented below in order of increasing statistical power and flexibility.
Nominal Scales
Nominal scales are used to label or identify a particular group or characteristic. For example, we can label a person’s gender male or female, and we could label a person’s religion
Catholic, Buddhist, Jewish, Muslim, or some other religion. In experimental designs, we
can also use nominal scales to label the condition to which a person has been assigned
(e.g., experimental or control groups). The assumption in using these labels is that members of the group have some common value or characteristic, as defined by the label. For
example, everyone in the Catholic group should have similar religious beliefs, and everyone in the female group should be of the same gender.
It is common practice in research studies to represent these labels with numeric codes, such
as using a 1 to indicate females and a 2 to indicate males. However, these numbers are completely arbitrary and meaningless—that is, males do not have more gender than females.
new66480_02_c02_p047-088.indd 63
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
We could just as easily replace the 1 and the 2 with another pair of numbers or with a pair
of letters or names. Thus, the primary characteristic of nominal scales is that the scaling
itself is arbitrary. This prevents us from using these values in mathematical calculations.
One helpful way to appreciate the difference between this scale and the other three is to
think of nominal scales as qualitative, because they label and identify, and to think of the
other scales as quantitative, because they indicate the extent to which someone possesses a
quality or characteristic. Let’s turn our attention to these quantitative scales in more detail.
Ordinal Scales
Ordinal scales are used to represent ranked orders of conceptual variables. For example,
beauty contestants, horses, and Olympic athletes are all ranked by the order in which they
finish—first, second, third, and so on. Likewise, movies, restaurants, and consumer goods
are often rated using a system of stars (i.e., 1 star is not good; 5 stars is excellent) to represent their quality. In these examples, we can draw conclusions about the relative speed,
beauty, or deliciousness of the rating target. But the numbers used to label these rankings
do not necessarily map directly onto differences in the conceptual variable. The fourthplace finisher in a race is rarely twice as slow as the second-place finisher; the beauty contest winner is not three times as attractive as the third-place finisher; and the boost in quality between a four-star and a five-star restaurant is not the same as the boost between a
two-star and three-star restaurant. Ordinal scales represent rank orders, but the numbers
do not have any absolute value of their own. Thus, this type of scale is more powerful than
a nominal scale but still limited in that we cannot perform mathematical operations. For
example, if an Olympic athlete finished first in the 800-meter dash, third in the 400-meter
hurdles, and second in the 400-meter relay, you might be tempted to calculate her average
finish as being in second place. Unfortunately, the properties of ordinal scales prevent us
from doing this sort of calculation because the distance between first, second, and third
place would be different in each case. In order to perform any mathematical manipulation
of our variables, we need one of the next two types of scale.
Interval Scales
Interval scales represent cases where the numbers on a measured variable correspond to equal
distances on a conceptual variable. Likewise,
temperature increases on the Fahrenheit scale
represent equal intervals—warming from 40 to
47 degrees is the same increase as warming from
90 to 97 degrees. Interval scales share the key feature of ordinal scales—higher numbers indicate
higher relative levels of the variable—but interval scales go an important step further. Because
these numbers represent equal intervals, we are
able to add, subtract, and compute averages.
That is, whereas we could not calculate our athlete’s average finish, we can calculate the average temperature in San Francisco or the average
age of our participants.
new66480_02_c02_p047-088.indd 64
Stockbyte
Interval scales can be used to calculate the
average age of marathon participants.
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
Ratio Scales
Ratio scales go one final step further, representing interval scales that also have a true
zero point, that is, the potential for a complete absence of the conceptual variable. Ratio
scales can be used in the case of physical measurements, such as length, weight, and time
since it is possible to have a complete absence of any of these. Ratio scales can also be
used in measurement of behaviors since it is possible to have zero drinks per day, zero
presses of a reward button, or zero symptoms of the flu. Temperature in degrees Kelvin
is measured on a ratio scale because 0 Kelvin indicates an absence of molecular motion.
(In contrast, 0 degrees Fahrenheit is only a center point on the temperature scale.) Contrast these measurements with many of the conceptual variables featured in psychology
research—there is no such thing as zero happiness or zero self-esteem. The big advantage
of having a true zero point is that it allows us to add, subtract, multiply, and divide scale
values. When we measure weight, for example, it makes sense to say that a 300-pound
adult weighs twice as much as a 150-pound adult. And, it makes sense to say that having
two drinks per day is only 1/4 as many as having eight drinks per day.
Summary—Choosing and Using Scales of Measurement
The take-home point from our discussion of these four scales of measurement is twofold. First, you should always use the most powerful and flexible scale possible for your
conceptual variables. In many cases, there is no choice; time is measured on a ratio scale
and gender is measured on a nominal scale. But in some cases, you have a bit of freedom
in designing your study. For example, if you were interested in correlating weight with
happiness, you could capture weight in a few different ways. One option would be to
ask people their satisfaction with their current weight on a seven-point scale. However,
the resulting data would be on an ordinal or interval scale (see discussion below), and
the degree to which you could manipulate the scale values would be limited. Another,
more powerful option, would be to measure people’s weight on a scale, resulting in ratio
scale data. Thus, whenever possible, it is preferable to incorporate physical or behavioral
measures. But the primary goal is also to represent your data accurately. Most variables
in the social and behavioral sciences do not have a true zero point and must therefore be
measured on nominal, ordinal, or interval scales.
Second, you should always be aware of the limitations of your measurement scale. As discussed above, these scales lend themselves to different amounts of mathematical manipulation. It is not possible to calculate statistical averages with anything less than an interval
scale and not possible to multiply or divide anything less than a ratio scale. What does
this mean for you? If you have collected ordinal data, you are limited to discussing the
rank ordering of the values (e.g., the critics liked Restaurant A better than Restaurant B). If
you have collected nominal data, you are limited to describing the different groups (e.g.,
numbers of Catholics and Protestants).
One conspicuous grey area for both of these points is the use of attitude scales in the social
and behavioral sciences. If you were to ask people to rate their attitudes about the death
penalty on a seven-point rating scale, would this be an ordinal scale or an interval scale?
This turns out to be a contentious issue in the field. The conservative point of view is
that these attitude ratings constitute only ordinal scales. We know that a 7 indicates more
endorsement than a 3 but cannot say that moving from a 3 to a 4 is equivalent to moving
new66480_02_c02_p047-088.indd 65
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
from a 6 to a 7 in people’s minds. The more liberal point of view is that these attitude
ratings can be viewed as interval scales. This perspective is generally guided by practical concerns—treating these as equal intervals allows us to compute totals and averages
for our variables. A good guideline is to assume that these individual attitude questions
represent ordinal scales by default. We will return to this issue again in Chapter 4 in our
discussion of creating questionnaire items.
Types of Measurement
Each of the four scales of measurement can be used across a wide variety of research
designs. In this section, we shift gears slightly and discuss measurement at a more conceptual level. The types of dependent measures that are used in psychological research studies can be grouped into three broad categories: behavioral, physiological, and self-report.
Behavioral Measurement
Behavioral measures are those that involve direct and systematic recording of observable
behaviors. If your research question involves the ways that married couples deal with
conflict, you could include a behavioral measure by observing the way participants interact during an argument. Do they cut one another off? Listen attentively? Express hostility? Behaviors can be measured and quantified in one of four primary ways, as illustrated
using the scenario of observing married couples during conflict situations:
•
•
•
•
Frequency measurements involve counting the number of times a behavior
occurs. For example, you could count the number of times each member of the
couple rolled his or her eyes, as a measure of dismissive behavior.
Duration measurements involve measuring the length of time a behavior lasts.
For example, you could quantify the length of time the couple spends discussing
positive versus negative topics as a measure of emotional tone.
Intensity measurements involve measuring the strength or potency of a behavior. For example, you could quantify the intensity of anger or happiness in each
minute of the conflict using ratings by trained judges.
Latency measures involve measuring the delay before onset of a behavior. For
example, you could measure the time between one person’s provocative statement and the other person’s response.
John Gottman, a psychologist at the University of Washington, has been conducting
research along these lines for several decades, observing body language and interaction
styles among married couples as they discuss an unresolved issue in their relationship
(you can read more about this research and its implications for therapy on Dr. Gottman’s
website, http://www.gottman.com/). What all of these behavioral measures provide is
a nonreactive way to measure the health of a relationship. That is, the major strength of
behavioral responses is that they are typically more honest and unfiltered than responses
to questionnaires. As we will discuss in Chapter 4, people are sometimes dishonest on
questionnaires in order to convey a more positive (or less negative) impression.
This is a particular plus if you are interested in unpopular attitudes, such as prejudice
and discrimination. If you were to ask people the extent to which they disliked members of other ethnic groups, they might not admit to these prejudices. Alternatively, you
new66480_02_c02_p047-088.indd 66
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
could adopt the approach used by Yale psychologist Jack Dovidio and colleagues and measure
how close people sat to people of different ethnic
and racial groups, using this distance as a subtle
and effective behavioral measure of prejudice (see
http://www.yale.edu/intergroup/ for more information). But you may have spotted the primary
downside to using behavioral measures: We end
up having to infer the reasons that people behave
as they do. Let’s say European-American particiPhotodisc
pants, on average, sit farther away from AfricanAmericans than from other European-Americans.
A downside to using using behavioral
This could—and usually does—indicate prejumethods is that a researcher can only infer
dice; but, for the sake of argument, the farthest
why people act the way they do.
seat from the minority group member might
also be the one closest to the window. In order
to understand the reasons for behaviors, researchers have to supplement the behavioral
measures with either physiological or self-report measurements.
Physiological Measurement
Physiological measures are those that involve quantifying bodily processes, including
heart rate, brain activity, and facial muscle movements. If you were interested in the experience of test anxiety, you could measure heart rate as people completed a difficult math
test. If you wanted to study emotional reactions to political speeches, you could measure
heart rate, facial muscles, and brain activity as people viewed video clips. The big advantage of these types of measures is that they are the least subjective and controllable. It is
incredibly difficult to control your heart rate or brain activity consciously, making these
a great tool for assessing emotional reactions. However, as with behavioral measures, we
always need some way to contextualize our physiological data.
The best example of this shortcoming is the use of the polygraph, or lie detector, to detect
deception. The lie detector test involves connecting a variety of sensors to the body to measure heart rate, blood pressure, breathing rate, and sweating. All of these are physiological
markers of the body’s fight-or-flight stress response; so the goal is to observe whether you
show signs of stress while being questioned. But here’s the problem: It is also stressful to
worry about being falsely accused. A trained polygraph examiner must place all of your
physiological responses in the proper context. Are you stressed throughout the exam or
only stressed when asked whether you pilfered money from the cash box? Are you stressed
when asked about your relationship with your spouse because you killed him or because
you were having an affair? The examiner has to be extremely careful to avoid false accusations based on misinterpretations of physiological responses.
Self-Report Measurement
Self-report measures are those that involve asking people to report on their own thoughts,
feelings, and behaviors. If you were interested in the relationship between income and
happiness, you could simply ask people to report their income and their level of happiness. If you wanted to know whether people were satisfied in their romantic relationships,
new66480_02_c02_p047-088.indd 67
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
you could simply ask them to rate their degree of
satisfaction. The big advantage of these measures
is that they provide access to internal processes.
That is, if you want insight into why people voted
for their favorite political candidate, you could
simply ask them. However, as we have suggested
already, people may not necessarily be honest
and forthright in their answers, especially when
dealing with politically incorrect or unpopular
attitudes. We will return to this balance again in
Chapter 4 and discuss ways to increase the likelihood of honest self-reported answers.
Digital Vision/Thinkstock
There are two broad categories of self-report measures. One of the most common approaches is to A self report measure might be used to
ask for people’s responses using a fixed-format determine how likely voters are to support a
scale, which asks them to indicate their opinion certain candidate.
on a preexisting scale. For example, you might ask
people, “How likely are you to vote for the Republican candidate for president?” on a scale from 1 (not likely) to 7 (very likely). The other
broad approach is to ask for responses using a free-response format, which asks people
to express their opinion in an open-ended format. For example, you might ask people to
explain, “What are the factors you consider in choosing a political candidate?” The trade-off
between these two categories is essentially a choice between data that is easy to code and
analyze and data that is rich and complex. In general, fixed-format scales are used more
in quantitative research while free-response formats are used more in qualitative research.
Research: Thinking Critically
Neuroscience and Addictive Behaviors
By Christian Nordqvist
Some people really are addicted to foods in a similar way others might be dependent on certain
substances, like addictive illegal or prescription drugs, or alcohol, researchers from Yale University
revealed in Archives of General Psychiatry. Those with an addictive-like behavior seem to have more
neural activity in specific parts of the brain in the same way substance-dependent people appear to
have, the authors explained.
It’s a bit like saying that if you dangle a tasty chocolate milkshake in front of a pathological eater,
what goes on in that person’s brain is similar to what would happen if you placed a bottle of scotch
in front of an alcoholic.
The researchers wrote:
One-third of American adults are now obese and obesity-related disease is the second
leading cause of preventable death. Unfortunately, most obesity treatments do not
result in lasting weight loss because most patients regain their lost weight within five
years. Based on numerous parallels in neural functioning associated with substance
dependence and obesity, theorists have proposed that addictive processes may be
involved in the etiology of obesity. (continued)
new66480_02_c02_p047-088.indd 68
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
Research: Thinking Critically (continued)
Food and drug use both result in dopamine release in mesolimbic regions and the degree
of release correlates with subjective reward from both food and drug use.
The authors believe that no studies had so far looked into the neural correlates of addictive-like eating behavior. They explained that some studies had demonstrated that photos of nice food can get the
brain’s reward centers to become more active in much the same way that photos of alcoholic drinks
might do for alcoholics. However, this latest study is the first tell the food addicts from the just overeaters.
Ashley N. Gearhardt, M.S., M.Phil., and team looked at the relation between the symptoms of food
addiction and neural activation. Food addiction was assessed by the Yale Food Addiction Scale, while
neural activation was gauged via functional MRI (magnetic resonance imaging). Forty-eight study
participants responded to cues which signaled the imminent arrival of very tasty food, such as a
chocolate milkshake, compared to a control solution (something with no taste). They also compared
what was going on while they consumed the milkshake compared to the tasteless solution.
The Yale Food Addiction Scale questionnaire identified 15 women with high scores for addictivelike eating behaviors. All the 48 study participants were young women, ranging in BMI from lean to
obese. They were recruited from a healthy weight maintenance study.
The scientists discovered a correlation between food addiction and greater activity in the amygdala,
medial orbitofrontal cortex and the anterior cingulated cortex when tasty food delivery was known
to arrive soon.
Those with high food addiction, the fifteen women, showed greater activity in the dorsolateral prefrontal cortex compared to those with low addiction to foods. They also had reduced activity in the
lateral orbitofrontal cortex while they were eating their nice food.
The authors explained:
As predicted, elevated FA (food addiction) scores were associated with greater activation
of regions that play a role in encoding the motivational value of stimuli in response to
food cues. The ACC and medial OFC have both been implicated in motivation to feed and
to consume drugs among individuals with substance dependence.
In sum, these findings support the theory that compulsive food consumption may be
driven in part by an enhanced anticipation of the rewarding properties of food. Similarly,
addicted individuals are more likely to be physiologically, psychologically, and behaviorally reactive to substance-related cues.
They concluded:
To our knowledge, this is the first study to link indicators of addictive eating behavior
with a specific pattern of neural activation. The current study also provides evidence that
objectively measured biological differences are related to variations in YFAS (Yale Food
Addiction Scale) scores, thus providing further support for the validity of the scale. Further, if certain foods are addictive, this may partially explain the difficulty people experience in achieving sustainable weight loss. If food cues take on enhanced motivational
properties in a manner analogous to drug cues, efforts to change the current food environment may be critical to successful weight loss and prevention efforts. Ubiquitous food
advertising and the availability of inexpensive palatable foods may make it extremely
difficult to adhere to healthier food choices because the omnipresent food cues trigger
the reward system. Finally, if palatable food consumption is accompanied by disinhibition
[loss of inhibition], the current emphasis on personal responsibility as the anecdote to
increasing obesity rates may have minimal effectiveness. (continued)
new66480_02_c02_p047-088.indd 69
10/31/11 9:23 AM
Section 2.3 Scales and Types of Measurement
CHAPTER 2
Research: Thinking Critically (continued)
Think about it
1. Is the study described here descriptive, correlational, or experimental? Explain.
2. Can one conclude from this study that food addiction causes brain abnormalities? Why or why
not?
3. The authors of the study concluded: “The current study also provides evidence that objectively measured biological differences are related to variations in YFAS (Yale Food Addiction
Scale) scores, thus providing further support for the validity of the scale.” What type(s) of
validity are they referring to? Explain.
4. What types of measures are included in this study (e.g., behavioral, self-report)? What are the
strengths and limitations of these measures in this study?
Choosing a Measurement Type
As you can see from these descriptions, each type of measurement has its strengths and
flaws. So, how do you decide which one to use? This question has to be answered for
every case, and the answer depends on three factors. First, and most obviously, the measure depends on the research question. If you are interested in effects of public speaking
on stress levels, then the best measures will be physiological. If you are interested in attitudes toward capital punishment, these are better measured using self-reports. Second,
the choice of measures is guided by previous research on the topic. If studies have assessed
prejudice by using self-reports, then you could feel comfortable doing the same. If studies
have measured fear responses using facial expressions, then let that be a starting point
for your research. Finally, a mix of availability and convenience often guides the choice
of measures. Measures of brain activity are a fantastic addition to any research program,
but these measures also require a specialized piece of equipment that can run upwards
of $2 million. As a result, many researchers interested in physiological measures opt for
something less expensive like a measure of heart rate or movement of facial muscles, both
of which can be measured using carefully placed sensors (i.e., on the chest or face).
In an ideal world, a program of research will use a wide variety of measures and designs.
The term for this is converging operations, or the use of multiple research methods to solve
a single problem. In essence, over the course of several studies—perhaps spanning several
years—you would address your research question using different designs, different measures, and different levels of analysis. One good example of converging operations comes
from the research of psychologist James Gross and his colleagues at Stanford University.
Gross studies the ways that people regulate their emotional responses and has conducted
this work using everything from questionnaires to brain scans (see http://spl.stanford.edu/
projects.html).
One branch of Gross’s research has examined the consequences of trying to either suppress emotions (pretend they’re not happening) or reappraise them (think of them in a
different light). Suppression is studied by asking people to hold in their emotional reactions while watching a graphic medical video. Reappraisal is studied by asking people
to watch the same video while trying to view it as a medical student, thus changing the
new66480_02_c02_p047-088.indd 70
10/31/11 9:23 AM
Section 2.4 Hypothesis Testing
CHAPTER 2
meaning of what they see. When people try to suppress emotional responses, they experience an ironic increase in physiological and self-reported emotional responses, as well as
deficits in cognitive and social functioning. Reappraising emotions, in contrast, actually
works quite well. In another branch of the research, Gross and colleagues have examined the neural processes at work when people change their perspective on an emotional
event. In yet another branch of the research, they have examined individual differences in
emotional responses, with the goal of understanding why some people are more capable
of managing their emotions than others. Taken together, these studies all converge into a
more comprehensive picture of the process of emotion regulation than would be possible
from any single study or method.
2.4 Hypothesis Testing
R
egardless of the details of a particular study, be it correlational, experimental, or
descriptive, all quantitative research follows the same process of testing a hypothesis. This section provides an overview of this process, including a discussion of
the statistical logic, the five steps of the process, and the two ways we can make mistakes
during our hypothesis test. Some of this may be a review from your statistics class, but
it forms the basis of our scientific decision-making process and thus warrants repeating.
The Logic of Hypothesis Testing
In Chapter 1, we discussed several criteria for identifying a “good” theory, one of which is
that our theories have to be falsifiable. In other words, our research questions should have
the ability to be proven wrong under the right set of conditions. Why is this so important?
This will sound counterintuitive at first, but by the standards of logic, it is more meaningful when data run counter to our theory than when data support the theory.
Let’s say you predict that growing up in a low-income family puts children at higher risk
for depression. If your data fit this pattern, your prediction might very well be correct.
But it’s also possible that these results are due to a third variable—perhaps low-income
families grow up in more stressful neighborhoods, and stress turns out to increase one’s
depression risk. Or, perhaps your sample accidentally contained an abnormal number of
depressed people. This is why we are always cautious in interpreting positive results from
a single study. But now, imagine that you test the same hypothesis and find that those
who grew up in low-income families show a lower rate of depression. This is still a single
study, but it suggests that our hypothesis may have been off base.
Another way to think about this is from a statistical perspective. As we discussed earlier
in this chapter, all measurements contain some amount of random error, which means
that any pattern of data could be caused by random chance. This is the primary reason
that research is never able to “prove” a theory. You’ll also remember from your statistics
class that at the end of any hypothesis test, we will calculate a p value, representing the
probability that our results are due to random chance. Conceptually, this means we are
calculating the probability that we’re wrong rather than the probability that we’re right
in our predictions. And the bigger our effect, the smaller this probability will generally
new66480_02_c02_p047-088.indd 71
10/31/11 9:23 AM
CHAPTER 2
Section 2.4 Hypothesis Testing
be. So, as strange as this seems, the ideal result of hypothesis testing is to have a small
probability of being wrong.
This focus on falsifiability carries over to the way we test our hypotheses in that our
goal is to reject the possibility of our results being due to chance. The starting point of a
hypothesis test is to state a null hypothesis, or the assumption that there is no real effect
of our variables in the overall population. This is another way of saying that our observed
patterns of data are due to random chance. In essence, we propose this null in hopes of
minimizing the odds that it is true. Then, as a counterpoint to the null hypothesis, we
propose an alternative hypothesis that represents our predicted pattern of results. In statistical jargon, the alternative hypothesis represents our predicted deviation from the null.
These alternative hypotheses can be directional, meaning that we specify the direction of
the effect, or nondirectional, meaning that we simply predict an effect.
Let’s say you want to test the hypothesis that people like cats better than dogs. You would
start with the null hypothesis, that people like cats and dogs the same amount (i.e., there’s
no difference). The next step is to state your alternative hypothesis, which in this case is
that people will prefer cats. Because you are predicting a direction (cats more than dogs),
this is a directional hypothesis. The other option would be a nondirectional hypothesis, or simply stating that people’s cat preferences differ from their dog preferences.
(Note that we’ve avoided predicting which one people like better; this is what makes it
nondirectional.)
Finally, these three hypotheses can also be expressed using logical notation, as shown
below. The letter H is used as an abbreviation for “Hypothesis,” and the Greek letter m is
a common abbreviation for the mean, or average.
Conceptual Hypothesis: People like cats better than dogs.
Null Hypothesis: H0: mcat 5 mdog
the “cat” mean is equal to the “dog” mean;
people like cats and dogs the same
Nondirectional Alternative Hypothesis: H1: mcat ? mdog
the “cat” mean is not equal to the “dog” mean;
people like cats and dogs different amounts
Directional Alternative Hypothesis: H1: µcat . µdog
the “cat” mean is greater than the “dog” mean;
people like cats more than dogs
Why do we need to distinguish between directional and nondirectional hypotheses? As
you’ll see when we get to the statistical calculations, this decision has implications for our
level of statistical significance. Because we always want to minimize the risk of coming to
the wrong conclusion based on chance findings, we have to be more conservative with a
nondirectional test. This idea is illustrated in Figure 2.4.
new66480_02_c02_p047-088.indd 72
10/31/11 9:23 AM
CHAPTER 2
Section 2.4 Hypothesis Testing
Figure 2.4: One-Tailed vs. Two-Tailed Hypothesis Tests
p(X1 — X2)
X1 — X2
These graphs represent the probability of obtaining a particular difference between our
groups. The graph on the left represents a simple directional hypothesis—we will be
comfortable rejecting the null hypothesis if our mean difference is above the alpha cutoff (usually 5%). The graph on the right, however, represents a nondirectional hypothesis, which simply predicts that one group is higher or lower than the other. Because we
are being less specific, we have to be more conservative. With a directional hypothesis
(also called one-tailed), we predict that the group difference will fall on one extreme of
the curve; with a nondirectional hypothesis (also called two-tailed), we predict that the
group difference will fall on either extreme of the curve. The implication of a two-tailed
hypothesis is that our 5% cutoff could become a 10% cutoff, with 5% on each side. Rather
than double our chance of an error, we follow standard practice and use a 2.5% cutoff on
each side of the curve.
Translation: We need bigger group differences to support our two-tailed, nondirectional
hypotheses. In the cats-versus-dogs example, it would take a bigger difference in ratings
to support the claim that people like cats and dogs different amounts than it would to
support the claim that people like cats more than dogs. The goal of all this statistical and
logical jargon is to place our hypothesis testing in the proper frame. The most important
thing to remember is that hypothesis testing is designed to reject the null hypothesis, and
our statistical tests tell us how confident to be in this rejection.
Five Steps to Hypothesis Testing
Now that you understand how to frame your hypothesis, what do you do with this
information? The good news is that you’ve now mastered the first step of a five-step
process of hypothesis testing. In this section, we walk through an example of hypothesis testing from start to finish, that is, from an initial hypothesis to a conclusion about
the hypothesis. In this fictitious study, we will test the prediction that married couples
without children are happier than those with children in the home. This example is
inspired by an actual study by Harvard social psychologist Dan Gilbert and his colleagues, described in a news article at http://www.telegraph.co.uk/news/1941195/Marriage-without-children-the-key-to-bliss.html. Our hypothesis may seem counterintuitive, but Gilbert’s research suggests that people tend to both overestimate the extent to
which children will make them happy and underestimate the added stress and financial
demands of having children in the house.
new66480_02_c02_p047-088.indd 73
10/31/11 9:23 AM
CHAPTER 2
Section 2.4 Hypothesis Testing
Step 1—State the Hypothesis
The first step in testing this hypothesis is to spell it out in logical terms. Remember that
we want to start with the null hypothesis that there is no effect. So, in this case, the null
hypothesis would be that couples are equally happy with and without children. Or, in
logical notation, H0: mchildren 5 mno children (i.e., the mean happiness rating for couples with
children equals the mean happiness rating for couples without children). From there, we
can spell out our alternative hypothesis; in this case, we predict that having children will
make couples less happy. Because this is a directional hypothesis, it is written H1: mchildren
, mno children (i.e., the mean happiness rating for couples with children is lower than the
mean happiness rating for couples without children).
Step 2—Collect Data
The next step is to design and conduct a study that will test our hypothesis. We will elaborate on this process in great detail over the next three chapters, but the general idea is the
same regardless of the design. In this case, the most appropriate design would be correlational because we want to predict happiness based on whether people have children.
It would be impractical and unethical to randomly assign people to have children, so an
experimental design is not possible in this case. One way to conduct our study would be
to survey married couples about whether they had children and ask them to rate their current level of happiness with the marriage. Let’s say we conduct this experiment and end
up with the data in Table 2.3.
As you can see, we get an average happiness rating of 5.7 for couples without children,
compared to an average happiness rating of 2.0 for couples with children. These groups
certainly look different—and encouraging for our hypothesis—but we need to be sure
that the difference is big enough that we can reject the null hypothesis.
Table 2.3: Sample Data for the “Children and Happiness” Study
new66480_02_c02_p047-088.indd 74
No Children
Children
7
2
5
3
7
1
5
2
4
4
5
3
6
2
7
1
6
1
5
1
mean 5 5.7
mean 5 2
S 5 1.06
S 5 1.05
SE 5 .33
SE 5 .33
10/31/11 9:23 AM
Section 2.4 Hypothesis Testing
CHAPTER 2
Step 3—Calculate Statistics
The next step in our hypothesis test is to calculate statistical tests to decide how confident
we can be that our results are meaningful. As a researcher, you have a wide variety of statistical tools at your disposal and different ways to analyze all manner of data. These tools
can be broadly grouped into descriptive statistics, which describe the patterns and distribution of measured variables, and inferential statistics, which attempt to draw inferences about the population from which the sample was drawn. These inferential statistics
are used to make decisions about the significance of the data. Your statistics class covered many of these in detail, and we will cover a few examples throughout this book. All
of these different techniques share a common principle: They attempt to make inference
by comparing the relationship among variables to the random variability of the data. As
we discussed earlier in this chapter, people’s measured levels of everything from happiness to heart rate can be influenced by a wide range of variables. The hope in testing our
hypotheses is that differences in our measurements will primarily reflect differences in the
variables we’re studying. In the current example, we would want to see that differences
in happiness ratings of the married couples were influenced more by the presence of children than by random fluctuations in happiness.
One of the most straightforward statistical tests to understand is Student’s t-test, which
is widely used to compare differences in the means of two groups. Because of its simplicity, it is also a great way to demonstrate the hypothesis-testing process. Conceptually, the
t-test compares the difference between two group means with the overall variability in
the data set. The end result is a test of whether our groups differ by a meaningful amount.
Imagine you found a 10-point difference in intelligence test scores between Republicans
and Democrats. Before concluding that your favorite party was smarter, you would need
to know how much scores varied on average. If your intelligence test were on a 100-point
scale, with a standard deviation of 5, then your 10-point difference would be interesting
and meaningful. But if you measured intelligence on a 1,000-point scale, with a standard
deviation of 100, then 10 points probably wouldn’t reflect a real difference.
So, conceptually, the t-test is a ratio of the mean difference to the average variability. Mathematically, the t-test is calculated like so:
t5
x1 2 x2
SEpooled
Let’s look at the pieces of this formula individually. First, the Xs on top of the line are a
common symbol for referring to the mean, or average, in our sample. Thus the terms X1
and X2 refer to the means for groups 1 and 2 in our sample, or the mean happiness for
couples with children and no children. The term below the line, SEpooled, represents our
estimate of variability in the sample. You may remember this term from your statistics
class, but let’s walk through a quick review. One common estimate of variability is the
standard deviation, which represents the average difference between individual scores
and the mean of the group. It is calculated by subtracting each score from the mean, squaring the deviation, adding up these squared deviations, dividing by the sample size, and
taking the square root of the result.
One problem with the standard deviation is that it generally underestimates the variability of the population, especially in small samples, because small samples are less likely to
include the full range of population values. So, we need a way to correct our variability
new66480_02_c02_p047-088.indd 75
10/31/11 9:23 AM
Section 2.4 Hypothesis Testing
CHAPTER 2
estimate in a small sample. Enter the standard error, which is computed by dividing the
standard deviation by the square root of the sample size. (To save time, these values are
already calculated and presented in Table 2.3.) The “pooled” standard error represents a
combination of the standard errors from our two groups:
SEpooled 5 “SE12 1 SE22 5 ” 1 .33 2 2 1 1 .33 2 2 5 “.218 5 .47
Our final step is to plug the appropriate numbers from our “children and happiness” data
set into the t-test formula.
t5
x1 2 x2 5.7 2 2
3.7
5
5
5 7.87
SEpooled
.47
.47
If this all seems overwhelming, stop and think about what we’ve done in conceptual
terms. The goal of our statistical test—the t-test—is to determine whether our groups
differ by a meaningful and significant amount. The best way to do that is to examine the
group difference as a ratio, relative to the overall variability in the sample. When we calculate this ratio, we get a value of 7.87, which certainly seems impressive, but there’s one
more step we need to take to interpret this number.
Step 4—Compare to a Critical Value
What does a 7.87 mean for our hypothesis test? To answer this question, we need to gather
two more pieces of information and then look up our t-test value (i.e., 7.87) in a table. The
first piece of information is the alpha level, representing the probability cutoff for our
hypothesis test. The standard alpha level to use is .05, meaning that we want to have less
than a 5% chance of the result being due to chance. In some cases, you might elect to use
an alpha level of .01, meaning that you would only be comfortable with a less than 1%
chance of your results being due to chance.
The second piece of information we need is the degrees of freedom in the data set; this
number represents the sample size and is calculated for a t-test via the formula n 2 2, the
number of couples in our sample minus 2. Think of it as a mathematical correction for
the fact that we are estimating values in a sample rather than from the entire population.
Another helpful way to think of degrees of freedom is as the number of values that are
“free to vary.” In our sample experiment, the no-children group has a mean of 5.7 while
the children group has a mean of 2. Theoretically, the values for 9 of the couples in each
group can be almost anything, but the 10th couple has to have a happiness score that will
yield the correct overall group mean. Thus, of the 20 happiness scores in our experiment,
18 are free to vary, giving us 18 degrees of freedom (i.e., n 2 2).
Armed with these two numbers—18 degrees of freedom and an alpha level of .05—we
turn to a critical value table, which contains cutoff scores for our statistical tests. (You
can find these values for a t-test at http://www.stattools.net/tTest_Tab.php). The numbers
in a critical value table represent the minimum value needed for the statistical test to be
significant. In this case, with 18 degrees of freedom and an alpha level of .05, we would
need a t-test value of 1.73 for a one-tailed (directional) hypothesis test and a t-test value
of 2.10 for a two-tailed (nondirectional) hypothesis test. (Remember, we have to be more
conservative for a nondirectional test.) In our children and happiness study, we had a
clear directional/one-tailed hypothesis that children would make couples less happy, so
we can legitimately use the one-tailed cutoff score of 1.73. Because our t-test value of 7.87
new66480_02_c02_p047-088.indd 76
10/31/11 9:23 AM
Section 2.4 Hypothesis Testing
CHAPTER 2
is unquestionably higher than 1.73, our statistical test is significant. In other words, there
is less than a 5% chance that the difference in happiness ratings is due to chance.
Step 5—Make a Decision
Finally, we are able to draw a conclusion about our experiment. Based on the outcome of
our statistical test (i.e., steps 3 and 4), we will make one of two decisions about our null
hypothesis:
Reject null: decide that the probability of the null being correct is sufficiently small;
that is, results are due to differences in groups
or
Fail to reject null: decide that the probability of the null being correct is too big;
that is, results are due to chance
Because our t-test value was quite a bit higher than the required cutoff value, we can be
confident in rejecting the null hypothesis. And, at long last, we can express our findings in
plain English: Couples with children are less happy than couples without children!
Now that we have walked through this five-step process, it’s time to let you in on a little
secret. When it comes to analyzing your own data, to test your own hypotheses, you will
actually rely on a computer program for part of this process—Steps 3 and 4 in particular.
In these modern times, it is rare to compute even a t-test by hand. Software programs
such as SPSS, SAS, and Microsoft Excel can take a table of data, compute the mean difference, compare it to the variability, and calculate the probability that the results are due to
chance. However, because these calculations happen behind the scenes, it is very important
to understand the process. To draw conclusions about your hypotheses, you have to understand what a p value and a t-test value mean. By understanding how the software operates,
you can reach informed conclusions about your research questions. Otherwise, you risk
making one of two possible errors in your hypothesis test, discussed in the next section.
Errors in Hypothesis Testing
In the children and happiness study, we concluded with a reasonable amount of confidence that our hypothesis was supported. But what if we make the wrong decision?
Because our conclusions are based on interpreting probability, there is always a chance
that we will draw the wrong conclusion. In interpreting our hypothesis tests, there are two
potential errors to be made, referred to as Type I and Type II errors.
Type I Errors occur when the results are due to chance, but the researcher mistakenly concludes that the effect is significant. In other words, no effect of the variables exists in the
population, but some quirk of the sample makes the effect appear significant. This error
can be viewed as a false positive—you get excited over results that are not actually meaningful. In our children and happiness study, a Type I error would occur if children had no
effect on happiness in the real world, but some quirk of chance made our “no children”
group happier than the “children” group. For example, our sample of childless couples
might accidentally contain a greater proportion of people with happy personalities or
greater job stability or simply more marital satisfaction to start with.
new66480_02_c02_p047-088.indd 77
10/31/11 9:23 AM
Section 2.4 Hypothesis Testing
CHAPTER 2
Fortunately, although this error sounds scary, we can generally compute the probability of
making it. Our alpha level sets the bar for how extreme our data must be in order to reject
the null hypothesis. At the end of the statistical calculation, we get a p value that tells us
how extreme the data actually are. When we set an alpha level of, say, .05, we are attempting to avoid a Type I error; our results will only be statistically significant if the effect
outweighs the random variability by a big-enough amount. If our p value falls below our
predetermined alpha level, we decide that the risk of a Type I error is sufficiently small
and can therefore reject the null hypothesis. If, however, our p value is greater than (or
even equal to) our alpha cutoff, we decide that the risk of Type I error is too high to ignore
and will therefore fail to reject the null hypothesis.
Type II Errors occur when the results are significant, but the researcher mistakenly concludes that they are due to chance. In other words, there actually is an effect of the variables
in the population, but some quirk of the sample makes the effect appear nonsignificant.
This error can be viewed as a false negative—you miss results that actually could have
been meaningful. In our children/happiness experiment, a Type II error would occur if
couples without children really were happier than couples with children but some flaw
in the experiment kept us from detecting the difference. For example, if our measures of
happiness were poorly designed, people could interpret the items in a variety of ways,
making it difficult to spot an overall difference between the groups.
Fortunately, although this error sounds disappointing, there are some fairly easy ways to
avoid or minimize it. The key factor in reducing Type II error is to maximize the power
of the statistical test, or the probability of detecting a real difference. In fact, power is
inversely related to the probability of a Type II error—the higher the power, the lower the
chance of Type II error. Power is analogous to the sensitivity, or accuracy, of the hypothesis test; it is under the researcher’s control in three main ways. First, as we discussed in
the section “Reliability and Validity,” it is important to make sure that your measures are
capturing what you think they are. If your happiness scale actually captures something
like narcissism, then this will cause problems for your hypothesis about the predictors of
happiness. Second, it is important to be careful throughout the process of coding and …
Purchase answer to see full
attachment
Why Choose Us
- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee
How it Works
- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "PAPER DETAILS" section.
- Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
- From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.