DOI: 10.1002/hrm.21852
SPECIAL ISSUE ARTICLE
Using multi-item psychometric scales for research and practice
in human resource management
Mark A. Robinson
Leeds University Business School, University
of Leeds
Correspondence:
Mark A. Robinson, Leeds University Business
School, University of Leeds, Leeds,
LS2 9JT, UK.
Email: m.robinson@lubs.leeds.ac.uk.
Questionnaires are a widely used research method in human resource management (HRM), and
multi-item psychometric scales are the most widely used measures in questionnaires. These
scales each have multiple items to measure a construct in a reliable and valid manner. However,
using this method effectively involves complex procedures that are frequently misunderstood
or unknown. Although there are existing methodological texts addressing this topic, few are
exhaustive and they often omit essential practical information. The current article therefore
aims to provide a detailed and comprehensive guide to the use of multi-item psychometric
scales for HRM research and practice, including their structure, development, use, administration, and data preparation.
KEYWORDS
measurement, multi-item scales, psychometric scales, questionnaires, surveys
1 | I N T RO D U C TI O N
Although methodological guidance is available, most texts focus
on specific topics or phases and omit essential practical information.
Questionnaires are one of the most widely used research methods in
Accordingly, this article addresses all phases of multi-item psychomet-
the social sciences (Bourque, 2004) and multi-item psychometric
ric scale use—including their structure, development, administration,
scales are the most widely used measures in questionnaires. For
and the preparation of the collected data for analysis—to provide a
instance, 29 of the 62 articles published in Human Resource Manage-
comprehensive resource for HRM researchers and practitioners that
ment during 2015 used multi-item psychometric scales to collect data
addresses the many practical issues and common points of confusion.
about topics as diverse as organizational ambidexterity (Halevi, Car-
The article will also be useful for researchers and practitioners in other
meli, & Brueller, 2015), employee voice (Matsunaga, 2015), and per-
social sciences (e.g., industrial and organizational psychology, manage-
formance management (Festing, Knappert, & Kornau, 2015).
ment) who use multi-item psychometric scales frequently.
Despite their widespread use, however, the complex principles
Questionnaires comprise a number of questions that participants
and procedures underlying multi-item psychometric scales are fre-
are required to answer and are therefore usually a self-report
quently misunderstood or unknown, even by experienced researchers
research method (Stone & Turkkan, 2000); although the same meth-
and practitioners in human resource management (HRM). This is par-
ods are sometimes used to rate others, such as supervisor ratings of
ticularly true of HRM practitioners conducting staff surveys
performance (see, e.g., Yam, Fehr, & Barnes, 2014). Multi-item psy-
(e.g., employee engagement), which ostensibly appear psychometric
chometric scales, the focus of this article, are a specialized type of
in nature but often neglect key steps in research design and analysis.
quantitative measure used in questionnaires (see, e.g., Nevill, Lane,
Such errors, omissions, and misunderstandings have major implica-
Kilgour, Bowes, & Whyte, 2001) and the most frequently used mea-
tions for HRM research and practice. Unreliable scales prevent the
sure in HRM research. Such scales each have multiple items to mea-
consistent measurement of variables, while scales low in validity may
sure a variable of interest in a reliable and valid manner (Kline, 2000).
not be measuring the intended variables (Cook, 2009). Such problems
Throughout the article, the term psychometric scale or simply scale is
can distort research findings, hinder theoretical development, and
used for brevity rather than the full term multi-item psychometric
result in ineffective or even counterproductive HRM practice.
scale. However, it is clear from the literature that several synonymous
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2017 The Author. Human Resource Management published by Wiley Periodicals, Inc.
Hum Resour Manage. 2018;57:739–750.
wileyonlinelibrary.com/journal/hrm
739
ROBINSON
740
terms are used, including multi-item measures, multi-item scales, psy-
and yielding parametric data. It should not be confused with a rating
chometric measures, and psychometric scales. The term psychometric
scale, despite the shared terminology. A psychometric scale com-
scale is preferred here to emphasize the need for reliable and valid
prises multiple questions or statements, each with their own rating
measurement.
scale. Therefore, a rating scale measures participants' responses to a
The article starts by discussing the structure of psychometric
single question or statement, while a psychometric scale measures
scales, then details their use and the various stages of their develop-
participants against a focal variable using multiple items (see DeVellis,
ment, before finally considering their administration and data prepa-
1991). In this article, for brevity and clarity, the single word scale is
ration issues.
used to refer to psychometric scales only; rating scales are always
referred to by their full name.
When including psychometric scales in questionnaires, research-
2 | STRUCTURE OF PSYCHOMETRIC
SCALES
ers have two broad options: they can use existing scales from the
published research literature, or they can develop their own scales.
Each of these options is discussed in detail below. First, however, the
Before proceeding, it is first necessary to define the key components
design properties of psychometric scales are discussed.
and characteristics of psychometric scales, for there is often confusion
here arising from the terminology used. Figure 1 provides an example
2.2 | Psychometric scale design properties
of a hypothetical psychometric scale measuring managerial support
(in lowercase letters) and illustrates its components (in uppercase let-
Psychometric scales have several design properties that require care-
ters), descriptions of which are provided below (partially adapted from
ful consideration, whether evaluating existing scales or developing
DeVellis, 1991; Kline, 2000).
new scales, and these are now discussed.
2.1 | Definition of key terms
2.2.1 | Likert rating scales
Questions/statements. Participants respond to questions or state-
Psychometric scales always use fixed-format rating scales and by far
ments about a focal variable.
the most widely used are Likert rating scales (Likert, 1932), the focus
Response points. A response point is a circle or box in which partic-
of this article. Likert rating scales comprise a number of response
ipants indicate their response to a question or statement, either by
points, usually 4 to 9, with accompanying verbal anchors. A key fea-
ticking, circling (on paper questionnaires), or clicking it (on electronic
ture is that there should be equally appearing intervals (Thurstone,
questionnaires).
1929), or identical space perceived by participants, between each
Anchors. An anchor is a verbal label accompanying a response
response point. This is important because it is a prerequisite of inter-
point.
val or ratio-level data, which in turn is one prerequisite of the para-
Rating scales. A rating scale is the measure along which partici-
metric data required for psychometric scales (Foster & Parker, 1995).
pants respond to a question or statement. Each question or state-
Such equally appearing intervals (Thurstone, 1929) should be
ment has its own rating scale, comprising a number of response
reflected both in the physical presentation of the questionnaire items
points and accompanying anchors. Usually, for efficiency, a single
and also in the meaning of the accompanying verbal anchors. In the
shared set of anchors is presented for multiple rating scales, as shown
former case, the response points should be equidistant from neigh-
in Figure 1.
boring response points even if this necessitates nonequidistant spa-
Items. An item comprises a question or statement about a focal
cing between verbal anchors of different lengths. In the latter case,
variable and an accompanying rating scale on which participants
antonyms (or opposite terms) should be selected for verbal anchors
respond.
at equivalent positions on either side of the rating scale, to ensure
Psychometric scales. A psychometric scale comprises multiple
that there is linguistic symmetry of different valence either side of
items measuring the same focal variable in a reliable and valid manner
the rating scale's midpoint. An excellent example of this is the
QUESTIONS /
STATEMENTS
ITEM
RATING SCALES
(ANCHORS ARE THE LABELS)
Strongly
disagree
Disagree
Neutral
Agree
Strongly
agree
1.My manager considers my feelings.
2. My wellbeing is a priority for my manager.
3. My manager defends me from unfair
criticism.
RESPONSE POINTS
RELIABILITY
VALIDITY
PARAMETRIC DATA
PSYCHOMETRIC
SCALE
Components and characteristics of a
psychometric scale
Note: This figure summarizes the more detailed
discussions in the section Structure of Psychometric
Scales, where full references can be found.
FIGURE 1
ROBINSON
741
commonly used anchor set of strongly disagree, disagree, neutral, agree,
scale, with some response points ignored. In these circumstances, a 7-
strongly agree (see, e.g., De Jong & Dirks, 2012). Here, the response
point rating scale will be preferable to a 5-point one, as it will still pro-
points on either side of the midpoint have anchors that are exact
vide a wide range of responses (see Baumgartner & Steenkamp,
antonyms—disagree and agree—while the response points two away
2001, for an extensive review of survey response psychology). For
from the midpoint retain these antonyms but add an identical
similar reasons, a 7-point rating scale can also be beneficial with parti-
adverb—strongly. Verbal anchors should ascend from left to right, in
cipants who are reluctant to select the most extreme responses (Hui &
level of agreement or level of the rated variable, as this is conventional
Triandis, 1985). Second, if the questionnaire is to be administered on
when listing measurements in written English (e.g., a ruler) and is
media with limited display space, such as smartphones, a 5-point rating
therefore clearest for participants. Figure 1 illustrates these principles.
scale will enable a less cramped presentation than a 7-point one.
Traditionally, each response point has an accompanying verbal
Finally, when surveying highly educated samples, 7-point rating scales
anchor, but another common approach is to label only the endpoints
are preferable, as these participants are able to comprehend the addi-
of the rating scale. Labeling only the endpoints can alleviate the prob-
tional response complexity, whereas 5-point rating scales are prefera-
lem of selecting appropriate labels for each response point, but it
ble for the general public (Weijters et al., 2010).
increases the difficulty of responding for participants (Darbyshire &
There are conflicting views about whether researchers should
McDonald, 2004). Furthermore, although fully labeled response
use an even number of response points with no midpoint or an odd
points increase acquiescence, they also reduce extreme responses
number with a midpoint. Weijters et al. (2010) provide a detailed
and increase the clarity of reverse-coded items (Weijters, Cabooter, &
summary of these counterarguments. Essentially, proponents of the
Schillewaert, 2010). So, on balance, labeling each response point is
former argue that participants should be forced to choose whether
usually preferable, unless it is impossible to select balanced and equi-
their response to an item is broadly negative or positive. Conversely,
distant verbal anchors throughout.
proponents of the latter argue that some participants will genuinely
Two modified versions of the traditional Likert rating scale are
have neutral views about some topics, so they should be free to
sometimes used, as described by DeVellis (1991). Semantic differen-
express these accurately. There is some evidence to suggest that the
tial rating scales present antonymic anchors at either end of a rating
elimination of a midpoint leads to less positive responses to items
scale, along which participants respond. Visual analogue scales
and may therefore mitigate socially desirable responding (Garland,
replace discrete response points with a continuous line along which
1991). Similarly, other evidence suggests that the inclusion of a mid-
participants indicate their response.
point increases acquiescence with statements but also decreases
extreme ratings (Weijters et al., 2010). Overall, though, there is not
yet a clear consensus on this issue, although scales with midpoints
2.2.2 | Number of response points
are more frequently used than those without.
Another key issue with Likert rating scales is deciding how many
Finally, when using existing scales, if researchers specifically wish
response points to include. Typically, researchers use between 4 and
to compare rating levels with those from previous studies, the original
9 response points, with some favoring an even number and others an
rating scales should be used with identical response points and
odd number with a midpoint. The optimal number of response points
anchors. This may be the case, for instance, if data on norms for dif-
has been frequently debated and examined statistically. For instance,
ferent populations exist for particular scales or items.
research examining the effect of 2 to 11 response points has shown
that reliability, criterion validity, and the ability to discriminate
between participants' ratings increase as the number of response
2.2.3 | Number of scale items
points increase, plateauing at around 7 response points (Preston &
In practice, the number of items in a scale is likely to be predeter-
Colman, 2000). However, other research has indicated that 5-point
mined, either by the researchers who published it or during the factor
rating scales yield higher quality data than those with 7 or 11 points
analytic development process, as discussed below. However, scale
(Revilla, Saris, & Krosnick, 2014). Still other research has indicated no
length is considered here, as it is a key criterion for scale selection.
difference between data collected from 5-point or 7-point rating
Conventionally, psychometric scales comprise multiple items.
scales (Dawes, 2008). Despite these slight disagreements, though,
Indeed, it could be argued that this is a prerequisite of psychometric
these studies do generally conclude that either 5 or 7 response points
scales, for only multiple items enable the assessment of internal relia-
yield higher quality data than fewer response points and are more
bility (as detailed later) and reliability is a prerequisite of psychometric
practical than longer rating scales. Thus, researchers should use either
scales (Kline, 2000). A minimum of three items per scale is usually
5-point or 7-point rating scales, with decisions between these two
recommended, as this number will reliably yield convergent solutions
options depending on the specifics of the study, such as the variables
in confirmatory factor analysis (Marsh, Hau, Balla, & Grayson, 1998).
being investigated, questionnaire space limits, or participant charac-
However, frequently in research, some problematic items are identi-
teristics, as discussed below.
fied and may therefore have to be deleted, as discussed later. There-
First, for items about some topics where certain responses are
more socially desirable—such as positive performance ratings—there
fore, it is prudent to include an additional item—so a minimum of four
items in a scale—where practical.
can be a tendency for participants to use the corresponding side of
The maximum number of items per scale will depend on the
the rating scale more frequently (Moorman & Podsakoff, 1992). This
complexity of the variable being measured. A larger number of items
leads to highly skewed distributions and effectively a truncated rating
will be required to capture the richness of multidimensional variables
ROBINSON
742
(see, e.g., Allen & Meyer, 1990). However, this must be balanced
Given the many journal articles available about most topics,
against the need for scale brevity to maximize response rates. For this
though, an equally likely problem is that researchers will find several
reason, short versions of well-established scales are often developed
potentially suitable scales that they must choose between. In these
(see, e.g., Thompson, 2007, for a short version of the mood scale by
cases, and when evaluating existing scales, researchers should use
Watson, Clark, & Tellegen, 1988). If such a short scale is unavailable,
two criteria to select the most appropriate scales: (1) psychometric
researchers may be able to adapt their own from the original publica-
data, concerning reliability and validity, as extensively detailed in the
tion describing the development of the scale. To do so, only those
next section, Developing Psychometric Scales, and (2) conceptual fit, as
items with the highest factor loadings for that scale should be
discussed below.
selected, and the internal reliability (as detailed later) of the resultant
Conceptual fit concerns the extent to which the scale matches
shorter scale should be carefully checked. This is a controversial prac-
the variable that the researcher wishes to measure. Ideally, and in
tice, however, as it can reduce the validity of the scale (Raykov, 2008).
many cases, researchers will be able to find an exact match between
Although still widely recognized as best practice, some research-
scale and variable. In other cases, though, the lack of an exact con-
ers have recently questioned the use of multi-item scales. They argue
ceptual fit may necessitate minor modifications to the scale's items.
that participants perceive them as repetitive and onerous, therefore
There are no exact rules about acceptable levels of modification, but
reducing response rates, and suggest the use of single-item measures
a useful guideline might be that changing the subject or object of a
in some circumstances to counter this (see, e.g., Wanous, Reichers, &
statement (or question), provided that the statement still relates to
Hudy, 1997). Accordingly, several single-item measures have been
the same domain, is generally acceptable provided the researcher
developed that demonstrate good validity against equivalent full-
checks the psychometric properties (i.e., reliability and validity, see
scale versions (see, e.g., Nagy, 2002). Indeed, some researchers
below) of the modified scale against those of the original. So, for
believe that for homogeneous construct variables, single-item mea-
instance, Martin, Washburn, Makri, and Gomez-Mejia (2015) modified
sures may even be preferable to multi-item scales (Postmes,
the subject of the original items from Ryckman, Robbins, Thornton,
Haslam, & Jans, 2013), as the latter's specificity may inadvertently
and Cantrell's (1982) self-efficacy scale (e.g., “I will be able to success-
exclude key facets of the variable (Scarpello & Campbell, 1983).
fully overcome future challenges”) to examine the self-efficacy of
firms instead of individuals (e.g., “The firm will be able to successfully
overcome future challenges”) in their study of CEO risk-taking, as
3 | USING EXISTING PSYCHOMETRIC
SCALES
there were no suitable existing scales available. If the scale's items
require excessive modification, however, or if there are no conceptually related scales available, which is common in new or emerging
research fields, researchers may have to develop their own scales
Several useful repositories of existing psychometric scales are available via the Internet, such as the Academy of Management's (AoM)
especially for the study. The procedures for doing so are outlined
below.
Measure Chest (n.d.) and the Social-Personality Psychology Questionnaire Instrument Compendium (Reifman, 2014). These provide lists of
published scales, categorized by topics. Many researchers also make
4 | DEVELOPING PSYCHOMETRIC SCALES
their own published scales freely available via their personal web
pages (see, e.g., Spector, n.d.), so researchers should consult those of
If suitable scales do not exist to measure the variables of interest, or
influential researchers in their field of interest. However, the primary
if researchers feel that existing scales are inadequate, then they may
source of existing scales is peer-reviewed journal articles. Generally,
need to develop their own. The general process for developing a
details about the scales used can be found in the Method section,
scale is outlined in Figure 2 and described below, drawing on well-
often in a subsection entitled Measures, with full scales often pro-
established principles discussed in detail in several sources (see,
vided in the appendix.
e.g., Hinkin, 1998; Kline, 2000; Matsunaga, 2015, for the develop-
Finding journal articles with suitable scales can sometimes prove
ment of a scale measuring employee voice strategy). Often, for effi-
difficult, but has been made considerably easier recently with the
ciency, several scales are developed simultaneously using this process
introduction of freely available Google Scholar (n.d.) software. Rather
(see e.g., Morgeson & Humphrey, 2006). Each stage in Figure 2 will
than relying solely on keywords like many traditional academic search
now be described, thereby also providing information for evaluating
engines, the advanced search capability of Google Scholar enables
the psychometric properties of existing scales from the literature.
researchers to search for exact phrases present anywhere in an article's text, such as the exact name of variables researchers are seeking
to measure (e.g., “job satisfaction”). In many cases, this will locate arti-
4.1 | Generate preliminary items
cles reporting empirical studies where data on that variable have
Researchers can use several methods to identify item content, includ-
been collected using a suitable scale, particularly if the word question-
ing literature reviews, interviews with experts, and content analysis
naire or survey is used as an additional search term for the article's
of existing data sets and resources. This stage is the foundation of
text. Furthermore, if there are widely recognized leading journals in
the entire process, so it is vital that it is theoretically driven. The
the field of research interest, it is often worth performing a Google
nomological network of the variables (Cronbach & Meehl, 1955;
Scholar advanced search restricted to those journals.
Gregory, 2007)—that the generated items will represent—should be
ROBINSON
743
4. Avoid negatively worded questions, as these can be confusing
1.
2.
3.
4.
5.
6.
7.
8.
Generate preliminary items
a. Develop theoretical model
b. Generate questions / statements
c. Generate rating scales
Evaluate preliminary items
a. Evaluate clarity
b. Evaluate content validity
Administer preliminary items
a. Prepare questionnaire
b. Administer questionnaire
c. Collect data
d. Collect feedback
Implement participant feedback
a. Address unclear items
b. Address controversial items
Analyze preliminary item data
a. Exploratory factor analysis
b. Identify preliminary scales
c. Remove surplus items
Administer revised items
a. Prepare questionnaire
b. Administer questionnaire
c. Collect data
Analyze revised item data
a. Confirmatory factor analysis
b. Verify preliminary scales
c. Evaluate internal reliability
d. Evaluate construct validity
e. Confirm final scales
Criterion validate psychometric scales
a. Evaluate criterion validity
(although the advice on this issue is mixed, as discussed below).
5. Avoid leading questions, as these can bias participants' responses
by suggesting how they should answer.
Furthermore, items should be kept short, where possible, for clarity. Rating scales for the items should be developed in accordance
with the design guidelines previously discussed.
Negatively worded items are controversial. They are generally
used as “cognitive speed bumps” to prevent participants slipping into
inaccurate automatic response patterns (Podsakoff, MacKenzie,
Lee, & Podsakoff, 2003, p. 884). However, research has shown that
participants generally do not understand such items (Idaszak & Drasgow, 1987), so their benefits are more than outweighed by their
costs. Some of this confusion arises from the use of complex doublenegatives and similar phrases (Foster & Parker, 1995). In some cases,
it may be possible to rephrase the question or statement to alleviate
such confusion. For instance, in a scale measuring absenteeism, the
statement “I am rarely absent from work” is preferable to the equivalent yet confusing statement “I infrequently do not attend work.”
Finally, when developing psychometric scales, researchers should
include a larger number of items in the preliminary item pool than the
number required for the final psychometric scale(s). The subsequent
development process will eliminate many items for statistical or
methodological reasons, as discussed below, so some redundancy is
useful initially, with double the desired number recommended
(Hinkin, 1998).
Process for developing psychometric scales
Note: This figure summarizes the more detailed discussions in the
section Developing Psychometric Scales, where full references can be
found.
FIGURE 2
4.2 | Evaluate preliminary items
Once the items have been generated, they are then evaluated by the
carefully considered, including their theoretically related variables,
researchers for clarity of expression, to ensure that they are easily
antecedents, and outcomes. This identifies the unique conceptual ter-
understandable and assess exactly what is required. Next, the content
ritory of the new scales, enabling the themes of the items to be spe-
validity of the items is evaluated; that is the extent to which all facets
cified and strengthening the construct validity of the resultant scales
of the focal variable(s) have been comprehensively addressed by the
(see Stage 7). A key consideration here is whether each focal variable
collective items and without redundancy (Cook, 2009). Often, at this
(i.e., theoretical construct) is represented by a single dimension (unidi-
stage, the items are reviewed by a panel of experts, such as those
mensional) or multiple dimensions (multidimensional), notes Edwards
consulted by Sendjaya, Sarros, and Santora (2008) when developing
(2001). The latter, he suggests, can be further divided into superordi-
their scales measuring servant leadership. Here, 15 domain experts
nate constructs (such as personality traits composed of facets) or
drawn from academia and business rated each item for relevance,
aggregate constructs (such as different components of job perfor-
with content validity established when 50% of the experts agreed the
mance). For such multidimensional variables, it is particularly impor-
item was essential. Similarly, it is possible to calculate a coefficient,
tant that the items generated represent the range and richness of the
the content validity index (CVI), to indicate the proportion of experts
underlying dimensions to ensure sound content validity (see Stage 2).
who agree about the relevance of single or multiple items, with values
Questions should be written in a clear and specific manner. Fos-
exceeding .80 considered acceptable (Polit, Beck, & Owen, 2007).
ter and Parker (1995) propose several rules, suggesting that generally
questions should:
4.3 | Administer preliminary items
1. Avoid jargon or specialist terminology, unless well known by the
intended participant population.
Once evaluated, and improved if necessary, the preliminary items are
incorporated into a questionnaire and administered to participants.
2. Avoid ambiguity and be specific. For instance, they suggest the
Often, this stage is referred to as a pilot study, or a pilot question-
term frequently is subjective, and it would be better to ask about
naire. This questionnaire should be designed and administered in
objective quantities.
accordance with the guidance provided elsewhere in this article.
3. Avoid combining questions; ask about only one issue at a time.
However, the pilot questionnaire also usually includes an additional
ROBINSON
744
free-format response section at the end where participants are asked
to provide feedback about the items and questionnaire overall.
Provided that such statistical criteria are satisfied, however, the
choice of how many items to select for each scale is likely to be
determined by practical considerations. Researchers should strive to
achieve an optimal balance between parsimony and comprehensive
4.4 | Implement participant feedback
The feedback provided by participants about the preliminary items
theoretical coverage of the focal variables, removing further surplus
preliminary items where required.
and questionnaire is analyzed. If a consensus emerges that particular
items are unclear or controversial, they should be removed or modified for the next questionnaire (Stage 6). This feedback can also be
supplemented with further interviews or focus groups with participants, if required.
4.6 | Administer revised items
Once the preliminary items have been analyzed, modified as necessary, and reduced to an optimal number per scale, these revised items
are then administered to a further sample of participants in another
questionnaire. Again, this questionnaire should be designed and admi-
4.5 | Analyze preliminary item data
nistered in accordance with the guidance provided throughout this
article. Here, researchers should also administer existing scales mea-
Exploratory factor analysis (EFA) is then conducted on the responses
suring conceptually related and unrelated variables to enable con-
to the preliminary items, to identify the factor structure within the
struct validity to be assessed (see, e.g., Lewis, 2003), as detailed in
items and thus the preliminary psychometric scales. The basic proce-
the description of Stage 7 below.
dures of EFA are detailed in many statistical textbooks (see,
e.g., Tabachnick & Fidell, 2007). The ideal minimum sample size for
EFA has been debated frequently, with larger samples and larger
4.7 | Analyze revised item data
item-to-participant ratios considered better (see Osborne & Costello,
Confirmatory factor analysis (CFA) is then conducted on the responses
2004, for a detailed discussion). Minimum sample sizes of 300 partici-
to the revised items to verify the factor structure identified by the ini-
pants are generally advocated (Comrey & Lee, 1992; Tabachnick &
tial EFA. If verified, the item composition of the scale(s) can be consid-
Fidell, 2007) unless loadings are particularly high (> .60), in which
ered finalized. If the CFA does not support the factor structure
case 150 participants are adequate (Guadagnoli & Velicer, 1988). A
identified by the initial EFA, however, then it may be necessary to
minimum of 10 items per participant are generally recommended
readminister the revised items to a further sample of participants until
(Guadagnoli & Velicer, 1988; Osborne & Costello, 2004). Indeed,
statistical consensus is achieved regarding factor structure. The sample
medians of 267 participants and 11 items per participant were found
size recommendations provided for EFA above are generally also of
in a review of published EFA studies (Henson & Roberts, 2006), cor-
relevance to CFA (Mundfrom, Shaw, & Ke, 2005; Tabachnick & Fidell,
roborating these guidelines.
2007). Specifically, though, a minimum sample of 200 participants has
For unidimensional variables, EFA is performed only on the
frequently been recommended for CFA (Barrett, 2007; Tabachnick &
matrix of correlations between the items, but for multidimensional
Fidell, 2007). Once the final factor structure has been confirmed, the
variables this initial EFA identifies the first-order factors, and a sec-
internal reliability of the scales should then be assessed.
ond EFA is then performed on the matrix of correlations between
Psychometric scales use multiple items measuring the same focal
these first-order factors to identify second-order factors (Edwards,
variable so that the consistency or internal reliability with which par-
2001; Gorsuch, 1983). So, when used for psychometric scale devel-
ticipants respond to them can be assessed. Reliability is a prerequisite
opment, first-order factors would identify subscales that are nested
for validity, and both are essential characteristics of psychometric
within the wider construct represented by the second-order factor.
scales (Kline, 2000). Cronbach's alpha coefficient (α; Cronbach, 1951)
Consideration should be given here to the optimal number of
is the most frequently used statistic for this purpose. There is wide-
items in each scale, with factor loadings examined accordingly. Com-
spread debate about what the minimum acceptable alpha coefficient
rey and Lee (1992) have proposed statistical loading thresholds to aid
level is for psychometric scales. Traditionally, the figure of α ≥ .70
factor interpretability, and these could also be usefully applied to
was widely suggested, although the origin was uncertain. Cortina
determine how many items to retain in each psychometric scale. Not-
(1993) discusses alpha in considerable depth, noting that the same
ing how squared factor loadings indicate the proportion of shared
mean inter-item correlation will yield higher alpha coefficients for
variance between those items and the factors on which they load,
longer scales than shorter scales, and advises cautious interpretation.
they suggested that items loading over .71 (50% shared variance)
Nevertheless, he suggests α ≥ .75 as the conventional accepted level.
have an excellent fit with the factor, those loading over .63 (40%
If researchers discover the alpha coefficient of a scale they have
shared variance) a very good fit, over .55 (30% shared variance) a
used is below .75, they may wish to consider deleting an item to
good fit, and over .45 a fair fit (20% shared variance). This resonates
increase this coefficient. Statistical analysis software such as SPSS
with Costello and Osborne's (2005) more recent recommended
(n.d.) can calculate alpha coefficients with each item deleted along-
threshold for loadings of .50 or higher. Furthermore, both pairs of
side the alpha coefficient for the overall scale, helping to identify
authors caution against retaining items loading below .30. So items
rogue items for potential deletion. However, such item deletion is a
with factor loadings above this .45 threshold would therefore make
controversial practice, with some arguing that it dilutes the concep-
excellent scales when combined.
tual coverage and validity of the scale (Raykov, 2008). If the rogue
ROBINSON
745
item is a negatively worded one, however, the problem is likely a
with these values, a minimum correlation threshold of .40 is therefore
methodological artifact (see Idaszak & Drasgow, 1987), as discussed
recommended for establishing sound criterion validity (Peers, 1996).
above, and it should therefore be deleted.
Following this final validation, the psychometric scales are now
However, for CFA, researchers are now increasingly using the
ready to use for research purposes. They may also be published, with
composite reliability method of Dillon-Goldstein's rho (or Jöreskog's
accompanying psychometric data concerning their reliability and
rho) (ρc), for which values over .70 indicate acceptable reliability
validity, for use by other researchers.
(Chin, 1998; Werts, Linn, & Jöreskog, 1974). Unlike Cronbach's alpha,
Finally, sometimes at this stage the scales are readministered to
it does not assume each of a scale's constituent items is of equal
the same participants from Stage 6 and their scores are compared to
importance, as it is based on the factor loadings instead of the corre-
establish test–retest reliability (Cook, 2009). Pearson correlations
lations between items, and is therefore more accurate (Esposito Vinzi,
exceeding .80 would indicate acceptable test–retest reliability, but
Trinchera, & Amato, 2010).
this is only a relevant concept for the few variables expected to
Finally in this stage, construct validity is assessed, which con-
exhibit stability over time such as personality (Kline, 2000).
cerns whether the scales measure the constructs they claim to (Cook,
Table 1 provides a summary of the different types of reliability
2009). This should be demonstrated in three ways. First, the factor
and validity of relevance to developing psychometric scales, as dis-
analyses performed in Stage 5 and here in Stage 7 will establish some
cussed above.
construct validity through the distinct factor structures identified and
the absence of cross-loadings (Tabachnick & Fidell, 2007). Second, to
establish convergent construct validity, scale scores for each variable
should be highly correlated (i.e., converge) with scores from other
5 | ADMINISTRATION
established measures of the same variable and also with measures of
other variables from within that variable's nomological network of
theoretically related constructs (Cronbach & Meehl, 1955; Gregory,
2007), administered in Stage 6. Third, to establish divergent
(or discriminant) construct validity, scale scores for each variable
should be uncorrelated (i.e., diverge) with scores from theoretically
unrelated constructs (Gregory, 2007), also administered in Stage
6. For instance, Maynes and Podsakoff (2014) established the construct validity of their four measures of employee voice behaviors
Psychometric scales are administered within questionnaires, so when
using them for research there are several practical issues to consider,
and these are now reviewed. The first broad issue is the content of
the remainder of the questionnaire, including its introduction, the
generation of identification codes, and demographic questions. The
second broad issue concerns the format of questionnaire administration and the implications of this for the presentation of the psychometric scales to participants.
through examining the relationships between these scale scores and
data from related and unrelated measures of the Big Five personality
traits. As a guide, convergent construct validity would be demonstrated by medium to large correlations exceeding .30, while small
correlations below .20 would indicate divergent construct validity
(Cohen, 1992; see also Lewis, 2003).
5.1 | Introductory information
First, participants are briefly provided with an overview of the
research, the purpose of their involvement, and the contact details of
the researchers. This engages participants, and also provides sufficient information for them to give their informed consent to participate (American Psychological Association [APA], 2010). Typically, the
4.8 | Criterion validate psychometric scales
research overview is relatively general, justifying their participation
but not detailing specific research questions or hypotheses. Indeed,
By this stage, the reliability of the scales has been established, as has
excessive information of this nature may prime participants to
their content validity through the generation and evaluation of the
respond in particular ways, which may bias the research.
items, and their construct validity through the two factor analyses
Second, a number of statements relating to research ethics are
and the convergent and divergent analyses. So the aim of this stage
presented. Conventions can vary by discipline, but the APA (2010)
is to establish the criterion validity of the scale(s). Here, participants'
provides extensive guidelines, the key principles of which are sum-
scores on the scale(s) from Stage 7 are statistically correlated with
marized below. First, participants are told that their involvement is
independent objective data measuring the same or related variables
entirely voluntary and that they have the right to withdraw at any
for each participant (Cook, 2009). For instance, a self-report scale
time. They are also assured that any information they provide will
measuring staff absenteeism could be criterion validated against
remain entirely confidential, and that results will only be presented in
absence data from official organizational records, using suitable iden-
an aggregated format so that no individual's responses are identifia-
tification codes to match the two data sources. The criterion data
ble. Then, they are asked to provide their informed consent to partici-
against which scale scores are validated can either be collected at the
pate, either by answering a direct question to this effect or by being
same time the scales are completed, to establish concurrent criterion
informed that their continuation implies this. Most institutions and
validity, or at a future date, to establish predictive criterion validity
organizations in which researchers work will have their own formal
(Cook, 2009). Drawing on Cohen's (1992) guidance about effect
ethical clearance procedures based on similar principles. Finally, ques-
sizes, correlations exceeding .30 would indicate reasonable criterion
tions about sensitive or controversial topics will usually require thor-
validity, with correlations exceeding .50 being excellent. Resonating
ough justification via an ethics committee.
ROBINSON
746
TABLE 1
Types of reliability and validity relevant to psychometric scale development
Concept
Definition
Measure
Internal reliability
All the items comprising the psychometric scale are
measuring the same variable consistently.
Cronbach's alpha (α) ≥ .75
Dillon-Goldstein's rho (or Jöreskog's rho) (ρc) ≥ .70
Test–retest reliability
Scores on the psychometric scale are consistent over time
for the same person (i.e., scores attained at different
times).
Pearson correlation (r) ≥ .80
Test–retest reliability is only relevant for variables
expected to be stable over time (e.g., personality).
Content validity
Collectively, the items comprising the psychometric scale
address all key aspects of the variable and no irrelevant
aspects.
Content validity index (CVI) ≥ .80
Construct validity
The psychometric scale is measuring the specific variable
it claims to measure (and not another similar variable).
Items load onto a single factor that is distinct from
other factors (i.e., no cross-loadings)
Convergent construct validity
Scores on the psychometric scale are strongly related to
scores on psychometric scales measuring conceptually
related variables.
Pearson correlation (r) ≥ .30
Divergent (or discriminant) construct validity
Criterion validity
Scores on the psychometric scale are not strongly related
to scores on psychometric scales measuring
conceptually unrelated variables.
Pearson correlation (r) < .20
Scores on the psychometric scale are strongly related to
objective measures of the same variable.
Pearson correlation (r) ≥ .40
Concurrent criterion validity
Scores on the psychometric scale are strongly related to
objective measures of the same variable measured at
the same time.
Predictive criterion validity
Scores on the psychometric scale are strongly related to
objective measures of the same variable measured at a
future time.
Note: This table summarizes the more detailed discussions in the section Developing Psychometric Scales, where full references can be found.
5.2 | Identification codes
In many research studies, questionnaires can be completed entirely
anonymously, so no identification features are required. However, in
some instances, it is necessary for researchers to be able to identify
participants. For example, participants may be distributed between
various groups (e.g., teams or organizations) and researchers may
wish to analyze the data at a group level, such as Halevi et al.’s
(2015) study of organizational ambidexterity in strategic business
units. In such instances, this should be explained to participants in
such as a nine-letter code comprising the first three letters from each
of the following three words: (1) first pet's name, (2) hometown,
(3) favorite sports team. In this way, the code does not have to be
remembered, although this is desirable, as it can be generated again
through asking the same questions. Then, each time participants complete a new questionnaire, they should be asked to provide this code
to enable matching with their previous questionnaires.
5.3 | Demographic questions
the introductory section and questionnaires should carry a suitable
Often in research, it is necessary to collect demographic data from
code (e.g., Team 1) to enable their grouping when returned.
participants. Sometimes, this is an integral part of the research itself;
A second example is the use of longitudinal research designs,
for instance, when conducting research examining age or gender (see
where data are collected from participants—using questionnaires
e.g., Festing et al., 2015, for a study of performance management
and/or other methods—at two or more time points, requiring
preferences between genders). In other instances, it is necessary to
researchers to match data from the same participant. For example,
control for demographic variables when performing statistical ana-
Sturges and Guest (2004) examined work–life balance in recent grad-
lyses (see e.g., Mäkelä, Kinnunen, & Suutari, 2015). When conducting
uates, tracking them from before they started work and into their
research in organizations, it may also be desirable to collect data con-
first appointment, by administering questionnaires at three time
cerning variables such as participants' roles, seniority, and experience.
points six months apart. Generally, participants are unwilling to
For standardization, demographic data are generally best collected
divulge readily identifiable personal information (e.g., names), how-
using questions with fixed response categories, from which partici-
ever, so they should each be asked to generate an anonymous and
pants select the appropriate response. One possible exception is
unique identification code to include on each questionnaire they
questions concerning time, such as age and organizational tenure,
complete. This code should contain information that will not change,
where the collection of exact data (e.g., 33 years)—provided this does
ROBINSON
747
not identify participants—can be subsequently coded into fixed cate-
anchors displayed at the top, to replicate a conventional paper ques-
gories (e.g., 30–34 years) if required.
tionnaire format. A further option is to display the scale anchors at
both the top and bottom of each screen, or above shorter blocks of
5.4 | Administration format: Paper or electronic
items, so that one set of anchors is always visible to participants.
Second, there are numerous different orders in which question-
Generally, there are two broad approaches for administering ques-
naire items and scales can be displayed, and these may have subtle
tionnaires: paper-based and electronic. In recent years, due to the
effects on how participants respond. In general, it is best to cover
advent of specialist software, the latter approach is increasingly used;
important topics earlier and sensitive topics later. In this way, if parti-
however, each has its advantages and disadvantages.
cipants do not complete the questionnaire, due to length or sensitiv-
The response rates and financial costs of each approach were
ity, some useful data may still be collected.
systematically investigated by Greenlaw and Brown-Welty (2009).
Third, in recent years, researchers have debated whether ques-
Questionnaires administered electronically through the Internet
tionnaire items should be grouped by topic, or presented in a mixed
resulted in a higher response rate (52.46%) and a substantially lower
or random order. Podsakoff et al. (2003) summarize the key issues in
cost per response ($0.64) than those administered in a paper format
this debate, as discussed below. Essentially, advocates of the former
(42.03%, $4.78). Although providing participants with both options
approach argue that grouping items is clearer for participants and
yielded the highest response rate of all (60.27%), the high cost of
allows them to carefully consider each topic holistically. However,
doing so outweighed this advantage ($3.61). Overall, then, the elec-
critics of this approach argue that by grouping multiple similar items,
tronic Internet-based approach was superior. Administering question-
common method bias may lead to artificially high consistency
naires via Internet-based services (e.g., Qualtrics, n.d.) is also
between responses to a scale's items. Indeed, Podsakoff et al. note
extremely efficient, as it enables data to be downloaded electroni-
that clustering items from the same scales together inflates intra-
cally, which greatly reduces the time taken to enter and format data
scale correlations, and thus also inflates Cronbach's alpha internal
for statistical analysis. Given the global reach of the Internet, it is also
reliability, while mixing them inflates inter-scale correlations, and thus
possible to recruit large numbers of participants with relative ease.
some bias is inevitable either way. They therefore conclude that the
Increasingly, electronic questionnaires are being administered on
issue has yet to be resolved.
handheld digital devices such as PDAs, smartphones, and tablets, and
Given the lack of consensus, then, the simplest approach is prob-
this approach has great research potential (Miller, 2012). In particular,
ably best. Unless the process of mixed or random item ordering can
the portability and convenience of such devices greatly facilitates the
be fully electronically automated, the potential for subsequent confu-
repeated collection of diary data over extended periods (Robinson,
sion and mistakes—when regrouping the items for analysis—would
2012). Reassuringly, research indicates that data collected from iden-
suggest that the simpler method of grouping the items by scale
tical surveys on computers and smartphones are comparable
(or theme) is the best procedure when administering questionnaires.
(de Bruijne & Wijnant, 2013). However, items and responses should
be clearly formatted for display on such small screens, in accordance
with the guidelines further below.
6 | DATA PREPARATION
Despite these advantages, though, there are still two circumstances where paper-based questionnaire administration may be preferable. First, some participants may not have access to the Internet
6.1 | Numerically coding responses
(e.g., factory workers), so paper questionnaires are the only practical
Once researchers receive the data, the first step in calculating scale
option. Second, in some cases, paper questionnaires may be more
scores is to numerically code the response points to quantify the par-
convenient, particularly if potential participants are gathered together
ticipants' response data. Consecutive ascending whole numbers are
in a single venue (such as a conference) and have some spare time to
almost always used to reflect the equally appearing intervals
participate.
(Thurstone, 1929) indicated by the verbal anchors. This is the recommended option. So, for instance, strongly disagree could be coded
5.5 | Questionnaire presentation: Methodological
issues
1, disagree coded 2, neutral coded 3, agree coded 4, and strongly agree
coded 5 (see e.g., Albirini, 2006).
Through convention, most researchers code the lowest response
There are three methodological issues concerning the presentation of
point as 1 rather than 0 (see, e.g., Albirini, 2006), although some do
items in questionnaires that researchers should pay careful attention
the latter (see e.g., Bolger, Zuckerman, & Kessler, 2000). Either way,
to, as discussed below. In principle, these issues relate to question-
inter-scale correlations will be the same; although scale scores (see
naires administered in any format; although, in practice, they relate
below) will naturally be 1 higher in the former case. However, there
mainly to electronic questionnaires.
are good reasons for coding response points from 0 upwards rather
First, if the rating scale anchors are only provided at the start of
than from 1. First, if scale scores are displayed in a graph, then it is
the questionnaire, participants may lose sight of them as they scroll
possible to start the y-axis at 0, which makes intuitive sense and is
down the screen, potentially leading to confusion or incorrect
the default option in most graphics software. Indeed, a very common
responses. This issue can be partly resolved by displaying the ques-
error is to display scale scores measured using 1–5 coding on graphs
tionnaire on a number of shorter consecutive screens, each with the
with 0–5 y-axes, thereby erroneously inflating perceived scores.
ROBINSON
748
Second, and related, when scale scores starting at 1 are reported as
approach is used, SPSS (n.d.) software offers researchers an option to
having a maximum score of 5, for example, or alternatively as meas-
specify the minimum number of items for which a response must
ured on a 5-point scale, a misperception often arises where readers
have been recorded before a scale score is calculated.
erroneously assume that 2.5 is the midpoint rather than the true
value of 3, again serving to erroneously inflate perceived scores.
There are notable exceptions to this recommendation, however.
Some scales have aggregated item scale scores that correspond to
Finally, where negatively worded items have been used, these
thresholds or particular critical levels of a variable. For instance, the
need to be reverse-coded before proceeding. So, using a traditional
revised Negative Acts Questionnaire has threshold scores corre-
5-point rating scale, the reverse coding would proceed as follows:
sponding to different degrees of workplace bullying (Notelaers &
strongly disagree (1 ! 5), disagree (2 ! 4), neutral (3 ! 3), agree
Einarsen, 2013). In these cases, an aggregated item scale score may
(4 ! 2), strongly agree (5 ! 1). To check the accuracy of this recod-
be required, and precautions should therefore be taken to ensure par-
ing, it is prudent to correlate the original negatively worded items
ticipants respond to all items.
with their reverse-coded counterpart items, to ensure that correlations are r = −1.00 as they should be. Finally, it is important to note
that negatively worded items are identified relative to the variable
the scale is measuring. So, for example, an item about alertness would
be a negatively worded one in a scale measuring fatigue, even if the
item itself does not contain a negative prefix (e.g., not or un-).
7 | CONC LU SION
Psychometric scales are arguably the most frequently used research
method in the social sciences. However, their effective development
and use requires a detailed knowledge of technical procedures and
6.2 | Calculating scale scores
issues that are frequently not well taught or misunderstood. It is
hoped that this article will therefore provide HRM researchers and
There are two ways in which scale scores can be calculated. First, the
mean rating of all items comprising the scale can be calculated. Sec-
practitioners with a solid grounding in this important method for the
benefit of future research and practice.
ond, the ratings of all items comprising the scale can be aggregated. If
there are no missing data, either option will yield identical inter-scale
correlations, albeit with different scale scores, naturally. However, in
ORCID
reality, there are almost always missing data, in which case aggregat-
Mark A. Robinson
http://orcid.org/0000-0001-5535-8737
ing item ratings will yield lower scale scores than appropriate for participants with missing data. Consequently, the first option—
calculating the mean rating of all items comprising the scale—is
strongly recommended, and this method is almost always used in
published research. A further advantage of this mean item rating
method is that the scale score is calibrated to the original rating
scale—for instance, a scale score of 3.4 on a 1–5 rating scale—and
therefore has more meaning for readers than an aggregated item
scale score that is more reflective of the number of items than the
strength of response.
When calculating the scale score from the mean of its constituent items' ratings, however, it is necessary to decide how many of
the scale's items a participant must respond to for this calculation to
be valid. Graham (2009) suggests that participants must have
responded to at least 50% of a scale's constituent items before their
scale scores can be calculated using the mean item rating method. He
also cautions that the scale's Cronbach alpha internal reliability should
be high and the items answered should adequately represent the
scale's construct. While agreeing with these latter two restrictions,
Newman (2014) suggests that even one item is sufficient to calculate
a scale score, reasoning that this is less wasteful of precious data and
therefore increases statistical power. To balance these competing
demands, a conservative rule of thumb might therefore be that: (a) a
threshold of responses to at least 50% of a scale's items should be
reached before calculating a scale score, unless (b) this approach
reduces the sample size to below recommended levels for statistical
analyses, in which case scale scores should be calculated from
responses to one or more items provided the Cronbach's alpha internal reliability of the scale is high (α ≥ .75; Cortina, 1993). Whichever
RE FE RE NC ES
Academy of Management (AoM). (n.d.). Measure chest. Research Methods
Division, AoM. Retrieved from http://rmdiv.org/?page_id=104
Albirini, A. (2006). Teachers' attitudes toward information and communication technologies: The case of Syrian EFL teachers. Computers &
Education, 47(4), 373–398.
Allen, N. J., & Meyer, J. P. (1990). The measurement and antecedents of
affective, continuance and normative commitment to the organization.
Journal of Occupational Psychology, 63(1), 1–18.
American Psychological Association (APA). (2010). Ethical principles of
psychologists and code of conduct. Retrieved from http://www.apa.
org/ethics/code/principles.pdf
Barrett, P. (2007). Structural equation modeling: Adjudging model fit. Personality and Individual Differences, 42(5), 815–824.
Baumgartner, H., & Steenkamp, J.-B. E. M. (2001). Response styles in marketing research: A cross-national investigation. Journal of Marketing
Research, 38(2), 143–156.
Bolger, N., Zuckerman, A., & Kessler, R. C. (2000). Invisible support and
adjustment to stress. Journal of Personality and Social Psychology, 79(6),
953–961.
Bourque, L. B. (2004). Self-administered questionnaire. In M. S. LewisBeck, A. Bryman, & T. Futing Liao (Eds.), The Sage encyclopedia of
social science research methods (Vol. 3, pp. 1012–1013). London, England: Sage.
Chin, W. W. (1998). The partial least squares approach to structural equation modeling. In G. A. Marcoulides (Ed.), Modern methods for business
research (pp. 295–336). Mahwah, NJ: Erlbaum.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.).
Hillsdale, NJ: Erlbaum.
Cook, M. (2009). Personnel selection: Adding value through people (5th ed.).
Chichester, England: Wiley-Blackwell.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory
and applications. Journal of Applied Psychology, 78(1), 98–104.
ROBINSON
Costello, A. B., & Osborne, J. W. (2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your
analyses. Practical Assessment, Research & Evaluation, 10(7), 1–9.
Cronbach, L. J. (1951). Coefficient alpha and the internal consistency of
tests. Psychometrika, 16(3), 297–334.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological
tests. Psychological Bulletin, 52(4), 281–302.
Darbyshire, P., & McDonald, H. (2004). Choosing response scale labels
and length: Guidance for researchers and clients. Australasian Journal
of Market Research, 12(2), 17–26.
Dawes, J. (2008). Do data characteristics change according to the number
of scale points used? An experiment using 5-point, 7-point and 10point scales. International Journal of Market Research, 50(1), 61–77.
de Bruijne, M., & Wijnant, A. (2013). Comparing survey results obtained
via mobile devices and computers: An experiment with a mobile web
survey on a heterogeneous group of mobile devices versus a computerassisted web survey. Social Science Computer Review, 31(4), 482–504.
De Jong, B. A., & Dirks, K. T. (2012). Beyond shared perceptions of trust
and monitoring in teams: Implications of asymmetry and dissensus.
Journal of Applied Psychology, 97(2), 391–406.
DeVellis, R. F. (1991). Scale development: Theory and applications. London,
England: Sage.
Edwards, J. R. (2001). Multidimensional constructs in organizational
behavior research: An integrative analytical framework. Organizational
Research Methods, 4(2), 144–192.
Esposito Vinzi, V., Trinchera, L., & Amato, S. (2010). PLS path modeling:
From foundations to recent developments and open issues for model
assessment and improvement. In V. Esposito Vinzi, W. W. Chin,
J. Henseler, & H. Wang (Eds.), Handbook of partial least squares
(pp. 47–82). Berlin, Germany: Springer-Verlag.
Festing, M., Knappert, L., & Kornau, A. (2015). Gender-specific preferences in global performance management: An empirical study of male
and female managers in a multinational context. Human Resource
Management, 54(1), 55–79.
Foster, J. J., & Parker, I. (1995). Carrying out investigations in psychology:
Methods and statistics. Leicester, England: British Psychological Society
(BPS) Books.
Garland, R. (1991). The mid-point on a rating-scale: Is it desirable? Marketing Bulletin, 2, 66–70.
Google Scholar. (n.d.). Academic literature search software. http://scholar.
google.com
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.
Graham, J. W. (2009). Missing data analysis: Making it work in the real
world. Annual Review of Psychology, 60, 549–576.
Greenlaw, C., & Brown-Welty, S. (2009). Testing assumptions of survey
mode and response cost: A comparison of web-based and paperbased survey methods. Evaluation Review, 33(5), 464–480.
Gregory, R. J. (2007). Psychological testing: History, principles, and applications. London, England: Pearson Education.
Guadagnoli, E., & Velicer, W. F. (1988). Relation of sample size to the stability of component patterns. Psychological Bulletin, 103(2), 265–275.
Halevi, M. Y., Carmeli, A., & Brueller, N. N. (2015). Ambidexterity in SBUs:
TMT behavioral integration and environmental dynamism. Human
Resource Management, 54(Suppl. 1), 223–238.
Henson, R. K., & Roberts, J. K. (2006). Use of exploratory factor analysis
in published research: Common errors and some comment on
improved practice. Educational and Psychological Measurement, 66(3),
393–416.
Hinkin, T. R. (1998). A brief tutorial on the development of measures for
use in survey questionnaires. Organizational Research Methods, 1(1),
104–121.
Hui, C. H., & Triandis, H. C. (1985). The instability of response sets. Public
Opinion Quarterly, 49(2), 253–260.
Idaszak, J. R., & Drasgow, F. (1987). A revision of the job diagnostic survey: Elimination of a measurement artifact. Journal of Applied
Psychology, 72(1), 69–74.
Kline, P. (2000). Handbook of psychological testing (2nd ed.). London, England: Routledge.
Lewis, K. (2003). Measuring transactive memory systems in the field:
Scale development and validation. Journal of Applied Psychology, 88(4),
587–604.
749
Likert, R. (1932). A technique for the measurement of attitudes. Archives
of Psychology, 22(140), 5–53.
Mäkelä, L., Kinnunen, U., & Suutari, V. (2015). Work-to-life conflict and
enrichment among international business travelers: The role of international career orientation. Human Resource Management, 54(3),
517–531.
Marsh, H. W., Hau, K. T., Balla, J. R., & Grayson, D. (1998). Is more ever
too much? The number of indicators per factor in confirmatory factor
analysis. Multivariate Behavioral Research, 33(2), 181–220.
Martin, G., Washburn, N., Makri, M., & Gomez-Mejia, L. R. (2015). Not all
risk taking is born equal: The behavioral agency model and CEO's perception of firm efficacy. Human Resource Management, 54(3),
483–498.
Matsunaga, M. (2015). Development and validation of an employee voice
strategy scale through four studies in Japan. Human Resource
Management, 54(4), 653–671.
Maynes, T. D., & Podsakoff, P. M. (2014). Speaking more broadly: An
examination of the nature, antecedents, and consequences of an
expanded set of employee voice behaviors. Journal of Applied
Psychology, 99(1), 87–112.
Miller, G. (2012). The smartphone psychology manifesto. Perspectives on
Psychological Science, 7(3), 221–237.
Moorman, R. H., & Podsakoff, P. M. (1992). A meta-analytic review and
empirical test of the potential confounding effects of social desirability
response sets in organizational behaviour research. Journal of Occupational and Organizational Psychology, 65(2), 131–149.
Morgeson, F. P., & Humphrey, S. E. (2006). The work design questionnaire
(WDQ): Developing and validating a comprehensive measure for
assessing job design and the nature of work. Journal of Applied
Psychology, 91(6), 1321–1339.
Mundfrom, D. J., Shaw, D. G., & Ke, T. L. (2005). Minimum sample size
recommendations for conducting factor analyses. International Journal
of Testing, 5(2), 159–168.
Nagy, M. S. (2002). Using a single-item approach to measure facet job satisfaction. Journal of Occupational and Organizational Psychology, 75(1),
77–86.
Nevill, A. M., Lane, A. M., Kilgour, L. J., Bowes, N., & Whyte, G. P. (2001).
Stability of psychometric questionnaires. Journal of Sports Sciences,
19(4), 273–278.
Newman, D. A. (2014). Missing data: Five practical guidelines. Organizational Research Methods, 17(4), 372–411.
Notelaers, G., & Einarsen, S. (2013). The world turns at 33 and 45: Defining simple cutoff scores for the Negative Acts Questionnaire—Revised
in a representative sample. European Journal of Work and Organizational Psychology, 22(6), 670–682.
Osborne, J. W., & Costello, A. B. (2004). Sample size and subject to item
ratio in principal components analysis. Practical Assessment, Research &
Evaluation, 9(11), 1–9.
Peers, I. S. (1996). Statistical analysis for education and psychology researchers. London, England: Falmer Press.
Podsakoff, P. M., MacKenzie, S. B., Lee, J.-Y., & Podsakoff, N. P. (2003).
Common method biases in behavioral research: A critical review of
the literature and recommended remedies. Journal of Applied
Psychology, 88(5), 879–903.
Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable
indicator of content validity? Appraisal and recommendations.
Research in Nursing & Health, 30(4), 459–467.
Postmes, T., Haslam, S. A., & Jans, L. (2013). A single-item measure of
social identification: Reliability, validity, and utility. British Journal of
Social Psychology, 52(4), 597–617.
Preston, C. C., & Colman, A. M. (2000). Optimal number of response categories in rating scales: Reliability, validity, discriminatory power, and
respondent preferences. Acta Psychologica, 104, 1–15.
Qualtrics. (n.d.). Questionnaire administration software. http://www.
qualtrics.com/
Raykov, T. (2008). Alpha if item deleted: A note on loss of criterion
validity in scale development if maximizing coefficient alpha.
British Journal of Mathematical and Statistical Psychology, 61,
275–285.
Reifman, A. (2014). Social-personality psychology questionnaire instrument compendium. http://www.webpages.ttu.edu/areifman/qic.htm
ROBINSON
750
Revilla, M. A., Saris, W. E., & Krosnick, J. A. (2014). Choosing the number
of categories in agree–disagree scales. Sociological Methods &
Research, 43(1), 73–97.
Robinson, M. A. (2012). How design engineers spend their time: Job content and task satisfaction. Design Studies, 33(4), 391–425.
Ryckman, R. M., Robbins, M. A., Thornton, B., & Cantrell, P. (1982). Development and validation of a physical self-efficacy scale. Journal of Personality and Social Psychology, 42(5), 891–900.
Scarpello, V., & Campbell, J. P. (1983). Job satisfaction: Are all the parts
there? Personnel Psychology, 36(3), 577–600.
Sendjaya, S., Sarros, J. C., & Santora, J. C. (2008). Defining and measuring
servant leadership behaviour in organizations. Journal of Management
Studies, 45(2), 402–424.
Spector, P. E. (n.d.). Psychological instrument resources. http://shell.cas.
usf.edu/~pspector/scalepage.html
SPSS. (n.d.). Statistical analysis software. http://www-01.ibm.com/
software/analytics/spss/
Stone, A. A., & Turkkan, J. S. (2000). Preface. In A. A. Stone, J. S. Turkkan,
C. A. Bachrach, J. B. Jobe, H. S. Kurtzman, & V. S. Cain (Eds.), The science of self-report (pp. ix–xi). Mahwah, NJ: Erlbaum.
Sturges, J., & Guest, D. (2004). Working to live or living to work? Work/
life balance early in the career. Human Resource Management Journal,
14(4), 5–20.
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th
ed.). London, England: Pearson/Allyn & Bacon.
Thompson, E. R. (2007). Development and validation of an internationally
reliable short-form of the Positive and Negative Affect Schedule
(PANAS). Journal of Cross-Cultural Psychology, 38(2), 227–242.
Thurstone, L. L. (1929). Theory of attitude measurement. Journal of Experimental Psychology, 12(3), 214–224.
Wanous, J. P., Reichers, A. E., & Hudy, M. J. (1997). Overall job satisfaction: How good are single-item measures? Journal of Applied
Psychology, 82(2), 247–252.
Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: The PANAS
scales. Journal of Personality and Social Psychology, 54(6), 1063–1070.
Weijters, B., Cabooter, E., & Schillewaert, N. (2010). The effect of rating
scale format on response styles: The number of response categories
and response category labels. International Journal of Research in
Marketing, 27(3), 236–247.
Werts, C. E., Linn, R. L., & Jöreskog, K. G. (1974). Intraclass reliability estimates: Testing structural assumptions. Educational and Psychological
Measurement, 34(1), 25–33.
Yam, K. C., Fehr, R., & Barnes, C. M. (2014). Morning employees are perceived as better employees: Employees' start times influence supervisor performance ratings. Journal of Applied Psychology, 99(6),
1288–1299.
AUTHOR'S BIOGRAPHY
Mark Robinson holds a PhD in Organizational Psychology from
the University of Leeds, where he currently works as a faculty
member in Leeds University Business School. He is also Deputy
Director of the Socio-Technical Centre, an interdisciplinary
research centre, and a member of the Workplace Behaviour
Research Centre. His research interests include human performance, group behavior, social cognition, and complex systems.
How to cite this article: Robinson MA. Using multi-item psychometric scales for research and practice in human resource
management.
Hum
Resour
Manage.
https://doi.org/10.1002/hrm.21852
2018;57:739–750.