Data Preparation Strategies in Excel
Data Preparation Strategies in Excel
com
ID number or code An identification variable that uniquely identifies a subject or entity. This could be a patient identifier, a number
assigned by the computer, or some other assigned code.
Demographic variables These variables include measures such as age, gender, and ethnicity that describe the subject population.
Outcome variable(s) (Dependent variable) This variable is the primary outcome measure. It can be discrete (eg, dead/alive, cured/not
cured) or continuous (eg, time to recurrence, blood pressure, cost incurred).
Predictor variables (Independent variables) These variables may include treatment (grouping) variables and other measured or observed
information, such as weight, height, or smoking status.
Covariate measures Covariate measures are variables that are related to the outcome variable and may be used to adjust the mean of the
response variable and account for variability in a statistical model.
Verification measures These variables may be used to verify the reliability of the data, such as a measure of how compliant a subject
is in taking medicine or multiple laboratory values to assess the reliability of laboratory data.
feasible and data entry is performed using Excel, the savvy realize too late that a variable needed to complete an
investigator can still implement the ‘‘good practice’’ tech- analysis was not collected.
niques described here. An important part of a well-designed study is the doc-
umentation of each variable in a table called a data dictio-
nary. An example data dictionary is shown in Table 2. This
table can be created in a word processor or spreadsheet
SELECTION AND DESCRIPTION OF DATA ELEMENTS
program and, once created, defines the characteristics of
Accurate data collection begins with planning. Before variables in both the data entry spreadsheet and statis-
collecting any data, an investigator should define research tical program.
questions and determine what measurements are needed A brief explanation of each item of the dictionary
to answer them. Typically, a research data set includes at follows:
least one outcome variable (dependent variable) and one
Variable name. Select simple variable names using
or more predictor (independent) variables. Other demo-
naming conventions compatible with the program that
graphic, covariate, or verification measures may also be
will be used for analysis. General naming conventions
recorded. It is essential that a unique key identifier be
that work for most programs (such as SAS and SPSS)
included for every observed subject (or entity) in a data
include the following:
set. Table 1 describes types of variables collected for data
analysis. For each project, the researcher should use this Make variable names short and explanatory. Typical
information to verify that all of the variables needed to variable names are ID, GENDER, B_DATE, COST,
perform an analysis are included. It is disheartening to GROUP, IQ_SCORE, and DEATH.
MISS = missing.
Because some programs (such as older versions of cle to ensure consistency. Depending on the statis-
SPSS) limit names to eight characters, it is best to tical program, there may be a downside to using text
comply with this restriction. categorical variables. Some programs, SPSS in par-
Begin variable names with a letter (A–Z). Some ticular, will not allow the use of text variables as
programs also allow variable names to begin with grouping variables in some analyses. Instead, SPSS
an underscore (_). Variable names may include num- requires that variables be classified as numbers (eg, 1,
bers (not as a first character) but not blanks or other 2, and 3 for red, white, and blue).
special characters (an underscore is valid). For Date variables can be used to represent dates and
example, AGE_2004 or AGE2004 is valid but not times of the day. Variables classified as dates warrant
2004AGE or AGE 2004 (with a blank between 2004 and special attention. Make sure dates are defined with a
AGE). Each name in the data set must be unique. four-digit year format to prevent any of the old ‘‘Y2K’’
Name variables in a sequence when appropriate. For problems from occurring. For example, if dates are
example, responses on a questionnaire might be entered using two-digit years in Excel and a date
named Q1, Q2, Q3, etc. . . The sequence of variables calculation is performed using its default date
may then be referred to using a shortcut such as settings, dates ranging from 01/01/00 to 12/31/29
Q1-Q47 in statistical programs such as SAS. are considered in the years 2000 to 2029, whereas the
Most programs also allow the inclusion of a descrip- date 01/01/30 (and after) is considered as being in the
tive label for each variable. year 1930 (and after). Thus, if age is calculated on
January 1, 2006, for a subject born on July 10, 1925,
Label. Include a brief description of the variable that can
and entered as 07/10/25, Excel will calculate the age as
be used as the variable label in the analysis program. For
19. When dates using two-digit years are imported
example, the label for AIS_SCORE might be ‘‘AIS based
into a statistics program, a similar error may result. It
on the ICD-9-CM scoring.’’
is best to store dates as a single properly formatted
Format. Specify the format to be used to enter data. For
date variable rather than storing month, day, and year
example, use a single-digit integer to indicate the pres-
as three separate variables. However, be aware that
ence or absence of a condition (0 or 1), a five-digit number
there are different ways of formatting a date. A typical
including one decimal point to indicate weight, or a date
US date format is MM/DD/YYYY, but some countries
in the format MM/DD/YYYY. Measurement data values
(and the military) use DD/MM/YYYY. If data are
should include a sufficient number of digits but not too
collected or recorded in another country, make sure
many. For example, recording a person’s weight to the
the data entry procedure takes this into account.
second decimal place would be unnecessary in most
cases, even if the weighing device reported the data to that
Codes and ranges. Define the range of values for each
number of decimal places. For measurement variables,
variable. For example, AGE might be limited to values
specify units such as pounds, inches, or liters. The most
from 0 to 100. Categorical variables should be limited to
commonly used data types are as follows:
a specific list of possible values. For example, 0 = no and
Numeric variables are those for which mathematical 1 = yes or AA = African American, H = Hispanic, C =
calculations make sense, such as age, salary, or weight. Caucasian, and O = other. In most statistical programs,
These variables are measured numerically. A binary formats can be defined for coded variables, so output
code (0, 1) representing the presence or absence of a will display the descriptions (male and female) rather
condition is usually coded as numeric (even though than a cryptic code (0 and 1). If a data set contains
the number is an identifier and has no real numeric numeric variables that are recorded with entries such as
meaning.) ‘‘>50’’ or ‘‘40-50,’’ consider recoding this variable into
Text variables (also called string or character vari- a categorical variable. For example, this variable could
ables) are codes, descriptions, or nonmathematical be recoded using three coded values: 1 = ‘‘0-50,’’ 2 =
numbers. For example, gender recorded as male and ‘‘51-100,’’ and 3 = ‘‘greater than 100.’’ It is not possible to
female or M and F is a text variable. Specifying an ID calculate averages and related statistics on data that
number such as ‘‘23432’’ as a text variable prevents it consist of a mixture of numbers and ranges.
from being used to calculate a meaningless statistic, Missing values. A missing value is a data element for
such as an average. If text variables are used in a data which there is no available value. Missing data points
set, it is best to avoid using entries that contain a large can result from lost, never collected, or unknown
number of characters. Data values such as ‘‘pulmo- information. There are several methods of handling
nary embolus’’ and ‘‘loss of operative reduction/ missing values. If an Excel cell is left blank for a missing
fixation’’ are lengthy to type into the spreadsheet and value and subsequently imported into SAS or SPSS, the
invite error. Use a code such as PE or LORF instead. blank value will be imported properly as a missing value.
If long descriptions must be used, it is best to use the However, it is best to define an explicit missing value
list box selection criteria described later in this arti- code as a confirmation that the data value has been
accounted for and has not been overlooked. For numeric means that when data are altered in any way, there must
variables, this code is typically an impossible value, such be a procedure in place to keep track of that change, such
as 9 for age. Sometimes it is necessary to define more as with a written change form.
than one missing value code for a variable such as 9 for
Addressing the issues related to the design of the study
‘‘not available’’ and 8 for ‘‘not done.’’ Missing values
and the way in which data will be collected and recorded is
for a text variable might be defined, such as MISS or NA.
an important step in increasing the accuracy of the data
For date values, use a blank to indicate a missing value or
set. For practical guidelines for entering and verifying your
a dot (.) as a missing value code. Another option would
data in Excel, see the Appendix to this article.
be to use a date impossible for your study, such as 11/11/
1111. If missing value codes are used in a data set, they
must subsequently be defined within the statistics
program in which the data will be analyzed. IMPORTING DATA
Once you have entered your data in Excel (or some other
program), you must import that data into your statistical
analysis program of choice before you can analyze it. Most
DATA COLLECTION STRATEGIES TO ENSURE
statistics programs, such as SAS and SPSS, can import data
BETTER DATA ACCURACY
files directly from Excel. If you have followed the guidelines
Data set design and collection strategies that lead to in this article, the import will be straightforward, with
increased accuracy, reliability, and analyzability of a data few or no problems. However, you should always perform
set include the following: data checks once your data are imported to verify that a
complete and accurate import occurred. In addition, you
Use open-ended questions with caution. Questions such
may want to add variable labels and define categorical
as ‘‘List the medicines you are taking’’ or ‘‘What mag-
codes once data have been imported.
azines do you read?’’ are open-ended questions. Free-
form answers to these types of questions are difficult to
analyze using statistical procedures (although such
information may be useful to analyze in a more sub- CONCLUSION
jective way). If your desire is to collect data useful for
If the data entered into your statistical program have errors,
statistical analysis, construct questions that require
many analyses you perform will be wrong. To increase your
subjects to select answers from a checklist (which should
chance of entering your data correctly into the computer,
include ‘‘unknown’’ or ‘‘other’’ as a category) or be pre-
you must develop a data management strategy. This article
pared to classify answers to open-ended questions into
described guidelines for creating such a strategy and pro-
categories for analysis.
vided information on how to use Microsoft Excel as your
Avoid unnecessary data collection. Collecting an exces-
data entry tool. The guidelines described here, if followed,
sive number of data elements (not pertinent to the
will help you create a cleaner, more accurate, and more
research question) can lead to coding or ‘‘fatigue’’ errors.
appropriate data set that is well designed to answer re-
Collect only the data needed for the study.
search questions. The Appendix illustrates how these tech-
Perform a pilot study. Gather a small amount of data and
niques can be implemented in Excel.
perform a preliminary analysis before collecting the full
research data. Many design and data collection flaws can
be found in this way. Information gleaned from a pilot
study will often help in planning a more effective larger REFERENCES
study. It is also helpful to have a knowledgeable and
1. English dictionary. Available at: [Link]
critical colleague look over the data collection forms or
us/meaning/wilf_hey.asp (accessed July 18, 2005).
questionnaires before they are used in an actual study.
2. Cryer J. Problems with using Microsoft Excel for statistics.
Develop a data audit procedure. Even though Excel is a
In: Proceedings of the 2001 joint statistical meetings [CD-
convenient tool for data entry, it was not designed for
ROM]. Alexandria, VA: American Statistical Association; 2002.
data auditing. Therefore, the burden of adapting Excel
3. Knüsel L. On the reliability of Microsoft Excel XP for statisti-
(or any similar strategy) to the data entry process lies with cal purposes. Comput Stat Data Anal 2002;39(1):109–10.
the researcher. Good practice (and some governmental 4. McCullough BD, Wilson B. On the accuracy of statistical pro-
regulations) require that an audit trail be maintained cedures in Microsoft Excel 97. Comput Stat Data Anal 1999;
for data changes in certain types of research (such as 31:27–37.
clinical trials). Professional data entry programs will 5. McFadden E. Management of data in clinical trials. New
automatically keep track of any records whose values York: Wiley-Interscience; 2002.
are changed (including date, time, data entry person, 6. Prokscha S. Practical guide to clinical data management.
etc), but there is no provision to do so in Excel. This Denver (CO): Interpharm Press.
APPENDIX
must be between 0 and 100 or 9 for missing.’’ If a value 7. Click on the Error Alert tab. Select Warning from the
outside the limits is entered into the cell, the dialog box Style pull-down menu. For the Title, enter ‘‘Gender,’’
shown in Figure 3 appears. and for the Error message, put ‘‘Only uppercase M
As mentioned earlier, the data entry person can over- and F allowed in this field, or X for missing.’’
ride the warning by clicking ‘‘Yes’’ and then enter a value 8. Click OK.
of 9. If the ‘‘Stop’’ option had been selected rather than
Once these verification criteria are set up, when a cell
‘‘Warning,’’ Excel would not allow the entry person to
within the specified range is selected, the message ‘‘Select
override the data range. A third option, ‘‘Information,’’
Gender’’ and ‘‘Select M=Male, F=Female or X=Missing’’
displays a message when a value is entered out of range,
appears in a yellow information box. If a value besides an
but it does not prevent the entry of a number outside the
M, F, or X is entered, the message ‘‘Only uppercase M and F
specified range.
allowed in this field or X for missing’’ appears in a warning
This example illustrates how to limit the entry of whole
dialog box. Clicking on the pull-down menu indicator
number values, but Excel also allows the specification of
(down arrow) in the box displays the list of defined values
limits for decimals, dates, lists, and clock times. When one
(M, F, or X).
of these ‘‘Allow’’ criteria is selected in the ‘‘Data Valida-
Unfortunately, Excel will not match case, so it is possible
tion’’ dialog box shown in Figure 2, the other entry options
for the entry person to enter a lowercase ‘‘m’’ instead of an
change to match the options allowed for that value type.
‘‘M.’’ Keep this in mind when you import these data into
Limit data values to a list. This list could be a list of US your statistics program.
state abbreviations, names of months, days of the week, These simple verification checks take only a few
gender, hospital names, diagnoses, and so on. To limit an minutes to set up in Excel. If there is more than a small
entry to M and F for a gender variable (and X for missing), data set to enter, or if multiple people will be entering data,
for example, use these steps: these validations will prevent the entry of obviously in-
correct data.
1. Create the list of items in the same spreadsheet as the
data. For example, in cell L2, place the value ‘‘M’’; in Make each row of data represent a single subject (usually).
cell L3, place the value ‘‘F’’; and in cell L4, enter ‘‘X’’ In most cases, data for a single subject or observation
(no quotation marks). should be on a single row in the spreadsheet. A few
2. Select the range of cells to be marked for validation analyses in SPSS and SAS, typically repeated measures
by highlighting a range of cells. models, expect data for a single subject in multiple rows.
3. Select the menu option Data/Validation. . . If you use multiple rows per subject, additional vari-
4. On the Settings tab option, select List from the Allow: able(s), such as visit number, date, or time, must be
criteria option. included so that each row is uniquely identified. Always
5. For the Source, select the range of cells containing use the single subject–single row option, unless the
the admissible values. In this case, enter ‘‘=$L$2: multirow format is required. If data are entered on a
$L$4’’ (no quotation marks). The dollar signs in the single row, the data can later be transformed into the
specification force the reference to be absolute. This multicolumn format within the statistics program to a
range must be in the active spreadsheet. It is best to multirow format if needed (or vice versa.)
allow several blank columns between the actual data
values and this list because the list can interfere with Data Entry Guidelines
importing the data later.
6. Create an input message by clicking on the Input Along with the techniques described above, here are other
Message tab in the Data Validation dialog box. Check suggestions that can ensure a cleaner data set:
the box titled ‘‘Show input message when cell is Freeze column headings so they will not scroll off the
selected.’’ In the Title textbox, enter ‘‘Select Gender,’’ screen. When data are entered in Excel, it is easy for the
and in the Input Message textbox, enter ‘‘Select column names to scroll off the screen. This makes it
M=Male, F=Female or X=Missing.’’ more likely to enter the data in the wrong column. To
prevent this, freeze the variable names to always remain
at the top of the screen. To freeze the variable names,
click on A2 in the Excel spreadsheet (variables names are
in column 1) and select Windows/Freeze Panes. In a
similar way, if ID is in the first column of a data entry
spreadsheet, freeze both variable names and the ID
column of the spreadsheet by following these steps:
FIGURE 6 Sheet3 (Difference) spreadsheet. FIGURE 7 Sheet3 (Difference) displaying actual differences.
corrections. Once all of the corrections have been for AGE in cell C3 are reversed on the two sheets (43
made, the cells in the Difference spreadsheet should versus 34). Notice in the date comparison in cell B5 that
all be 0 (zero). date codes (38396/38395) are displayed rather than actual
dates. Because these numbers are one digit apart, it
To make the difference more informative, use the more means that the dates on Spread1 and Spread2 are 1 day
complicated Excel formula below (in a single line): apart. The original spreadsheet contains the date as Feb-
¼ IFðEXACTðSHEET1!A2; SHEET2!A2Þ; 0; ruary 12, 2005, and the other spreadsheet contains it as
SHEET1!A2&‘‘/’’&SHEET2!A2) February 13, 2005.
This formula produces the spreadsheet shown in Once you have verified that the two spreadsheets are
Figure 7. identical, you are ready to import your data in a statistics
The Figure 7 version of the differences shows the actual program. If you have followed the guidelines in this ar-
data values from the two sheets displayed so that the dif- ticle, your data set should accurately reflect the data that
ferences are more readily visible. For example, the digits were collected.
These include:
References This article cites 2 articles, 0 of which you can access for free at:
[Link]
Email alerting Receive free email alerts when new articles cite this article. Sign up in
service the box at the top right corner of the online article.
Notes