0% found this document useful (0 votes)
139 views9 pages

Data Preparation Strategies in Excel

The document discusses guidelines for preparing data for analysis in Microsoft Excel. It describes defining variables, documenting them in a data dictionary, formatting values, ensuring data quality, and structuring the spreadsheet. Following these strategies can save researchers time and produce higher quality data for analysis.

Uploaded by

Ramanpreet Kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views9 pages

Data Preparation Strategies in Excel

The document discusses guidelines for preparing data for analysis in Microsoft Excel. It describes defining variables, documenting them in a data dictionary, formatting values, ensuring data quality, and structuring the spreadsheet. Following these strategies can save researchers time and produce higher quality data for analysis.

Uploaded by

Ramanpreet Kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Downloaded from [Link] on August 18, 2016 - Published by [Link].

com

TOOLS AND ISSUES

Preparing Data for Analysis Using Microsoft Excel


Alan C. Elliott, Linda S. Hynan, Joan S. Reisch, Janet P. Smith

researchers to discover that their data sets must be exten-


ABSTRACT
sively ‘‘cleaned’’ before they can be properly analyzed.
A critical component essential to good research is the accurate and Written for research teams who do not have the services
efficient collection and preparation of data for analysis. Most of a professional data management team, this article
medical researchers have little or no training in data management, provides guidance on how to develop a strategy to create
often causing not only excessive time spent cleaning data but also a well-designed and verified data set. Following these
a risk that the data set contains collection or recording errors. The guidelines will save researchers time and money during
implementation of simple guidelines based on techniques used by all phases of the research project and will result in data
professional data management teams will save researchers time that can be used in a statistical software program with
and money and result in a data set better suited to answer research minimal modification.
questions. Because Microsoft Excel is often used by researchers to All of the strategies in this article can (and should) be
collect data, specific techniques that can be implemented in Excel employed whether the data are entered into the computer
are presented. using SAS (SAS Institute, Cary, NC), SPSS (SPSS Inc, Chi-
Key Words: data collection, database management system, re- cago, IL), Access (Microsoft Corporation, Redmond, WA), or
search design, pilot projects, informatics any number of programs. To illustrate the guidelines in
this article and appendix, we use the Microsoft Excel
(Microsoft Corporation) spreadsheet program. Although
Excel was not designed to be a research data entry tool, it
is commonly used because almost every researcher al-
During a presentation at an IBM technicians meeting in ready knows the basics of how to use it. This article does
1966, a programmer named Wilf Hey coined the phrase not address the use of Excel for data analysis because its
‘‘Garbage In, Garbage Out.’’ Now abbreviated GIGO, this limited data analysis capabilities and sometimes confus-
term has become a catchphrase for the too common sit- ing output make it suitable only for preliminary analyses.
uation in which inaccurate data entered into a computer As Jonathan Cryer put it, ‘‘Friends don’t let friends use
are used to produce misleading or erroneous results.1 Excel for statistics.’’2 Other articles have also discussed
Investigative teams, ranging from a few individuals to the problems associated with performing statistical analy-
complex organizations at universities, governments, and sis using Excel.3,4 Furthermore, Excel is limited to spread-
corporations, are all involved in the planning, execution, sheets containing less than 256 variables (columns) and
and analysis of research. A critical component essential to 65,536 records (rows).
each of these research projects is collecting data and In well-funded studies, a professional data manage-
entering results into a computer in preparation for statis- ment team works with investigators from the planning
tical analysis. Often because of a lack of funding, expe- stage through forms development, database design, and
rience, or both, these data are entered into the computer data collection and entry and in the preparation of data for
using an ad hoc process that results in poorly coded analysis. The characteristics of a professionally designed
data, incorrectly formatted values, incomplete informa- data management process include a thorough description
tion, and typographical errors. It is not uncommon for of the data variables, validation of data values as they are
entered into the computer, and the use of a double-entry
data process into a relational database. Such processes use
From the Department of Clinical Sciences (A.C.E., L.S.H., J.S.R., specialized programs for data entry rather than Excel.5,6
J.P.S.), Division of Biostatistics, UT Southwestern Medical Programs designed for professional data entry include
Center, Dallas, Dallas, TX. the SPSS Data Entry Builder, Key Entry III from Southern
Address correspondence to: Alan C. Elliott, Department of Computer Systems (Birmingham, AL) SAS/AF, and Access.
Clinical Sciences, Division of Biostatistics, UT Southwestern Some of these programs require the expertise of a pro-
Medical Center, Dallas, Dallas, TX 75390; tel: 214-648-2712; grammer to create data entry screens, validation code, and
fax: 214-648-7673; e-mail: [Link]@[Link]. data verification procedures. For smaller projects in which
DOI 10.2310/6650.2006.05038 the use of a professional data management team is not

334 JOURNAL OF INVESTIGATIVE MEDICINE  volume 54 number 6  September 2006


Downloaded from [Link] on August 18, 2016 - Published by [Link]

TABLE 1 Variable Types Collected for Research


Variable Type Description

ID number or code An identification variable that uniquely identifies a subject or entity. This could be a patient identifier, a number
assigned by the computer, or some other assigned code.
Demographic variables These variables include measures such as age, gender, and ethnicity that describe the subject population.
Outcome variable(s) (Dependent variable) This variable is the primary outcome measure. It can be discrete (eg, dead/alive, cured/not
cured) or continuous (eg, time to recurrence, blood pressure, cost incurred).
Predictor variables (Independent variables) These variables may include treatment (grouping) variables and other measured or observed
information, such as weight, height, or smoking status.
Covariate measures Covariate measures are variables that are related to the outcome variable and may be used to adjust the mean of the
response variable and account for variability in a statistical model.
Verification measures These variables may be used to verify the reliability of the data, such as a measure of how compliant a subject
is in taking medicine or multiple laboratory values to assess the reliability of laboratory data.

feasible and data entry is performed using Excel, the savvy realize too late that a variable needed to complete an
investigator can still implement the ‘‘good practice’’ tech- analysis was not collected.
niques described here. An important part of a well-designed study is the doc-
umentation of each variable in a table called a data dictio-
nary. An example data dictionary is shown in Table 2. This
table can be created in a word processor or spreadsheet
SELECTION AND DESCRIPTION OF DATA ELEMENTS
program and, once created, defines the characteristics of
Accurate data collection begins with planning. Before variables in both the data entry spreadsheet and statis-
collecting any data, an investigator should define research tical program.
questions and determine what measurements are needed A brief explanation of each item of the dictionary
to answer them. Typically, a research data set includes at follows:
least one outcome variable (dependent variable) and one
 Variable name. Select simple variable names using
or more predictor (independent) variables. Other demo-
naming conventions compatible with the program that
graphic, covariate, or verification measures may also be
will be used for analysis. General naming conventions
recorded. It is essential that a unique key identifier be
that work for most programs (such as SAS and SPSS)
included for every observed subject (or entity) in a data
include the following:
set. Table 1 describes types of variables collected for data
analysis. For each project, the researcher should use this  Make variable names short and explanatory. Typical
information to verify that all of the variables needed to variable names are ID, GENDER, B_DATE, COST,
perform an analysis are included. It is disheartening to GROUP, IQ_SCORE, and DEATH.

TABLE 2 Example Data Dictionary


Column Variable Name Label (Units) Format Codes and Ranges Missing Values

A SUBJECT Subject ID number Text (4) 1000–9999 Not allowed


B VDATE Date client visited clinic Date (MM/DD/YYYY) None . (dot) or 11/11/1111
C AGE Age at visit date Numeric (3.0) Range 0–100 9
D TEMP_F Temperature (jF) Numeric (4.1) None 9
E GENDER Gender Text (1) F = female X
M = male
F ARRIVE Mode of arrival String (4) Car MISS
Bus
Walk
G ANTIBIO Was antibiotic prescribed? Numeric (1.0) 1 = yes 9
0 = no

MISS = missing.

Preparing Data for Analysis Using Excel/ELLIOTT ET AL 335


Downloaded from [Link] on August 18, 2016 - Published by [Link]

 Because some programs (such as older versions of cle to ensure consistency. Depending on the statis-
SPSS) limit names to eight characters, it is best to tical program, there may be a downside to using text
comply with this restriction. categorical variables. Some programs, SPSS in par-
 Begin variable names with a letter (A–Z). Some ticular, will not allow the use of text variables as
programs also allow variable names to begin with grouping variables in some analyses. Instead, SPSS
an underscore (_). Variable names may include num- requires that variables be classified as numbers (eg, 1,
bers (not as a first character) but not blanks or other 2, and 3 for red, white, and blue).
special characters (an underscore is valid). For  Date variables can be used to represent dates and
example, AGE_2004 or AGE2004 is valid but not times of the day. Variables classified as dates warrant
2004AGE or AGE 2004 (with a blank between 2004 and special attention. Make sure dates are defined with a
AGE). Each name in the data set must be unique. four-digit year format to prevent any of the old ‘‘Y2K’’
 Name variables in a sequence when appropriate. For problems from occurring. For example, if dates are
example, responses on a questionnaire might be entered using two-digit years in Excel and a date
named Q1, Q2, Q3, etc. . . The sequence of variables calculation is performed using its default date
may then be referred to using a shortcut such as settings, dates ranging from 01/01/00 to 12/31/29
Q1-Q47 in statistical programs such as SAS. are considered in the years 2000 to 2029, whereas the
 Most programs also allow the inclusion of a descrip- date 01/01/30 (and after) is considered as being in the
tive label for each variable. year 1930 (and after). Thus, if age is calculated on
January 1, 2006, for a subject born on July 10, 1925,
 Label. Include a brief description of the variable that can
and entered as 07/10/25, Excel will calculate the age as
be used as the variable label in the analysis program. For
19. When dates using two-digit years are imported
example, the label for AIS_SCORE might be ‘‘AIS based
into a statistics program, a similar error may result. It
on the ICD-9-CM scoring.’’
is best to store dates as a single properly formatted
 Format. Specify the format to be used to enter data. For
date variable rather than storing month, day, and year
example, use a single-digit integer to indicate the pres-
as three separate variables. However, be aware that
ence or absence of a condition (0 or 1), a five-digit number
there are different ways of formatting a date. A typical
including one decimal point to indicate weight, or a date
US date format is MM/DD/YYYY, but some countries
in the format MM/DD/YYYY. Measurement data values
(and the military) use DD/MM/YYYY. If data are
should include a sufficient number of digits but not too
collected or recorded in another country, make sure
many. For example, recording a person’s weight to the
the data entry procedure takes this into account.
second decimal place would be unnecessary in most
cases, even if the weighing device reported the data to that
 Codes and ranges. Define the range of values for each
number of decimal places. For measurement variables,
variable. For example, AGE might be limited to values
specify units such as pounds, inches, or liters. The most
from 0 to 100. Categorical variables should be limited to
commonly used data types are as follows:
a specific list of possible values. For example, 0 = no and
 Numeric variables are those for which mathematical 1 = yes or AA = African American, H = Hispanic, C =
calculations make sense, such as age, salary, or weight. Caucasian, and O = other. In most statistical programs,
These variables are measured numerically. A binary formats can be defined for coded variables, so output
code (0, 1) representing the presence or absence of a will display the descriptions (male and female) rather
condition is usually coded as numeric (even though than a cryptic code (0 and 1). If a data set contains
the number is an identifier and has no real numeric numeric variables that are recorded with entries such as
meaning.) ‘‘>50’’ or ‘‘40-50,’’ consider recoding this variable into
 Text variables (also called string or character vari- a categorical variable. For example, this variable could
ables) are codes, descriptions, or nonmathematical be recoded using three coded values: 1 = ‘‘0-50,’’ 2 =
numbers. For example, gender recorded as male and ‘‘51-100,’’ and 3 = ‘‘greater than 100.’’ It is not possible to
female or M and F is a text variable. Specifying an ID calculate averages and related statistics on data that
number such as ‘‘23432’’ as a text variable prevents it consist of a mixture of numbers and ranges.
from being used to calculate a meaningless statistic,  Missing values. A missing value is a data element for
such as an average. If text variables are used in a data which there is no available value. Missing data points
set, it is best to avoid using entries that contain a large can result from lost, never collected, or unknown
number of characters. Data values such as ‘‘pulmo- information. There are several methods of handling
nary embolus’’ and ‘‘loss of operative reduction/ missing values. If an Excel cell is left blank for a missing
fixation’’ are lengthy to type into the spreadsheet and value and subsequently imported into SAS or SPSS, the
invite error. Use a code such as PE or LORF instead. blank value will be imported properly as a missing value.
If long descriptions must be used, it is best to use the However, it is best to define an explicit missing value
list box selection criteria described later in this arti- code as a confirmation that the data value has been

336 JOURNAL OF INVESTIGATIVE MEDICINE  volume 54 number 6  September 2006


Downloaded from [Link] on August 18, 2016 - Published by [Link]

accounted for and has not been overlooked. For numeric means that when data are altered in any way, there must
variables, this code is typically an impossible value, such be a procedure in place to keep track of that change, such
as 9 for age. Sometimes it is necessary to define more as with a written change form.
than one missing value code for a variable such as 9 for
Addressing the issues related to the design of the study
‘‘not available’’ and 8 for ‘‘not done.’’ Missing values
and the way in which data will be collected and recorded is
for a text variable might be defined, such as MISS or NA.
an important step in increasing the accuracy of the data
For date values, use a blank to indicate a missing value or
set. For practical guidelines for entering and verifying your
a dot (.) as a missing value code. Another option would
data in Excel, see the Appendix to this article.
be to use a date impossible for your study, such as 11/11/
1111. If missing value codes are used in a data set, they
must subsequently be defined within the statistics
program in which the data will be analyzed. IMPORTING DATA
Once you have entered your data in Excel (or some other
program), you must import that data into your statistical
analysis program of choice before you can analyze it. Most
DATA COLLECTION STRATEGIES TO ENSURE
statistics programs, such as SAS and SPSS, can import data
BETTER DATA ACCURACY
files directly from Excel. If you have followed the guidelines
Data set design and collection strategies that lead to in this article, the import will be straightforward, with
increased accuracy, reliability, and analyzability of a data few or no problems. However, you should always perform
set include the following: data checks once your data are imported to verify that a
complete and accurate import occurred. In addition, you
 Use open-ended questions with caution. Questions such
may want to add variable labels and define categorical
as ‘‘List the medicines you are taking’’ or ‘‘What mag-
codes once data have been imported.
azines do you read?’’ are open-ended questions. Free-
form answers to these types of questions are difficult to
analyze using statistical procedures (although such
information may be useful to analyze in a more sub- CONCLUSION
jective way). If your desire is to collect data useful for
If the data entered into your statistical program have errors,
statistical analysis, construct questions that require
many analyses you perform will be wrong. To increase your
subjects to select answers from a checklist (which should
chance of entering your data correctly into the computer,
include ‘‘unknown’’ or ‘‘other’’ as a category) or be pre-
you must develop a data management strategy. This article
pared to classify answers to open-ended questions into
described guidelines for creating such a strategy and pro-
categories for analysis.
vided information on how to use Microsoft Excel as your
 Avoid unnecessary data collection. Collecting an exces-
data entry tool. The guidelines described here, if followed,
sive number of data elements (not pertinent to the
will help you create a cleaner, more accurate, and more
research question) can lead to coding or ‘‘fatigue’’ errors.
appropriate data set that is well designed to answer re-
Collect only the data needed for the study.
search questions. The Appendix illustrates how these tech-
 Perform a pilot study. Gather a small amount of data and
niques can be implemented in Excel.
perform a preliminary analysis before collecting the full
research data. Many design and data collection flaws can
be found in this way. Information gleaned from a pilot
study will often help in planning a more effective larger REFERENCES
study. It is also helpful to have a knowledgeable and
1. English dictionary. Available at: [Link]
critical colleague look over the data collection forms or
us/meaning/wilf_hey.asp (accessed July 18, 2005).
questionnaires before they are used in an actual study.
2. Cryer J. Problems with using Microsoft Excel for statistics.
 Develop a data audit procedure. Even though Excel is a
In: Proceedings of the 2001 joint statistical meetings [CD-
convenient tool for data entry, it was not designed for
ROM]. Alexandria, VA: American Statistical Association; 2002.
data auditing. Therefore, the burden of adapting Excel
3. Knüsel L. On the reliability of Microsoft Excel XP for statisti-
(or any similar strategy) to the data entry process lies with cal purposes. Comput Stat Data Anal 2002;39(1):109–10.
the researcher. Good practice (and some governmental 4. McCullough BD, Wilson B. On the accuracy of statistical pro-
regulations) require that an audit trail be maintained cedures in Microsoft Excel 97. Comput Stat Data Anal 1999;
for data changes in certain types of research (such as 31:27–37.
clinical trials). Professional data entry programs will 5. McFadden E. Management of data in clinical trials. New
automatically keep track of any records whose values York: Wiley-Interscience; 2002.
are changed (including date, time, data entry person, 6. Prokscha S. Practical guide to clinical data management.
etc), but there is no provision to do so in Excel. This Denver (CO): Interpharm Press.

Preparing Data for Analysis Using Excel/ELLIOTT ET AL 337


Downloaded from [Link] on August 18, 2016 - Published by [Link]

APPENDIX

Implementing Data Management Techniques in Excel


This appendix illustrates the data management techniques
described in the article. Follow these examples in Microsoft
Excel to create a data management strategy that will
improve the accuracy and usefulness of your data.

Design Your Spreadsheet Using the Data Dictionary


Establishing the data dictionary is an essential first step in
preparing data for entry into an Excel spreadsheet. With the
dictionary in hand, following this list of guidelines will help
FIGURE 2 Creating data validation criteria.
design the spreadsheet for data entry:
 Place variables names in row 1. The first row of the data
spreadsheet should contain only variable names. For
example, Figure 1 shows a spreadsheet containing the
Never format columns using the currency ($) or comma
seven variables specified in the previous data dictionary,
formats because that may cause problem when the data are
one per column. These variable names are found in the
imported into a statistics program.
data dictionary shown in Table 2. Case does not matter
for variable names. ‘‘Subject’’ works equally as well as  Specify a range of allowed values. Using the criteria in the
‘‘SUBJECT.’’ data dictionary, specify a range of possible values
 Format columns to match the variable type. To help pre- allowed in a particular range of cells. The following
vent inaccurate values from being entered into the data steps can be used to specify that the format for the AGE
spreadsheet, format the column cells to match the pre- variable in Excel column ‘‘C’’ can only contain values
scribed data values for that column. For example, column between 0 and 100 but will also allow you to override the
‘‘B’’ contains a date variable that should be in the form check to enter a missing value code of 99:
MM/DD/YYYY. To format the VDATE column in Excel,
1. Select the range of cells to be marked for validation
1. Highlight the cells that will contain date values by by highlighting a range of cells.
clicking the column header, B in this example. 2. Select the menu option Data/Validation. . .
2. From the Excel menu, select Format/Cells/Date. 3. On the Settings tab option, select ‘‘Whole Number’’
3. Select the MM/DD/YYYY format (which appears on from the ‘‘Allow:’’ criteria option.
the list of formats in the form of the current date, 4. On the ‘‘Data:’’ option, leave it as ‘‘between’’ and
such as 07/10/2005). This will cause data in these enter 0 as the minimum and 100 as the maximum
cells to appear in the specified date format. (Figure 2).
5. Click on the Input Message tab. In the Title textbox,
In a similar way, format the SUBJECT, GENDER, and
enter ‘‘Age verification,’’ and in the Input Message
ARRIVE columns as text. The TEMP_F column should be
text box enter ‘‘Age must be between 0 and 100 or 9
defined as a number with a single decimal place, and the
(override) for missing.’’
ANTIBIO column should be a number defined with no
6. Click on the Error Alert tab. Select the yellow Warning
decimal place because the data entered will be a 0, 1, or 9.
icon in the Style pull-down box. In the Title text box,
The AGE column should be three digits with no decimal.
enter ‘‘Age out of range.’’ In the Error message box,
enter ‘‘Enter an age between 0 and 100 or 9
(override) for missing.’’ Choosing the Warning icon
rather than the Stop icon allows the data entry
person to override the preset limits. This is recom-
mended in this case because it is possible for a
subject to be over the age of 100 and because you
designated 9 as a missing value code.
7. Click OK.
Once defined in Excel, the verification criteria will check
values as they are entered to make sure the entry does not
violate the specification. When a cell in the age column is
FIGURE 1 Variable names for a data set. selected, a yellow box appears with the input message ‘‘Age

338 JOURNAL OF INVESTIGATIVE MEDICINE  volume 54 number 6  September 2006


Downloaded from [Link] on August 18, 2016 - Published by [Link]

must be between 0 and 100 or 9 for missing.’’ If a value 7. Click on the Error Alert tab. Select Warning from the
outside the limits is entered into the cell, the dialog box Style pull-down menu. For the Title, enter ‘‘Gender,’’
shown in Figure 3 appears. and for the Error message, put ‘‘Only uppercase M
As mentioned earlier, the data entry person can over- and F allowed in this field, or X for missing.’’
ride the warning by clicking ‘‘Yes’’ and then enter a value 8. Click OK.
of 9. If the ‘‘Stop’’ option had been selected rather than
Once these verification criteria are set up, when a cell
‘‘Warning,’’ Excel would not allow the entry person to
within the specified range is selected, the message ‘‘Select
override the data range. A third option, ‘‘Information,’’
Gender’’ and ‘‘Select M=Male, F=Female or X=Missing’’
displays a message when a value is entered out of range,
appears in a yellow information box. If a value besides an
but it does not prevent the entry of a number outside the
M, F, or X is entered, the message ‘‘Only uppercase M and F
specified range.
allowed in this field or X for missing’’ appears in a warning
This example illustrates how to limit the entry of whole
dialog box. Clicking on the pull-down menu indicator
number values, but Excel also allows the specification of
(down arrow) in the box displays the list of defined values
limits for decimals, dates, lists, and clock times. When one
(M, F, or X).
of these ‘‘Allow’’ criteria is selected in the ‘‘Data Valida-
Unfortunately, Excel will not match case, so it is possible
tion’’ dialog box shown in Figure 2, the other entry options
for the entry person to enter a lowercase ‘‘m’’ instead of an
change to match the options allowed for that value type.
‘‘M.’’ Keep this in mind when you import these data into
 Limit data values to a list. This list could be a list of US your statistics program.
state abbreviations, names of months, days of the week, These simple verification checks take only a few
gender, hospital names, diagnoses, and so on. To limit an minutes to set up in Excel. If there is more than a small
entry to M and F for a gender variable (and X for missing), data set to enter, or if multiple people will be entering data,
for example, use these steps: these validations will prevent the entry of obviously in-
correct data.
1. Create the list of items in the same spreadsheet as the
data. For example, in cell L2, place the value ‘‘M’’; in  Make each row of data represent a single subject (usually).
cell L3, place the value ‘‘F’’; and in cell L4, enter ‘‘X’’ In most cases, data for a single subject or observation
(no quotation marks). should be on a single row in the spreadsheet. A few
2. Select the range of cells to be marked for validation analyses in SPSS and SAS, typically repeated measures
by highlighting a range of cells. models, expect data for a single subject in multiple rows.
3. Select the menu option Data/Validation. . . If you use multiple rows per subject, additional vari-
4. On the Settings tab option, select List from the Allow: able(s), such as visit number, date, or time, must be
criteria option. included so that each row is uniquely identified. Always
5. For the Source, select the range of cells containing use the single subject–single row option, unless the
the admissible values. In this case, enter ‘‘=$L$2: multirow format is required. If data are entered on a
$L$4’’ (no quotation marks). The dollar signs in the single row, the data can later be transformed into the
specification force the reference to be absolute. This multicolumn format within the statistics program to a
range must be in the active spreadsheet. It is best to multirow format if needed (or vice versa.)
allow several blank columns between the actual data
values and this list because the list can interfere with Data Entry Guidelines
importing the data later.
6. Create an input message by clicking on the Input Along with the techniques described above, here are other
Message tab in the Data Validation dialog box. Check suggestions that can ensure a cleaner data set:
the box titled ‘‘Show input message when cell is  Freeze column headings so they will not scroll off the
selected.’’ In the Title textbox, enter ‘‘Select Gender,’’ screen. When data are entered in Excel, it is easy for the
and in the Input Message textbox, enter ‘‘Select column names to scroll off the screen. This makes it
M=Male, F=Female or X=Missing.’’ more likely to enter the data in the wrong column. To
prevent this, freeze the variable names to always remain
at the top of the screen. To freeze the variable names,
click on A2 in the Excel spreadsheet (variables names are
in column 1) and select Windows/Freeze Panes. In a
similar way, if ID is in the first column of a data entry
spreadsheet, freeze both variable names and the ID
column of the spreadsheet by following these steps:

1. Click on B2 in the data entry spreadsheet.


FIGURE 3 Warning that an invalid number has been entered. 2. Select Windows/Freeze Panes.

Preparing Data for Analysis Using Excel/ELLIOTT ET AL 339


Downloaded from [Link] on August 18, 2016 - Published by [Link]

3. The variable names and ID column remain on the


screen even when scrolled.
4. To Unfreeze the panes, click Windows/Unfreeze
Panes.
 Enter string variables in a consistent case. String (text/ FIGURE 5 Second spreadsheet to compare (Sheet2).
categorical) variables should always be entered in the
same case. When entering a letter-coded gender vari-
able, consistently use either M and F or m and f. If cases accurate but will also result in a data set that can be
are mixed ‘‘M and m,’’ the statistics program may see imported seamlessly into a statistical program, avoiding
these letters as two different data values. When a much of the time-consuming data manipulation and
comparison by gender is performed, the program finds cleaning problems that must take place before data can
four gender categories (F, f, M, and m). Forcing the data be analyzed.
to match a list, as described above, is one way to prevent
the mismatched case problem.
Verify Data Using Double Data Entry in Excel
 Do not leave any blank rows in the spreadsheet. Blank
rows are sometimes imported incorrectly into the The gold standard for professional data entry is to enter
statistics program (depending on the program) and data not once but twice. The two data sets are then
may complicate an analysis. compared, differences are examined, and corrections are
 Do not include unessential text or fancy formatting in the made. To use this double data entry method, create two
spreadsheet. Extra text, colors, unusual fonts, separator identical blank data entry spreadsheets. The data should
lines, and other formatting options that are not meant to then be entered into the spreadsheets by two different
be imported into the statistics program may cause people. If it is impossible to use two different people, at
problems. Keep the data entry spreadsheet straight- least enter the data at two different sessions. Once the data
forward and simple. are entered, compare the two spreadsheets for differences
 Get rid of formulas. If the data spreadsheet in Excel in Excel using the following technique. Figure 4 shows
contains formulas, there may be unexpected errors the first spreadsheet to compare (Sheet1), and Figure 5
when the data are imported into a statistics program. A shows the second spreadsheet (Sheet2).
technique to get rid of all formulas is to copy the entire If the two spreadsheets containing the entered data are
data spreadsheet (leaving out any cells containing lists not in the same worksheet file, copy the second spread-
used in list entries), go to a new blank sheet, and select sheet and paste it into Sheet2 of the original worksheet.
Edit, Paste Special. From the Paste Special dialog box, Note that these spreadsheets must have the data in the
select the ‘‘Values’’ option and OK. This will paste the same order and data in identical cells. To compare these
data into the new spreadsheet with all formulas re- two spreadsheets, follow these steps:
moved. Save this new spreadsheet (under a new name)
1. In the Sheet1 spreadsheet, select Insert/Worksheet
and use it to import the data into the statistics program.
to insert a third worksheet (Sheet3). Copy the labels
 Sort data with caution. It is often helpful to sort a data set
(row 1) from the Sheet1 worksheet to the Sheet3
in Excel to put it into some order, such as patient number.
(Difference) worksheet.
However, be cautious when sorting data in Excel be-
2. In Sheet3, place the cursor in cell A2 and enter the
cause it is easy to sort a single column while leaving the
following Excel formula:
other columns intact, thus ruining the integrity of the
data. Data sets should be saved before performing any
¼ IFðEXACTðSheet1!A2; Sheet2!A2Þ; 0; 1Þ
sorting. To sort correctly, highlight all of the columns
containing data and click Data/Sort. Follow the prompts
3. Copy this formula to all cells from A2 to G5 (the range
to select which columns to use as sorting variables.
of cells to compare). One method of copying this
After designing a data entry spreadsheet using the formula in Excel is to place the cursor in cell A2 and
guidelines above, the data entry should not only be more press CTRL-C (Copy). Then highlight the cells from A2
to G5 and press CTRL-V (Paste). This copies the
formula to all of the specified cells. The Difference
spreadsheet (Sheet3) looks like the one illustrated in
Figure 6.
4. Notice the cells in the Difference spreadsheet. Cells
containing a 1 indicate that the values of the two
spreadsheets in that cell do not match.
5. When doing this comparison on the data set, examine
FIGURE 4 First spreadsheet to compare (Sheet1). the cells that are different (marked as 1) and make

340 JOURNAL OF INVESTIGATIVE MEDICINE  volume 54 number 6  September 2006


Downloaded from [Link] on August 18, 2016 - Published by [Link]

FIGURE 6 Sheet3 (Difference) spreadsheet. FIGURE 7 Sheet3 (Difference) displaying actual differences.

corrections. Once all of the corrections have been for AGE in cell C3 are reversed on the two sheets (43
made, the cells in the Difference spreadsheet should versus 34). Notice in the date comparison in cell B5 that
all be 0 (zero). date codes (38396/38395) are displayed rather than actual
dates. Because these numbers are one digit apart, it
To make the difference more informative, use the more means that the dates on Spread1 and Spread2 are 1 day
complicated Excel formula below (in a single line): apart. The original spreadsheet contains the date as Feb-
¼ IFðEXACTðSHEET1!A2; SHEET2!A2Þ; 0; ruary 12, 2005, and the other spreadsheet contains it as
SHEET1!A2&‘‘/’’&SHEET2!A2) February 13, 2005.
This formula produces the spreadsheet shown in Once you have verified that the two spreadsheets are
Figure 7. identical, you are ready to import your data in a statistics
The Figure 7 version of the differences shows the actual program. If you have followed the guidelines in this ar-
data values from the two sheets displayed so that the dif- ticle, your data set should accurately reflect the data that
ferences are more readily visible. For example, the digits were collected.

Preparing Data for Analysis Using Excel/ELLIOTT ET AL 341


Downloaded from [Link] on August 18, 2016 - Published by [Link]

Preparing Data for Analysis Using Microsoft


Excel
Alan C. Elliott, Linda S. Hynan, Joan S. Reisch and Janet P. Smith

J Investig Med 2006 54: 334-341


doi: 10.2310/6650.2006.05038

Updated information and services can be found at:


[Link]

These include:

References This article cites 2 articles, 0 of which you can access for free at:
[Link]

Email alerting Receive free email alerts when new articles cite this article. Sign up in
service the box at the top right corner of the online article.

Notes

To request permissions go to:


[Link]

To order reprints go to:


[Link]

To subscribe to BMJ go to:


[Link]

You might also like