You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+40-39Lines changed: 40 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ This is a guide for anyone who needs to share data with a statistician. The targ
8
8
* Junior statistics students whose job it is to collate/clean data sets
9
9
10
10
The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls
11
-
and sources of delay in the transition from data collection to data analysis. The Leek group works with a large
11
+
and sources of delay in the transition from data collection to data analysis. The [Leek group](http://biostat.jhsph.edu/~jleek/) works with a large
12
12
number of collaborators and the number one source of variation in the speed to results is the status of the data
13
13
when they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally.
14
14
@@ -38,37 +38,37 @@ Let's look at each part of the data package you will transfer.
38
38
It is critical that you include the rawest form of the data that you have access to. Here are some examples of the
39
39
raw form of data:
40
40
41
-
* The strange binary file your measurement machine spits out
42
-
* The unformated Excel file with 10 worksheets the company you contracted with sent you
43
-
* The complicated JSON data you got from scraping the Twitter API
41
+
* The strange [binary file](http://en.wikipedia.org/wiki/Binary_file) your measurement machine spits out
42
+
* The unformatted Excel file with 10 worksheets the company you contracted with sent you
43
+
* The complicated [JSON](http://en.wikipedia.org/wiki/JSON) data you got from scraping the [Twitter API](https://twitter.com/twitterapi)
44
44
* The hand-entered numbers you collected looking through a microscope
45
45
46
46
You know the raw data is in the right format if you:
47
47
48
48
1. Ran no software on the data
49
-
2. Did not manipulate any of the numbers in the data
50
-
3. You did not remove any data from the data set
51
-
4. You did not summarize the data in any way
49
+
1. Did not manipulate any of the numbers in the data
50
+
1. You did not remove any data from the data set
51
+
1. You did not summarize the data in any way
52
52
53
53
If you did any manipulation of the data at all it is not the raw form of the data. Reporting manipulated data
54
54
as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a
55
55
forensic study of your data to figure out why the raw data looks weird.
56
56
57
57
### The tidy data set
58
58
59
-
The general principles of tidy data are laid out by Hadley Wickham in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)
60
-
and [this video](http://vimeo.com/33727555). The paper and the video are both focused on the R package, which you
59
+
The general principles of tidy data are laid out by [Hadley Wickham](http://had.co.nz/) in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)
60
+
and [this video](http://vimeo.com/33727555). The paper and the video are both focused on the [R](http://www.r-project.org/) package, which you
61
61
may or may not know how to use. Regardless the three general principles you should pay attention to are:
62
62
63
63
1. Each variable you measure should be in one column
64
-
2. Each different observation of that variable should be in a different row
65
-
3. There should be one table for each "kind" of variable
66
-
4. If you have multiple tables, they should include a column in the table that allows them to be linked
64
+
1. Each different observation of that variable should be in a different row
65
+
1. There should be one table for each "kind" of variable
66
+
1. If you have multiple tables, they should include a column in the table that allows them to be linked
67
67
68
68
While these are the hard and fast rules, there are a number of other things that will make your data set much easier
69
69
to handle. First is to include a row at the top of each data table/spreadsheet that contains full row names.
70
-
So if you measured age at diagnosis for patients, you would head that column with the name AgeAtDiagnosis instead
71
-
of something like ADx or another abreviation that may be hard for another person to understand.
70
+
So if you measured age at diagnosis for patients, you would head that column with the name `AgeAtDiagnosis` instead
71
+
of something like `ADx` or another abbreviation that may be hard for another person to understand.
72
72
73
73
74
74
Here is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with
@@ -80,8 +80,9 @@ is summarized at the level of the number of counts per exon. Suppose you have 10
80
80
table/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient
81
81
ids and one row for each data type).
82
82
83
-
If you are sharing your data with the collaborator in Excel the tidy data should be in one Excel file per table. They
83
+
If you are sharing your data with the collaborator in Excel, the tidy data should be in one Excel file per table. They
84
84
should not have multiple worksheets, no macros should be applied to the data, and no columns/cells should be highlighted.
85
+
Alternatively share the data in a [CSV](http://en.wikipedia.org/wiki/Comma-separated_values) or [TAB-delimited](http://en.wikipedia.org/wiki/Tab-separated_values) text file.
85
86
86
87
87
88
### The code book
@@ -90,8 +91,8 @@ For almost any data set, the measurements you calculate will need to be describe
90
91
into the spreadsheet. The code book contains this information. At minimum it should contain:
91
92
92
93
1. Information about the variables (including units!) in the data set not contained in the tidy data
93
-
2. Information about the summary choices you made
94
-
3. Information about the experimental study design you used
94
+
1. Information about the summary choices you made
95
+
1. Information about the experimental study design you used
95
96
96
97
In our genomics example, the analyst would want to know what the unit of measurement for each
97
98
clinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They
@@ -100,30 +101,30 @@ would also want to know any other information about how you did the data collect
100
101
are these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic
101
102
like age? Are they randomized to treatments?
102
103
103
-
A common format for this document is a word file. There should be a section called "Study design" that has a thorugh
104
+
A common format for this document is a Word file. There should be a section called "Study design" that has a through
104
105
description of how you collected the data. There is a section called "Code book" that describes each variable and its
105
106
units.
106
107
107
108
### How to code variables
108
109
109
-
When you put variables into a spreadsheet there are several main categories you will run into:
110
+
When you put variables into a spreadsheet there are several main categories you will run into depending on their [data type](http://en.wikipedia.org/wiki/Statistical_data_type):
110
111
111
112
1. Continuous
112
-
2. Ordinal
113
-
3. Categorical
114
-
4. Misssing
115
-
5. Censored
113
+
1. Ordinal
114
+
1. Categorical
115
+
1. Misssing
116
+
1. Censored
116
117
117
118
Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example
118
-
would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered.
119
-
This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there
120
-
are multiple categories, but they aren't ordered. One example would be sex: male or female. Missing data are data
121
-
that are missing and you don't know the mechanism. You should code missing values as NA. Censored data are data
119
+
would be something like weight measured in kg. [Ordinal data](http://en.wikipedia.org/wiki/Ordinal_data) are data that have a fixed, small (< 100) number of levels but are ordered.
120
+
This could be for example survey responses where the choices are: poor, fair, good. [Categorical data](http://en.wikipedia.org/wiki/Categorical_variable) are data where there
121
+
are multiple categories, but they aren't ordered. One example would be sex: male or female. [Missing data](http://en.wikipedia.org/wiki/Missing_data) are data
122
+
that are missing and you don't know the mechanism. You should code missing values as `NA`. [Censored data](http://en.wikipedia.org/wiki/Censoring_(statistics)) are data
122
123
where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit
123
-
ora patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should
124
-
also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored
125
-
and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report
126
-
to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/
124
+
or a patient being lost to follow-up. They should also be coded as `NA` when you don't have the data. But you should
125
+
also add a new column to your tidy data called, "VariableNameCensored" which should have values of `TRUE` if censored
126
+
and `FALSE` if not. In the code book you should explain why those values are missing. It is absolutely critical to report
127
+
to the analyst if there is a reason you know about that some of the data are missing. You should also not [impute](http://en.wikipedia.org/wiki/Imputation_(statistics))/make up/
127
128
throw away missing observations.
128
129
129
130
In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy
@@ -138,17 +139,17 @@ That means, when you submit your paper, the reviewers and the rest of the world
138
139
the analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform
139
140
some summarization/data analysis steps before the data can be considered tidy.
140
141
141
-
The ideal thing for you to do when performing summarization is to create a computer script (in R, Python, or something else)
142
+
The ideal thing for you to do when performing summarization is to create a computer script (in `R`, `Python`, or something else)
142
143
that takes the raw data as input and produces the tidy data you are sharing as output. You can try running your script
143
144
a couple of times and see if the code produces the same output.
144
145
145
146
In many cases, the person who collected the data has incentive to make it tidy for a statistician to speed the process
146
147
of collaboration. They may not know how to code in a scripting language. In that case, what you should provide the statistician
147
-
is something called psuedocode. It should look something like:
148
+
is something called [pseudocode](http://en.wikipedia.org/wiki/Pseudocode). It should look something like:
148
149
149
150
1. Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
150
-
2. Step 2 - run the software separatly for each sample
151
-
3. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set
151
+
1. Step 2 - run the software separately for each sample
152
+
1. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set
152
153
153
154
You should also include information about which system (Mac/Windows/Linux) you used the software on and whether you
154
155
tried it more than once to confirm it gave the same results. Ideally, you will run this by a fellow student/labmate
@@ -168,8 +169,8 @@ checks.
168
169
You should then expect from the statistician:
169
170
170
171
1. An analysis script that performs each of the analyses (not just instructions)
171
-
2. The exact computer code they used to run the analysis
172
-
3. All output files/figures they generated.
172
+
1. The exact computer code they used to run the analysis
173
+
1. All output files/figures they generated.
173
174
174
175
This is the information you will use in the supplement to establish reproducibility and precision of your results. Each
175
176
of the steps in the analysis should be clearly explained and you should ask questions when you don't understand
@@ -181,7 +182,7 @@ to explain why the statistician performed each step to a labmate/your principal
181
182
Contributors
182
183
====================
183
184
184
-
[Jeff Leek](http://biostat.jhsph.edu/~jleek/) - Wrote the initial version.
185
-
185
+
*[Jeff Leek](http://biostat.jhsph.edu/~jleek/) - Wrote the initial version.
0 commit comments