Skip to content

Commit a3961cc

Browse files
committed
Fixed typos, added links
1 parent 041e2a4 commit a3961cc

File tree

1 file changed

+40
-39
lines changed

1 file changed

+40
-39
lines changed

README.md

Lines changed: 40 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This is a guide for anyone who needs to share data with a statistician. The targ
88
* Junior statistics students whose job it is to collate/clean data sets
99

1010
The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls
11-
and sources of delay in the transition from data collection to data analysis. The Leek group works with a large
11+
and sources of delay in the transition from data collection to data analysis. The [Leek group](http://biostat.jhsph.edu/~jleek/) works with a large
1212
number of collaborators and the number one source of variation in the speed to results is the status of the data
1313
when they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally.
1414

@@ -38,37 +38,37 @@ Let's look at each part of the data package you will transfer.
3838
It is critical that you include the rawest form of the data that you have access to. Here are some examples of the
3939
raw form of data:
4040

41-
* The strange binary file your measurement machine spits out
42-
* The unformated Excel file with 10 worksheets the company you contracted with sent you
43-
* The complicated JSON data you got from scraping the Twitter API
41+
* The strange [binary file](http://en.wikipedia.org/wiki/Binary_file) your measurement machine spits out
42+
* The unformatted Excel file with 10 worksheets the company you contracted with sent you
43+
* The complicated [JSON](http://en.wikipedia.org/wiki/JSON) data you got from scraping the [Twitter API](https://twitter.com/twitterapi)
4444
* The hand-entered numbers you collected looking through a microscope
4545

4646
You know the raw data is in the right format if you:
4747

4848
1. Ran no software on the data
49-
2. Did not manipulate any of the numbers in the data
50-
3. You did not remove any data from the data set
51-
4. You did not summarize the data in any way
49+
1. Did not manipulate any of the numbers in the data
50+
1. You did not remove any data from the data set
51+
1. You did not summarize the data in any way
5252

5353
If you did any manipulation of the data at all it is not the raw form of the data. Reporting manipulated data
5454
as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a
5555
forensic study of your data to figure out why the raw data looks weird.
5656

5757
### The tidy data set
5858

59-
The general principles of tidy data are laid out by Hadley Wickham in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)
60-
and [this video](http://vimeo.com/33727555). The paper and the video are both focused on the R package, which you
59+
The general principles of tidy data are laid out by [Hadley Wickham](http://had.co.nz/) in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)
60+
and [this video](http://vimeo.com/33727555). The paper and the video are both focused on the [R](http://www.r-project.org/) package, which you
6161
may or may not know how to use. Regardless the three general principles you should pay attention to are:
6262

6363
1. Each variable you measure should be in one column
64-
2. Each different observation of that variable should be in a different row
65-
3. There should be one table for each "kind" of variable
66-
4. If you have multiple tables, they should include a column in the table that allows them to be linked
64+
1. Each different observation of that variable should be in a different row
65+
1. There should be one table for each "kind" of variable
66+
1. If you have multiple tables, they should include a column in the table that allows them to be linked
6767

6868
While these are the hard and fast rules, there are a number of other things that will make your data set much easier
6969
to handle. First is to include a row at the top of each data table/spreadsheet that contains full row names.
70-
So if you measured age at diagnosis for patients, you would head that column with the name AgeAtDiagnosis instead
71-
of something like ADx or another abreviation that may be hard for another person to understand.
70+
So if you measured age at diagnosis for patients, you would head that column with the name `AgeAtDiagnosis` instead
71+
of something like `ADx` or another abbreviation that may be hard for another person to understand.
7272

7373

7474
Here is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with
@@ -80,8 +80,9 @@ is summarized at the level of the number of counts per exon. Suppose you have 10
8080
table/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient
8181
ids and one row for each data type).
8282

83-
If you are sharing your data with the collaborator in Excel the tidy data should be in one Excel file per table. They
83+
If you are sharing your data with the collaborator in Excel, the tidy data should be in one Excel file per table. They
8484
should not have multiple worksheets, no macros should be applied to the data, and no columns/cells should be highlighted.
85+
Alternatively share the data in a [CSV](http://en.wikipedia.org/wiki/Comma-separated_values) or [TAB-delimited](http://en.wikipedia.org/wiki/Tab-separated_values) text file.
8586

8687

8788
### The code book
@@ -90,8 +91,8 @@ For almost any data set, the measurements you calculate will need to be describe
9091
into the spreadsheet. The code book contains this information. At minimum it should contain:
9192

9293
1. Information about the variables (including units!) in the data set not contained in the tidy data
93-
2. Information about the summary choices you made
94-
3. Information about the experimental study design you used
94+
1. Information about the summary choices you made
95+
1. Information about the experimental study design you used
9596

9697
In our genomics example, the analyst would want to know what the unit of measurement for each
9798
clinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They
@@ -100,30 +101,30 @@ would also want to know any other information about how you did the data collect
100101
are these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic
101102
like age? Are they randomized to treatments?
102103

103-
A common format for this document is a word file. There should be a section called "Study design" that has a thorugh
104+
A common format for this document is a Word file. There should be a section called "Study design" that has a through
104105
description of how you collected the data. There is a section called "Code book" that describes each variable and its
105106
units.
106107

107108
### How to code variables
108109

109-
When you put variables into a spreadsheet there are several main categories you will run into:
110+
When you put variables into a spreadsheet there are several main categories you will run into depending on their [data type](http://en.wikipedia.org/wiki/Statistical_data_type):
110111

111112
1. Continuous
112-
2. Ordinal
113-
3. Categorical
114-
4. Misssing
115-
5. Censored
113+
1. Ordinal
114+
1. Categorical
115+
1. Misssing
116+
1. Censored
116117

117118
Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example
118-
would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered.
119-
This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there
120-
are multiple categories, but they aren't ordered. One example would be sex: male or female. Missing data are data
121-
that are missing and you don't know the mechanism. You should code missing values as NA. Censored data are data
119+
would be something like weight measured in kg. [Ordinal data](http://en.wikipedia.org/wiki/Ordinal_data) are data that have a fixed, small (< 100) number of levels but are ordered.
120+
This could be for example survey responses where the choices are: poor, fair, good. [Categorical data](http://en.wikipedia.org/wiki/Categorical_variable) are data where there
121+
are multiple categories, but they aren't ordered. One example would be sex: male or female. [Missing data](http://en.wikipedia.org/wiki/Missing_data) are data
122+
that are missing and you don't know the mechanism. You should code missing values as `NA`. [Censored data](http://en.wikipedia.org/wiki/Censoring_(statistics)) are data
122123
where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit
123-
ora patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should
124-
also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored
125-
and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report
126-
to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/
124+
or a patient being lost to follow-up. They should also be coded as `NA` when you don't have the data. But you should
125+
also add a new column to your tidy data called, "VariableNameCensored" which should have values of `TRUE` if censored
126+
and `FALSE` if not. In the code book you should explain why those values are missing. It is absolutely critical to report
127+
to the analyst if there is a reason you know about that some of the data are missing. You should also not [impute](http://en.wikipedia.org/wiki/Imputation_(statistics))/make up/
127128
throw away missing observations.
128129

129130
In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy
@@ -138,17 +139,17 @@ That means, when you submit your paper, the reviewers and the rest of the world
138139
the analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform
139140
some summarization/data analysis steps before the data can be considered tidy.
140141

141-
The ideal thing for you to do when performing summarization is to create a computer script (in R, Python, or something else)
142+
The ideal thing for you to do when performing summarization is to create a computer script (in `R`, `Python`, or something else)
142143
that takes the raw data as input and produces the tidy data you are sharing as output. You can try running your script
143144
a couple of times and see if the code produces the same output.
144145

145146
In many cases, the person who collected the data has incentive to make it tidy for a statistician to speed the process
146147
of collaboration. They may not know how to code in a scripting language. In that case, what you should provide the statistician
147-
is something called psuedocode. It should look something like:
148+
is something called [pseudocode](http://en.wikipedia.org/wiki/Pseudocode). It should look something like:
148149

149150
1. Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
150-
2. Step 2 - run the software separatly for each sample
151-
3. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set
151+
1. Step 2 - run the software separately for each sample
152+
1. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set
152153

153154
You should also include information about which system (Mac/Windows/Linux) you used the software on and whether you
154155
tried it more than once to confirm it gave the same results. Ideally, you will run this by a fellow student/labmate
@@ -168,8 +169,8 @@ checks.
168169
You should then expect from the statistician:
169170

170171
1. An analysis script that performs each of the analyses (not just instructions)
171-
2. The exact computer code they used to run the analysis
172-
3. All output files/figures they generated.
172+
1. The exact computer code they used to run the analysis
173+
1. All output files/figures they generated.
173174

174175
This is the information you will use in the supplement to establish reproducibility and precision of your results. Each
175176
of the steps in the analysis should be clearly explained and you should ask questions when you don't understand
@@ -181,7 +182,7 @@ to explain why the statistician performed each step to a labmate/your principal
181182
Contributors
182183
====================
183184

184-
[Jeff Leek](http://biostat.jhsph.edu/~jleek/) - Wrote the initial version.
185-
185+
* [Jeff Leek](http://biostat.jhsph.edu/~jleek/) - Wrote the initial version.
186+
* [L. Collado-Torres](http://bit.ly/LColladoTorres) - Fixed typos, added links.
186187

187188

0 commit comments

Comments
 (0)