Practical Computing For Biologists
Practical Computing For Biologists
Release 1.0
Cliburn Chan
CONTENTS
Updates
Introduction
Course Description
Index
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
9
10
11
12
13
19
22
30
43
52
57
78
83
85
88
95
98
105
ii
CHAPTER
ONE
UPDATES
24 April 2012 Personal web space on the Duke servers is not turned on by default for DUMC perosnnel. However, if
you make a request for AFS space to [email protected], it will be available to you within 24 hours.
9 April 2012 The PCfB textbook is now available for collection for course participants at Room 120, Surgical Oncology Research Facility. Please read or at least scan the book before the workshop starts. There are also pre-workshop
Assignments that you will need to do. We will shortly be contacting course participants for data sets/repetitive tasks
that could serve as relevant demonstrations or examples of regular expression manipulation, programming or use of
relational databases.
3 April 2012 The course is now fully subscribed, and new registrants will be placed on a wait list. Please continue to
register if you are interested - if there is sufficient demand, we will plan for a second workshop. Thanks so much for
your enthusiasm and support!
Chapter 1. Updates
CHAPTER
TWO
INTRODUCTION
The CFAR Biostatistics and Computational Biology Core is conducting a free four-day workshop for Duke researchers
to learn how to use the computer more effectively for scientific work. It is designed for people who need to work with
large and complex data sets and suspect that there is a better and faster way to get their work done. The course
will use the textbook Practical Computing for Biologists (PCfB) by Steven Haddock and Casey Dunn, and CFAR is
generously giving each participant a free copy of the book. The main intent of the course is to teach researchers how
to use the Unix shell, the Python programming language, databases and image manipulation tools to execute common
scientific chores. An OS X system is preferred since Macs provide a Unix command line natively. Windows users
can also participate by setting up Linux in an emulator (this is perfectly safe and instructions are given in the PCfB
textbook).
The course is designed for people trained in biology, and no previous Unix or programming experience is necessary.
The course will be limited to 12 participants and will be held at the Surgical Oncology Research Facility (SORF) Beard
Conference Room from 29 May 2012 to 1 June 2012. Please email [email protected] if you have any enquiries
or wish to register for the course. Acceptance will be on a first-come first-serve basis, but CFAR investigators and
their trainees will be given priority.
We will contact course participants before the workshop starts to collect your copy of Practical Computing for Biologists. To make it relevant for your needs, participants will also be asked to suggest computational tasks that you would
like to automate or simplify, as well as to contribute data sets that are tedious to preprocess and filter manually. We
will try to work these examples into the demonstrations or class assignments if at all possible. Updates and course
materials will be posted at http://www.duke.edu/~ccc14/pcfb/.
Chapter 2. Introduction
CHAPTER
THREE
COURSE DESCRIPTION
29 May 2012 (Tuesday)
AM: Software installation and working with text editors. We will install the TextWrangler editor (jEdit for Linux
users), the Enthought Python distribution (Academic license), ImageMagick, ImageJ, MySQL Community Server and
MySQL Workbench. Participants are expected to install the software ahead of the workshop following instructions in
PCfB, but help and troubleshooting will be provided in the morning session if necessary. Many operations on large
file sets, especially for text data, are performed much more efficiently from the command line than from a graphical
interface. We will learn how to open a Terminal, and perform text processing, access material from the web, and write
simple shell scripts to automate common tasks.
Installation and introduction
Basic Unix commands
PM: We will learn to use the TextWranger/jEdit editor to understand the basics of regular expressions, and how to
reformat text using regular expressions. TextWrangler/jEdit will also be used to develop programs from Day 2. We
will also learn to transfer and synchronize files with remote computers from the command line, or run programs on
remote computers using the command line (ssh)). We will conclude by showing how to construct a simple homepage
using Sphinx and upload it to the Duke server.
Using a text editor and regular expressions
Remote computing and web page generation
30 May 2012 (Wednesday)
AM: Day 2 introduces you to the Python programming language, a modern dynamic language that is (relatively) easy
to learn. The morning session will introduce you to the powerful IPython interpreter, where you will test out code
snippets with instant feedback, and learn about the Python documentation and help system. We wil then move on to
Python scripting, including decisions and loops, reading from and writing to files, and writing your own functions.
Python Basics I
Python Basics II
PM: The afternoon will introduce you to the most useful Python modules in the standard library, followed by an
introduction to the NumPy module for numerical work, and Matplotlib for graphics.
Python Modules
NumPy and Matplotlib
31 May 2012 (Thursday)
AM: You will learn more about Numpy and Matplotlib, together with how to use the Biopython module for sequenc
and array analysis, as well as how to access the NCBI databases programmatically.
Biopython I
Biopython II
PM: The afternoon starts with an introduction to relational databases and how to query them using SQL, then concludes
with some intermediate examples of using Python for data analysis and statistical simulation.
Data management and relational databasesI
Data analysis with Python
01 June 2012 (Friday)
AM: On the final day, we will have a tutorial for how to create scientific diagrams using the vector illustration program
Inkscape. The course will conclude with working through developing a moderately complex Python program to parse,
summarize and display data from a cytokine assay experiment.
Vector graphics with Inkscape
Capstone example
CHAPTER
FOUR
4.1.8 Biopython I
1. orchid FASTA file
4.1.9 Biopython II
4.1.10 Data management and relational database
SQLite example database Code to generate the database
4.2 Assignments
4.2.1 Pre-workshop
#1 Software installation
Once you have collected your copy of the PCfB book from SORF, install the following software. If you will be using
a Windows system, please follow the instructions starting on page 458 under Installing VirtualBox till the end of
Appendix 1.
For Mac users, install TextWrangler (Page 12) and MySQL (Page 260). We also recommend installing the Enthought
Python Distribution by requesting a free academic copy from http://www.enthought.com/products/edudownload.php
(this will email you a download link). It will also be useful to learn how to compile and install software from source
by following the instructions given in Chapter 21. If you find the instructions extremely confusing, an alternative is to
use a package management system such as MacPorts. MacPorts and how to use it to install software are described on
Page 415.
At the end of this assgnment, you should have installed the following software:
1. TextWranger (jEdit for Windows/Ubuntu)
2. MySQL
3. Enthought Python Distribution
4. ImageMagick (compiling from source or using a package management system such as MacPorts)
#2 Creating your Duke home page
Requesting for AFS space All DUMC personnel with a NetID are eligible for AFS space (5GB) for hosting personal
web pages. However it is not available by default. Please email [email protected] to request for AFS space if
necessary to complete this assignment. It should be available to you within 24 hours of the request.
1. Create a filed called index.html in your text editor (TextWranger or jEdit) and type or copy the following text:
<html>
<head>
<title>My home page for PCfB</title>
</head>
<body>
Congratulations, you have successfully created your home page!
</body>
</html>
2. Use your NetID and password to log into WebFiles. Youll be connected to your home directory.
3. Click the Shared Spaces tab.
4. Under Your Personal Web Space, click Create public_html
5. Under Your Personal Web Space, click Upload to public_html and upload the index.html file you downloaded to
your desktop in Step 1.
6. To view your Web site, visit http://www.duke.edu/~NetID. (Replace NetID with your NetID but kep the ~)
4.2. Assignments
4.3 References
The website for the textbook Practical Computing for Biologists.
4.3.3 Unix
Unix Cheat Sheet
4.3.5 Python
Online tutorials
Learn Python the Hard Way: If you have found the learning curve for our exercises to be too steep, try the 52 exercises
at this site, which provide a much more gentle ramp. The author shares our philosophy that the only way to effectively
learn programming is by working on programming exercises. Dont be put off by the title - the exercises are not as
hard as the ones in the workshop - by the hard way the author just means learning by doing instead of learning by
reading.
Think Python - How to Think Like a Computer Scientist: Once you are comfortable with the basic syntax of Python
(e.g. from the book above), this book introduces you gently to the conceptual ideas you willl need to program effectively..
PyPi - A repository of software for the Python programming language
pypi
Useful packages for scientific computing
Python
Numpy
Scipy
Matplotlib
Sphinx
10
4.4 Participants
4.4.1 Registered
1. Will Williams <[email protected]>
2. Jessica Peel <[email protected]>
3. John Yi <[email protected]>
4. Alex Price <[email protected]>
5. Christopher J. Pierick <[email protected]>
6. Sandeep Dave <[email protected]>
7. Janet Staats <[email protected]>
8. Joe Saelens <[email protected]>
9. Anna Maria Masci <[email protected]>
4.4. Participants
11
If any of you have had trouble installing software, we will spend some time helping you to troubleshoot.
4.5.3 Feedback
Preparing for and running such a workshop takes a lot of time and effort. We are therefore very interested in any
feedback that you can provide that will help us improve. During the workshop, if you have any suggestions for
improvement, please let us know on the spot. Since this is a small class, we want the sessions to be highly informal
and welcome questions and interruptions.
We will probably run this course again in the future if you found it useful. It is also possible that we will run other
similar workshops, depending on interest. As a simple survey, what computational topics would you be interested in?
1. Practical programming for biologists - An intermediate course on the use of Python for scientific computation.
2. Practical statistics for biologists - An introduction to basic statistics in Python and R.
3. Practical data management for biologists - An introduction to creating and using relational database systems to
manage laboratory data.
4. Practical data visualization for biologists - An introduction to statistical and scientific graphics for exploratory
data analysis and scientific communication.
5. Modeling and simulations in biology - How to construct and simulate computational models of biological phenomena.
6. Others (please specify)
4.5.4 Pre-test
1. Do you know how to open a Unix shell/console/terminal on your computer?
2. How do you create a directory foo that has a subdirectory bar that has a subdirectory baz with a single
command?
3. How do you write a regular expression to find sequences that lie between specific restriction enzyme motifs?
4. What is the difference between ssh and scp?
5. How do you write a function in Python to plot a histogram of some data?
6. How do you use BioPython to get information from the NCBI databases?
7. What does this mean select f.name, b.value from foo f, bar b where f.foo_id =
b.foo_id;?
8. How can you estimate the 95% confidence intervals for a statistic without using any formulas?
9. Can you illustrate a conceptual biological model using a vector graphics program?
10. Can you write a program to summarize data from a typical laboratory spreadsheet?
Record your score from 0 to 10. We are curious to see if there is any improvement in your score by the end of the
workshop!
13
cd - change directories
pwd - print working directory
mkdir - make directory
rmdir - remove directory
ls - list directory
cp - copy files
mv - move files
rm - remove files
Changing and Making Directories
pwd is a command that prints the current directory. Depending on how your shell is configured, your current directory or part of it is displayed in your prompt (the prompt is the bit in your shell that looks like this iMac:pcfb
cliburn$). Typically your shell starts you in your home directory, where you would have permissions to write and
create files. To change directories you would use cd, the change directory command.
[jacob@moku
/home/jacob
[jacob@moku
[jacob@moku
/tmp
[jacob@moku
[jacob@moku
/home/jacob
~]$ pwd
~]$ cd /tmp
/tmp]$ pwd
/tmp]$ cd
~]$ pwd
You can use cd without specifying a directory - this returns you to your home directory. You can also use ~ as an alias
for your home directory too. Creating directories uses the mkdir command. If you dont specify a full path (a path
starting with a /) it tries to create one in the current directory.
[jacob@moku ~]$ pwd
/home/jacob
[jacob@moku ~]$ mkdir foo
[jacob@moku ~]$ cd foo
[jacob@moku ~/foo]$ pwd
/home/jacob/foo
[jacob@moku ~/foo]$ mkdir /tmp/bar
[jacob@moku ~/foo]$ cd /tmp/bar
[jacob@moku /tmp/bar]$ pwd
/tmp/bar
If you need to make a deep hierarchy of directories all at once, you can use the -p argument to mkdir to create all the
necessary preceding directories.
[jacob@moku ~]$ pwd
/home/jacob
[jacob@moku ~]$ cd foo
foo: No such file or directory.
[jacob@moku ~]$ mkdir -p foo/bar
[jacob@moku ~]$ cd foo
[jacob@moku ~/foo]$ cd bar
[jacob@moku ~/foo/bar]$ pwd
/home/jacob/foo/bar
14
The rmdir command removes directories. Directories must be empty to be removed. Just like mkdir, if a full path
is not specified it tries to remove the directory from the current directory. Similar to mkdir, the -p argument tries to
remove all the preceding directories
[jacob@moku
[jacob@moku
[jacob@moku
[jacob@moku
[jacob@moku
[jacob@moku
rmdir: bar:
/tmp/bar]$ cd
~]$ rmdir /tmp/bar
~]$ rmdir -p foo/bar
~]$ mkdir bar
~]$ touch bar/foo
~]$ rmdir bar
Directory not empty
touch is a command that creates an empty file. We will find out about it when we look at working with files
Examining directories
Now that we understand directories, wed want to look at what files the directories contain. ls will list the files in a
directory.
[jacob@moku ~]$ ls
A.txt
B.txt
C.txt
[jacob@moku ~]$ ls bar
foo
bar
Just like mkdir, ls has several useful command line options. ls -l will list out all the extra properties of the directory
listed (file permissions, owner, last time modified). ls -a will list hidden files (those files whose name begin with a .).
ls -F will append directories names / and (along with other symbols after other special file types).
[jacob@moku ~]$ ls -l
total 5
-rw-r--r-- 1 jacob jacob 32 May 25 08:48 A.txt
-rw-r--r-- 1 jacob jacob 32 May 25 08:49 B.txt
-rw-r--r-- 1 jacob jacob 64 May 25 08:53 C.txt
drwxr-xr-x 2 jacob jacob
3 May 23 15:54 bar
[jacob@moku ~]$ ls -a
.
.cshrc
.mail_aliases
.rhosts
A.txt
bar
..
.login
.mailrc
.shrc
B.txt
.bash_history
.login_conf .profile
.ssh
C.txt
[jacob@moku ~]$ ls -laF
total 20
drwxr-xr-x 4 jacob jacob
16 May 27 12:11 ./
drwxr-xr-x 4 root
wheel
5 May 23 15:12 ../
-rw------- 1 jacob jacob
459 May 25 09:32 .bash_history
-rw-r--r-- 1 jacob jacob 1014 May 23 15:12 .cshrc
-rw-r--r-- 1 jacob jacob
257 May 23 15:12 .login
-rw-r--r-- 1 jacob jacob
167 May 23 15:12 .login_conf
-rw------- 1 jacob jacob
379 May 23 15:12 .mail_aliases
-rw-r--r-- 1 jacob jacob
339 May 23 15:12 .mailrc
-rw-r--r-- 1 jacob jacob
753 May 23 15:12 .profile
-rw------- 1 jacob jacob
284 May 23 15:12 .rhosts
-rw-r--r-- 1 jacob jacob
978 May 23 15:12 .shrc
drwx------ 2 jacob jacob
3 May 23 16:15 .ssh/
-rw-r--r-- 1 jacob jacob
32 May 25 08:48 A.txt
-rw-r--r-- 1 jacob jacob
32 May 25 08:49 B.txt
-rw-r--r-- 1 jacob jacob
64 May 25 08:53 C.txt
drwxr-xr-x 2 jacob jacob
3 May 23 15:54 bar/
15
You can also copy multiple files at once. cp will copy all the files listed on the command line into the directory
specified in the last argument.
[jacob@moku ~]$ cp A.txt B.txt C.txt bar/
[jacob@moku ~]$ ls bar
A.txt
B.txt
C.txt
foo
Globbing will allow us to use many files at once rather than typing them all out explicitly. Globbing is a form of
wildcards.
Glob
*
?
[abc]
Effect
any number of any character
any single character
one of a, b, or c
or
[jacob@moku ~]$ cp ?.txt bar/
[jacob@moku ~]$ ls bar
A.txt
B.txt
C.txt
foo
or even
[jacob@moku ~]$ cp [ABC].txt bar/
[jacob@moku ~]$ ls bar
A.txt
B.txt
C.txt
foo
with the -r command line argument you can recursively copy whole directories
[jacob@moku ~]$ cp -r bar foo
[jacob@moku ~]$ ls foo
A.txt
B.txt
C.txt
foo
~]$
~]$
~]$
~]$
mv
mv
mv
mv
A.txt foo
[BC].txt foo
foo/*.txt bar/
foo baz
To remove files, use the rm command. A word of caution, there is no trash can or waste basket. Removed files are
gone. It is very easy to accidentally shoot your self in the foot when blindly removing files.
[jacob@moku ~]$ rm bar/A.txt
[jacob@moku ~]$ rm bar/[BC].txt
16
the -r command line flag removes files recursively, while -f attempts to ignore permissions on the file. The combination
of -r and -f flags can be useful to remove whole directory tree. BE VERY CAREFUL using -r and -f flags.
> is a redirect to create a new file (and delete the old file if it exists). >> is the append redirect, while | (pipe) allow
you to send the output of one command as input to a new command.
[jacob@moku ~]$ cat [AB].txt
This is file A.
It has 2 lines.
This is file B.
17
It has
3 lines.
While cat is useful for displaying small files, longer files would page off the screen quickly. To display longer files, a
page aware program will be used, less.
[jacob@moku ~]$ less <file name>
effect
Go to the last line
Go to the first line
Go to line number #
Search forward for foo
Search Backward for foo
Quit less
The first argument passed to grep is the pattern to search for, in the above example poor Yorick.
Regular expressions provide the ability to search beyond known text, using wildcards to build complex patterns.
key
.
+
*
^
$
[abc]
Meaning
any single character
one or more of the preceding character
zero or more of the preceding character
matches the beginning of the line
matches the end of the line
matches a singular character of a, b, or c
so to find all the lines beginning with the word HAMLET and end withs DEMARK
18
4.6.4 Exercise
Make a directory in your home folder named spam containing subfolders eggs, bacon, foo and bar and then remove
spam/foo and spam/bar
4.7.3 Mini-exercise
When you first open the Lorem ipsum.txt file, it looks like
19
Now drag the splitbar all the way to the top to get back a single window.
20
4.7.6 Mini-exercise
1. Construct a regular expression to find two or more consecutive vowels in the cats and dogs file.
2. Use a regular expression pattern to delete all punctuation from the example.
3. Use a positional assertion with the alternation pattern to find all cat, cats, dog or dogs that occur at the end a
sentence. You should find that the ends of sentences 1, 3, 5, 9 and 11 match.
4.7.8 Exercise
Open the file Ch3observations.txt in the examples folder. It looks like this
4.7. Using a text editor and regular expressions
21
13 13 53 -1.414 5.781
1961 Mar.
17 03 46 14 3.6
2002 Oct.
1 18 22 36.51 -3.4221
1863 Jul.
20 12 02 1.74 133
Hint: Construct a regular expression to match one single line. Look at the patterns that you must capture from
the original to perform the conversion. Construct the appropriate subpatterns to do so. Now construct the regular
patterns between the subpatterns to match the unwanted separating characters. When you have a regular expression
that matches a single line, check by clicking Next - the highlighted match should jump from one complete line to the
other. Now use the references \1, \2 etc, re-ordering if necessary, and adding in filler such as extra punctuation or
tabs to construct the desired replacement string. Save the regular expression by clicking the little g button on the Find
dialog box and clicking Save .... Now hit Replace and see if it does what you expect. If it works, hit replace a few
more times or click Replace All. If it doesnt work, Undo and try again.
If you are totally lost and about to pull all your hair out, the construction of the solution is described in detail on pages
38-40 of the PCfB textbook. However, you should not peek at the answer without trying for at least 15 minutes.
22
Self provisioned systems are now available for remote usage through OITs Virtual Computing Lab servi
Mon May 28 00:44:43 EDT 2012
[ccc14@login5 ~]$ ls
AFSDocs public_html Sites
[ccc14@login5 ~]$
If you know the IP address of a Linux or Mac computer that you have login rights to, you can usually connect to it via
ssh. For example, this is how we typically access departmental servers or computing workstations from home. If you
work with the Duke Beowulf cluster, you will also use ssh to connect and run your programs remotely.
If you can ssh to a computer, you can also copy files to or from the remote computer using scp (secure copy). Here
is an example:
[ccc14@login5 ~]$ cat > remote.txt
This is my remote file on lgoin.oit.duke.edu
[ccc14@login5 ~]$ exit
logout
Connection to login.oit.duke.edu closed.
eris:pcfb cliburn$ scp [email protected]:~/remote.txt .
[email protected] password:
remote.txt
100%
45
0.0KB/s
eris:pcfb cliburn$ cat remote.txt
This is my remote file on lgoin.oit.duke.edu
00:00
If you wish to synchronize entire directory trees between computers, it is more efficient to use rsync which performs
data compression, only tranfers files that are differnet, and allows resuming of interrupted transfers. For example,
rsync is a simple way to back up your files to another computer.
eris:tmp cliburn$ rsync -avz foo [email protected]:~/
[email protected] password:
building file list ... done
foo/
foo/foo.txt
foo/bar/
foo/bar/bar.txt
foo/bar/baz/
foo/bar/baz/baz.txt
sent 376 bytes received 104 bytes
total size is 75 speedup is 0.16
73.85 bytes/sec
The flags -avz are short for --archive, --verbose and --compress. The --archive flag preserves symbolic links and is perfect for remote backups. As usual, you can look at man rsync if you want to know the details
of how rsync works.
23
24
file
file
file
file
homepage/conf.py.
homepage/index.rst.
homepage/Makefile.
homepage/make.bat.
The next thing to do is to cd homepage to enter the directory that was just created for us and edit the conf.py
file to setup a configuraiton that we like. The only change to be made for now is to change the html_theme from
defautl to agogo to match our workshop website theme. The themes that come with Sphinx can be viewed at
http://sphinx.pocoo.org/theming.html.
The first page to edit is the index.rst file. The rst extension is for ReStructuredText, a simple plain text
markup language that is much easier to work with than HTML. Look at the primer on ReStructuredText at
http://sphinx.pocoo.org/rest.html to see examples of how to use it. Open the index.rst file in your text editor:
.. Demo home page documentation master file, created by
sphinx-quickstart on Mon May 28 01:12:56 2012.
You can adapt this file completely to your liking, but it should at least
contain the root toctree directive.
Welcome to Demo home pages documentation!
==========================================
Contents:
.. toctree::
:maxdepth: 2
Since this is to be a home page rather than documentation page, we can simplify the structure. Edit the file so that the
last part looks like this:
Contents:
.. toctree::
:maxdepth: 2
:hidden:
Home <self>
research
publications
We want to keep the table of contents hidden, and have set up a simple structure where the home page (index.html)
4.8. Remote computing and web page generation
25
links to a research.html and a publications.html file. Just as the index.html file will be generated by thiis index.rst file,
the other two files are also generated by a research.rst and publications.rst file that we write using ReStructuredText.
The full contents of the 3 rst files are included verbatim for reference:
4.8.2 index.rst
Cliburns very boring home page
==========================================
Stuff I do
------------------Tongue ribeye pig, tenderloin turducken salami frankfurter strip
steak. T-bone turducken meatball flank, beef ribs brisket corned
beef tail. Ball tip tongue flank beef ribs, biltong tri-tip salami
chicken sausage leberkas chuck tail. Kielbasa shankle pork chop
sirloin, leberkas bresaola tail. Ham hamburger venison sausage
biltong, pork loin brisket pig sirloin pastrami short loin shank
chicken. Pig andouille leberkas beef short loin ribeye turkey ham
hock. Cow ham kielbasa, capicola short ribs brisket shoulder
pancetta t-bone pork belly tri-tip pork loin tenderloin.
Ground round pork belly pastrami pork chop, drumstick corned beef
t-bone tail bresaola filet mignon meatloaf. Boudin spare ribs ham
hock short loin. Prosciutto ham hock sausage, biltong leberkas
turkey hamburger pork meatball bresaola pork belly. Shankle tri-tip
frankfurter ribeye leberkas ham hock, tongue beef ribs speck venison
pork chop andouille chuck. Rump pastrami bresaola, strip steak short
loin andouille pork chop beef boudin capicola bacon shank prosciutto
beef ribs swine. Meatloaf leberkas pancetta beef.
More stuff I do
-------------------Enim do boudin officia labore tail. Pork exercitation short ribs
deserunt laboris, tenderloin drumstick in dolor tongue sunt ex. Ham
hock t-bone exercitation pork loin non mollit. Jowl boudin magna
adipisicing in dolore. Brisket quis shoulder nostrud tempor
ea. Aliquip officia consequat deserunt, dolore nostrud est tri-tip
ut pancetta speck shank excepteur. Sausage cillum ground round velit
rump, dolore laboris.
Commodo consectetur ut, officia proident eu cillum jowl aute flank
sausage ut beef ribs. Deserunt occaecat pariatur elit. Pork chop ut
tempor, enim aliqua laborum cillum eiusmod t-bone occaecat aute
laboris labore. Ham hock turkey beef nostrud excepteur
dolor. Consectetur meatball chicken deserunt exercitation, corned
beef beef in short ribs ut ea velit beef ribs. Enim andouille in,
dolore ut meatball ea ut tail proident short ribs leberkas ground
round filet mignon.
Andouille sirloin chicken tempor aute, cow salami commodo dolore
leberkas culpa in ea esse. Id ground round tongue velit. Ex elit
minim sirloin fatback laboris. Irure andouille shankle cupidatat,
nostrud bresaola id shank do jowl. Swine sirloin pork loin,
prosciutto bresaola rump cillum in exercitation capicola.
26
Contents:
.. toctree::
:maxdepth: 2
:hidden:
Home <self>
research
publications
4.8.3 research.rst
Cliburns boring research page
=================================================
Current research interests
--------------------------------------1. Bacon
2. Pork rind
3. Trotters
.. image:: bacon.jpg
:width: 60%
Past research interests
---------------------------------------1. LOL cats
.. image:: Lolcat.JPG
4.8.4 publications.rst
Not really Cliburns publications
================================
First 5 hits on Pubmed search for "Sphinx"
--------------------------------------------1. Quadrature RF Coil for In Vivo Brain MRI of a Macaque Monkey in
a Stereotaxic Head Frame. Roopnariane CA, Ryu YC, Tofighi MR,
Miller PA, Oh S, Wang J, Park BS, Ansel L, Lieu CA, Subramanian T,
Yang QX, Collins CM. Concepts Magn Reson Part B Magn Reson
Eng. 2012 Feb;41B(1):22-27. Epub 2012 Feb 18. PMID: 22611340
[PubMed]
2. The place of general practitioners in cancer care in
Champagne-Ardenne. Tardieu E, Thiry-Bour C, Devaux C, Ciocan D,
de Carvalho V, Grand M, Rousselot-Marche E, Jovenin N. Bull
Cancer. 2012 May 1;99(5):557-562. PMID: 22522646 [PubMed - as
supplied by publisher]
27
make.bat
publications.rst
research.rst
28
22035.20 bytes/sec
Now, if we navigate to http://www.duke.edu/~ccc14/homepage/, we will see the homepage and the links on the sidebar
to publications and research work as well.
29
create persistent data stores that allow you to slice and dice well-structured data that will be briefly touched upon
in the session on :doc:Data management and relational databasesI</database>, but will require
another full workshop to cover in any depth.
Data analysis is largely about how to do statistics. We will show very simple examples of analysis in :doc:Data
analysis with Python</analysis>, but the proper cultivation of the statistical way of thinking probably
requires not just another workshop, but returning to graduate school.
Finally, there are two main reasons for data visualization - the first is for exploratory data analysis, since the human
brain is highly optimized to detect patterns in pictures; and the second is for communicating results, since every
biologist Ive ever met is only ever interested in the figures in a paper and never the raw data. Making pictures from
data for exploratory analysis and communication are covered in NumPy and Matplotlib</numerics> and
the creation of schematics to illustrate concepts in :doc:Vector graphics with Inkscape</inkscape>.
But first - in order to tell your slave how to do these jobs, you need to think like a programmer.
For now, we simply define a variable as a name we give to data so that we can retrieve it later. The other terms should
be familiar to everyone.
For example, here is a simple example that checks if a word is a palindrome:
1
2
3
4
5
6
7
2
3
4
5
6
7
8
9
10
11
12
Since the 5 operations are about all that a computer can do, even big complex projects must boil down to smaller tasks
that mix and match these operations. Essentially, if you know these 5 operations, you know how to program. The rest
are details.
31
There is also a vanilla python interpreter, but ipython provides so many nice features such as Unix shell integration, tab completion, history etc that I hardly ever use the python interpreter. Try typing ? in ipython to see what
it offers. To exit the information screen, type q to quit. There are several ways to get more information about a Python
language feature - for example, what does the python range function do? Type help(range) or help range
or range? or ?range.
32
In [9]: math.sin(math.pi/4)
Out[9]: 0.7071067811865475
In [10]: math.asin(math.sin(math.pi/4))
Out[10]: 0.7853981633974482
In [11]: math.sqrt(16)
Out[11]: 4.0
In [12]: 16**0.5
Out[12]: 4.0
Note the gotcha in the [2] calculation - when both numerator and denominator are integers, the division operator
returns an integer, which might not be what you want. Make either numerator or denominator a float (a number with
a decimal point) as in [3] to get the usual answer.
4.9.6 Types
As we have already seen in the previous example, there is a difference between 2 and 2.0. In particular, they differ in
type - 2 is an integer, while 2.0 is a float. Types are necessary so that Python can distinguish between different kinds
of things that may have different behaviors. Here are the most commonly used basic types in Python:
1. Integers are natural numbers, ..., -3, -2, -1, 0, 1, 2, 3 ...
2. Floats are decimal numbers e.g. 0.01, 1e-6, math.pi etc
3. Bools are the values True and False
4. Strings are anything within single quotes, double quotes, or triple quotes such as hello, "hello",
hello and """hello""".
While the first 3 types are atomic, strings are actually sequences of characters, and we can retrieve characters at specific
postions by indexing and slicing. An example of how we can slice and dice sequences is useful here:
In [1]: s = "My first string"
In [2]: s[0]
Out[2]: M
In [3]: s[1]
Out[3]: y
In [4]: s[-1]
Out[4]: g
In [5]: s[0:2]
Out[5]: My
In [6]: s[3:8]
Out[6]: first
In [7]: s[3:8:2]
Out[7]: frt
In [8]: s[::-1]
Out[8]: gnirts tsrif yM
Note that in Python, we count from zero, not one. Note also that a negative index means count backwards from the
end of the sequence.
4.9. Python Basics I
33
Another type of sequence that is ubiquitous in Python programs is the list, consisting of a sequence of other types
delimited by square brackets [ and ]. Unlike strings which only contain characters, list elements can be anything,
including other lists. Another difference between strings and lists is that the elements in a list can be changed by
assigning new values to them. In geek-speak, lists are mutable and strings are immutable. You may also see tuples
which are items separated by commas, and typically delimited by ( and ). For the most part, we can just consider
tuples to be immutable lists. A neat trick we can do with tuples is unpacking, perhaps easier demonstrated than
explained:
In the first example above, we unpacked the length 3 list into the variables a, b, c in a single statement. In the
second example, we swapped the contents of a and b.
Time for more experimentation in ipython:
In [1]: alist = [1,2,3.14,foo,bar,[a,b,True]]
In [2]: alist[5]
Out[2]: [a, b, True]
In [3]: alist[5][2]
Out[3]: True
In [4]: alist[1] = 99
In [5]: alist
Out[5]: [1, 99, 3.14, foo, bar, [a, b, True]]
In [6]: atuple = (1,2,3.14,foo,bar,[a,b,True])
In [7]: atuple[5]
Out[7]: [a, b, True]
In [8]: atuple[5][2]
Out[8]: True
In [9]: atuple[1] = 99
--------------------------------------------------------------------------TypeError
Traceback (most recent call last)
/Volumes/HD3/hg/pcfb/<ipython-input-9-5a8168f444b1> in <module>()
----> 1 atuple[1] = 99
TypeError: tuple object does not support item assignment
In [10]: astring = "hi there"
In [11]: astring[2] = x
--------------------------------------------------------------------------TypeError
Traceback (most recent call last)
/Volumes/HD3/hg/pcfb/<ipython-input-11-71b2376dea24> in <module>()
----> 1 astring[2] = x
TypeError: str object does not support item assignment
Here the difference between mutable and immutable is clearly shown. We can grow lists in several ways - using an
insert, and append and list concatenation. We can remove items from a list by using pop, del or assigning
a slice to the empty list.
In [1]: blist = []
In [2]: blist.append(1)
34
In [3]: blist.append(99)
In [4]: blist = blist + [3,4,5]
In [5]: blist
Out[5]: [1, 99, 3, 4, 5]
In [6]: blist[2:2] = [a,b,c]
In [7]: blist
Out[7]: [1, 99, a, b, c, 3, 4, 5]
In [8]: blist[2:4] = []
In [9]: blist
Out[9]: [1, 99, c, 3, 4, 5]
In [10]: blist.pop()
Out[10]: 5
In [11]: blist
Out[11]: [1, 99, c, 3, 4]
In [12]: del blist[3]
In [13]: blist
Out[13]: [1, 99, c, 4]
The final basic type we will look at is the dictionary. A dictionary consists of (key, value) pairs, where
the key is an immutable type (e.g. a number, a string, a tuple) and the value is anything. We retrieve the value in a
dictionary by using the associated key. Dictionaries are delimited by { and }. For example, we can make a dictionary
of email addresses:
In [1]: emails = {}
In [2]: emails[cliburn] = [email protected]
In [3]: emails[jacob] = [email protected]
In [4]: emails[cliburn]
Out[4]: [email protected]
In [5]: emails.keys()
Out[5]: [jacob, cliburn]
In [6]: emails.values()
Out[6]: [[email protected], [email protected]]
In [7]: emails
Out[7]: {cliburn: [email protected], jacob: [email protected]}
We can also think of dictionaries as fancy lists that are not restricted to consecutive integers for indexing. Note that
we create dictionaries with curly braces {} but assign element to and retrieve elements from dictionaries with square
brackets [key]. If the key is not found in the dictionary, Python will raise a KeyError exception and abort. To avoid
that, we can either check for the key before retrieval, tell Python to ignore KeyErrors in a try-except statement,
or return a default value using the get method instead of [] to access the dictionary. Here are more examples of
dictionary creation and usage:
35
Dictionaries can also be constructed from a list of (key, value) pairs (or 2-tuples). The zip function takes the first
element from list 1 and the first element from list 2 to make a tuple, then does the same for the second element etc until
one or both lists are exhausted. It is used here to construct a list of pair from two matching lists of keys and values.
The get method returns the default (second) argument when the key given by its first argument is not found in the
dictionary. The setdefault method does the same thing, but additionally inserts the new key / default value into
the dictionary if not found. If we want to ingore missing keys but just retrieve values for valid keys, we can wrap the
dictionary access in a try-except statement:
In [1]: adict
Out[1]: {a: 1, b: 2, c: 3, d: 4, e: 0}
In [2]: for ch in "ajfljldjajfeljad":
...:
try:
...:
print adict[ch]
...:
except KeyError:
...:
pass
...:
...:
1
4
1
0
1
4
The pass keyword means do nothing. Without the try-except statement, the program would crash with a
KeyError the first time ch was not found in the dictionary keys a, b, c, d, e.
You might have noticed that we sometimes used a funny notation with a dot . between names, for example
list.append(1). This is because Python is an object-oriented language, and these basic types are also classes.
We wont discuss classes here except to note that we use the dot notation to access values (attributes) and functions
(methods) associated with the class. In ipython, hit the tab key after the dot to see what types are available.
In [1]: blist
Out[1]: [1, 99, c, 4]
36
The code [b for b in dir(blist) if not b.startswith(_)] is a list comprehension to show all the normal
methods of the list class, filtering out methods that look like __xxx__. The methods with __ prefixes and suffixes
are special internal methods that we wont use in this workshop. You can use help to find out what extend, index
etc do. The count, reverse and sort methods are quite simple:
In [1]: numlist = [3,1,4,1,5,1,6,9]
In [2]: numlist.sort()
In [3]: numlist
Out[3]: [1, 1, 1, 3, 4, 5, 6, 9]
In [4]: numlist.reverse()
In [5]: numlist
Out[5]: [9, 6, 5, 4, 3, 1, 1, 1]
37
rfind,
rindex,
rjust,
rpartition,
rsplit,
rstrip,
split,
splitlines,
startswith,
strip,
swapcase,
title,
translate,
upper,
zfill]
In [3]: quote.lower()
Out[3]: my philosophy, like color television, is all there in black and white
In [4]: quote.upper()
Out[4]: MY PHILOSOPHY, LIKE COLOR TELEVISION, IS ALL THERE IN BLACK AND WHITE
In [5]: quote.split()
Out[5]:
[My,
philosophy,,
like,
color,
television,,
is,
all,
there,
in,
black,
and,
white]
In [6]: quote = .join(quote)
In [7]: quote.split(,)
Out[7]: [My philosophy, like color television, is all there in black and white]
4.9.7 Operators
We have already seen some operations, such as + and * for addition and multiplication. In addition to the numeric
operators, there are also Boolean operators and, or and not, comparison operators <, <=, >, >=, ==, !=, is and
is not, and some other operators we will not discuss here (e.g. bitwise operators). Most of these operators are quite
self-evident, and if not, experimentation in the interpreter will clarify what they do:
In [1]: True and False
Out[1]: False
In [2]: True or False
Out[2]: True
In [3]: not True
Out[3]: False
38
In [4]: 3 == 3
Out[4]: True
In [5]: 3 == 4
Out[5]: False
In [6]: 3 != 4
Out[6]: True
In [7]: 3 > 4
Out[7]: False
In [8]: 4 > 3
Out[8]: True
In [9]: None is None
Out[9]: True
In [10]: 0 is None
Out[10]: False
In [11]: 0 is not None
Out[11]: True
39
...: else:
...:
grade = D
...:
In [4]: grade
Out[4]: B
As you can see, the if statement can be used by itself without an else part with the understanding that if the condition
is not true, then nothing is done. If you need to make decisions based on many conditions, the if-[elif]-else
form is useful, where the final optional else statement will be executed if none of the others above it are true. Note
that the last example depends on the ordering of the conditions, and works because the if-elif-else statement
works from top to bottom. See if you can figure out why the code below doesnt work as intended:
In [1]: score = 86
In [2]: grade = None
In [3]: if (score > 70):
...:
grade = C
...: elif (score > 85):
...:
grade = B
...: elif (score > 93):
...:
grade = A
...:
In [4]: grade
Out[4]: C
OK, back to looping. There are two main ways to loop in Python using the for and while statements. The for
statement goes through a sequence of items one at a time, typically performing some work on that item as it iterates
over it. The examples below should make clear what a for loop does:
In [1]: range(10, 15)
Out[1]: [10, 11, 12, 13, 14]
In [2]: for number in range(10, 15):
...:
print number
...:
10
11
12
13
14
In [3]: for char in abcde:
...:
print char
...:
a
b
c
d
e
In [4]: for name in [adam, eve]:
...:
print name
...:
adam
eve
40
Remember that strings, lists, tuples and dictionaries are all sequences, and hence iterable. So we can use the for loop
on any of these. It is getting rather tedious to use the word sequence, so from now on, I will use lists, or strings or
dictionaries, but you should remember that the looping constructs work on all of them. Another Python idiom that is
sometimes useful in looping is the use of enumerate to keep track of position (or index) while looping over a list.
Here is how it is used:
In [1]: for i, name in enumerate([cliburn, jacob, adam]):
...:
print i, name
...:
0 cliburn
1 jacob
2 adam
A common use of the for loop is to create a new list from an old one. For example, here is how you can create a list of
all the squares from 10 to 15.
In [1]: squares = []
In [2]: for i in range(10, 16):
...:
squares.append(i**2)
...:
In [3]: squares
Out[3]: [100, 121, 144, 169, 196, 225]
Notice that to get the numbers [10,11,12,13,14,15], we call range(10, 16) since Python indexing includes the start but excludes the end.
We can now combine if checks with loops to filter lists that we are constructing, only adding items to our list if they
meet certain conditions. Suppose we wanted to only collect the squares of the odd numbers and discard the even one
in the previous example:
In [1]: oddsquares = []
In [2]: for i in range(10, 16):
...:
if i%2==1:
...:
oddsquares.append(i**2)
...:
In [3]: oddsquares
Out[3]: [121, 169, 225]
This process of looping over a sequence and collecting the items in a list, filtering by some condition if necessary, is so
common that Python has a short cut way of doing it known as list comprehension. Here is the nicer list comprehension
version of the above two examples:
In [1]: squares = [i**2 for i in range(10, 16)]
In [2]: squares
Out[2]: [100, 121, 144, 169, 196, 225]
In [3]: oddsquares = [i**2 for i in range(10, 16) if i%2==1]
In [4]: oddsquares
Out[4]: [121, 169, 225]
We can nest loops within each other - for example, to generate labels for a 96-well plate, we can do this list comprehension:
41
It might be clearer to understand what is happening using the longer version of creating an empty list, then using
nested for loops:
In [1]: wells = []
In [2]: for r in ABCDEF:
...:
for c in range(1,13):
...:
wells.append(%s%02d % (r, c))
...:
In [3]: wells[:15]
Out[3]:
[A01,
A02,
A03,
A04,
A05,
A06,
A07,
A08,
A09,
A10,
A11,
A12,
B01,
B02,
B03]
Dont worry about the funny %02d and %s bits in the code. We will explain string interpolation in the next
session. Congratulations! You have learnt the basic building blocks of a python program. Once you complete the
exercise below, we will finish off our Python crash course with I/O and creating your own functions.
42
then save it as expert.py. Now open a new terminal, and run the program like so:
eris:~ cliburn$ python expert.py
I am an expert programmer
You can also run Python programs from within ipython with the run keyword:
In [1]: run expert.py
ERROR: File expert.py not found.
For the above two programs to run, you need to be in the same directory as expert.py. Later, Jacob will show you
how to run programs in arbitrary locations.
4.9.10 Exercise
It is claimed by an interviewer that the majority of computer science graduates cannot write a correct solution to
this problem (http://imranontech.com/2007/01/24/using-fizzbuzz-to-find-developers-who-grok-coding/). Are you better than the average computer science graduate?
Write a program that prints the numbers from 1 to 100. But for multiples
of three print Fizz instead of the number and for the multiples of five
print Buzz. For numbers which are multiples of both three and five print
FizzBuzz.
You should save your program in a text editor and execute it as shown above. You can experiment within
ipython, and copy and paste working code from ipython into the text editor. The instruction history -n
startline:stopline shows you code in your history without the line numbers that is convient for cutting and
pasting. Do it in stages - first, how do you print the numbers from 1 to 100? Next, how do you find multiples of 3?
How do you change what is printed if the number is a multiple of 3? And so on ...
Typically, the condition is updated within the body of the while statement such that it eventually becomes false. A
simple example follows:
In [1]: i = 0
In [2]: while (i < 5):
...:
i = i+1
...:
print i
...:
1
2
3
4
5
43
44
Note that the variables to be inserted into the string are given as a tuple following the % separator. However, the default
formatting leaves something to be desired. For numbers, we can specify the minimum width, as well as the number of
decimal places when the number is a float.
In [1]: %4d % 123
Out[1]: 123
In [2]: %4d % 12345
Out[2]: 12345
In [3]: %5f % 3.14
Out[3]: 3.140000
In [4]: %5.2f % 3.14
Out[4]: 3.14
Sometimes it is also convenient to pad strings or change alignment so rows line up nicely using the flags 0 (left pad
with zeros), (left pad with space), - (left align) and + (add sign character).
In [1]: %05d % 23
Out[1]: 00023
In [2]: % 5d % 23
Out[2]:
23
In [3]: %-5d % 23
Out[3]: 23
In [4]: %+5d % 23
Out[4]: +23
Simple tables are often created using loops and string interpolation. For example, here is the code to print out the
layout of a 96 well plate:
In [1]:
...:
A01 A02
B01 B02
C01 C02
D01 D02
E01 E02
F01 F02
A04
B04
C04
D04
E04
F04
A05
B05
C05
D05
E05
F05
A06
B06
C06
D06
E06
F06
A07
B07
C07
D07
E07
F07
A08
B08
C08
D08
E08
F08
A09
B09
C09
D09
E09
F09
A10
B10
C10
D10
E10
F10
A11
B11
C11
D11
E11
F11
A12
B12
C12
D12
E12
F12
The join method of a string joins together all the strings in a list, separated by the original string. In this case the
original string is a space , so all the strings in the list comprehension will be joined with spaces separating them
before being printed. Take your time to deconstruct this short example - it pulls together many concepts - looping, list
comprehension, string interpolation and the use of the string method join.
4.10.3 Mini-exercise
1. Write a program that produces these 2 lines of output from range(1,11):
0001 0002 0003 0004 0005 6.00 7.00 8.00 9.00 10.00
2. Write a program that starts with range(1, 6) and ends up with this string 1-one-thousand-2-one-thousand-3-onethousand-4-one-thousand-5, using a list comprehension, the str() function and a string join.
45
Opening files for writing or appending is similar, but replace the r in the argument with w or a. Remember if you
open the file sequence1.txt with the w flag, the current contents are gone forever.
OK. Now we will open a file for writing, write some lines, close it, open again for appending more lines, close it, and
finally open again for reading.
In [1]: graffiti = \n.join([Roses are red, Violets are blue, The dog is pregnant, Thanks to
In [2]: fo = open(graffiti.txt, w)
In [3]: fo.write(graffiti)
In [4]: fo.close()
Here, we write some lines of doggerel in a list, join them as separate lines with the newline separator \n, then write it
to a file called directory.txt that has been opened for writing. Sometimes, you will see another newer idiom for
opening files:
In [1]: with open(graffiti.txt, w) as fo:
...:
fo.write(graffiti)
...:
The difference is that when using the with statement, you dont need to remember to close the file handler. The
operating system limits the numbers of file handlers that are available, and exceeding the number may lead to a system
crash. Closing the file frees up the resource, but it is easy to forget to do so in more complicated programs, hence the
availability of the with statement. Either way is fine. You can see whats in the file by using less graffiti.txt
either in ipython or on the command line.
Lets add another line for the author of the poem:
In [1]: fo = open(graffiti.txt, a)
In [2]: fo.write(\n + by anonymous college toilet poet)
In [3]: fo.close()
Note that we add a newline \n before the attribution string so that it appears on a separate line.
46
4.10.5 Mini-exercise
1. Find the AT/GC ratio in sequence1.txt.
2. Find all palindromes of length = 9 in sequence1.txt and save them to a file called palindromes.txt.
3. Now, re-open palindromes.txt and append all palindromes of length 8 to the file.
In [2]: poem
Out[2]: Roses are red\nViolets are blue\nThe dog is pregnant\nThanks to you\nby anonymous college to
In [3]: poem = open(graffiti.txt, rU).readlines()
In [4]: poem
Out[4]:
[Roses are red\n,
Violets are blue\n,
The dog is pregnant\n,
Thanks to you\n,
by anonymous college toilet poet]
We can now process the string in poem1 or the list in poem2 as necessary.
4.10.7 Exercise
1. Convert the contents of the file graffit.txt to all uppercase letters. That is, calling cat or less on graffit.txt
should look like this before and after your program is run:
BEFORE
eris:pcfb cliburn$ cat graffiti.txt
Roses are red
Violets are blue
The dog is pregnant
Thanks to you
by anonymous college toilet poet
AFTER
eris:pcfb cliburn$ cat graffit.txt
ROSES ARE RED
VIOLETS ARE BLUE
THE DOG IS PREGNANT
THANKS TO YOU
BY ANONYMOUS COLLEGE TOILET POET
eris:pcfb cliburn$
2. Count the number of times each word appears in hamlet.txt found in the
examples folder. For, we define a word to be any string of characters
that is separated by white space (space, tab, newline). We also ignore
47
1. Open the file hamlet.txt and assign its contents to a variable as a single string
2. Convert the string to lower case
3. Remove all punctuation characters from the string (punctuation characters are !#$%&()*+,./:;<=>?@[\]^_{|}~, which you can also find in the string module)
4. Split the string into a list of words, where a word is defined to be any sequence of characters separated by white
space
5. Create an empty dictionary to store word counts
6. Loop over the list of words and increment the dictionary count for that word by 1
7. Print the number of occurrences of hamlet in Hamlet
8. Close the file if necessary
The use of the built-in sum function hides the details of having to initialize the sum to zero and looping over each
number while adding that number to the sum variable. While Python comes with many useful built-in functions, sooner
or later, you will need to write your own functions. As you will see, writing your own functions is really simple. Lets
write our version of the sum function and a product function that when given a sequence of numbers, returns the
product rather than the sum of numbers. We will store save the functions in examples/functions.py.
def sum(xs):
"""Given a sequence of numbers, return the sum."""
s = 0
for x in xs:
s += x
return s
def prod(xs):
"""Given a sequence of numbers, return the product."""
s = 1
for x in xs:
s *= x
return s
48
4.10.9 Mini-exercise
1. Write a function that returns the cumulative sum of numbers in a list. For example, if the function is given the
list [1,2,3,4,5], it should return the list [ 1, 3, 6, 10, 15].
2. Write a function fib that generates the first n Fibonacci numbers. The Fibonacci numbers are the sequence
[1,1,2,3,5,8,13,...], where each successive number is the sum of the two preceding numbers. Here are some
results that your function should give:
In [1]: fib(1)
Out[1]: [1]
In [2]: fib(2)
Out[2]: [1, 1]
In [3]: fib(3)
Out[3]: [1, 1, 2]
In [4]: fib(4)
Out[4]: [1, 1, 2, 3]
In [5]: fib(10)
Out[5]: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
If calling functions.prod is too verbose for you, you can modify the import statement like so:
Just be aware that this will make any existing function with the name prod inaccessible. So for instance, if we
used from functions import sum, we would no longer have access to the built in sum function. Whereas
if we used import functions, we can choose which function to use - sum will use the built-in function, while
functions.sum will use our function. We recommend using the full name all the time to avoid such name clashes,
using a shorter alias for the imported module with the as keyword if you are really lazy.
In [1]: import functions as f
In [2]: f.prod(xs)
Out[2]: 24
49
There is another way to write short throwaway functions for one-time use that is much terser using lamba or anonymous functions:
In [1]: f = lambda x: x*x
In [2]: f(3)
Out[2]: 9
This use of lambda is typically seen in the context of the built-in higher order functions (functions that take functions
as arguments) map and filter.
In [1]: filter(lambda x: x % 2==0, range(10))
Out[1]: [0, 2, 4, 6, 8]
In [2]: map(lambda x: x**2, range(5))
Out[2]: [0, 1, 4, 9, 16]
In general, Python programmers prefer to use defined rather than anonymous functions, and the use of list comprehensions rather than map and filter as they are more explicit and easier to understand, but you may come across
lambda, map and filter in books or on the web.
4.10.12 Mini-Exercise
1. Replace the filter and map functionality in the above example using list comprehension.
2. Rewrite f = lambda x: x**2 as a regular function also called f using def.
50
1 2 3 4
In [4]: f(d=1, c=2, b=3, a=4)
4 3 2 1
Warning: When you assign a list or a dictionary as a default value for an argument, the list is created at the same time
the function is declared, and persists over subsequent function calls if not overwritten. That is probably not what you
intended - if you do not want the default list to persist, you have set the default to None in the argument, then set it to
the empty list in the function after checking that it has not been assigned. An example should make this clear:
# we set b to have a default of an empty list
In [1]: def f(a, b=[]):
...:
b.append(a)
...:
print a, b
...:
# but the behavior is rather counter-intuitive
In [2]: f(2)
2 [2]
In [3]: f(3)
3 [2, 3]
# if we over-write the default argument, everything is OK
In [4]: f(3, [1,2])
3 [1, 2, 3]
# this is the way to get the non-persistent behavior
In [5]: def f(a, b=None):
...:
if b is None:
...:
b = []
...:
b.append(a)
...:
print a, b
...:
In [6]: f(2)
2 [2]
In [7]: f(3)
3 [3]
4.10.14 Exercise
1. Write a function that finds palindromic sequences of length k from a string, and use it to find all palindromic
sequences of length 9 in sequence1.txt in the examples folder. The function should take 2 arguments, the string
and k, the palindrome length
2. Write a program that plays the childrens guessing game with you. Running the program and playing with it
looks like this:
eris:examples cliburn$ python guessing.py
Im thinking of a number between 1 and 100. Guess what it is!
Guess a number: 50
Too small
Guess a number: 75
Too large
Guess a number: 63
51
Too large
Guess a number: 56
Too small
Guess a number: 60
Too large
Guess a number: 58
Youve guessed it! The number is 58
4.11.1 os module
The os module provide a platform independent way to work with the operating system, make or remove files and
directories.
In [1]: import os
In [2]: print os.getcwd()
/home/jacob
In [3]: os.chdir(/home/jacob/baz)
In [4]: os.getcwd()
Out[4]: /home/jacob/baz
In [5]: os.remove(foo)
In [6]: os.chdir(/home/jacob)
In [7]: os.rmdir(baz)
52
row in cfr:
print , .join(row)
eggs, spam
spam and eggs, spam
If youd prefer a different separator than commas the delimiter optional argument can be used
In [1]: import csv
In [2]: f = open(tabbs.csv, rU)
In [3]: csvfile = csv.reader(f, delimiter=\t)
In [4]: for row in csvfile:
...:
print row
...:
[1, 2, 3]
[2, 3, 4]
[4, 5, 6]
often youll want to skip the first row in a csv file, and a simple way to do that is
import csv
f = open(test_scores.csv, rU)
csvfile = csv.reader(f)
header = False
for row in csvfile:
if not header:
header = True
else:
print row
f.close()
53
csv excercise
use the test_scores.csv file to calculate the average score (colum 4) for each sex (column 2)
4.11.3 sys.argv
The sys module contains many objects and functions for dealing with how python was compiled or called when
executed. Most significantly is argv, which is a list containing all the parameters passed on the command line when
python executed, including the name of the python program in position 0. Note that all the elements of sys.argv are
strings - if you want a number, you will have to convert it using int() or float(). For example, if you want to
assign the argument at position 1 as an integer variable, you can use n = int(sys.argv[1]).
import sys
if len(sys.argv) > 1:
print sys.argv
else:
print no arguments passed
[jacob@moku ~]$ python argv_example.py foo bar baz
[argv_example.py, foo, bar, baz]
sys.argv exercise
write a program that takes two arguments, your first name, and your age, and then prints out your name and the year
you were born.
54
In [4]: math.log10(100)
Out[4]: 2.0
In [5]: math.log(math.e)
Out[5]: 1.0
In [7]: math.cos(math.pi)
Out[7]: -1.0
In [10]: math.exp(1)
Out[10]: 2.7182818284590455
In [11]: math.pow(5,2)
Out[11]: 25.0
In [12]: math.sin(math.pi)
Out[12]: 1.2246467991473532e-16
The time module provides simple estimates for how long a command takes.
In [1]: import time
In [2]: a = time.time()
In [3]: time.sleep(10)
In [4]: b = time.time()
55
56
Uninstalling xlrd:
/Users/cliburn/Library/Python/2.7/site-packages/xlrd
/Users/cliburn/Library/Python/2.7/site-packages/xlrd-0.7.7-py2.7.egg-info
/Users/cliburn/bin/runxlrd.py
Proceed (y/n)? y
Successfully uninstalled xlrd
4.11.10 Exercise
Write a program that takes a number on the command line and calculates the log, square, sin and cosine, and writes
them out in a csv file.
The main object in NumPy is the homogeneous, multidimensional array. An array is a table of elements. An example
is a matrix x
1 2 3
x = 4 5 6
7 8 9
can be represented as
>>> import numpy as np
>>> x = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> x
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> x.shape
(3, 3)
The array x has 2 dimensions. In NumPy the number of dimensions is referred to as rank. The ndim is the same as
the number of axes or the length of the output of x.shape
>>> x.ndim
2
>>> x.size
9
57
>>> x.sum(axis=0)
array([12, 15, 18])
>>> x.sum(axis=1)
array([ 6, 15, 24])
>>> x.mean(axis=0)
array([ 4., 5., 6.])
>>> x.mean(axis=1)
array([ 2., 5., 8.])
But arrays are also useful because they interact with other NumPy functions as well as being central to other package
functionality. To make a sequence of numbers, similar to range in the Python standard library, we use arange.
>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.arange(5,10)
array([5, 6, 7, 8, 9])
>>> np.arange(5,10,0.5)
array([ 5. , 5.5, 6. , 6.5, 7. , 7.5,
8. ,
8.5,
9. ,
9.5])
Also we can recreate the first matrix by reshaping the output of arange.
>>> x = np.arange(1,10).reshape(3,3)
>>> x
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Another similar function to arange is linspace which fills a vector with evenly spaced variables for a specified interval.
>>> x = np.linspace(0,5,5)
>>> x
array([ 0. , 1.25, 2.5 ,
3.75,
5.
])
Visualizing linspace...
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
N = 8
y = np.zeros(N)
x1 = np.linspace(0, 10, N, endpoint=True)
p1 = plt.plot(x1, y, o)
ax.set_xlim([-0.5,10.5])
plt.show()
58
0.06
0.04
0.02
0.00
0.02
0.04
0.06 0
10
There are several convenience functions for making arrays that are worth mentioning:
zeros
ones
>>> x = np.zeros([3,4])
>>> x
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
>>> x = np.ones([3,4])
>>> x
59
array([[ 1.,
[ 1.,
[ 1.,
1.,
1.,
1.,
1.,
1.,
1.,
1.],
1.],
1.]])
Exercise
1. Create the following array (1 line)
1
11
a= .
..
2
12
..
.
..
.
10
20
..
.
91
92
100
2. Use the array object to get the number of elements, rows and columns
3. Get the mean of the rows and columns
4. What do you get when you do this?
>>> a[4,:]
NumPy - basics
Quick reference
Here we provide a quick reference guide to the commonly used functions from the NumPy package along with several
frequently encountered examples.
NumPy command
a.ndim
a.shape
arange(start,stop,step)
linspace(start,stop,steps)
dot(a,b)
vstack([a,b])
hstack([a,b])
where(a>x)
argsort(a)
Note
returns the num. of dimensions
returns the num. of rows and colums
returns a sequence vector
returns a evenly spaced sequence in the specificed interval
matrix multiplication
stack arrays a and b vertically
stack arrays a and b horizontally
returns elements from an array depending on condition
returns the sorted indices of an input array
Basic operations
60
>>> a =
>>> b =
>>> a array([
np.array([3,4,5])
np.ones(3)
b
2., 3., 4.])
Something that can be tricky for people familar with other programming languages is that the * operator does not
carry out a matrix product. This is done with the dot function.
>>> a = np.array([[1,2],[3,4]])
>>> b = np.array([[1,2],[3,4]])
>>> a
array([[1, 2],
[3, 4]])
>>> b
array([[1, 2],
[3, 4]])
>>> a * b
array([[ 1, 4],
[ 9, 16]])
>>> np.dot(a,b)
array([[ 7, 10],
[15, 22]])
Concatenation
>>> a = np.array([1,2,3])
>>> b = np.array([4,5,6])
>>> c = np.array([7,8,9])
>>> np.hstack([a,b,c])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.vstack([a,b,c])
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Sorting arrays
>>> x.sort()
>>> x
array([0, 0, 1, 3, 4, 5])
>>> x = np.array(([1,3,4,0,0,5]))
array([3, 4, 0, 1, 2, 5])
61
>>> np.argsort(x)
array([3, 4, 0, 1, 2, 5])
5.44139809,
6.28318531])
4h
0.12
0.01
0.03
0.05
12h
0.08
0.07
0.04
0.09
24h
0.06
0.11
0.04
0.11
48h
0.02
0.09
0.02
0.14
Tip:
>>>
>>>
>>>
>>>
>>>
geneList
values0
values1
values2
values3
=
=
=
=
=
Additional NumPy
Indexing and Slicing 1D arrays can be indexed in the same way a Python list can.
62
>>> a = np.arange(10)
>>> a[2:4]
array([2, 3])
>>> a[:10:2]
array([0, 2, 4, 6, 8])
>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
Where
>>> a = np.array([1,1,1,2,2,2,3,3,3])
>>> a[a>1]
array([2, 2, 2, 3, 3, 3])
>>> a[a==3]
array([3, 3, 3])
>>> np.where(a<3)
(array([0, 1, 2, 3, 4, 5]),)
>>> np.where(a<3)[0]
array([0, 1, 2, 3, 4, 5])
>>> np.where(a>9)
(array([], dtype=int64),)
Printing
>>> for row in x:
...
print row
...
[0 1 2 3]
[4 5 6 7]
[ 8 9 10 11]
>>> for element in x.flat:
...
print element
...
0
1
2
3
4
5
6
63
7
8
9
10
11
Copying
>>> a = np.array([a,b,c])
>>> b = a
>>> b[1] = z
>>> a
array([a, z, c],
dtype=|S1)
>>> a = np.array([a,b,c])
>>> b = a.copy()
>>> b[1] = z
>>> a
array([a, b, c],
dtype=|S1)
Missing data
>>> import numpy as np
>>> from scipy.stats import nanmean
>>> a = np.array([[1,2,3],[4,5,np.nan],[7,8,9]])
>>> a
array([[ 1.,
2.,
3.],
[ 4.,
5., nan],
[ 7.,
8.,
9.]])
>>> columnMean = nanmean(a,axis=0)
>>> columnMean
array([ 4., 5., 6.])
>>> rowMean = nanmean(a,axis=1)
>>> rowMean
array([ 2. , 4.5, 8. ])
There are many useful functions in random however we are showing only a few so that they will be familar when we
get to plotting.
NumPy - linear algebra
Linear algebra is a branch of mathematics concerned with vector spaces and the mappings between those spaces.
NumPy has a package called linalg. This page is meant only to familiarize you with the NumPys linear algebra
functions for those who are interested.
64
A 1 N dimensional vector x
x1
x2
x= .
..
xN
and its transpose xT = (x1 , x2 , . . . , xN ) can be expressed in python as
>>>
>>>
>>>
>>>
(3,
>>>
(1,
import numpy as np
x = np.array([[1,2,3]]).T
xt = x.T
x.shape
1)
xt.shape
3)
x=
5
6
>>> x = np.array([[3,4,5,6]]).T
x= 3
a1,1
a2,1
= .
..
a1,2
a2,2
..
.
..
.
a1,n
a2,n
..
.
am,1
am,2
am,n
>>> x = np.array([[3,4,5,6]])
Am,n
Common tasks
Matrix determinant
>>> a = np.array([[3,-9],[2,5]])
>>> np.linalg.det(a)
33.000000000000014
Matrix inverse
>>> A = np.array([[-4,-2],[5,5]])
>>> A
array([[-4, -2],
[ 5, 5]])
>>> invA = np.linalg.inv(A)
65
>>> invA
array([[-0.5, -0.2],
[ 0.5, 0.4]])
>>> np.round(np.dot(A,invA))
array([[ 1., 0.],
[ 0., 1.]])
Because AA1 = A1 A = I.
Eigenvalues and Eigenvectors
>>> a = np.diag((1, 2, 3))
>>> a
array([[1, 0, 0],
[0, 2, 0],
[0, 0, 3]])
>>> w,v = np.linalg.eig(a)
>>> w;v
array([ 1., 2., 3.])
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
This is by no means a complete listalso the SciPy package has additional functions if this is an area of interest.
Bibliographic notes
1. Duda, R. O., Hart, P. E. & Stork, D. G. Pattern Classification, John Wiley & Sons, Inc., 2001.
Useful links
NumPy homepage
Official NumPy tutorial
NumPy for MATLAB users
The most frequently used plotting package in Python, matplotlib, is written in pure Python and is heavily dependent
on NumPy. The main webpage introduction itemizes what John Hunter (mpl creator) was looking for in a plotting
toolkit.
Plots should look great - publication quality. One important requirement for me is that the text looks good
(antialiased, etc.)
Postscript output for inclusion with TeX documents
Embeddable in a graphical user interface for application development
Code should be easy enough that I can understand it and extend it
66
The Axes class is the most important class in mpl. The following three lines are used to get an axes class ready for
use.
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(2,1,1)
After a figure is drawn you may save it and or plot it with the following.
>>> fig.saveas(foo.png,dpi=200)
>>> plt.show()
The DPI argument is optional and we can save to a bunch of formats like: JPEG, PNG, TIFF, PDF and EPS.
Here is the example from the artist tutorial.
An example
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
fig.subplots_adjust(top=0.8)
ax1 = fig.add_subplot(211)
ax1.set_ylabel(volts)
ax1.set_title(a sine wave)
t = np.arange(0.0, 1.0, 0.01)
s = np.sin(2*np.pi*t)
line, = ax1.plot(t, s, color=blue, lw=2)
ax2 = fig.add_axes([0.15, 0.1, 0.7, 0.3])
n, bins, patches = ax2.hist(np.random.randn(1000), 50,
facecolor=yellow, edgecolor=yellow)
ax2.set_xlabel(time (s))
67
a sine wave
1.0
volts
0.5
0.0
0.5
1.00.0
60
50
40
30
20
10
04
0.2
0.4
0.6
0
time (s)
0.8
1.0
68
a sine wave
1.0
volts
0.5
0.0
0.5
1.00.0
0.2
0.4
0.6
0.8
1.0
## the data
t = np.arange(0.0, 1.0, 0.01)
s = np.sin(2*np.pi*t)
## the top axes
ax1 = fig.add_subplot(3,1,1)
ax1.set_ylabel(volts)
ax1.set_title(a sine wave)
line1 = ax1.plot(t, s+5.0, color=blue, lw=2)
line2 = ax1.plot(t, s+2.5, color=red, lw=2)
line3 = ax1.plot(t, s, color=orange, lw=2)
## the middle axes
ax2 = fig.add_subplot(3,1,2)
69
ax2.set_ylabel(volts)
ax2.set_title(a sine wave)
line1 = ax2.plot(t, s+5.0, color=black, lw=2,linestyle="--")
line2 = ax2.plot(t, s+2.5, color=black, lw=2,linestyle="-.")
line3 = ax2.plot(t, s, color=#000000, lw=2,linestyle=":")
## the thrid axes
ax3 = fig.add_subplot(3,1,3)
ax3.set_ylabel(volts)
ax3.set_title(a sine wave)
line1 = ax3.plot(t,s+5.0, color=blue, marker="+")
line2 = ax3.plot(t,s+2.5, color=red, marker="o")
line3 = ax3.plot(t,s, color=orange, marker="^")
volts
volts
volts
6
5
4
3
2
1
0
10.0
6
5
4
3
2
1
0
10.0
6
5
4
3
2
1
0
10.0
a sine wave
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
a sine wave
a sine wave
Box plots
70
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
x1 = np.random.normal(0,1,50)
x2 = np.random.normal(1,1,50)
x3 = np.random.normal(2,1,50)
ax.boxplot([x1,x2,x3])
plt.show()
5
4
3
2
1
0
1
2
3
71
## the bars
rects1 = ax.bar(ind, menMeans, width,
color=black,
yerr=menStd,
error_kw=dict(elinewidth=2,ecolor=red))
rects2 = ax.bar(ind+width, womenMeans, width,
color=red,
yerr=womenStd,
error_kw=dict(elinewidth=2,ecolor=black))
# axes and labels
ax.set_xlim(-width,len(ind)+width)
ax.set_ylim(0,45)
ax.set_ylabel(Scores)
ax.set_title(Scores by group and gender)
xTickMarks = [Group+str(i) for i in range(1,6)]
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=45, fontsize=10)
## add a legend
ax.legend( (rects1[0], rects2[0]), (Men, Women) )
plt.show()
72
45
Men
Women
40
35
Scores
30
25
20
15
10
Gr
ou
p5
p4
Gr
ou
Gr
ou
p3
p2
Gr
ou
Gr
ou
p1
73
0.8
0.6 Z
0.4
0.2
10
X
15
20
15
10 Y
25
0.0
30
Scatter plot
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax1 = fig.add_subplot(121)
## the data
N=1000
x = np.random.randn(N)
y = np.random.randn(N)
## left panel
ax1.scatter(x,y,color=blue,s=5,edgecolor=none)
ax1.set_aspect(1./ax1.get_data_ratio()) # make axes square
## right panel
ax2 = fig.add_subplot(122)
props = dict(alpha=0.5, edgecolors=none )
handles = []
74
4
10
blue
3
8
green
2
magenta
6
1
cyan
4
0
2
1
0
2
2
3
4
44 3 2 1 0 1 2 3 4
4 2 0 2 4 6 8 10
Histogram plot
fig = plt.figure()
ax = fig.add_subplot(111)
75
x = np.random.normal(0,1,1000)
numBins = 50
ax.hist(x,numBins,color=green,alpha=0.8)
plt.show()
60
50
40
30
20
10
04
Matplotlib - exercises
Yeast cell cycle data
File download
The data were originally downloaded from the Yeast Cell Cycle Analysis Project Page. These data [1] by Spellman
and Sherlock have likely been used in over a hundred papers.
1. look at the file so you know what you are getting into (less,wc)
2. copy this script into an editor and lets go over it
#!/usr/bin/env python
import csv,os,sys,pickle
import numpy as np
## read the file once to get numRows and numCols
txtFilePath = os.path.join("..","data","cellcycle.txt")
76
numRows = 0
reader = csv.reader(open(txtFilePath, r),delimiter=\t)
expListIDs = reader.next()
expListIDs = np.array(expListIDs[1:])
for linja in reader:
numRows+=1
## populate a matrix and name vectors with file info
numColumns = len(expListIDs)
exprMat = np.zeros([numRows,numColumns])
reader = csv.reader(open(txtFilePath, r),delimiter=\t)
header = reader.next()
rowInd = 0
geneList = []
for linja in reader:
row = np.array(linja[1:])
newRow = np.zeros(len(row),)
nanInds = np.where(row == )
goodInds = np.where(row != )
newRow[nanInds] = np.nan
newRow[goodInds] = [float(element) for element in row[goodInds]]
exprMat[rowInd,:] = newRow
rowInd +=1
geneList.append(linja[0])
geneList = np.array(geneList)
## print out info
print ".............."
print "matrix of size (%s,%s) created..."%(exprMat.shape)
print "gene list size - %s"%geneList.size
print "exp list size - %s"%expListIDs.size
print ".............."
## write the data to a file
outFilePath = os.path.join(".","excercise-np.pickle")
tmp = open(outFilePath,w)
pickle.dump([geneList,expListIDs,exprMat],tmp)
tmp.close()
3. save the file using your editor to a directory and use it to read cellcycle.txt. To do this you will need to change
at least one line?
4. create your own script(s) that does the following
opens the pickle file
calculates mean expression value for a gene
plot expression values for a given gene in a histogram
[extra] for a given gene create a lineplot that shows expression values for all the conditions
[extra] make plot that has boxplots for the 5 genes with the greatest expression mean
[extra] create a scatter plot (1 condition) where the negative expression values are green and positive ones
are red
Note that:
77
>>> a = np.array([1,2,3,np.nan])
>>> a.max()
nan
>>> a.min()
nan
>>> a.mean()
nan
>>> np.where(np.isnan(a)==False)[0]
array([0, 1, 2])
Also, note that we do not transform, normalize or otherwise process the data in this example. We are using this data
set as a learning tool. The missing data difficulties that arise are common in the biological sciences.
Bibliographic notes
4.13 Biopython I
From the Biopython:
Biopython is a set of freely available tools for biological computation written in Python by an international team of
developers. It is a distributed collaborative effort to develop Python libraries and applications which address the needs
of current and future work in bioinformatics. The source code is made available under the Biopython License, which
is extremely liberal and compatible with almost every license in the world.
Note
finds regions of local similarity between sequences
multiple sequence alignment program
NCBI sequence database
Document database
SIB resource portal (Enzyme and Prosite)
Structural Classification of Proteins (e.g. dom,lin)
computationally identifies transcripts from the same locus
annotated and non-redundant protein sequence database
Chapter 4. Instructor: Cliburn Chan, Biostatistics and Bioinformatics.
Some examples will also require a working internet connection in order to run.
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq(AGTACACTGGT, Alphabet())
>>> aStringSeq = str(my_seq)
>>> aStringSeq
AGTACACTGGT
>>> my_seq_complement = my_seq.complement()
>>> my_seq_complement
Seq(TCATGTGACCA, Alphabet())
>>> my_seq_reverse = my_seq.reverse()
>>> my_seq_rc = my_seq.reverse_complement()
>>> my_seq_rc
Seq(ACCAGTGTACT, Alphabet())
There is so much more, but first before we get into it we should figure out how to get sequences in and out of python.
File download
FASTA formats are the standard format for storing sequence data. Here is a little reminder about sequences.
Nucleic acid code
A
T
C
G
N
U
D
H
Note
adenosine
thymidine
cytidine
guanine
A/G/C/T (any)
uridine
G/A/T
A/C/T
Note
G/T (keto)
A/C (amino)
G/A (purine)
G/C (strong)
A/T (weak)
G/T/C
T/C (pyrimidine)
G/C/A
4.13. Biopython I
79
gi|2765658|emb|Z78533.1|CIZ78533
Seq(CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC, SingleLetterAlphabet())
740
You can translate directly from the DNA coding sequence or you can use the mRNA directly.
80
Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):
>>> coding_dna.translate()
Seq(MAIVMGR*KGAR*, HasStopCodon(IUPACProtein(), *))
>>> coding_dna.translate(to_stop=True)
Seq(MAIVMGR, IUPACProtein())
Exercise
1. There is so much stuff available in biopython. What happens if you do this?
>>>
>>>
>>>
>>>
>>>
4.13. Biopython I
81
# exercise 3 -- get the reverse of the sequence (just like for lists)
We use a list here to save the gene records from a FASTA file
If you have many records a dictionary will make the program faster.
82
Just as easy as it is to read a set of files we can save modified versions (i.e. QA). Also, it is almost the exact same code
as above to parse sequences from a GenBank (.gb) file.
There is really way to much to cover in the time we have, but if you have Next Generation Sequencing data then refer
to sections 4.8, 16.1.7 and 16.1.8 of the biopython tutorial. There is even support for binary formats (i.e. SFF).
4.14 Biopython II
4.14.1 Biopython - Entrez databases
NCBIs Guidelines
Taken from the tutorial.
Before using Biopython to access the NCBIs online resources (via Bio.Entrez or some of the other modules), please
read the NCBIs Entrez User Requirements. If the NCBI finds you are abusing their systems, they can and will ban
your access!
To paraphrase: For any series of more than 100 requests, do this at weekends or outside USA peak times. This is
up to you to obey. Use the http://eutils.ncbi.nlm.nih.gov address, not the standard NCBI Web address. Biopython
uses this web address. Make no more than three requests every seconds (relaxed from at most one request every three
seconds in early 2009). This is automatically enforced by Biopython. Use the optional email parameter so the NCBI
can contact you if there is a problem. You can either explicitly set this as a parameter with each call to Entrez (e.g.
include [email protected] in the argument list), or as of Biopython 1.48, you can set a global email
address:
>>> from Bio import Entrez
>>> Entrez.email = "[email protected]"
Bio.Entrez will then use this email address with each call to Entrez. The example.com address is a reserved domain
name specifically for documentation (RFC 2606). Please DO NOT use a random email its better not to give an
email at all. The email parameter will be mandatory from June 1, 2010. In case of excessive usage, NCBI will attempt
to contact a user at the e-mail address provided prior to blocking access to the E-utilities.
If you are using Biopython within some larger software suite, use the tool parameter to specify this. You can either
explicitly set the tool name as a parameter with each call to Entrez (e.g. include tool=MyLocalScript in the argument
list), or as of Biopython 1.54, you can set a global tool name:
>>> from Bio import Entrez
>>> Entrez.tool = "MyLocalScript"
The tool parameter will default to Biopython. For large queries, the NCBI also recommend using their session history
feature (the WebEnv session cookie string, see Section 8.15). This is only slightly more complicated.
4.14. Biopython II
83
Other databases?
>>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]")
>>> record = Entrez.read(handle)
>>> record["Count"]
75
84
>>> handle.close()
>>> print record
ID: EU490707.1
Name: EU490707
Description: Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast.
Number of features: 3
/sequence_version=1
/source=chloroplast Selenipedium aequinoctiale
/taxonomy=[Eukaryota, Viridiplantae, Streptophyta, Embryophyta, Tracheophyta, Spermatophyt
/keywords=[]
/references=[Reference(title=Phylogenetic utility of ycf1 in orchids: a plastid gene more variable t
/accessions=[EU490707]
/data_file_division=PLN
/date=15-JAN-2009
/organism=Selenipedium aequinoctiale
/gi=186972394
Seq(ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA, IUPACAmbiguousDNA())
>>> handle = Entrez.efetch(db="pubmed", id="21210977")
>>> print handle.read()
Select
the select command retrieves data from the database.
sqlite> select * from people;
0|Alice|Research Director|555-123-0001|4b
1|Bob|Research assistant|555-123-0002|17
85
2|Charles|Research assistant|555-123-0001|24
3|David|Research assistant|555-123-0001|8
sqlite> select * from experiment;
0|EBV Vaccine trial|0|A vaccine trial
1|Flu antibody study|2|Study of the morphology of flu antibodies
The * in the select statement says to select all columns. If you only need a few of the columns you can select them by
name.
sqlite> select name, phone from people;
Alice|555-123-0001
Bob|555-123-0002
Charles|555-123-0001
David|555-123-0001
sqlite> select name, description from experiment;
EBV Vaccine trial|A vaccine trial
Flu antibody study|Study of the morphology of flu antibodies
You can also limit the returned results to rows that match specified information using the where directive.
sqlite> select * from people where name == Alice;
0|Alice|Research Director|555-123-0001|4b
sqlite> select position from people where name == David;
Research assistant
Insert
Adding values to the database is done by using the insert statement.
sqlite> insert into people values ( Null, Edward, Toadie, None, Basement);
sqlite> select * from people where name == Edward;
4|Edward|Toadie|None|Basement
Update
You can also change existing rows once theyve been inserted. update takes a table name as its first argument followed
by set column = value. With out a where clause this will set all rows values. You there for will almost always use the
where clause so that you get specific row/rows values updated.
sqlite> select * from people;
0|Alice|Research Director|555-123-0001|4b
1|Bob|Research assistant|555-123-0002|17
2|Charles|Research assistant|555-123-0001|24
3|David|Research assistant|555-123-0001|8
4|Edward|Toadie|None|Basement
sqlite> update people set name=Eddie where id=4;
sqlite> select * from people;
0|Alice|Research Director|555-123-0001|4b
1|Bob|Research assistant|555-123-0002|17
2|Charles|Research assistant|555-123-0001|24
3|David|Research assistant|555-123-0001|8
4|Eddie|Toadie|None|Basement
86
Delete
Similar to updating you can delete rows from the database. The argument again will most likely want a where clause
to prevent deleting all rows in a table.
sqlite> select * from people;
0|Alice|Research Director|555-123-0001|4b
1|Bob|Research assistant|555-123-0002|17
2|Charles|Research assistant|555-123-0001|24
3|David|Research assistant|555-123-0001|8
4|Eddie|Toadie|None|Basement
sqlite> delete from people where name=Eddie;
sqlite> select * from people;
0|Alice|Research Director|555-123-0001|4b
1|Bob|Research assistant|555-123-0002|17
2|Charles|Research assistant|555-123-0001|24
3|David|Research assistant|555-123-0001|8
Joins
The power of relational databases lies in recording relations (the foreign key in the table declaration). To join two
tables you use the join keyword in the select statement and provide a relation to join the two tables. Note, that since
both the people and experiment tables have a column called name we must cast the tables using the as statement.
sqlite> select p.name, e.name from people as p join experiment as e where e.researcher == p.id;
Alice|EPV Vaccine trial
Charles|Flu antibody study
In [5]: r = con.execute(select p.name, e.name from people as p join experiment as e where e.research
In [6]: for i in r:
...:
print Name: %s\n\tExperiment: %s % (i[0],i[1])
87
...:
Name: Alice
Experiment: EPV Vaccine trial
Name: Charles
Experiment: Flu antibody study
4.15.3 Exercise:
Write a script to a add a new user and experiment to the database, remove Alice, and reassign her experiments to the
new user. Then have it print out all the experiment names with who owns each experiment.
5.781
3.6
-3.4221
133
This session will introduce you to the next stage of data analysis and data visualization. Because it is impossible to do
any justice to these areas in a few hours, the aim of this session is to provide a taste of what analysis and visualization
in Python look like, and a tour of some of the many modules available for scientific computation in Python.
AD
+D
1 + (x/C)B
where x is the concentration, A is the minimum asymptote, B is the steepness, C is the inflection point and D is the
maximum asymptote.
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
from scipy.optimize import leastsq
88
89
It will be straightforward to modify this code to use, for example, a five parameter logistic or other equation, offering
a flexibility rarely available with standard analysis software.
0.47060219,
In [3]: npr.random((3,4))
Out[3]:
array([[ 0.29302404, 0.9372624 ,
[ 0.42932372, 0.20717542,
[ 0.83125219, 0.85109042,
In [4]: npr.normal(5, 1, 4)
Out[4]: array([ 3.45966714,
0.10922504,
0.36149538,
0.18447385,
0.32720074,
3.99815107,
0.70776782,
0.41784061])
0.59367473],
0.91159639],
0.33453366]])
5.0191997 ,
5.93739408])
90
1.77176392,
3.96170981,
5.91184323,
6.43878462,
2.49197674,
4.57074553])
[1,2,3,4,5,6]
In [9]: npr.shuffle(x)
In [10]: x
Out[10]: [2, 1, 5, 6, 4, 3]
In [11]: npr.permutation(10)
Out[11]: array([2, 7, 4, 1, 6, 5, 8, 9, 3, 0])
For example, choosing a new sample with replacement from an existing sample (i.e. we draw one item from the data,
record what it is, then replace it in the data and repeat to get a new sample) can be done efficiently in this way:
In [1]: import numpy as np
In [2]: import numpy.random as npr
In [3]: data = np.array([tom, jerry, mickey, minnie, pocahontas])
In [4]: idx = npr.randint(0, len(data), (4,len(data)))
In [5]: idx
Out[5]:
array([[3, 0,
[4, 2,
[4, 0,
[2, 2,
2,
3,
0,
3,
1,
4,
0,
1,
0],
4],
3],
4]])
In the next version of numpy (1.7.0), a new function choice is available in numpy.random to do the same thing
with a nicer syntax. Version 1.7.0 is only currently available from the git repository as source code that you must
compile yourself, but should be available for easy_install/pip installation soon.
In [1]: import numpy.random as npr
In [2]: data = [tom, jerry, mickey, minnie, pocahontas]
91
# only availlable if you install numpy 1.7.0 from the git repository
In [3]: npr.choice(data, size=(4, len(data)), replace=True)
--------------------------------------------------------------------------AttributeError
Traceback (most recent call last)
/Volumes/HD3/hg/pcfb/<ipython-input-3-84e7d179a607> in <module>()
----> 1 npr.choice(data, size=(4, len(data)), replace=True)
AttributeError: module object has no attribute choice
Moving on our first simulation example - if we want to plot the 95% confidence interval for the mean of our data
samples, we can use the bootstrap to do so. The basic idea is simple - draw many, many samples with replacement
from the data available, estimate the mean from each sample, then rank order the means to estimate the 2.5 and 97.5
percentile values for 95% confidence interval. Unlike using normal assumptions to calculate 95% CI, the results
generated by the bootstrap are robust even if the underlying data are very far from normal.
import numpy as np
import numpy.random as npr
import pylab
def bootstrap(data, num_samples, statistic, alpha):
"""Returns bootstrap estimate of 100.0*(1-alpha) CI for statistic."""
n = len(data)
idx = npr.randint(0, n, (num_samples, n))
samples = x[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
if __name__ == __main__:
# data of interest is bimodal and obviously not normal
x = np.concatenate([npr.normal(3, 1, 100), npr.normal(6, 2, 200)])
# find mean 95% CI and 100,000 bootstrap samples
low, high = bootstrap(x, 100000, np.mean, 0.05)
# make plots
pylab.figure(figsize=(8,4))
pylab.subplot(121)
pylab.hist(x, 50, histtype=step)
pylab.title(Historgram of data)
pylab.subplot(122)
pylab.plot([-0.03,0.03], [np.mean(x), np.mean(x)], r, linewidth=2)
pylab.scatter(0.1*(npr.random(len(x))-0.5), x)
pylab.plot([0.19,0.21], [low, low], r, linewidth=2)
pylab.plot([0.19,0.21], [high, high], r, linewidth=2)
pylab.plot([0.2,0.2], [low, high], r, linewidth=2)
pylab.xlim([-0.2, 0.3])
pylab.title(Bootstrap 95% CI for mean)
pylab.savefig(examples/boostrap.png)
92
Note that the bootstrap function is a higher order function, and will return the boostrap CI for any valid statistical
function, not just the mean. For example, to find the 95% CI for the standard deviation, we only need to change
np.mean to np.std in the arguments:
# find standard deviation 95% CI bootstrap samples
low, high = bootstrap(x, 100000, np.std, 0.05)
The function is also highly optimized, and takes under 2 seconds to calculate the boostrap mean for a data sample of
size 300 using 100,000 bootstrap samples on a 4 year old MacBook Pro with 2.4 GHz Intel Core 2 Duo processor.
Permutation-resampling is another form of simulation-based statistical calculation, and is often used to evaluate the
p-value for the difference between two groups, under the null hypothesis that the groups are invariant under label
permutation. For example, in a case-control study, it can be used to find the p-value that hypothesis that the mean of
the case group is different from that of the control group, and we cannot use the t-test because the distributions are
highly skewed.
import numpy as np
import numpy.random as npr
import pylab
def permutation_resampling(case, control, num_samples, statistic):
"""Returns p-value that statistic for case is different
from statistc for control."""
observed_diff = abs(statistic(case) - statistic(control))
num_case = len(case)
combined = np.concatenate([case, control])
diffs = []
for i in range(num_samples):
xs = npr.permutation(combined)
diff = np.mean(xs[:num_case]) - np.mean(xs[num_case:])
diffs.append(diff)
pval = (np.sum(diffs > observed_diff) +
np.sum(diffs < -observed_diff))/float(num_samples)
return pval, observed_diff, diffs
if __name__ == __main__:
# make up some data
case = [94, 38, 23, 197, 99, 16, 141]
control = [52, 10, 40, 104, 51, 27, 146, 30, 46]
93
94
import numpy as np
import pylab
xs = np.loadtxt(anscombe.txt)
for i in range(4):
x = xs[:,i*2]
y = xs[:,i*2+1]
A = np.vstack([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
pylab.subplot(2,2,i+1)
pylab.scatter(x, y)
pylab.plot(x, m*x+c, r)
pylab.axis([2,20,0,14])
pylab.savefig(anscombe.png)
For communication of results, matplotlib offers a huge range of graphics. We have only scratched the surface of
what the package has to offer. The fastest way to get a custom graphic to communicate your results is to look at the
thumbnails at http://matplotlib.sourceforge.net/gallery.html. If one of the graphics looks appropirate for your needs,
just click on the thumbnail to get the source code. You should now know enough Python to customize the graphic to
your specific needs.
95
requirements. Please refer to Chapter 17 for further depth than we go into here on graphics.
Vector and pixel images are probably concepts most biologists are familiar with. Pixel images (bitmap, raster) are a
uniform grid of colored dots. In vector art a line can be defined by two endpoints. The vector representation has the
advantage of being often easier to store. Also, the amount of information for vector art stays the same no matter how
large the plot and as you zoom in vector art stays the same.
File formats that store vector based images include: PDF, EPS, SVG and AI. File formats that store pixel-based images
are JPEG, PNG, TIFF, BMP and PSD. PDF, EPS and AI can store embedded pixel information (hybrid images).
It is always possible to go from vector to pixel art, but it is not so easy to go the other way. Another advantage of
vector art is that pixel text is not easily retrieved by a machine.
Pixel art is everywhere (photos, machine output), however if we have to create a completely plot then there are a
number of reasons to do so using vector graphics. Vector art is covered in chapter 18 of the book.
4.17.2 Inkscape
Inkscape uses the SVG (Scalable Vector Graphics) format for its files. SVG is an open standard widely supported by
graphic software.
Inkscape basics
QuickRef
NumPy command
Ctrl+arrow
Ctrl+B
- + =
Ctrl+N
Ctrl+O
Ctrl+S
holding Alt
holding Ctrl
space or F1
[ and ]
arrows
Alt+< and Alt+>
tom
hold Shift + click
Note
pan or scroll the canvas
hides scrollbars
zoom zoom zoom
create a new inkscape documents
opens an exisiting svg file
saves the current file
restricts movement in the move mode
preserves the original height/width ratio during resize
activates the selector
rotate an object
move objects
resize and object. < and > work too
bombadil
selects multiple objects
Getting familiar
96
Inkscape tutorials
Since there are people that make their living by working with inkscape I figured there were better tutorials already
available on the web than what I could come up with.
Contents
One of the most common issues in inkscape is making arrows. Although, it is a little work once you have made one
then you can save it and use it for all future derivative arrows.
Arrow tutorial
One nice aspect of inkscape is how easy it is to trace a bitmap.
97
Resizing of images
Adding text or shapes or colors to a bitmap
Additional Resources
The inkscape documentation
Nice set of tutorials
Very nice gallery of wallpapers
Python effects tutorial
98
4.18.2 Exercise
Write a function called read_table to load a tab delimited file as a numpy array using the numpy loadtxt
function. The function should take a filename and an optional delimiter (e.g. t for TDL and , for CSV) as arguments,
and return a numpy array of strings:
def read_table(filename, delimiter=\t):
# your code goes here and should return a numpy array of strings
Now that we have the table as a numpy array, lets focus on converting the middle data portion into an array of
floats. We need to somehow convert all those error or warning messages into numbers. One simple way to do thiis is
to encode each unique message as a number that cannot be mistaken for a data value. Since concentrations are never
negative, we will use the negative integers to encode the errors. But in order to do the encoding, we first need to find
out what types of messages exist in the data set.
4.18.3 Exercise
Write a function that finds all the warnings in the table returned by the previous read_table function and returns a
dictionary of {warning : code} where warning is a string (e.g. Bead Issues or UN) and code is a negative number.
Each unique warning should be given a different negative number. Basically we want to find all strings in the data
part of the table, i.e. table[1:, 1:], and everytime we find a string, we shove it into the dictionary if it is not already
there with a new negative number code:
def parse_warnings(table):
# your code goes here and should return a dictionary of warnings
Next, lets deal with the sample information in column 1, excluding row 1, column 1. We notice that the same sample
name can occur on multiple rows - lets create a dictionary whose key is the sample name, and whose value is the row
number minus one (minus one because we want row number = 0 to index the first data row, not the header information).
4.18.4 Exercise
Complete the function:
def parse_samples(table):
# your code goes here and should return a dictionary of lists of row numbers for each sample
Parsing the header information is slightly more challenging. We notice that two different pieces of information are
provided by each cell in the header - the cytokine name (e.g. IL-2 beta) and the day the sample was taken (e.g. day 3).
Each cytokine is sampled on multple days (0, 3 and 28). We want to create a dictionary that will tell us what column
we can find the values for a given cytokine, day combination. One way to deal with the fact that the same cytokine is
sampled on multiple days is to use a dictionary of dictionaries. In particular, we will construct a dictionary whose key
is a cytokine name, and whose value is another dictionary. This other dictionary has a key representing the day, and a
value representing the column number minus one for the (cytokine, day) combination.
4.18.5 Exercise
Complete the function:
def parse_headers(table):
# your code goes here and should return a dictionary of dictionaries as described in the text
We finally get to convert the data from an array of strings to an array of floats. Replace all the warning messages with
the code numbers, and return a new array of floats comprising the following subarray - table[1:, 1:].
99
4.18.6 Exercise
Complete the function:
Lets consolidate all the above functions into a single function that when given a filename, reads the file and breaks it
down into a sample dictionary, a cytokine dicitonary and a data array. My function is shown below - you only have to
change the names to match the structures that you created:
def parse_cytokine_table(filename):
# extract data and sample and cytokine mapping dictinaries from table
table = read_table(filename)
warning_dict = parse_warnings(table)
sample_mapper = parse_samples(table)
cytokine_mapper = parse_headers(table)
data = parse_data(table, warning_dict)
return warning_dict, sample_mapper, cytokine_mapper, data
Well done! The most difficult part of the program is now complete. We next see how we can use the structures we
have created to make summaries of the data. First, lets use a box-and-whiskers plot to show the distribution of each
cytokine over days 0, 3 and 28. Generate one such figure for each such cytokine - it should look something like this.
4.18.7 Exercise
Write the function to generate the box-and -whiskers plots. Remember that the negative numbers are not really
cytokine concentrations and should not be used for plotting:
def plot_cytokine(cytokine, cytokine_mapper, data, save=False, directory=.):
# generate plots like the one shown
100
Finally, lets write the QC data to file. There seem to be an enormous number of errors in this file - lets summarize
them so that Herman can figure out what is going on! Generate one text file per unique warning, where each row has
two columns - the sample name, and the number of that type of warning associated with the sample. For instance, the
Bead_Issues.txt file will contain the following values:
0921-X-2-2
1059-X-2-2
0273-X-2-2
0740-X-2-2
0263-X-2-2
0175-X-2-2
0012-X-2-2
0108-X-2-2
0066-X-2-2
0057-X-2-2
0313-X-2-2
0103-X-2-2
0685-X-2-2
0799-X-2-2
0749-X-2-2
0693-X-2-2
1023-X-2-2
0300-X-2-2
2
12
2
4
18
2
4
1
2
12
7
25
2
4
16
1
6
2
"""Intermediate
Task: Summarize
Input: An Excel
Output: Figures
"""
6
7
8
9
import numpy
import pylab
import os
10
11
12
13
def parse_cytokine_table(filename):
# extract data and sample and cytokine mapping dictinaries from table
table = read_table(filename)
14
15
warning_dict = parse_warnings(table)
101
sample_mapper = parse_samples(table)
cytokine_mapper = parse_headers(table)
data = parse_data(table, warning_dict)
return warning_dict, sample_mapper, cytokine_mapper, data
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def parse_warnings(table):
# create a dictionary to convert missing data codes into numeric codes
data = table[1:, 1:]
warning_dict = {}
idx = -1
for x in data.ravel():
try:
float(x)
except:
if x not in warning_dict:
warning_dict[x] = idx
idx -= 1
return warning_dict
40
41
42
43
44
45
46
47
def parse_samples(table):
# create a dictionary that returns the rows where each sample has data
samples = table[1:,0]
sample_mapper = {}
for i, sample in enumerate(samples):
sample_mapper.setdefault(sample, []).append(i)
return sample_mapper
48
49
50
51
52
53
54
55
56
57
def parse_headers(table):
# get header information in first row
headers = table[0,:]
# parse header informtion to find columns for each cytokine/day combination
cytokine_mapper = {}
for i, label in enumerate(headers[1:]):
cytokine, day = label.split(day)
cytokine_mapper.setdefault(cytokine.strip(), {})[int(day)] = i
return cytokine_mapper
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# collect measured values for each day in a list ignoring missing values
ys = []
72
73
102
74
75
76
for i in range(xs.shape[1]):
col = xs[:,i]
ys.append(col[col >= 0])
77
78
79
80
81
82
83
84
85
86
87
88
89
if save:
if not os.path.exists(directory):
os.makedirs(directory)
pylab.savefig(os.path.join(directory, cytokine + .png))
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
for k in sorted(qc):
filename = k.replace( , _).replace(/, -) + .txt
fo = open(os.path.join(directory, filename), w)
sample_list = qc[k]
for sample in sample_list:
if sample[1] != 0:
fo.write(%s\t%d % sample + \n)
fo.close()
115
116
117
118
119
if __name__ == __main__:
# parse the file
warning_dict, sample_mapper, cytokine_mapper, data = parse_cytokine_table(
Cytokine_assay_31Dec08PAD.txt)
120
121
122
123
124
125
126
127
128
103
104
INDEX
A
assignments, 9
P
participants, 11
R
references, 10
105