Jones - 2015 - Python For Biologists Write Your Own Software, Become More Productive, and Take Control of Your Res
Jones - 2015 - Python For Biologists Write Your Own Software, Become More Productive, and Take Control of Your Res
All rights reserved. This book or any portion thereof may not be reproduced
or used in any manner whatsoever without the express written permission of
the publisher except for the use of brief quotations in a book review.
ISBN-13: 978-1492346135
ISBN-10: 1492346136
http://pythonforbiologists.com
Set in PT Serif and Source Code Pro
ii
Table of Contents
About the author » ii
8: Dictionaries 178
Storing paired data » 178
Creating a dictionary » 183
Iterating over a dictionary » 188
Recap » 193
Exercises » 194
Solutions » 195
1 Chapter 1: Introduction and environment
programming book, but which are very useful to biologists (for example, regular
expressions and subprocesses). Having a biology-specific textbook allows us to
include these features, along with explanations of why they are particularly useful
to us.
A related point is that a textbook written just for biologists allows us to introduce
features in a way that allows us to start writing useful programs right away. We can
do this by taking into account the sorts of problems that repeatedly crop up in
biology, and prioritising the features that are best at solving them. This book has
been designed so that you should be able to start writing small but useful programs
using only the tools in the first couple of chapters.
Why Python?
Let me start this section with the following statement: programming languages are
overrated. What I mean by that is that people who are new to programming tend to
worry far too much about what language to learn. The choice of programming
language does matter, of course, but it matters far less than people think it does. To
put it another way, choosing the "wrong" programming language is very unlikely to
mean the difference between failure and success when learning. Other factors
(motivation, having time to devote to learning, helpful colleagues) are far more
important, yet receive less attention.
The reason that people place so much weight on the "what language should I learn?"
question is that it's a big, obvious question, and it's not difficult to find people who
will give you strong opinions on the subject. It's also the first big question that
beginners have to answer once they've decided to learn programming, so it assumes
a great deal of importance in their minds.
There are three main reasons why choice of programming language is not as
important as most people think it is. Firstly, nearly everybody who spends any
significant amount of time programming as part of their job will eventually end up
3 Chapter 1: Introduction and environment
using multiple languages. Partly this is just down to the simple constraints of
various languages – if you want to write a web application you'll probably do it in
Javascript, if you want to write a graphical user interface you'll probably use
something like Java, and if you want to write low-level algorithms you'll probably
use C.
Secondly, learning a first programming language gets you 90% of the way towards
learning a second, third, and fourth one. Learning to think like a programmer is
largely a matter of learning to break down complex tasks into simple ones, and is a
skill that cuts across all languages. So if you spend a few months learning Python
and then discover that you really need to write in C, your time won't have been
wasted as you'll be able to pick it up much quicker.
Thirdly, the kinds of problems that we want to solve in biology are generally
amenable to being solved in any language, even though different programming
languages are good at different things. In other words, as a beginner, your choice of
language is vanishingly unlikely to prevent you from solving the problems that you
need to solve.
Having said all that, when learning to program we do need to pick a language to
work in, so we might as well pick one that's going to make the job easier. Python is
such a language for a number of reasons:
• It has a mostly-consistent syntax, so you can generally learn one way of
doing things and then apply it in multiple places
• It has a sensible set of built-in libraries for doing lots of common tasks
• It is designed in such a way that there's an obvious way of doing most things
• It's one of the most widely-used languages in the world, and there's a lot of
advice, documentation and tutorials available on the web
• It's designed in a way that lets you start to write useful programs as soon as
possible
4 Chapter 1: Introduction and environment
• Its use of indentation, while annoying to people who aren't used to it, is
great for beginners as it enforces a certain amount of readability
Python also has a couple of points to recommend it to biologists and scientists
specifically:
• It's widely used in the scientific community
• It has a couple of very well-designed libraries for doing complex scientific
computing (although we won't encounter them in this book)
• It lend itself well to being integrated with other, existing tools
• It has features which make it easy to manipulate strings of characters (for
example, strings of DNA bases and protein amino acid residues, which we as
biologists are particularly fond of)
Further reading
I've deliberately limited the scope of this book to introductory material, in order to
keep the size manageable. As a result, there are lots of useful techniques and tools
that I've had to leave out. The good stuff that I couldn't fit into this book forms the
basis of my second book, Advanced Python for Biologists. You can read more about
Advanced Python for Biologists at this URL:
http://pythonforbiologists.com/books
There are several places in this book where we'll look briefly at a tool or technique
that is covered in much more depth in Advanced Python for Biologists. In these cases
I've mentioned the appropriate chapter in a footnote so it doesn't disrupt the flow
of the text.
Formatting
A couple of notes on typography: bold type is used to emphasize important points
and italics for technical terms and filenames. Where code is mixed in with normal
text it's written in a mono-spaced font like this. Occasionally there are
footnotes1 to provide additional information that is interesting to know but not
crucial to understanding, or to give links to web pages.
Example code is highlighted with a solid border and the name of the matching
example file is written just underneath the example to the right:
1 Like this.
7 Chapter 1: Introduction and environment
example.py
Not every bit of code has a matching example file – much of the time we'll be
building up a Python program bit-by-bit, in which case there will be a single
example file containing the finished version of the program. The example files are
in separate folders, one for each chapter, to make them easy to find.
Example output (i.e. what we see on the screen when we run the code) is
highlighted with a dotted border:
Often we want to look at the code and the output it produces together. In these
situations, you'll see a solid-bordered code block followed immediately by a dotted-
bordered output block.
Sometimes it's necessary to refer in the text to individual lines of code or output, in
which case I've used line numberings on the left:
1 first line
2 second line
3 third line
Other blocks of text (usually file contents or typed command lines) don't have any
kind of border and look like this:
contents of a file
8 Chapter 1: Introduction and environment
Getting in touch
One of the most convincing arguments for presenting a course like this one in the
form of an ebook is that it can be continually updated and tweaked based on reader
feedback. So, if you find anything that is hard to understand, or you think may
contain an error, please get in touch – just drop me an email at
[email protected] and I promise to get back to you.
Installing Python
The process of installing Python depends on the type of computer you're running
on. If you're running a mainstream Linux distribution like Ubuntu, Python is
probably already installed. To find out, open a terminal and type
python
Then you are ready to go. If your Linux installation doesn't already have Python
installed, try installing it with your package manager (the command will probably
be either sudo apt-get install python or sudo yum install python).
If this doesn't work, then download the package from the Python download page1.
The official Python website has installation instructions for Mac 2 and Windows3
computers as well; these are likely to be the most up-to-date instructions, so follow
them closely.
/usr/local/bin/python
c:\Python27\python
1 http://www.python.org/getit/
2 http://www.python.org/getit/mac/
3 http://www.python.org/getit/windows/
4 When we refer to "a Python program" in this book, we are usually talking about the text file that holds the
code.
11 Chapter 1: Introduction and environment
To run a Python program, it's generally easiest to be in the same folder as it. By
convention, Python programs are given the extension .py, so to run a program
called test.py, we just type:
/usr/local/bin/python test.py
There are a couple of tricks that can be useful when experimenting with programs 1.
Firstly, you can run Python in an interactive (or "shell") mode by running it without
the name of a program file. This allows you to type individual statements and see
the result straight away.
Secondly, you can run Python with the -i option, which will cause it to run your
program and then enter interactive mode. This can be handy if you want to
examine the state of variables after your code has run.
Text editors
Since a Python program is just a text file, you can create and edit it with any text
editor of your choice. Note that by a text editor I don't mean a word processor – do
not try to edit Python programs with Microsoft Word, LibreOffice Writer, or similar
tools, as they tend to insert special formatting marks that Python cannot read.
When choosing a text editor, there is one feature that is essential2 to have, and one
which is nice to have. The essential feature is something that's usually called tab
emulation. The effect of this feature at first seems quite odd; when enabled, it
replaces any tab characters that you type with an equivalent number of space
characters (usually set to four). The reason why this is useful is discussed at length
in chapter 4, but here's a brief explanation: Python is very fussy about your use of
tabs and spaces, and unless you are very disciplined when typing, it's easy to end up
1 Don't worry if these two options make no sense to you right now – they will do so later on in the book, once
you've learned what statements and variables actually are.
2 OK, so it's not strictly essential, but you will find life much easer if you have it.
12 Chapter 1: Introduction and environment
with a mixture of tabs and spaces in your programs. This causes very infuriating
problems, because they look the same to you, but not to Python! Tab emulation
fixes the problem by making it effectively impossible for you to type a tab character.
The feature that is nice to have is syntax highlighting. This will apply different
colours to different parts of your Python code, and can help you spot errors more
easily.
Recommended text editors are Notepad++ for Windows1, TextWrangler for Mac
OSX2, and gedit for Linux3, all of which are freely available.
1 http://notepad-plus-plus.org/
2 http://www.barebones.com/products/TextWrangler/
3 https://projects.gnome.org/gedit/
4 https://code.google.com/p/spyderlib/
13 Chapter 1: Introduction and environment
programming and makes a very nice environment once you are familiar with the
Python language.
We won't go into the explanation behind this line, except to say that it's necessary
in order to correct a small quirk with the way that Python 2 handles division of
numbers.
Depending on what version you use, you might see slight differences between the
output in this book and the output you get when you run the code on your
computer. I've tried to note these differences in the text where possible.
1 You might encounter writing online that makes the 2 to 3 changeover seem like a big deal, and it is – but
only for existing, large projects. When writing code from scratch, as you'll be doing when learning, you're
unlikely to run into any problems.
14 Chapter 1: Introduction and environment
1 http://www.python.org/doc/
15 Chapter 2: Printing and manipulating text
the word we use to refer to a bit of text in a computer program (it just means a
string of characters). From this point on we'll use the word string when we're talking
about computer code, and we'll reserve the word sequence for when we're discussing
biological sequences like DNA and protein.
print("Hello world")
hello_world.py
Let's take a look at the various bits of this line of code, and give some of them
names:
The whole line is called a statement.
print() is the name of a function. The function tells Python, in vague terms, what
we want to do – in this case, we want to print some text. The function name is
always1 followed by parentheses2.
The bits of text inside the parentheses are called the arguments to the function. In
this case, we just have one argument (later on we'll see examples of functions that
1 This is not strictly true, but it's easier to just follow this rule than worry about the exceptions.
2 There are several different types of brackets in Python, so for clarity we will always refer to parentheses
when we mean these: (), square brackets when we mean these: [] and curly brackets when we mean these: {}
17 Chapter 2: Printing and manipulating text
take more than one argument, in which case the arguments are separated by
commas).
The arguments tell Python what we want to do more specifically – in this case, the
argument tells Python exactly what it is we want to print: a friendly greeting.
Assuming you've followed the instructions in chapter 1 and set up your Python
environment, type the line of code above into your favourite text editor, save it, and
run it. You should see a single line of output like this:
Hello world
print("Hello world")
print('Hello world')
different_quotes.py
Hello world
Hello world
1 From this point on, I won't tell you to create a new file, enter the text, and run the program for each
example – I will simply show you the output – but I encourage you to try the examples yourself.
18 Chapter 2: Printing and manipulating text
You'll notice that the output above doesn't contain quotes – they are part of the
code, not part of the string itself. If we do want to include quotes in the output, the
easiest thing to do1 is use the other type of quotes for surrounding the string:
printing_quotes.py
Be careful when writing and reading code that involves quotes – you have to make
sure that the quotes at the beginning and end of the string match up.
comment.py
1 The alternative is to place a backslash character (\) before the quote – this is called escaping the quote and
will prevent Python from trying to interpret it.
2 This symbol has many names – you might know it as number sign, pound sign, octothorpe, sharp (from
musical notation), cross, or pig-pen.
19 Chapter 2: Printing and manipulating text
You're going to see a lot of comments in the source code examples in this book, and
also in the solutions to the exercises. Comments are a very useful way to document
your code, for a number of reasons:
• You can put the explanation of what a particular bit of code does right next
to the code itself. This makes it much easier to find the documentation for a
line of code that is in the middle of a large program, without having to
search through a separate document.
• Because the comments are part of the source code, they can never get mixed
up or separated. In other words, if you are looking at the source code for a
particular program, then you automatically have the documentation as well.
In contrast, if you keep the documentation in a separate file, it can easily
become separated from the code.
• Having the comments right next to the code acts as a reminder to update the
documentation whenever you change the code. The only thing worse than
undocumented code is code with old documentation that is no longer
accurate!
Don't make the mistake, by the way, of thinking that comments are only useful if
you are planning on showing your code to somebody else. When you start writing
your own code, you will be amazed at how quickly you forget the purpose of a
particular section or statement. If you are working on a solution to one of the
exercises in this book on Friday afternoon, then come back to it on Monday
morning, it will probably take you quite a while to pick up where you left off.
Comments can help with this problem by giving you hints about the purpose of
code, meaning that you spend less time trying to understand your old code, thus
speeding up your progress. A side benefit is that writing a comment for a bit of code
reinforces your understanding at the time you are doing it. A good habit to get into
is writing a quick one-line comment above any line of code that does something
interesting:
20 Chapter 2: Printing and manipulating text
You'll see this technique used a lot in the code examples in this book, and I
encourage you to use it for your own code as well.
Forgetting quotes
Here's one possible error we can make when printing a line of output – we can
forget to include the quotes:
print(Hello world)
missing_quotes.py
This is easily done, so let's take a look at the output we'll get if we try to run the
above code1:
1 The output that you see might be very slightly different from this, depending on a bunch of factors like
your operating system and the exact version of Python you are using.
21 Chapter 2: Printing and manipulating text
1 $ python error.py
2 File "error.py", line 1
3 print(Hello world)
4 ^
5 SyntaxError: invalid syntax
Referring to the line numbers on the left we can see that the name of the Python
file is error.py (line 1) and that the error occurs on the first line of the file (line
2). Python's best guess at the location of the error is just before the close
parentheses (line 3). Depending on the type of error, this can be wrong by quite a
bit, so don't rely on it too much!
The type of error is a SyntaxError (line 5), which mean that Python can't
understand the code – it breaks the rules in some way (in this case, the rule that
strings must be surrounded by quotation marks). We'll see different types of errors
later in this book.
Spelling mistakes
What happens if we miss-spell the name of the function?:
prin("Hello world")
spelling.py
We get a different type of error – a NameError – and the error message is a bit
more helpful:
1 $ python error.py
2 Traceback (most recent call last):
3 File "error.py", line 1, in <module>
4 prin("Hello world")
5 NameError: name 'prin' is not defined
22 Chapter 2: Printing and manipulating text
This time, Python doesn't try to show us where on the line the error occurred, it
just shows us the whole line (line 4). The error message tells us which word Python
doesn't understand (line 5), so in this case, it's quite easy to fix.
Hello
World
We might try putting a new line in the middle of our string like this:
print("Hello
World")
but that won't work and we'll get the following error message:
1 $ python error.py
2 File "error.py", line 1
3 print("Hello
4 ^
5 SyntaxError: EOL while scanning string literal
Python finds the error when it gets to the end of the first line of code (line 2 in the
output). The error message (line 5) is a bit more cryptic than the others. EOL stands
for End Of Line, and string literal means a string in quotes. So to put this error
message in plain English: "I started reading a string in quotes, and I got to the end of
the line before I came to the closing quotation mark"
If splitting the line up doesn't work, then how do we get the output we want.....?
23 Chapter 2: Printing and manipulating text
print_newline.py
Notice that there's no need for a space before or after the newline. This newline
character will become very important in the next chapter when we start reading
data from files.
There are a few other useful special characters as well, all of which consist of a
backslash followed by a letter. The only ones which you are likely to need for the
exercises in this book are the tab character (\t) and the carriage return character
(\r). The tab character can sometimes be useful when writing a program that will
produce a lot of output. The carriage return character works a bit like a newline in
that it puts the cursor back to the start of the line, but doesn't actually start a new
line, so you can use it to overwrite output – this is sometimes useful for long-
running programs.
24 Chapter 2: Printing and manipulating text
The variable my_dna now points to the string "ATGCGTA". We call this assigning a
variable, and once we've done it, we can use the variable name instead of the string
itself – for example, we can use it in a print() statement1:
print_variable.py
Notice that when we use the variable in a print() statement, we don't need any
quotation marks – the quotes are part of the string, so they are already "built in" to
the variable my_dna. Also notice that this example includes a blank line to separate
the different bits and make it easier to read. We are allowed to put as many blank
lines as we like in our programs when writing Python – the computer will ignore
them.
We can change the value of a variable as many times as we like once we've created
it:
1 If it's not clear why this is useful, don't worry – it will become much more apparent when we look at some
longer examples.
25 Chapter 2: Printing and manipulating text
my_dna = "ATGCGTA"
print(my_dna)
# change the value of my_dna
my_dna = "TGGTCCA"
Here's a very important point that trips many beginners up: variable names are
arbitrary – that means that we can pick whatever we like to be the name of a
variable. So our code above would work in exactly the same way if we picked a
different variable name:
What makes a good variable name? Generally, it's a good idea to use a variable
name that gives us a clue as to what the variable refers to. In this example, my_dna
is a good variable name, because it tells us that the content of the variable is a DNA
sequence. Conversely, banana is a bad variable name, because it doesn't really tell
us anything about the value that's stored. As you read through the code examples
in this book, you'll get a better idea of what constitutes good and bad variable
names.
This idea – that names for things are arbitrary, and can be anything we like – is a
theme that will occur many times in this book, so it's important to keep it in mind.
Occasionally you will see a variable name that looks like it has some sort of
relationship with the value it points to:
my_file = "my_file.txt"
but don't be fooled! Variable names and strings are separate things.
26 Chapter 2: Printing and manipulating text
I said above that variable names can be anything we want, but it's actually not quite
that simple – there are some rules we have to follow. We are only allowed to use
letters, numbers, and underscores, so we can't have variable names that contain
odd characters like £, ^ or %. We are not allowed to start a name with a number
(though we can use numbers in the middle or at the end of a name). Finally, we
can't use a word that's already built in to the Python language like "print".
It's also important to remember that variable names are case-sensitive, so
my_dna, MY_DNA, My_DNA and My_Dna are all separate variables. Technically
this means that you could use all four of those names in a Python program to store
different values, but please don't do this – it is very easy to become confused when
you use very similar variable names.
Concatenation
We can concatenate (stick together) two strings using the + symbol1. This symbol
will join together the string on the left with the string on the right:
print_concatenated.py
AATTGGCC
In the above example, the things being concatenated were strings, but we can also
use variables that point to strings:
upstream = "AAA"
my_dna = upstream + "ATGC"
# my_dna is now "AAAATGC"
upstream = "AAA"
downstream = "GGG"
my_dna = upstream + "ATGC" + downstream
# my_dna is now "AAAATGCGGG"
It's important to realize that the result of concatenating two strings together is
itself a string. So it's perfectly OK to use a concatenation inside a print()
statement:
As we'll see in the rest of the book, using one tool inside another is quite a common
thing to do in Python.
that uses len() to calculate the length of a string, the program will run but we
won't see any output:
If we want to actually use the return value, we need to store it in a variable, and
then do something useful with it (like printing it):
dna_length = len("AGTC")
print(dna_length)
print_length.py
There's another interesting thing about the len() function: the result (or return
value) is not a string, it's a number. This is a very important idea so I'm going to
write it out in bold: Python treats strings and numbers differently.
We can see that this is the case if we try to concatenate together a number and a
string. Consider this short program which calculates the length of a DNA sequence
and then prints a message telling us the length:
1 $ python error.py
2 Traceback (most recent call last):
3 File "error.py", line 8, in <module>
4 print("The length of the DNA sequence is " + dna_length)
5 TypeError: cannot concatenate 'str' and 'int' objects
my_dna = "ATGCGAGT"
dna_length = len(my_dna)
print("The length of the DNA sequence is " + str(dna_length))
print_dna_length.py
The only thing we have changed is that we've replaced dna_length with
str(dna_length) inside the print() statement2. Notice that because we're
using one function (str()) inside another function (print()), our statement now
ends with two closing parentheses.
To finish our discussion of the str() function, here's a formal description of it,
with all the technical terms in italics:
str() is a function which takes one argument (whose type is number), and returns a
value (whose type is string) representing that number.
If you're unsure about the meanings of any of the words in italics, skip back to the
earlier parts of this chapter where we discussed them. Understanding how types
work is key to avoiding many of the frustrations which new programmers typically
encounter, so make sure the idea is clear in your mind before moving on with the
rest of this book.
Changing case
We can convert a string to lower case by using a new type of syntax – a method that
belongs to strings. A method is like a function, but instead of being built in to the
Python language, it belongs to a particular type1. The method we are talking about
here is called lower(), and we say that it belongs to the string type. Here's how we
use it:
my_dna = "ATGC"
# print my_dna in lower case
print(my_dna.lower())
print_lower.py
Notice how using a method looks different to using a function. When we use a
function like print() or len(), we write the function name first and the
arguments go in parentheses:
1 The chapter on object-oriented programming in Advanced Python for Biologists gives you the full details on
types.
31 Chapter 2: Printing and manipulating text
print("ATGC")
len(my_dna)
When we use a method, we write the name of the variable first, followed by a
period, then the name of the method, then the method arguments in parentheses.
For the example we're looking at here, lower(), there is no argument, so the
opening and closing parentheses are right next to each other.
It's important to notice that the lower() method does not actually change the
variable; instead it returns a copy of the variable in lower case. We can prove that it
works this way by printing the variable before and after running lower(). Here's
the code to do so:
my_dna = "ATGC"
# print the variable
print("before: " + my_dna)
# run the lower method and store the result
lowercase_dna = my_dna.lower()
# print the variable again
print("after: " + my_dna)
print_before_and_after.py
before: ATGC
after: ATGC
Just like the len() function, in order to actually do anything useful with the
lower() method, we need to store the result (or print it right away).
Because the lower() method belongs to the string type, we can only use it on
variables that are strings. If we try to use it on a number:
32 Chapter 2: Printing and manipulating text
my_number = len("AGTC")
# my_number is 4
print(my_number.lower())
The error message is a bit cryptic, but hopefully you can grasp the meaning:
something that is a number (an int, or integer) does not have a lower() method.
This is a good example of the importance of types in Python code: we can only use
methods on the type that they belong to.
Before we move on, let's just mention that there is another method that belongs to
the string type called upper() – you can probably guess what it does!
Replacement
Here's another example of a useful method that belongs to the string type:
replace(). replace() is slightly different from anything we've seen before – it
takes two arguments (both strings) and returns a copy of the variable where all
occurrences of the first string are replaced by the second string. That's quite a long-
winded description, so here are a few examples to make things clearer:
33 Chapter 2: Printing and manipulating text
protein = "vlspadktnv"
replace.py
ylspadktny
ymtpadktnv
vlspadktnv
We'll take a look at more tools for carrying out string replacement in chapter 7.
protein = "vlspadktnv"
# if we use a stop position beyond the end, it's the same as using the end
print(protein[0:60])
print_substrings.py
pa
vlspad
vlspadktnv
There are two important things to notice here. Firstly, we actually start counting
from position zero, rather than one – in other words, position 3 is actually the
fourth character1. This explains why the first character of the first line of output is
p and not s as you might think. Secondly, the positions are inclusive at the start,
but exclusive at the stop. In other words, the expression protein[3:5] gives us
everything starting at the fourth character, and stopping just before the sixth
character (i.e. characters four and five).
If we just give a single number in the square brackets, we'll just get a single
character:
protein = "vlspadktnv"
first_residue = protein[0]
1 This seems very annoying when you first encounter it, but we'll see later why it's necessary.
35 Chapter 2: Printing and manipulating text
We'll learn a lot more about this type of notation, and what we can do with it, in
chapter 4.
protein = "vlspadktnv"
# count amino acid residues
valine_count = protein.count('v')
lsp_count = protein.count('lsp')
tryptophan_count = protein.count('w')
count_amino_acids.py
valines: 2
leucines: 1
tryptophans: 0
36 Chapter 2: Printing and manipulating text
protein = "vlspadktnv"
print(str(protein.find('p')))
print(str(protein.find('kt')))
print(str(protein.find('w')))
find_amino_acids.py
3
6
-1
Notice the behaviour of find() when we ask it to locate a substring that doesn't
exist – we get back the answer -1.
Both count() and find() have a pretty serious limitation: you can only search
for exact substrings. If you need to count the number of occurrences of a variable
protein motif, or find the position of a variable transcription factor binding site,
they will not help you. The whole of chapter 7 is devoted to tools that can do those
kinds of jobs.
37 Chapter 2: Printing and manipulating text
Of the tools we've discussed in this section, three – replace(), count() and
find() – require at least two strings to work, so be careful that you don't get
confused about the order – remember that:
my_dna.count(my_motif)
my_motif.count(my_dna)
Recap
We started this chapter talking about strings and how to work with them, but along
the way we had to take a lot of diversions, all of which were necessary to
understand how the different string tools work. Thankfully, that means that we've
covered most of the nuts and bolts of the Python language, which will make future
chapters go much more smoothly.
We've learned about some general features of the Python programming language
like
• the difference between functions, statements and arguments
• the importance of comments and how to use them
• how to use Python's error messages to fix bugs in our programs
38 Chapter 2: Printing and manipulating text
Exercises
Reminder: the descriptions of the exercises are deliberately terse and may be
somewhat ambiguous (just like requirements for programs you will write in real
life). See the solutions for in-depth discussions of the exercises.
Calculating AT content
Here's a short DNA sequence:
ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT
Write a program that will print out the AT content of this DNA sequence. Hint: you
can use normal mathematical symbols like add (+), subtract (-), multiply (*), divide
(/) and parentheses to carry out calculations on numbers in Python.
Reminder: if you're using Python 2 rather than Python 3, include this line at the
top of your program:
Complementing DNA
Here's a short DNA sequence:
ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT
ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT
The sequence contains a recognition site for the EcoRI restriction enzyme, which
cuts at the motif G*AATTC (the position of the cut is indicated by an asterisk).
Write a program which will calculate the size of the two fragments that will be
produced when the DNA sequence is digested with EcoRI.
Solutions
Calculating AT content
This exercise is going to involve a mixture of strings and numbers. Let's remind
ourselves of the formula for calculating AT content:
A +T
AT content =
length
There are three numbers we need to figure out: the number of As, the number of Ts,
and the length of the sequence. We know that we can get the length of the
sequence using the len() function, and we can count the number of As and Ts
using the count() method. Here are a few lines of code that we think will
calculate the numbers we need:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
At this point, it seems sensible to check these lines before we go any further. So
rather than diving straight in and doing some calculations, let's print out these
numbers so that we can eyeball them and see if they look approximately right. We'll
have to remember to turn the numbers into strings using str() so that we can
print them:
43 Chapter 2: Printing and manipulating text
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
length: 54
A count: 16
T count: 21
That looks about right, but how do we know if it's exactly right? We could go
through the sequence manually base by base, and verify that there are sixteen As
and twenty-one Ts, but that doesn't seem like a great use of our time: also, what
would we do if the sequence were 51 kilobases rather than 51 bases? A better idea is
to run the exact same code with a much shorter test sequence, to verify that it
works before going ahead and running it on the larger sequence.
Here's a version that uses a very short test sequence with one of each of the four
bases:
test_dna = "ATGC"
length = len(test_dna)
a_count = test_dna.count('A')
t_count = test_dna.count('T')
length: 4
A count: 1
T count: 1
Everything looks OK – we can probably go ahead and run the code on the long
sequence. But wait; we know that the next step is going to involve doing some
calculations using the numbers. If we switch back to the long sequence now, then
we'll be in the same position as we were before – we'll end up with an answer for
the AT content, but we won't know if it's the right one.
A better plan is to stick with the short test sequence until we've written the whole
program, and check that we get the right answer for the AT content (we can easily
see by glancing at the test sequence that the AT content is 0.5). Here goes – we'll
use the add and divide symbols from the exercise hint:
test_dna = "ATGC"
length = len(test_dna)
a_count = test_dna.count('A')
t_count = test_dna.count('T')
AT content is 1.25
That doesn't look right. Looking back at the code we can see what has gone wrong –
in the calculation, the division has taken precedence over the addition, so what we
have actually calculated is:
T
A+
length
45 Chapter 2: Printing and manipulating text
To fix it, all we need to do is add some parentheses around the addition, so that the
line becomes:
AT content is 0.5
and we can go ahead and run the program using the longer sequence, confident
that the code is working and that the calculations are correct. Here's the final
version:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content.py
AT content is 0.6851851851851852
Complementing DNA
This one seems pretty straightforward – we need to take our sequence and replace
A with T, T with A, C with G, and G with C. We'll have to make four separate calls to
replace(), and use the return value for each on as the input for the next tone.
Let's try it:
46 Chapter 2: Printing and manipulating text
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
# replace A with T
replacement1 = my_dna.replace('A', 'T')
# replace T with A
replacement2 = replacement1.replace('T', 'A')
# replace C with G
replacement3 = replacement2.replace('C', 'G')
# replace G with C
replacement4 = replacement3.replace('G', 'C')
# print the result of the final replacement
print(replacement4)
ACACAACCAAAACCAAAACAAAAACCAAACAAACAAAAAAAACCAACCCAACAA
We can see just by looking at the original sequence that the first letter is A, so the
first letter of the printed sequence should be its complement, T. But instead the
first letter is A. In fact, all of the bases in the printed sequence are either A or T.
This is definitely not what we want!
Let's try and track the problem down by printing out all the intermediate steps as
well:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
replacement1 = my_dna.replace('A', 'T')
print(replacement1)
replacement2 = replacement1.replace('T', 'A')
print(replacement2)
replacement3 = replacement2.replace('C', 'G')
print(replacement3)
replacement4 = replacement3.replace('G', 'C')
print(replacement4)
The output from this program makes it clear what the problem is:
47 Chapter 2: Printing and manipulating text
TCTGTTCGTTTTCGTTTTGTTTTTGCTTTCTTTCTTTTTTTTCGTTGCGTTCTT
ACAGAACGAAAACGAAAAGAAAAAGCAAACAAACAAAAAAAACGAAGCGAACAA
AGAGAAGGAAAAGGAAAAGAAAAAGGAAAGAAAGAAAAAAAAGGAAGGGAAGAA
ACACAACCAAAACCAAAACAAAAACCAAACAAACAAAAAAAACCAACCCAACAA
The first replacement (the result of which is shown in the first line of the output)
works fine – all the As have been replaced with Ts (for example, look at the first
character – it's A in the original sequence and T in the first line of the output).
The second replacement is where it starts to go wrong: all the Ts are replaced by As,
including those that were there as a result of the first replacement. So during
the first two replacements, the first character is changed from A to T and then
straight back to A again.
How are we going to get round this problem? One option is to pick a temporary
alphabet of four letters and do each replacement twice:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
replacement1 = my_dna.replace('A', 'H')
replacement2 = replacement1.replace('T', 'J')
replacement3 = replacement2.replace('C', 'K')
replacement4 = replacement3.replace('G', 'L')
replacement5 = replacement4.replace('H', 'T')
replacement6 = replacement5.replace('J', 'A')
replacement7 = replacement6.replace('K', 'G')
replacement8 = replacement7.replace('L', 'C')
print(replacement8)
This gets us the result we are looking for. It avoids the problem with the previous
program by using another letter to stand in for each base while the replacements
are being done. For example, A is first converted to H and then later on H is
converted to T.
Here's a slightly more elegant way of doing it. We can take advantage of the fact
that the replace() method is case-sensitive, and make all the replaced bases
lower case. Then, once all the replacements have been carried out, we can simply
48 Chapter 2: Printing and manipulating text
call upper() and change the whole sequence back to upper case. Let's take a look
at how this works:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
replacement1 = my_dna.replace('A', 't')
print(replacement1)
replacement2 = replacement1.replace('T', 'a')
print(replacement2)
replacement3 = replacement2.replace('C', 'g')
print(replacement3)
replacement4 = replacement3.replace('G', 'c')
print(replacement4)
print(replacement4.upper())
complement_dna.py
The output lets us see exactly what's happening – notice that in this version of the
program we print the final string twice, once as it is and then once converted to
upper case:
tCTGtTCGtTTtCGTtTtGTtTTTGCTtTCtTtCtTtTtTtTCGtTGCGTTCtT
tCaGtaCGtaatCGatatGataaaGCataCtatCtatatataCGtaGCGaaCta
tgaGtagGtaatgGatatGataaaGgatagtatgtatatatagGtaGgGaagta
tgactagctaatgcatatcataaacgatagtatgtatatatagctacgcaagta
TGACTAGCTAATGCATATCATAAACGATAGTATGTATATATAGCTACGCAAGTA
We can see that as the program runs, each base in turn is replaced by its
complement in lower case. Since the next replacement is only looking for upper
case characters, bases don't get changed back as they did in the first version of our
program.
49 Chapter 2: Printing and manipulating text
1 2 3 4 5
0123456789012345678901234567890123456789012345678901234
ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT
Since the EcoRI enzyme cuts the DNA between the G and first A, we can figure out
that the first fragment will run from position 0 to position 21, and the second
fragment from position 22 to the last position, 54. Therefore the lengths of the two
fragments are 22 and 33.
Writing a program to figure out the lengths is just a question of applying the same
logic. We'll use the find() method to figure out the position of the start of the
EcoRI motif, then add one to account for the fact that the positions start counting
from zero – this will give us the length of the first fragment. From there we can get
the length of the second fragment by finding the length of the input sequence and
subtracting the length of the first fragment:
my_dna = "ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT"
frag1_length = my_dna.find("GAATTC") + 1
frag2_length = len(my_dna) - frag1_length
print("length of fragment one is " + str(frag1_length))
print("length of fragment two is " + str(frag2_length))
fragment_lengths.py
The output from this program confirms that it agrees with the answer we got
manually:
50 Chapter 2: Printing and manipulating text
If we wanted to run the same program using a different restriction enzyme, we'd
have to change both the string that we used in the find() method call, and the
number that we add in order to take account of the cut site.
It's worth noting that this program assumes that the DNA sequence definitely does
contain the restriction site we're looking for. If we try the same program using a
DNA sequence which doesn't contain the site, it will report a fragment of length 0
and a fragment whose length is equal to the total length of the DNA sequence.
While this is not strictly wrong, it's a little misleading – if we were going to use this
program for real-life work, we'd probably prefer to have slightly different behaviour
depending on whether or not the DNA sequence contained the motif we're looking
for. We'll talk about how to implement that type of behaviour in chapter 6.
my_dna =
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC
GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
Because the sequence is quite long, this single statement actually runs over three
lines – although, of course, if you open up this code in a text editor it might look
different depending on your set-up.
1 We know that that's not really how a splicosome works, but it's fine as a conceptual model.
51 Chapter 2: Printing and manipulating text
The next step in solving this exercise is to extract the two exons from our DNA
sequence. We'll have to use the substring notation from earlier in the chapter, and
we'll need to take care with the numbers.
The first bit of the sequence goes from the first character to the sixty-third
character, so we might be tempted to write a line like this:
exon1 = my_dna[1:63]
However, remember that when we take a substring like this the numbers are
inclusive at the start, but exclusive at the end, so our stop position needs to be one
higher:
exon1 = my_dna[1:64]
The second exon starts at the ninety-first base and goes to the end of the DNA
sequence. There are a number of different ways we could express this. One is to
figure out the position of the last character by using the len() function to get the
length of the DNA sequence:
exon2 = my_dna[91:len(my_dna)]
exon2 = my_dna[91:10000]
taking advantage of the fact that giving a stop position beyond the end of the
my_dna string will cause the substring to run to the end. In fact, we can do even
better by leaving off the stop position entirely – this code will do the same:
exon2 = my_dna[91:]
my_dna =
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC
GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
exon1 = my_dna[1:64]
exon2 = my_dna[91:]
print(exon1 + exon2)
TCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCATCGATCGAT
ATCGATGCATCGACTACTAT
but when we look more closely we can see that something is not right. The printed
coding sequence is supposed to start at the very first character of the input
sequence, but it's starting at the second. We have forgotten to take into account the
fact that Python starts counting from zero, so our numbers are all too high by one.
Let's try again:
my_dna =
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC
GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
exon1 = my_dna[0:63]
exon2 = my_dna[90:]
print(exon1 + exon2)
introns1.py
Now the output looks correct – the coding sequence starts at the very beginning of
the input sequence:
53 Chapter 2: Printing and manipulating text
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCATCGATCGA
TATCGATGCATCGACTACTAT
my_dna =
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC
GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
exon1 = my_dna[0:63]
exon2 = my_dna[90:10000]
coding_length = len(exon1 + exon2)
total_length = len(my_dna)
print(coding_length / total_length)
0.780487804878
We have calculated the coding proportion as a fraction, but the exercise called for a
percentage. We can easily fix this by multiplying by 100. Notice that the symbol for
multiplication is not x, as you might think, but *. The final code:
54 Chapter 2: Printing and manipulating text
my_dna =
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC
GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
exon1 = my_dna[0:63]
exon2 = my_dna[90:]
coding_length = len(exon1 + exon2)
total_length = len(my_dna)
print(100 * coding_length / total_length)
introns2.py
78.0487804878
my_dna =
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC
GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
exon1 = my_dna[0:63]
intron = my_dna[63:90]
exon2 = my_dna[90:]
print(exon1 + intron.lower() + exon2)
introns3.py
55 Chapter 2: Printing and manipulating text
Looking at the output, we see an upper case DNA sequence with a lower case
section in the middle, as expected:
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAtcgatcgatcg
atcgatcgatcatgctATCATCGATCGATATCGATGCATCGACTACTAT
When we are applying several transformations to text, as in this exercise, there are
usually a number of different ways we can write the program. For example, we
could store the lower case version of the intron, rather than converting it to lower
case when printing:
intron = my_dna[63:90].lower()
Or we could avoid using variables for the introns and exons all together, and do
everything in one big print() statement:
This last option is very concise, but a bit harder to read than the more verbose way.
As the exercises in this book get longer, you'll notice that there are more and more
different ways to write the code – you may end up with solutions that look very
different to the example solutions. When trying to choose between different ways
to write a program, always favour the solution that is clearest in intent and easiest
to read.
56 Chapter 3: Reading and writing files
1 i.e. files which you can open in a text editor and read, as opposed to binary files which cannot be read
directly.
2 In this book we'll mostly be talking about FASTA format as it's the simplest and most common format, but
there are many more.
57 Chapter 3: Reading and writing files
Another reason for our interest in file input/output is the need for our Python
programs to work as part of a pipeline or work flow involving other, existing tools.
When it comes to using Python in the real world, we often want Python to either
accept data from, or provide data to, another program. Often the easiest way to do
this is to have Python read, or write, files in a format that the other program
already understands.
my_file = open("dna.txt")
A file object is a new type which we haven't encountered before, and it's a little more
complicated than the string and number types that we saw in the previous chapter.
With strings and numbers it was easy to understand what they represented – a
single bit of text, or a single number. A file object, in contrast, represents
something a bit less tangible – it represents a file on your computer's hard drive.
The way that we use file objects is a bit different to strings and numbers as well. If
you glance back at the examples from the previous chapter you'll see that most of
59 Chapter 3: Reading and writing files
the time when we want to use a variable containing a string or number we just use
the variable name:
my_string = 'abcdefg'
print(my_string)
my_number = 42
print(my_number + 1)
In contrast, when we're working with file objects most of our interaction will be
through methods. This style of programming will seem unusual at first, but as we'll
see in this chapter, the file type has a well thought-out set of methods which let us
do lots of useful things.
The first thing we need to be able to do is to read the contents of the file. The file
type has a read() method which does this. It doesn't take any arguments, and the
return value is a string, which we can store in a variable. Once we've read the file
contents into a variable, we can treat them just like any other string – for example,
we can print them:
my_file = open("dna.txt")
file_contents = my_file.read()
print(file_contents)
print_file_contents.py
1 my_file_name = "dna.txt"
2 my_file = open(my_file_name)
3 my_file_contents = my_file.read()
What's going on here? On line 1, we store the string dna.txt in the variable
my_file_name. On line 2, we use the variable my_file_name as the argument
to the open() function, and store the resulting file object in the variable
my_file. On line 3, we call the read() method on the variable my_file, and
store the resulting string in the variable my_file_contents.
The important thing to understand about this code is that there are three separate
variables which have different types and which are storing three very different
things:
• my_file_name is a string, and it stores the name of a file on disk.
• my_file is a file object, and it represents the file itself.
• my_file_contents is a string, and it stores the text that is in the file.
Remember that variable names are arbitrary – the computer doesn't care what you
call your variables. So this piece of code is exactly the same as the previous
example:
apple = "dna.txt"
banana = open(apple)
grape = banana.read()
my_file_name = "dna.txt"
my_contents = my_file_name.read()
read_error.py
we'll get an AttributeError – Python will complain that strings don't have a
read() method1:
Another common error is to use the file object when we meant to use the file
contents. If we try to print the file object:
my_file_name = "dna.txt"
my_file = open(my_file_name)
print(my_file)
print_file_object.py
We won't discuss the meaning of this line now: just remember that if you try to
print the contents of a file but instead you get some output that looks like the
above, you have almost definitely printed the file object rather than the file
contents.
1 From now on, I'll just show the relevant bits of output when discussing error message.
62 Chapter 3: Reading and writing files
single line with a short DNA sequence. Open the file up in a text editor and take a
look at it.
We're going to write a simple program to read the DNA sequence from the file and
print it out along with its length. Putting together the file functions and methods
from this chapter, and the material we saw in the previous chapter, we get the
following code:
print_seq_and_length.py
When we look at the output, we can see that there are two things wrong.
sequence is ACTGTACGTGCACTGATC
and length is 19
Firstly, the output has been split over two lines, even though we didn't ask for it.
And secondly, the length is wrong – there are only 18 characters in the DNA string.
Both of these problems have the same explanation: Python has included the
newline character at the end of the file as part of the contents. In other words, the
variable my_dna has a newline character at the end of it. If we could view the
my_dna variable directly1, we would see that it looks like this:
'ACTGTACGTGCACTGATC\n'
1 In fact, we can do this – there's a function called repr() that returns a representation of a variable.
63 Chapter 3: Reading and writing files
This explains why the output from our program is split over two lines – the newline
character is part of the string we are trying to print. It also explains why the length
is wrong – Python is including the newline character when it counts the number of
characters in the string.
The solution is simple. Because this is such a common problem, strings have a
method for removing newline characters from the end of them. The method is
called rstrip(), and it takes one string argument which is the character that you
want to remove. In this case, we want to remove the newline character (\n). Here's
a modified version of the code – note that the argument to rstrip() is itself a
string so needs to be enclosed in quotes:
my_file = open("dna.txt")
my_file_contents = my_file.read()
dna_length = len(my_dna)
print("sequence is " + my_dna + " and length is " + str(dna_length))
print_length_and_seq2.py
In the code above, we first read the file contents and then removed the newline, in
two separate steps:
my_file_contents = my_file.read()
my_dna = my_file_contents.rstrip("\n")
64 Chapter 3: Reading and writing files
but it's more common to read the contents and remove the newline all in one go,
like this:
my_dna = my_file.read().rstrip("\n")
This is a bit tricky to read at first as we are using two different methods (read()
and rstrip()) in the same statement. The key is to read it from left to right – we
take the my_file variable and use the read() method on it, then we take the
output of that method (which we know is a string) and use the rstrip() method
on it. The result of the rstrip() method is then stored in the my_dna variable.
If you find it difficult write the whole thing as one statement like this, just break it
up and do the two things separately – your programs will run just as well.
Missing files
What happens if we try to read a file that doesn't exist?
my_file = open("nonexistent.txt")
missing_file.py
If you encounter this error, you've probably got the filename wrong.
It has a few drawbacks, however, when writing code that we might want to use in
real life.
Printing output to the screen only really works well when there isn't very much of
it1. It's great for short programs and status messages, but quickly becomes
cumbersome for large amounts of output. Some terminals struggle with large
amounts of text, or worse, have a limited scrollback capability which can cause the
first bit of your output to disappear. It's not easy to search in output that's being
displayed at the terminal, and long lines tend to get wrapped. Also, for many
programs we want to send different bits of output to different files, rather than
having it all dumped in the same place.
Most importantly, terminal output vanishes when you close your terminal program.
For small programs like the examples in this book, that's not a problem – if you
want to see the output again you can just re-run the program. If you have a
program that requires a few hours to run, that's not such a great option.
1 Linux users may be aware that we can redirect terminal output to a file using shell redirection, which can
get around some of these problems.
2 We call this the mode of the file.
3 These are the most commonly-used options – there are a few others.
66 Chapter 3: Reading and writing files
whatever data we write to it. If we open an existing file with the mode "a", it will add
new data onto the end of the file, but will not remove any existing content. If there
doesn't already exist a file with the specified name, then "w" and "a" behave
identically – they will both create a new file to hold the output.
Quite a lot of Python functions and methods have these optional arguments. For
the purposes of this book, we will only mention them when they are directly
relevant to what we're doing. If you want to see all the optional arguments for a
particular method or function, the best place to look is the official Python
documentation – see chapter 1 for details.
Once we've opened a file for writing, we can use the file write() method to write
some text to it. write() works a lot like print() – it takes a single string
argument - but instead of printing the string to the screen it writes it to the file.
Here's how we use open() with a second argument to open a file and write a single
line of text to it:
write.py
Because the output is being written to the file in this example, you won't see any
output on the screen if you run it. To check that the code has worked, you'll have to
run it, then open up the file out.txt in your text editor and check that its contents
are what you expect1.
Remember that with write(), just like with print(), we can use any string as
the argument. This also means that we can use any method or function that
returns a string. The following are all perfectly OK:
1 .txt is the standard filename extension for a plain text file. Later in this book, when we generate output files
with a particular format, we'll use different filename extensions.
67 Chapter 3: Reading and writing files
# write "abcdef"
my_file.write("abc" + "def")
# write "8"
my_file.write(str(len('AGTGCTAG')))
# write "TTGC"
my_file.write("ATGC".replace('A', 'T'))
# write "atgc"
my_file.write("ATGC".lower())
Closing files
There's one more important file method to look at before we finish this chapter –
close(). Unsurprisingly, this is the opposite of open() (but note that it's a
method, whereas open() is a function). We should call close() after we're done
reading or writing to a file – we won't go into the details here, but it's a good habit
to get into as it avoids some types of bugs that can be tricky to track down 1.
close() is an unusual method as it takes no arguments (so it's called with an
empty pair of parentheses) and doesn't return any useful value:
close_file.py
1 Specifically, it helps to ensure that output to a file is flushed, which is necessary when we want to make a
file available to another program as part of our work flow.
68 Chapter 3: Reading and writing files
my_file = open("/home/martin/myfolder/myfile.txt")
my_file = open(r"c:\windows\Desktop\myfolder\myfile.txt")
my_file = open("/Users/martin/Desktop/myfolder/myfile.txt")
Recap
We've taken a whole chapter to introduce the various ways of reading and writing
to files, because it's such an important part of building programs that are useful in
biology. We've seen how working with file contents is always a two-step process –
we must open a file before reading or writing – and looked at several common
pitfalls.
We've also introduced a couple of new concepts that are more widely-applicable.
We've encountered our first example of an optional argument in the open()
1 The extra r character before the string is necessary to prevent Python from trying to interpret the
backslash in the file path; see chapter 7 for an explanation.
69 Chapter 3: Reading and writing files
function (we'll see more of these in future chapters). We've also encountered the
first example of a complex data type – the file object – and seen how we can do
useful things with it by calling its various methods, in contrast to the simple strings
and numbers that we've been working with in the previous chapter. In future
chapters, we'll learn about more of these complex data types and how to use them.
70 Chapter 3: Reading and writing files
Exercises
>sequence_name
ATCGACTGATCGATCGTACGAT
>sequence_one
ATCGATCGATCGATCGAT
>sequence_two
ACTAGCTAGCTAGCATCG
>sequence_three
ACTGCATCGATCGTACCT
71 Chapter 3: Reading and writing files
Write a program that will create a FASTA file for the following three sequences –
make sure that all sequences are in upper case and only contain the bases A, T, G
and C.
Solutions
my_dna =
"ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATC
GATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT"
exon1 = my_dna[0:63]
intron = my_dna[63:90]
exon2 = my_dna[90:]
print(exon1 + intron.lower() + exon2)
What changes do we need to make? Firstly, we need to read the DNA sequence from
a file instead of writing it in the code:
dna_file = open("genomic_dna.txt")
my_dna = dna_file.read()
Secondly, we need to create two new file objects to hold the output:
Finally, we need to concatenate the two exon sequences and write them to the
coding DNA file, and write the intron sequence to the non-coding DNA file:
coding_file.write(exon1 + exon2)
noncoding_file.write(intron)
73 Chapter 3: Reading and writing files
Let's put it all together, with some blank lines to separate out the different parts of
the program:
genomic_dna.py
header_1 = "ABC123"
header_2 = "DEF456"
header_3 = "HIJ789"
seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
seq_2 = "actgatcgacgatcgatcgatcacgact"
seq_3 = "ACTGAC-ACTGT--ACTGTA----CATGTG"
FASTA format has alternating lines of header and sequence, so before we try any
sequence manipulation, let's try to write a program that produces the lines in the
74 Chapter 3: Reading and writing files
right order. Rather than writing to a file, we'll print the output to the screen for now
– that will make it easier to see the output right away. Once we've got it working,
we'll switch over to file output. Here's a few lines which will print data to the
screen:
print(header_1)
print(seq_1)
print(header_2)
print(seq_2)
print(header_3)
print(seq_3)
ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456
actgatcgacgatcgatcgatcacgact
HIJ789
ACTGAC-ACTGT--ACTGTA----CATGTG
Not far off – the lines are in the right order, but we forgot to include the greater-
than symbol at the start of the header. Also, we don't really need to print the
header and the sequence separately for each sequence – we can include a newline
character in the print string in order to get them on separate lines. Here's an
improved version of the code:
>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
actgatcgacgatcgatcgatcacgact
>HIJ789
ACTGAC-ACTGT--ACTGTA----CATGTG
Next, let's tackle the problems with the sequences. The second sequence is in lower
case, and it needs to be in upper case – we can fix that using the upper() string
method. The third sequence has a bunch of gaps that we need to remove. We
haven't come across a remove method.... but we do know how to replace one
character with another. If we replace all the gap characters with an empty string, it
will be the same as removing them1. Here's a version that fixes both sequences:
>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG
>DEF456
ACTGATCGACGATCGATCGATCACGACT
>HIJ789
ACTGACACTGTACTGTACATGTG
The final step is to switch from printed output to writing to a file. We'll open a new
file, and change the three print() lines to write():
1 An empty string is just a pair of open and close quotation marks with nothing in between them.
76 Chapter 3: Reading and writing files
After making these changes the code doesn't produce any output on the screen, so
to see what's happened we'll need to take a look at the sequences.fasta file:
>ABC123
ATCGTACGATCGATCGATCGCTAGACGTATCG>DEF456
ACTGATCGACGATCGATCGATCACGACT>HIJ789
ACTGACACTGTACTGTACATGTG
This doesn't look right – the second and third lines have been joined together, as
have the fourth and fifth. What has happened?
It looks like we've uncovered a difference between the print() function and the
write() method. print() automatically puts a newline at the end of the string,
whereas write doesn't. This means we've got to be careful when switching
between them! The fix is quite simple, we'll just add a newline onto the end of each
string that gets written to the file:
The arguments for the write statements are getting quite complicated, but they are
all made up of simple building blocks. For example the last one, if we translated it
into English, would read "a greater-than symbol, followed by the variable header_3,
followed by a newline, followed by the variable seq_3 with all hyphens replaced with
nothing, followed by another newline".
Here's the final code, including the variable definition at the beginning, with blank
lines and comments:
77 Chapter 3: Reading and writing files
# write the header and sequence for seq3 with hyphens removed
output.write('>' + header_3 + '\n' + seq_3.replace('-', '') + '\n')
writing_a_fasta_file.py
There's an exercise that uses different techniques to solve a very similar problem at
the end of the chapter on functional programming in Advanced Python for Biologists
– if you find yourself carrying out this type of process in real life code, then it's
probably worth a look.
Remember, the first argument to open is a string, so it's fine to use a concatenation
because we know that the result of concatenating two strings is also a string.
We'll also change the write() statements so that we have one for each of the
output files. We need to be careful with the number here in order to make sure that
we get the right sequence in each file. Here's the final code, with comments.
writing_multiple_fasta_files.py
Looking at the code above, it seems like there's a lot of redundancy there. Each of
the four sections of code – setting the header values, setting the sequence values,
creating the output files, and writing data to the output files – consists of three
nearly-identical statements. Although the solution works, it seems to involve a lot
79 Chapter 3: Reading and writing files
1 We know that files are slightly different to strings and numbers because they can store a lot of information,
but each file object still only refers to a single file.
81 Chapter 4: Lists and loops
The limitations of this approach became clear quite quickly as we looked at the
solution code – it only worked because the number of sequences were small, and we
knew the number in advance. If we were to repeat the exercise with three hundred
or three thousand sequences, the vast majority of the code would be given over to
storing variables and it would become completely unmanageable. And if we were
to try and write a program that could process an unknown number of input
sequences (for instance, by reading them from a file), we wouldn't be able to do it.
To make our programs able to process multiple pieces of data, we need an entirely
new type of structure which can hold many pieces of information at the same time
– a list.
We've also dealt exclusively with programs whose statements are executed from top
to bottom in a very straightforward way. This has great advantages when first
starting to think about programming – it makes it very easy to follow the flow of a
program. The downside of this sequential style of programming, however, is that it
leads to very redundant code like we saw at the end of the previous chapter:
Again; it was only possible to solve the exercise in this manner because we knew in
advance the number of output files we were going to need. Looking at the code, it's
clear that these three lines consist of essentially the same statement being
executed multiple times, with some slight variations. This idea of repetition-with-
variation is incredibly common in programming problems, and Python has a built-
in tool for expressing it – loops.
82 Chapter 4: Lists and loops
Each individual item in a list is called an element. To get a single element from the
list, write the variable name followed by the index of the element you want in
square brackets:
create_list.py
If we want to go in the other direction – i.e. we know which element we want but
we don't know the index – we can use the index() method, which takes an
argument of any type and returns the position of the argument in the list:
Remember that in Python we start counting from zero rather than one, so the first
element of a list is always at index zero. If we give a negative number, Python starts
counting from the end of the list rather than the beginning – so it's easy to get the
last element from a list:
last_ape = apes[-1]
What if we want to get more than one element from a list? We can give a start and
stop position, separated by a colon, to specify a range of elements:
sublist.py
Does this look familiar? It's the exact same notation that we used to get substrings
back in chapter 2, and it works in exactly the same way – numbers are inclusive at
the start and exclusive at the end. The fact that we use the same notation for
strings and lists hints at a deeper relationship between the two types. In fact, what
we were doing when extracting substrings in chapter 2 was treating a string as
though it were a list of characters. This idea – that we can treat a variable as
though it were a list when it's not – is a powerful one in Python and we'll come back
to it later in this chapter (and also in the chapter on iterators in Advanced Python
for Biologists).
append.py
84 Chapter 4: Lists and loops
We can get the length of a list by using the len() function, just like we did for
strings:
list_length.py
The output shows that the number of elements in apes really has changed:
We can concatenate two lists just as we did with strings, by using the plus symbol:
concatenate_lists.py
85 Chapter 4: Lists and loops
As we can see from the output, this doesn't change either of the two original lists –
it makes a brand new list which contains elements from both:
3 apes
2 monkeys
5 primates
If we want to add elements from a list onto the end of an existing list, changing it
in the process, we can use the extend() method. extend behaves like append()
but takes a list as its argument rather than a single element.
Here are two more list methods that change the variable they're used on:
reverse() and sort(). Both reverse() and sort() work by changing the
order of the elements in the list. If we want to print out a list to see how this works,
we need to used str() (just as we did when printing out numbers):
reverse_and_sort.py
If we take a look at the output, we can see how the order of the elements in the list
is changed by these two methods:
Writing a loop
Imagine we wanted to take our list of apes:
but this is very repetitive and relies on us knowing the number of elements in the
list. What we need is a way to say something along the lines of "for each element in
the list of apes, print out the element, followed by the words ' is an ape'". Python's loop
syntax allows us to express those instructions like this:
loop.py
1 We can sort in other ways too – take a look at the functional programming chapter in Advanced Python for
Biologists.
87 Chapter 4: Lists and loops
Let's take a moment to look at the different parts of this loop. We start by writing
for x in y, where y is the name of the list we want to process and x is the name
we want to use for the current element each time round the loop.
x is just a variable name (so it follows all the rules that we've already learned about
variable names), but it behaves slightly differently to all the other variables we've
seen so far. In all previous examples, we create a variable and store something in it,
and then the value of that variable doesn't change unless we change it ourselves. In
contrast, when we create a variable to be used in a loop, we don't set its value – the
value of the variable will be automatically set to each element of the list in turn,
and it will be different each time round the loop.
This first line of the loop ends with a colon, and all the subsequent lines (just one,
in this case) are indented. Indented lines can start with any number of tab or space
characters, but they must all be indented in the same way. This pattern – a line
which ends with a colon, followed by some indented lines – is very common in
Python, and we'll see it in several more places throughout this book. A group of
indented lines is often called a block of code1.
In this case, we refer to the indented bock as the body of the loop, and the lines
inside it will be executed once for each element in the list. To refer to the current
element, we use the variable name that we wrote in the first line. The body of the
loop can contain as many lines as we like, and can include all the functions and
methods that we've learned about, with one important exception: we're not allowed
to change the list while inside the body of the loop2.
Here's an example of a loop with a more complicated body:
1 If you're familiar with any other programming languages, you might know code blocks as things that are
surrounded with curly brackets – the indentation does the same job in Python
2 Changing the list while looping can cause Python to become confused about which elements have already
been processed and which are yet to come.
88 Chapter 4: Lists and loops
complex_loop.py
The body of the loop in the code above has four statements, two of which are
print() statements, so each time round the loop we'll get two lines of output. If
we look at the output we can see all six lines:
Why is the above approach better than printing out these six lines in six separate
statements? Well, for one thing, there's much less redundancy – here we only
needed to write two print() statements. This also means that if we need to make
a change to the code, we only have to make it once rather than three separate
times. Another benefit of using a loop here is that if we want to add some elements
to the list, we don't have to touch the loop code at all. Consequently, it doesn't
matter how many elements are in the list, and it's not a problem if we don't know
how many are going to be in it at the time when we write the code.
Many problems that can be solved with loops can also be solved using a tool called
list comprehensions – see the chapter on comprehensions in Advanced Python for
Biologists.
89 Chapter 4: Lists and loops
Indentation errors
Unfortunately, introducing tools like loops that require an indented block of code
also introduces the possibility of a new type of error – an IndentationError.
Notice what happens when the indentation of one of the lines in the block does not
match the others:
indentation_error.py
When we run this code, we get an error message before the program even starts to
run:
treats each character in the string as a separate element. This allows us to very
easily process a string one character at a time:
name = "python"
for character in name:
print("one character is " + character)
string_as_list.py
one character is p
one character is y
one character is t
one character is h
one character is o
one character is n
The process of repeating a set of instructions for each element of a list (or
character in a string) is called iteration, and we often talk about iterating over a list
or string.
names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(str(species))
split.py
We can see from the output that the string has been split wherever there was a
comma leaving us with a list of strings:
Of course, once we've created a list in this way we can iterate over it using a loop,
just like any other list.
file = open("some_input.txt")
for line in file:
# do something with the line
Notice that in this example we are iterating over the file object, not over the file
contents. If we iterate over the file contents like this:
file = open("some_input.txt")
contents = file.read()
for line in contents:
# warning: line contains just a single character!
92 Chapter 4: Lists and loops
then each time round the loop we will be dealing with a single character, which is
probably not what we want. A good way to avoid this mistake is to ask yourself,
whenever you open a file, whether you want to get the contents as one big string (in
which case you should use read()) or line-by-line (in which case you should
iterate over the file object).
Another common pitfall is to iterate over the same file object twice:
file = open("some_input.txt")
If we run this code, we'll find that the second for loop never gets executed. The
reason for this is that file objects are exhaustible. Once we have iterated over a file
object, Python "remembers" that it is already at the end of the file, so when we try
to iterate over it again, there are no lines remaining to be read. One way round this
problem is to close and re-open the file each time we want to iterate over it:
A better approach is to read the lines of the file into a list, then iterate over the list
(which we can safely do multiple times). The file object readlines() method
returns a list of all the lines in a file, and we can use it like this:
readlines.py
protein = "vlspadktnv"
and we want to print out the first three residues, then the first four residues, etc.
etc.:
vls
vlsp
vlspa
vlspad
...etc...
94 Chapter 4: Lists and loops
One way to tackle the problem would be to use a loop – we could extract a substring
from the protein sequence and print it in the body of the loop, and the only thing
that would need to change is the stop position in the substring. But what are we
going to iterate over? We can't just iterate over the protein string, because that will
give us individual residues, which is not what we want. We can manually assemble a
list of stop positions, and loop over that:
stop_positions = [3,4,5,6,7,8,9,10]
for stop in stop_positions:
substring = protein[0:stop]
print(substring)
but this seems cumbersome, and only works if we know the length of the protein
sequence in advance.
A better solution is to use the range() function. range() is a built-in Python
function that generates lists of numbers for us to loop over. The behaviour of
range() depends on how many arguments we give it. Below are a few examples,
with the output following directly after the code.
With a single argument, range will count up from zero to that number, excluding
the number itself:
range1.py
0
1
2
3
4
5
95 Chapter 4: Lists and loops
With two numbers, range will count up from the first number (inclusive1) to the
second (exclusive):
range2.py
3
4
5
6
7
With three numbers, range will count up from the first to the second with the step
size given by the third:
range3.py
2
6
10
Recap
In this chapter we've seen several tools that work together to allow our programs to
deal elegantly with multiple pieces of data. Lists let us store many elements in a
single variable, and loops let us process those elements one by one. In learning
1 The rules for ranges are the same as for array notation – inclusive on the low end, exclusive on the high end
– so you only have to memorize them once!
96 Chapter 4: Lists and loops
about loops, we've also been introduced to the block syntax and the importance of
indentation in Python.
We've also seen several useful ways in which we can use the notation we've learned
for working with lists with other types of data. Depending on the circumstances, we
can use strings, files, and ranges as if they were lists. This is a very helpful feature of
Python, because once we've become familiar with the syntax for working with lists,
we can use it in many different place. Learning about these tools has also helped us
make sense of some interesting behaviour that we observed in earlier chapters.
Lists are the first example we've encountered of structures that can hold multiple
pieces of data. We'll encounter another such structure – the dict – in chapter 8. In
fact, Python has several more such data types – you'll find a full survey of them in
the chapter on complex data structures in Advanced Python for Biologists.
97 Chapter 4: Lists and loops
Exercises
Note: all the files mentioned in these exercises can be found in the chapter_4 folder
of the exercises download.
Solutions
file = open("input.txt")
for dna in file:
print(dna)
We can see from the output that we've forgotten to remove the newlines from the
ends of the DNA sequences – there is a blank line between each:
ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGT
ATTCGATTATAAGCATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC
ATTCGATTATAAGCACTATCGATGATCTAGCTACGATCGTAGCTGTA
ATTCGATTATAAGCACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA
but we'll ignore that for now. The next step is to remove the first 14 bases of each
sequence. We know that we want to take a substring from each sequence, starting
at the fifteenth character, and continuing to the end. Unfortunately, the sequences
are all different lengths, so the stop position is going to be different for all of them.
We can use the same trick we used in chapter 1 to make the substring run to the
end of the string, by simple leaving out the stop position.
Here's what the code looks like with the substring part added:
99 Chapter 4: Lists and loops
file = open("input.txt")
for dna in file:
trimmed_dna = dna[14:]
print(trimmed_dna)
As before, we are simply printing the trimmed DNA sequence to the screen, and
from the output we can confirm that the first 14 bases have been removed from
each sequence:
TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT
ATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC
ACTATCGATGATCTAGCTACGATCGTAGCTGTA
ACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA
Now that we know our code is working, we'll switch from printing to the screen to
writing to a file. We'll have to open the file before the loop, then write the trimmed
sequences to the file inside the loop:
file = open("input.txt")
output = open("trimmed.txt", "w")
for dna in file:
trimmed_dna = dna[14:]
output.write(trimmed_dna)
Opening up the trimmed.txt file, we can see that the result looks good. It didn't
matter that we never removed the newlines, because they appear in the correct
place in the output file anyway:
100 Chapter 4: Lists and loops
TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT
ATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC
ACTATCGATGATCTAGCTACGATCGTAGCTGTA
ACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA
Now the final step – printing the lengths to the screen – requires just one more line
of code. Here's the final program in full, with comments:
remove_adapter.py
exon_locations = open("exons.txt")
for line in exon_locations:
print(line)
This gives us a loop in which we are dealing with a different exon each time round.
If we look at the output, we can see that we still have a newline at the end of each
line, but we'll not worry about that for now:
5,58
72,133
190,276
340,398
Now we have to split up each line into a start and stop position. The split()
method is probably a good choice for this job – let's see what happens when we
split each line using a comma as the delimiter:
exon_locations = open("exons.txt")
for line in exon_locations:
positions = line.split(',')
print(positions)
The output shows that each line, when split, turns into a list of two elements:
['5', '58\n']
['72', '133\n']
['190', '276\n']
['340', '398\n']
The second element of each list has a newline on the end, because we haven't
removed them. Let's try assigning the start and stop position to sensible variable
names, and printing them out individually:
102 Chapter 4: Lists and loops
exon_locations = open("exons.txt")
for line in exon_locations:
positions = line.split(',')
start = positions[0]
stop = positions[1]
print("start is " + start + ", stop is " + stop)
The output shows that this approach works – the start and stop variables take
different values each time round the loop:
start is 5, stop is 58
Now let's try putting these variables to use. We'll read the genomic sequence from
the file all in one go using read() – there's no need to process each line separately,
as we just want the entire contents. Then we'll use the exon coordinates to extract
one exon each time round the loop, and print it to the screen:
1 genomic_dna = open("genomic_dna.txt").read()
2 exon_locations = open("exons.txt")
3 for line in exon_locations:
4 positions = line.split(',')
5 start = positions[0]
6 stop = positions[1]
7 exon = genomic_dna[start:stop]
8 print("exon is: " + exon)
What has gone wrong? Recall that the result of using split() on a string is a list
of strings – this means that the start and stop variables in our program are
also strings (because they're just individual elements of the positions list). The
problem comes when we try to use them as numbers in line 7. Fortunately, it's
easily fixed – we just have to use the int() function to turn our strings into
numbers:
start = int(positions[0])
stop = int(positions[1])
but instead we have a single exon variable that stores one exon at a time. Here's
one way to get the complete coding sequence: before the loop starts we'll create a
new variable called coding_sequence and assign it to an empty string. Then,
each time round the loop, we'll add the current exon on to the end, and store the
result back in the same variable. When the loop has finished, the variable will
contain all the exons. This is what the code looks like (with line numbers as the
program is getting quite long):
104 Chapter 4: Lists and loops
1 genomic_dna = open("genomic_dna.txt").read()
2 exon_locations = open("exons.txt")
3 coding_sequence = ""
4 for line in exon_locations:
5 positions = line.split(',')
6 start = int(positions[0])
7 stop = int(positions[1])
8 exon = genomic_dna[start:stop]
9 coding_sequence = coding_sequence + exon
10 print("coding sequence is : " + coding_sequence)
On line 3 we create the coding_sequence variable, and on line 9, inside the loop,
we add the current exon on to the end. This is an unusual type of variable
assignment, because the coding_sequence variable is on both the left and right
side of the equals sign. The trick to understanding line 9 is to read the right-hand
side of the statement first i.e. "concatenate the current coding_sequence and the
current exon, then store the result of that concatenation in coding_sequence".
On line 10, instead of printing the exon, we're printing the coding sequence, and we
can see from the output how the coding sequence is gradually built up as we go
round the loop:
coding sequence is :
CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGA
TATCATCGATGCATCGATCATCGATCGATCGATCGATCGA
coding sequence is :
CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGA
TATCATCGATGCATCGATCATCGATCGATCGATCGATCGACGATCGATCGATCGTAGCTAGCTAGCTAGATCGA
TCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA
coding sequence is :
CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGA
TATCATCGATGCATCGATCATCGATCGATCGATCGATCGACGATCGATCGATCGTAGCTAGCTAGCTAGATCGA
TCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTACGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGTAGCTAGCTACGATCG
105 Chapter 4: Lists and loops
The final step is to save the coding sequence to a file. We can do this at the end of
the program with three lines of code. Here's the final code with comments:
write_coding_sequence.py
106 Chapter 5: Writing our own functions
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT"
length = len(my_dna)
a_count = my_dna.count('A')
t_count = my_dna.count('T')
at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))
If we discount the first line (whose job is to store the input sequence) and the last
line (whose job is to print the result), we can see that it takes four lines of code to
calculate the AT content1. This means that every place in our code where we want
to calculate the AT content of a sequence, we need these same four lines – and we
have to make sure we copy them exactly, without any mistakes.
It would be much simpler if Python had a built-in function (let's call it
get_at_content()) for calculating AT content. If that were the case, then we
could just run get_at_content() in the same way we run print(), or len(),
or open(). Although, sadly, Python does not have such a built-in function, it does
have the next best thing – a way for us to create our own functions.
There are some obvious benefits to creating our own function to carry out a
particular job. For example, it allows us to re-use the same code many times within
a program without having to copy it out each time. We can even re-use code across
multiple different programs.
Putting code into functions has other benefits that are not so obvious. Functions
allow us to split up our code into logical units, which makes it possible to work on
different bits of the code independently. This kind of logical separation is what
makes it possible to work on very large programs without getting confused. In this
chapter, we'll take a look at the basics of writing functions and see how to make
them useful and flexible. At the end of the chapter, we'll also talk about how to
write functions in a way that maximises your ability to deal with complex
problems.
Defining a function
Let's go ahead and create our get_at_content() function. Before we start
typing, we need to figure out what the inputs (the number and types of the function
arguments) and outputs (the type of the return value) are going to be. For this
function, that seems pretty obvious – the input is going to be a single DNA
sequence, and the output is going to be a decimal number. To translate these into
Python terms: the function will take a single argument of type string, and will
return a value of type number1. Here's the code:
def get_at_content(dna):
length = len(dna)
a_count = dna.count('A')
t_count = dna.count('T')
at_content = (a_count + t_count) / length
return at_content
define_function.py
1 In fact, we can be a little bit more specific: we can say that the return value will be of type float – a
floating point number (i.e. one with a decimal point).
108 Chapter 5: Writing our own functions
Reminder: if you're using Python 2 rather than Python 3, include this line at the
top of your program:
The first line of the function definition contains a several different elements. We
start with the word def, which is short for define (writing a function is called
defining it). Following that we write the name of the function, followed by the
names of the argument variables in parentheses. Just like we saw before with
normal variables, the function name and the argument names are arbitrary – we
can use whatever we like.
The first line ends with a colon, just like the first line of the loops that we were
looking at in the previous chapter. And just like loops, this line is followed by a
block of indented lines – the function body. The function body can have as many
lines of code as we like, as long as they all have the same indentation. Within the
function body, we can refer to the arguments by using the variable names from the
first line. In this case, the variable dna refers to the sequence that was passed in as
the argument to the function.
The last line of the function causes it to return the AT content that was calculated
in the function body. To return from a function, we simply write return followed
by the value that the function should output.
There are a couple of important things to be aware of when writing functions.
Firstly, we need to make a clear distinction between defining a function, and
running it (we refer to running a function as calling it). The code we've written
above will not cause anything to happen when we run it, because we've not actually
asked Python to execute the get_at_content() function – we have simply
defined what it is. The code in the function will not be executed until we call the
function like this:
109 Chapter 5: Writing our own functions
get_at_content("ATGACTGGACCA")
If we simply call the function like that, however, then the AT content will vanish
once it's been calculated. In order to use the function to do something useful, we
must either store the result in a variable:
at_content = get_at_content("ATGACTGGACCA")
Or use it directly:
Secondly, it's important to understand that the argument variable dna does not
hold any particular value when the function is defined 1. Instead, its job is to hold
whatever value is given as the argument when the function is called. In this way it's
analogous to the loop variables we saw in the previous chapter: loop variables hold
a different value each time round the loop, and function argument variables hold a
different value each time the function is called.
Finally, be aware that any variables that we create as part of the function only exist
inside the function, and cannot be accessed outside. If we try to use a variable that's
created inside the function from outside:
def get_at_content(dna):
length = len(dna)
a_count = dna.count('A')
t_count = dna.count('T')
at_content = (a_count + t_count) / length
return at_content
print(a_count)
1 Indeed, it doesn't actually exist when it's defined, only when it runs.
110 Chapter 5: Writing our own functions
1 def get_at_content(dna):
2 length = len(dna)
3 a_count = dna.count('A')
4 t_count = dna.count('T')
5 at_content = (a_count + t_count) / length
6 return at_content
7
8 my_at_content = get_at_content("ATGCGCGATCGATCGAATCG")
9 print(str(my_at_content))
10 print(get_at_content("ATGCATGCAACTGTAGC"))
11 print(get_at_content("aactgtagctagctagcagcgta"))
calling_function.py
Looking at the output, we can see that the first function call works fine – the AT
content is calculated to be 0.45, is stored in the variable my_at_content, then
printed. However, the output for the next two calls is not so great. The call at line
10 produces a number with way too many figures after the decimal point, and the
call at line 11, with the input sequence in lower case, gives a result of 0.0, which is
definitely not correct:
0.45
0.5294117647058824
0.0
111 Chapter 5: Writing our own functions
def get_at_content(dna):
length = len(dna)
a_count = dna.upper().count('A')
t_count = dna.upper().count('T')
at_content = (a_count + t_count) / length
return round(at_content, 2)
my_at_content = get_at_content("ATGCGCGATCGATCGAATCG")
print(str(my_at_content))
print(get_at_content("ATGCATGCAACTGTAGC"))
print(get_at_content("aactgtagctagctagcagcgta"))
improved_function.py
0.45
0.53
0.52
We can make the function even better though: why not allow it to be called with
the number of significant figures as an argument1? In the above code, we've picked
two significant figures, but there might be situations where we want to see more.
1 An even better solution would be to specify the number of significant figures in the string representation of
the number when it's printed.
112 Chapter 5: Writing our own functions
Adding the second argument is easy; we just add it to the argument variable list on
the first line of the function definition, and then use the new argument variable in
the call to round(). We'll throw in a few calls to the new version of the function
with different arguments to check that it works:
test_dna = "ATGCATGCAACTGTAGC"
print(get_at_content(test_dna, 1))
print(get_at_content(test_dna, 2))
print(get_at_content(test_dna, 3))
two_arguments.py
0.5
0.53
0.529
I've written that last sentence in bold, because it's incredibly important. It's no
exaggeration to say that understanding the implications of that sentence is the key
to being able to write larger, useful programs. The reason it's so important is that it
describes a programming phenomenon that we call encapsulation. Encapsulation
just means dividing up a complex program into little bits which we can work on
independently. In the example above, the code is divided into two parts – the part
where we define the function, and the part where we use it – and we can make
changes to one part without worrying about the effects on the other part.
This is a very powerful idea, because without it, the size of programs we can write is
limited to the number of lines of code we can hold in our head at one time. Some of
the example code in the solutions to exercises in the previous chapter were starting
to push at this limit already, even for relatively simple problems. By contrast, using
functions allows us to build up a complex program from small building blocks, each
of which individually is small enough to understand in its entirety.
Because using functions is so important, future solutions to exercises will use them
when appropriate, even when it's not explicitly mentioned in the problem text. I
encourage you to get into the habit of using functions in your solutions too.
def get_a_number():
return 42
but such functions tend not to be very useful. For example, we can write a version
of get_at_content() that doesn't require any arguments by setting the value of
the dna variable inside the function:
114 Chapter 5: Writing our own functions
def get_at_content():
dna = "ACTGATGCTAGCTA"
length = len(dna)
a_count = dna.upper().count('A')
t_count = dna.upper().count('T')
at_content = (a_count + t_count) / length
return round(at_content, 2)
but that's obviously not very useful, since it will always operate on the same DNA
sequence. Occasionally you may be tempted to write a no-argument function that
works like this:
1 def get_at_content():
2 length = len(dna)
3 a_count = dna.upper().count('A')
4 t_count = dna.upper().count('T')
5 at_content = (a_count + t_count) / length
6 return round(at_content, 2)
7
8 dna = "ACTGATCGATCG"
9 print(get_at_content())
At first this seems like a good idea – it works because the function gets the value of
the dna variable that is set on line 81. However, this breaks the encapsulation that
we worked so hard to achieve. The function now only works if there is a variable
called dna set in the bit of the code where the function is called, so the two pieces
of code are no longer independent.
If you find yourself writing code like this, it's usually a good idea to identify which
variables from outside the function are being used inside it, and turn them into
arguments.
1 It doesn't matter that the variable is set after the function is defined – all that matters it that it's set before
the function is called on line 9.
115 Chapter 5: Writing our own functions
def print_at_content(dna):
length = len(dna)
a_count = dna.upper().count('A')
t_count = dna.upper().count('T')
at_content = (a_count + t_count) / length
print(str(round(at_content, 2)))
When you first start writing functions, it's very tempting to do this kind of thing.
You think "OK, I need to calculate and print the AT content – I'll write a function that
does both". The trouble with this approach is that it results in a function that is less
flexible. Right now you want to print the AT content to the screen, but what if you
later discover that you want to write it to a file, or use it as part of some other
calculation? You'll have to write more functions to carry out these tasks.
The key to designing flexible functions is to recognize that the job calculate and
print the AT content is actually two separate jobs – calculating the AT content, and
printing it. Try to write your functions in such a way that they just do one job. You
can then easily write code to carry out more complicated jobs by using your simple
functions as building blocks.
we need to know that the DNA sequence comes first, followed by the number of
significant figures.
There's a feature in Python called keyword arguments which allows us to call
functions in a slightly different way. Instead of giving a list of arguments in
parentheses:
get_at_content("ATCGTGACTCG", 2)
we can supply a list of argument variable names and values joined by equals signs:
get_at_content(dna="ATCGTGACTCG", sig_figs=2)
This style of calling functions1 has several advantages. It doesn't rely on the order
of arguments, so we can use whichever order we prefer. These two statements
behave identically:
get_at_content(dna="ATCGTGACTCG", sig_figs=2)
get_at_content(sig_figs=2, dna="ATCGTGACTCG")
It's also clearer to read what's happening when the argument names are given
explicitly.
We can even mix and match the two styles of calling – the following are all
identical:
get_at_content("ATCGTGACTCG", 2)
get_at_content(dna="ATCGTGACTCG", sig_figs=2)
get_at_content("ATCGTGACTCG", sig_figs=2)
keyword_arguments.py
1 It works with methods too, including all the ones we've seen so far.
117 Chapter 5: Writing our own functions
Although we're not allowed to start off with keyword arguments then switch back
to normal – this will cause an error:
get_at_content(dna="ATCGTGACTCG", 2)
Keyword arguments can be particularly useful for functions and methods that have
a lot of arguments, and we'll use them where appropriate in the examples and
exercise solutions in the rest of this book.
The only change that we've made to the code is to add =2 after the sig_figs
variable in the definition line. Now we have the best of both worlds. If the function
is called with two arguments, it will use the number of significant figures specified;
if it's called with one argument, it will use the default value of two significant
figures. Let's see some examples:
118 Chapter 5: Writing our own functions
get_at_content("ATCGTGACTCG")
get_at_content("ATCGTGACTCG", 3)
get_at_content("ATCGTGACTCG", sig_figs=4)
default_argument_values.py
The function takes care of filling in the default value for sig_figs for the first
function call where none is supplied:
0.45
0.455
0.4545
Function argument defaults allow us to write very flexible functions which can
have varying numbers of arguments. It only makes sense to use them for arguments
where a sensible default can be chosen – there's no point specifying a default for
the dna argument in our example. They are particularly useful for functions where
some of the options are only going to be used infrequently.
Testing functions
When writing code of any type, it's important to periodically check that your code
does what you intend it to do. If you look back over the solutions to exercises from
the first few chapters, you can see that we generally test our code at each step by
printing some output to the screen and checking that it looks OK. For example, in
chapter 2 when we were first calculating AT content, we used a very short test
sequence to verify that our code worked before running it on the real input.
The reason we used a test sequence was that, because it was so short, we could
easily work out the answer by eye and compare it to the answer given by our code.
This idea – running code on a test input and comparing the result to an answer
that we know to be correct1 – is such a useful one that Python has a built-in tool
for expressing it: assert. An assertion consists of the word assert, followed by a
call to our function, then two equals signs, then the result that we expect1.
For example, we know that if we run our get_at_content() function on the DNA
sequence "ATGC" we should get an answer of 0.5. This assertion will test whether
that's the case:
Notice the two equals signs – we'll learn the reason behind that in the next chapter.
The way that assertion statements work is very simple; if an assertion turns out to
be false (i.e. if Python executes our function on the input "ATGC" and the answer
isn't 0.5) then the program will stop and we will get an AssertionError.
Assertions are useful in a number of ways. They provide a means for us to check
whether our functions are working as intended and therefore help us track down
errors in our programs. If we get some unexpected output from a program that uses
a particular function, and the assertion tests for that function all pass, then we can
be confident that the error doesn't lie in the function but in the code that calls it.
They also let us modify a function and check that we haven't introduced any errors.
If we have a function that passes a series of assertion tests, and we make some
changes to it, we can re-run the assertion tests and, assuming they all pass, be
confident that we haven't broken the function2.
Assertions are also useful as a form of documentation. By including a collection of
assertion tests alongside a function, we can show exactly what output is expected
from a given input.
Finally, we can use assertions to test the behaviour of our function for unusual
inputs. For example, what is the expected behaviour of get_at_content() when
given a DNA sequence that includes unknown bases (usually represented as N)? A
1 In fact, assertions can include any conditional statement; we'll learn about those in the next chapter.
2 This idea is very similar to a process in software development called regression testing.
120 Chapter 5: Writing our own functions
sensible way to handle unknown bases would be to exclude them from the AT
content calculation – in other words, the AT content for a given sequence shouldn't
be affected by adding a bunch of unknown bases. We can write an assertion that
expresses this rule:
This assertions fails for the current version of get_at_content. However, we can
easily modify the function to remove all N characters before carrying out the
calculation:
assert get_at_content("A") == 1
assert get_at_content("G") == 0
assert get_at_content("ATGC") == 0.5
assert get_at_content("AGG") == 0.33
assert get_at_content("AGG", 1) == 0.3
assert get_at_content("AGG", 5) == 0.33333
test_function.py
121 Chapter 5: Writing our own functions
Recap
In this chapter, we've seen how packaging up code into functions helps us to
manage the complexity of large programs and promote code reuse. We learned how
to define and call our own functions along with various new ways to supply
arguments to functions. We also looked at a couple of things that are possible in
Python, but rarely advisable – writing functions without arguments or return
values. Finally, we explored the use of assertions to test our functions, and
discussed how we can use them to catch errors before they become a problem.
This chapter has covered the basics of writing and using functions, but there's
much more we can do with them – in fact, there's a whole style of programming
(functional programming) which revolves around the manipulation of functions.
You'll find a discussion of this in the chapter in Advanced Python for Biologists
called, unsurprisingly, functional programming.
The remaining chapters in this book will make use of functions in both the
examples and the exercise solutions, so make sure you are comfortable with the
new ideas from this chapter before moving on.
122 Chapter 5: Writing our own functions
Exercises
Reminder: if you're using Python 2 rather than Python 3, include this line at the
top of your program:
Solutions
protein = "MSRSLLLRFLLFLLLLPPLP"
aa = "R"
aa_count = protein.count(aa)
protein_length = len(protein)
percentage = aa_count * 100 / protein_length
print(percentage)
Now we'll make this code into a function by turning the two variables protein and
aa into arguments, and returning the percentage rather than printing it. We'll add
in the assertions at the end of the program to test if the function is doing its job:
Running the code shows that one of the assertions is failing – the error message
tells us which assertion is the failed one:
124 Chapter 5: Writing our own functions
Our function fails to work when the protein sequence is in upper case, but the
amino acid residue code is in lower case. Looking at the assertions, we can make an
educated guess that the next one (with the protein in lower case and the amino
acid in upper case) is probably going to fail as well. Let's try to fix both of these
problems by converting both the protein and the amino acid string to upper case at
the start of the function. We'll use the same trick as we did before of converting a
string to upper case and then storing the result back in the same variable:
aa_count = protein.count(aa)
protein_length = len(protein)
percentage = aa_count * 100 / protein_length
return percentage
amino_acids1.py
times they occur in the protein sequence, to get a total count. Or, we can treat the
protein sequence string as a list (as described in the previous chapter) and ask, for
each position, whether the character at that position is a member of the list of
amino acid residues. We'll use the first method here; in the next chapter we'll learn
about the tools necessary to implement the second.
We'll need some way to keep a running total of matching amino acids as we go
round the loop, so we'll create a new variable outside the loop and update it each
time round. The code inside the loop will be quite similar to that from the previous
exercise. Here's the code with some print() statements so we can see exactly
what is happening:
protein = "MSRSLLLRFLLFLLLLPPLP"
aa_list = ['M', 'L', 'F']
# the total variable will hold the total number of matching residues
total = 0
for aa in aa_list:
print("counting number of " + aa)
aa = aa.upper()
aa_count = protein.count(aa)
When we run the code, we can see how the running total increases each time round
the loop:
126 Chapter 5: Writing our own functions
counting number of M
running total is 1
counting number of L
running total is 11
counting number of F
running total is 13
final percentage is 65.0
Now let's take the code and, just like before, turn the protein string and the amino
acid list into arguments to create a function:
This function passes all the assertion tests except the last one, which tests the
behaviour when run with only one argument. In fact, Python never even gets as far
as testing the result from running the function, as we get an error indicating that
the function didn't complete:
Fixing the error takes only one change: we add a default value for aa_list in the
first line of the function definition:
127 Chapter 5: Writing our own functions
protein = protein.upper()
protein_length = len(protein)
total = 0
for aa in aa_list:
aa = aa.upper()
aa_count = protein.count(aa)
total = total + aa_count
percentage = total * 100 / protein_length
return percentage
amino_acids2.py
6: Conditional tests
1 print(3 == 5)
2 print(3 > 5)
3 print(3 <=5)
4 print(len("ATGC") > 5)
5 print("GAATTC".count("T") > 1)
6 print("ATGCTT".startswith("ATG"))
7 print("ATGCTT".endswith("TTT"))
8 print("ATGCTT".isupper())
9 print("ATGCTT".islower())
10 print("V" in ["V", "W", "L"])
print_conditions.py
If we look at the output, we can see use the line numbers to match up each
condition with its result:
1 False
2 False
3 True
4 False
5 True
6 True
7 False
8 True
9 False
10 True
But what's actually being printed here? At first glance, it looks like we're printing
the strings "True" and "False", but those strings don't appear anywhere in our code.
What is actually being printed is the special built-in values that Python uses to
represent true and false – they are capitalized so that we know they're these special
values.
We can show that these values are special by trying to print them. The following
code runs without errors (note the absence of quotation marks):
130 Chapter 6: Conditional tests
print(True)
print(False)
print(Hello)
causes a NameError.
There's a wide range of things that we can include in conditions, and it would be
impossible to give an exhaustive list here. The basic building blocks are:
• equals (represented by ==)
• greater and less than (represented by > and <)
• greater and less than or equal to (represented by >= and <=)
• not equal (represented by!=)
• is a value in a list (represented by in)
• are two objects the same1 (represented by is)
Many data types also provide methods that return True or False values, which are
often a lot more convenient to use than the building blocks above. We've already
seen a few in the code sample above: for example, strings have a startswith()
method that returns true if the string starts with the string given as an argument.
We'll mention these true/false methods when they come up.
Notice that the test for equality is two equals signs, not one. Forgetting the second
equals sign will cause an error.
Now that we know how to express tests as conditions, let's see what we can do with
them.
1 A discussion of what this actually means in Python is beyond the scope of this book, so we'll avoid using
this comparison for the chapter.
131 Chapter 6: Conditional tests
if statements
The simplest kind of conditional statement is an if statement. Hopefully the syntax
is fairly simple to understand:
expression_level = 125
if expression_level > 100:
print("gene is highly expressed")
We write the word if, followed by a condition, and end the first line with a colon.
There follows a block of indented lines of code (the body of the if statement), which
will only be executed if the condition is true. This colon-plus-block pattern should
be familiar to you from the chapters on loops and functions.
Most of the time, we want to use an if statement to test a property of some
variable whose value we don't know at the time when we are writing the program.
The example above is obviously useless, as the value of the expression_level
variable is not going to change!
Here's a slightly more interesting example: we'll define a list of gene accession
names and print out just the ones that start with "a":
print_accessions.py
ab56
ay93
ap97
132 Chapter 6: Conditional tests
If you take a close look at the code above, you'll see something interesting – the
lines of code inside the loop are indented (just as we've seen before), but the line of
code inside the if statement is indented twice – once for the loop, and once for
the if. This is the first time we've seen multiple levels of indentation, but it's very
common once we start working with larger programs – whenever we have one loop
or if statement nested inside another, we'll have this type of indentation.
Python is quite happy to have as many levels of indentation as needed, but you'll
need to keep careful track of which lines of code belong at which level. If you find
yourself writing a piece of code that requires more than three levels of indentation,
it's generally an indication that that piece of code should be turned into a function.
else statements
Closely related to the if statement is the else statement. The examples above use
a yes/no type of decision-making: should we print the gene accession number or
not? Often we need an either/or type of decision, where we have two possible
actions to take. To do this, we can add on an else clause after the end of the body
of an if statement:
expression_level = 125
if expression_level > 100:
print("gene is highly expressed")
else:
print("gene is lowly expressed")
The else statement doesn't have any condition of its own – rather, the else
statement body is executed when the if statement to which it's attached is not
executed.
Here's an example which uses if and else to split up a list of accession names into
two different files – accessions that start with "a" go into the first file, and all other
accessions go into the second file:
133 Chapter 6: Conditional tests
write_accessions.py
Notice how there are multiple indentation levels as before, but that the if and
else statements are at the same level.
elif statements
What if we have more than two possible branches? For example, say we want three
files of accession names: ones that start with "a", ones that start with "b", and all
others. We could have a second if statement nested inside the else clause of the
first if statement:
write_accessions_nested.py
134 Chapter 6: Conditional tests
This works, but is difficult to read – we can quickly see that we need an extra level
of indentation for every additional choice we want to include. To get round this,
Python has an elif statement, which merges together else and if and allows us
to rewrite the above example in a much more elegant way:
write_accessions_elif.py
Notice how this version of the code only needs two levels of indention. In fact,
using elif we can have any number of branches and still only require a single
extra level of indentation:
Note the order of the statements in the example above; we always start with an if
and end with an else, and all the elif statements go in the middle. This kind of
if/elif/else structure is very useful when we have several mutually-exclusive
options. In the example above, only one branch can be true for each accession
number – a string can't start with both "a" and "b". If we have a situation where the
branches are not mutually exclusive – i.e. where more than one branch can be
taken – then we simply need a series of if statements:
while loops
Here's one final thing we can do with conditions: use them to determine when to
exit a loop. In chapter 4 we learned about loops that iterate over a collection of
items (like a list, a string or a file). Python has another type of loop called a while
loop. Rather than running a set number of times, a while loop runs until some
condition is met. For example, here's a bit of code that increments a count variable
by one each time round the loop, stopping when the count variable reaches ten:
count = 0
while count<10:
print(count)
count = count + 1
136 Chapter 6: Conditional tests
Because normal loops in Python are so powerful1, while loops are used much less
frequently than in other languages, so we won't discuss them further.
but this brings in an extra, unneeded level of indention. A better way is to join up
the two condition with and to make a complex expression:
accessions_and.py
This version is better in two ways: it doesn't require the extra level of indentation,
and the condition reads in a very natural way. We can also use or to join up two
conditions, to produce a complex condition that will be true if either of the two
simple conditions are true:
1 E.g. the example code here could be better accomplished with a range.
137 Chapter 6: Conditional tests
accessions_or.py
We can even join up complex conditions to make more complex conditions – here's
an example which prints accessions if they start with either "a" or "b" and end with
"4":
accessions_complex.py
(X or Y) and Z
X and (Y or Z)
Finally, we can negate any type of condition by prefixing it with the word not. This
example will print out accessions that start with "a" and don't end with 6:
138 Chapter 6: Conditional tests
accessions_not.py
def is_at_rich(dna):
length = len(dna)
a_count = dna.upper().count('A')
t_count = dna.upper().count('T')
at_content = (a_count + t_count) / length
if at_content > 0.65:
return True
else:
return False
boolean_function.py
print(is_at_rich("ATTATCTACTA"))
print(is_at_rich("CGGCAGCGCT"))
The output shows that the function returns True or False just like the other
conditions we've been looking at:
True
False
if is_at_rich(my_dna):
# do something with the sequence
Because the last four lines of our function are devoted to evaluating a condition
and returning True or False, we can write a slightly more compact version. In this
example we evaluate the condition, and then return the result right away:
140 Chapter 6: Conditional tests
def is_at_rich(dna):
length = len(dna)
a_count = dna.upper().count('A')
t_count = dna.upper().count('T')
at_content = (a_count + t_count) / length
return at_content > 0.65
This is a little more concise, and also easier to read once you're familiar with the
idiom.
Recap
In this chapter, we've dealt with two things: conditions, and the statements that
use them.
We've seen how simple conditions can be joined together to make more complex
ones, and how the concepts of truth and falsehood are built in to Python on a
fundamental level. We've also seen how we can incorporate True and False in our
own functions in a way that allows them to be used as part of conditions.
We've been introduced to four different tools that use conditions – if, else, elif,
and while – in approximate order of usefulness. You'll probably find, in the
programs that you write and in your solutions to the exercises in this book, that
you use if and else very frequently, elif occasionally, and while almost never.
141 Chapter 6: Conditional tests
Exercises
In the chapter_6 folder in the exercises download, you'll find a text file called
data.csv, containing some made-up data for a number of genes. Each line contains
the following fields for a single gene in this order: species name, sequence, gene
name, expression level. The fields are separated by commas (hence the name of the
file – csv stands for Comma Separated Values). Think of it as a representation of a
table in a spreadsheet – each line is a row, and each field in a line is a column. All
the exercises for this chapter use the data read from this file.
Reminder: if you're using Python 2 rather than Python 3, include this line at the
top of your programs:
Several species
Print out the gene names for all genes belonging to Drosophila melanogaster or
Drosophila simulans.
Length range
Print out the gene names for all genes between 90 and 110 bases long.
AT content
Print out the gene names for all genes whose AT content is less than 0.5 and whose
expression level is greater than 200.
142 Chapter 6: Conditional tests
Complex condition
Print out the gene names for all genes whose name begins with "k" or "h" except
those belonging to Drosophila melanogaster.
Solutions
Several species
These exercises are somewhat more complicated than previous ones, and they're
going to require material from multiple different chapters to solve. The first
problem is to deal with the format of the data file. Open it up in a text editor and
take a look before continuing.
We know that we're going to have to open the file (chapter 3) and process the
contents line-by-line (chapter 4). To deal with each line, we'll have to split it to
make a list of columns (chapter 4), then apply the condition (this chapter) in order
to figure out whether or not we should print it. Here's a program that will read each
line from the file, split it using commas as the delimiter, then assign each of the
four columns to a variable and print the gene name:
data = open("data.csv")
for line in data:
print(name)
Notice that we use rstrip() to remove the newline from the end of the current
line before splitting it. We know the order of the fields in the line because they were
mentioned in the exercise description, so we can easily assign them to the four
144 Chapter 6: Conditional tests
variables. This program doesn't do anything useful, but we can check the output to
confirm that it gets the names right:
kdy647
jdg766
kdy533
hdt739
hdu045
teg436
Now we can add in the condition. We want to print the name if the species is either
Drosophila melanogaster or Drosophila simulans. If the species name is neither of
those two, then we don't want to do anything. This is a yes/no type decision, so we
need an if statement:
data = open("data.csv")
for line in data:
columns = line.rstrip("\n").split(",")
species = columns[0]
sequence = columns[1]
name = columns[2]
expression = columns[3]
several_species.py
The line containing the if statement is quite long, so it wraps around onto the next
line on this page, but it's still just a single line in the program file. We can check the
output we get:
kdy647
jdg766
kdy533
145 Chapter 6: Conditional tests
against the contents of the file, and confirm that the program is working.
Length range
We can re-use a large part of the code from the previous exercise to help solve this
one. We have another complex condition: we only want to print names for genes
whose length is between 90 and 110 bases – in other words, genes whose length is
greater than 90 and less than 110. We'll have to calculate the length using the
len() function. Once we've done that the rest of the program is quite
straightforward:
data = open("data.csv")
for line in data:
columns = line.rstrip("\n").split(",")
species = columns[0]
sequence = columns[1]
name = columns[2]
expression = columns[3]
length_range.py
AT content
This exercise has a complex condition like the others, but it also requires us to do a
bit more calculation – we need to be able to calculate the AT content of each
sequence. Rather than starting from scratch, we'll simply use the function that we
wrote in the previous chapter and include it at the start of the program. Once we've
done that, it's just a case of using the output from get_at_content() as part of
the condition. We must be careful to convert the fourth column – the expression
level – into an integer so that it can be properly compared:
146 Chapter 6: Conditional tests
data = open("data.csv")
for line in data:
columns = line.rstrip("\n").split(",")
species = columns[0]
sequence = columns[1]
name = columns[2]
expression = int(columns[3])
if get_at_content(sequence) < 0.5 and expression > 200:
print(name)
at_content.py
Complex condition
There are no calculations to carry out for this exercise – the complexity comes from
the fact that there are three components to the condition, and they have to be
joined together in the right way:
data = open("data.csv")
for line in data:
columns = line.rstrip("\n").split(",")
species = columns[0]
sequence = columns[1]
name = columns[2]
expression = columns[3]
if (name.startswith('k') or name.startswith('h')) and species !=
"Drosophila melanogaster":
print(name)
complex_condition.py
147 Chapter 6: Conditional tests
The line containing the if statement is quite long, so it wraps around onto the next
line on this page, but it's still just a single line in the program file. There are two
different ways to express the requirement that the name is not Drosophila
melanogaster. In the above example we've used the not-equals sign (!=) but we
could also have used the not boolean operator:
data = open("data.csv")
for line in data:
columns = line.rstrip("\n").split(",")
species = columns[0]
sequence = columns[1]
name = columns[2]
expression = columns[3]
if get_at_content(sequence) > 0.65:
print(name + " has high AT content")
elif get_at_content(sequence) < 0.45:
print(name + " has low AT content")
else:
print(name + " has medium AT content")
high_low_medium.py
This general type of problem – filtering data based on a set if criteria – is very
common in programming. There's a similar exercise at the end of the chapter on
functional programming in Advanced Python for Biologists which illustrates a
different approach to solving them.
149 Chapter 7: Regular expressions
7: Regular expressions
1 Note that although many of the things in this list are numerical data, they're still read in to Python
programs as strings and need to be manipulated as such.
150 Chapter 7: Regular expressions
In previous chapters, we've looked at some programming tasks that involve pattern
recognition in strings. We've seen how to count individual amino acid residues (and
even groups of amino acid residues) in protein sequences (chapter 5), and how to
identify restriction enzyme cut sites in DNA sequences (chapter 2). We've also seen
how to examine parts of gene names and match them against individual characters
(chapter 6).
The common theme among all these problems is that they involve searching for a
fixed set of characters. But there are many problems that we want to solve that
require more flexible patterns. For example:
• Given a DNA sequence, what's the length of the poly-A tail?
• Given a gene accession name, extract the part between the third character
and the underscore
• Given a protein sequence, determine if it contains this highly-redundant
domain motif
Because these types of problems crop up in so many different fields, there's a
standard set of tools in Python1 for dealing with them: regular expressions. Regular
expressions2 are a topic that might not be covered in a general-purpose
programming book, but because they're so useful in biology, we're going to devote
the whole of this chapter to looking at them. The chapter is split into two sections.
In the first part, we'll will learn about the concept of regular expression patterns and
in the second part we'll see how to use them.
Although the tools for dealing with regular expressions are built in to Python, they
are not made automatically available when you write a program. In order to use
them we must first talk about modules.
Modules in Python
The functions and data types that we've discussed so far in this book have been
ones that are likely to be needed in pretty much every program – tools for dealing
with strings and numbers, for reading and writing files, and for manipulating lists
of data. As such, they are automatically made available when we start to create a
Python program. If we want to open a file, we simply write a statement that uses
the open() function.
However, there's another category of tools in Python which are more specialized,
and it's in this category that regular expressions fall. In fact, there's a large list of
specialized tools which are very useful when you need them, but are not likely to be
needed for the majority of programs. Examples include tools for doing advanced
mathematical calculations, for downloading data from the web, for running
external programs, and for manipulating date/time information. Each collection of
specialized tools – really just a collection of specialized functions and data types – is
called a module.
For reasons of efficiency, Python doesn't automatically make these modules
available in each new program, as it does with the more basic tools. Instead, we
have to explicitly load each module of specialized tools that we want to use inside
our program. To load a module we use the import statement1. For example, the
module that deals with regular expressions is called re, so if we want to write a
program that uses the regular expression tools we must include the line:
import re
at the top of our program. When we then want to use one of the tools from a
module, we have to prefix it with the module name2. For example, to use the regular
1 This is the reason for the from __future__ import division statement that we have to include if
we're using Python 2.
2 There are ways round this, but we won't consider them in this book.
152 Chapter 7: Regular expressions
expression search() function (which we'll discuss later in this chapter) we have to
write:
re.search(pattern, string)
search(pattern, string)
If we forget to import the module which we want to use, or forget to include the
module name as part of the function call, we will get a NameError.
We'll encounter various other module in the rest of this book. For the rest of this
chapter specifically, all code examples will require the import re statement in
order to work. For clarity, we won't include it, so if you want try running any of the
code in this chapter, you'll need to add it at the top.
Raw strings
Writing regular expression patterns, as we'll see in the very next section of this
chapter, requires us to type a lot of special characters. Recall from chapter 2 that
certain combinations of characters are interpreted by Python to have special
meaning. For example, \n means start a new line, and \t means insert a tab
character.
153 Chapter 7: Regular expressions
print(r"\t\n")
The r stands for raw, which is Python's description for a string where special
characters are ignored. Notice that the r goes outside the quotation marks – it is
not part of the string itself. We can see from the output that the above code prints
out the string just as we've written it:
\t\n
without any tabs or newlines. You'll see this special raw notation used in all the
regular expression code examples in this chapter.
dna = "ATCGCGAATTCAC"
if re.search(r"GAATTC", dna):
print("restriction site found!")
ecor1.py
Notice that we've used the raw notation for the pattern, even though it's not strictly
necessary as it doesn't contain any special characters. Because the EcoRI enzyme
cuts at a fixed sequence (i.e. there is no variation in the recognition site), we didn't
strictly need regular expressions for this job; we could have just used the count()
method like this:
dna = "ATCGCGAATTCAC"
if dna.count("GAATTC") > 1:
print("restriction site found!")
Alternation
Now that we've seen a simple example of how to use re.search(), let's look at
something a bit more interesting. This time, we'll check for the presence of an AvaII
recognition site, which can have two different sequences: GGACC and GGTCC. One
way to do this would be to use the techniques we learned in the previous chapter to
make a complex condition using or:
dna = "ATCGCGAATTCAC"
if re.search(r"GGACC", dna) or re.search(r"GGTCC", dna):
print("restriction site found!")
But a better way is to capture the variation in the AvaII site using a regular
expression. One useful feature of regular expressions is called alternation. To
represent a number of different alternatives, we write the alternatives inside
parentheses separated by a pipe character. In the case of AvaII, there are two
155 Chapter 7: Regular expressions
alternatives for the third base – it can be either A or T – so the pattern looks like
this:
GG(A|T)CC
Writing the pattern as a raw string and putting it inside a call to re.search()
gives us the code:
dna = "ATCGCGAATTCAC"
if re.search(r"GG(A|T)CC", dna):
print("restriction site found!")
ava2.py
Notice the power of what we've done here; we've written a single pattern which
captures all the variation in the sequence in one string.
Character groups
The BisI restriction enzyme cuts at an even wider range of motifs – the pattern is
GCNGC, where N represents any base. We can use the same alternation technique
to represent this pattern:
GC(A|T|G|C)GC
However, there's another regular expression feature that lets us write the pattern
more concisely. A pair of square brackets with a list of characters inside them can
represent any one of these characters. So the pattern GC[ATGC]GC will match
GCAGC, GCTGC, GCGGC and GCCGC. Here's a program that checks for the presence of
a BisI restriction site using character groups:
dna = "ATCGCGAATTCAC"
if re.search(r"GC[ATGC]GC", dna):
print("restriction site found!")
bis1.py
156 Chapter 7: Regular expressions
Taken together, alternation and character groups do a pretty good job of capturing
the kind of variation that we're interested in for biological programming. Before we
move on, here are two short cuts that deal with specific, common scenarios.
If we want a character in a pattern to match any character in the input, we can use
a period or dot. For example, the pattern GC.GC would match all four possibilities
in the BisI example above. However, the period would also match any character
which is not a DNA base, or even a letter. Therefore, the whole pattern would also
match GCFGC, GC&GC and GC9GC, which may not be what we want, so be careful
when using this feature.
Sometimes it's easier, rather than listing all the acceptable characters, to specify
the characters that we don't want to match. Putting a caret ^ at the start of a
character group like this [^XYZ] will negate it, and match any character that isn't in
the group.
Quantifiers
The regular expression features discussed above let us describe variation in the
individual characters of patterns. Another group of features, quantifiers, let us
describe variation in the number of times a section of a pattern is repeated.
A question mark immediately following a character means that that character is
optional – it can match either zero or one times. So in the pattern GAT?C the T is
optional, and the pattern will match either GATC or GAC. If we want to apply a
question mark to more than one character, we can group the characters in
parentheses. For example, in the pattern GGG(AAA)?TTT the group of three As is
optional, so the pattern will match either GGGAAATTT or GGGTTT. Notice that we
now have two different roles for parentheses in regular expressions; they act both
to surround alternations, and to surround sections of patterns for use with
quantifiers
157 Chapter 7: Regular expressions
A plus sign immediately following a character or group means that the character or
group must be present but can be repeated any number of times – in other words,
it will match one or more times. For example, the pattern GGGA+TTT will match
three Gs, followed by one or more As, followed by three Ts. In other words, it will
match GGGATTT, GGGAATT, GGGAAATT, etc. but not GGGTTT.
An asterisk immediately following a character or group means that the character or
group is optional, but can also be repeated. In other words, it will match zero or
more times. For example, the pattern GGGA*TTT will match three Gs, followed by
zero or more As, followed by three Ts. So it will match GGGTTT, GGGATTT,
GGGAATTT, etc.
If we want to specify a specific number of repeats, we can use curly brackets.
Following a character or group with a single number inside curly brackets will
match exactly that number of repeats. For example, the pattern GA{5}T will match
GAAAAAT but not GAAAAT or GAAAAAAT. Following a character or group with a
pair of numbers inside curly brackets separated with a comma allows us to specify
an acceptable range of number of repeats. For example, the pattern GA{2,4}T will
match GAAT, GAAAT and GAAAAT but not GAT or GAAAAAT.
Positions
The final set of regular expression tools we're going to look at don't represent
characters at all, but rather positions in the input string. The caret symbol ^
matches the start of a string, and the dollar symbol $ matches the end of a string.
The pattern ^AAA will match AAA, but only at the start of a string; it will match
AAATTT but not GGGAAATTT. The pattern GGG$ will match GGG, but only at the
end of a string; it will match AAAGGG but not AAAGGGCCC.
158 Chapter 7: Regular expressions
Combining
The real power of regular expressions comes from combining these tools. Using
alternation, character groups, quantifiers and positions together allows us to
specify very flexible patterns. For example, here's a complex pattern to identify full-
length eukaryotic messenger RNA sequences:
^AUG[AUGC]{30,1000}A{5,10}$
pattern if it matches the entire string. Most of the time we want the former
behaviour.
1 If a match isn't found, then the same thing applies; the function doesn't return False, but a different
built-in value – None – that evaluates as False. If this doesn't make sense, don't worry.
160 Chapter 7: Regular expressions
dna = "ATGACGTACGTACGACTG"
extract_match.py
In the above code, we're searching inside a DNA sequence for GA, followed by any
three bases, followed by AC. Notice the difference in how we're using re.search()
compared to the previous examples – rather than using re.search() in an if
statement, we are storing the result in a variable. By calling the group() method
on the resulting match object, we can see the part of the DNA sequence that
matched, and figure out what the middle three bases were:
GACGTAC
What if we want to extract more than one bit of the pattern? Say we want to match
this pattern:
GA[ATGC]{3}AC[ATGC]{2}AC
That's GA, followed by three bases, followed by AC, followed by two bases, followed
by AC again. We can surround the bits of the pattern that we want to extract with
parentheses – this is called capturing it:
GA([ATGC]{3})AC([ATGC]{2})AC
We can now refer to the captured bits of the pattern by supplying an argument to
the group method. group(1) will return the bit of the string matched by the
161 Chapter 7: Regular expressions
section of the pattern in the first set of parentheses, group(2) will return the bit
matched by the second, etc.:
dna = "ATGACGTACGTACGACTG"
extract_groups.py
The output shows that the three bases in the first variable section were CGT, and
the two bases in the second variable section were GT:
If you're keeping count, you'll realize that we now have three different roles for
parentheses in regular expressions:
• surrounding the alternatives in an alternation
• grouping parts of a pattern for use with a quantifier
• defining parts of a pattern to be extracted after the match
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
positions.py
Remember that we start counting from zero, so in this case, the match starting at
the third base has a start position of two:
start: 2
end: 13
We can get the start and end positions of individual groups by supplying a number
as the argument to start() and end():
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
group_positions.py
In this particular case, we could figure out the start and end positions of the
individual groups from the start and end positions of the whole pattern:
163 Chapter 7: Regular expressions
start: 2
end: 13
group one start: 4
group one end: 7
group two start: 9
group two end: 11
but that might not always be possible for patterns that have variable length
repeats.
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
runs = re.split(r"[^ATGC]", dna)
print(runs)
split.py
The output shows how the function works – the return value is a list of strings:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall(r"[AT]{5,100}", dna)
print(runs)
findall.py
Notice that the return value of the findall() method is not a match object – it is
a straightforward list of strings:
['ATTATAT', 'AAATTATA']
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{5,100}", dna)
finditer.py
Recap
Just as in the previous chapter, we learned about two distinct concepts (conditions,
and the statements that use them) in this chapter we learned about regular
expressions, and the functions that use them.
We started with a brief introduction to two concepts that, while not part of the
regular expression tools, are necessary in order to use them – libraries and raw
strings. We got a far-from-complete overview of features that can be used in regular
expression patterns, and a quick look at the range of different things we can do
with them. Just as regular expressions themselves can range from simple to
complex, so can their uses. We can use regular expressions for simple tasks like
determining whether or not a sequence contains a particular motif, or for
complicated ones like identifying messenger RNA sequences by using complex
patterns.
166 Chapter 7: Regular expressions
Before we move on to the exercises, it's important to recognize that for any given
pattern, there are probably multiple ways to describe it using a regular expression.
Near the start of the chapter, we came up with the pattern GG(A|T)CC to describe
the AvaII restriction enzyme recognition site, but it could also be written as
• GG[AT]CC,
• (GGACC|GGTCC)
• (GGA|GGT)CC
• G{2}[AT]C{2}
As with other situations where there are multiple different ways to write the same
thing, it's best to be guided by what is clearest to read.
167 Chapter 7: Regular expressions
Exercises
Accession names
Here's a list of made-up gene accession names:
xkn59438, yhdck2, eihd39d9, chdsye847, hedle3455, xjhd53e, 45da, de37dp
Write a program that will print only the accession names that satisfy the following
criteria – treat each criterion separately:
• contain the number 5
• contain the letter d or e
• contain the letters d and e in that order
• contain the letters d and e in that order with a single letter between them
• contain both the letters d and e in any order
• start with x or y
• start with x or y and end with e
• contain three or more numbers in a row
• end with d followed by either a, r or p
Double digest
In the chapter_7 folder inside the exercises folder, there's a file called dna.txt which
contains a made-up DNA sequence. Predict the fragment lengths that we will get if
we digest the sequence with two made-up restriction enzymes – AbcI, whose
recognition site is ANT*AAT, and AbcII, whose recognition site is GCRW*TG
(asterisks indicate the position of the cut site).
168 Chapter 7: Regular expressions
Solutions
Accession names
Obviously, the bulk of the work here is going to be coming up with the regular
expression patterns to select each subset of the accession names. Here's the easy
bit – storing the accession names in a list and then processing them in a loop (the
first line wraps round because it's too long to fit on the page):
Now we can tackle the different criteria one by one. For each example, the code
(bordered by solid lines) is followed immediately by the output (bordered by dotted
lines). Rather than create separate solution files for each part, you'll find solutions
to all the parts of this exercise in a single file called accession_names.py.
The first criterion is straightforward – accessions that contain the number 5. We
don't even have to use any fancy regular expression features:
xkn59438
hedle3455
xjhd53e
45da
Now for accessions that contain the letters d or e. We can use either alternation or a
character group. Here's a solution using alternation:
169 Chapter 7: Regular expressions
yhdck2
eihd39d9
chdsye847
hedle3455
xjhd53e
45da
de37dp
The next one – accessions that contain both the letters d and e, in that order – is a
bit more tricky. We can't just use a simple alternation or a character group, because
they match any of their constituent parts, and we need both d and e. One way to
think of the pattern is d, followed by some other letters and numbers, followed by e.
We have to be careful with our quantifiers, however – at first glance the pattern
d.+e looks good, but it will fail to match the accession where e follows d directly.
To allow for the fact that d might be immediately followed by e, we need to use the
asterisk:
chdsye847
hedle3455
xjhd53e
de37dp
hedle3455
The next requirement – d and e in any order – is more difficult. We could do it with
an alternation using the pattern (d.*e|e.*d), which translates as d then e, or e
then d. In this case, I think it's clearer to carry out two separate regular expression
searches and combine them into a complex condition:
hedle3455
de37dp
xkn59438
yhdck2
xjhd53e
We can modify this quite easily to add the requirement that the accession ends
with e. As before, we need to use .* in the middle to match any number of any
character, resulting in quite a complex pattern:
171 Chapter 7: Regular expressions
xjhd53e
To match three or more numbers in a row, we need a more specific quantifier – the
curly brackets – and a character group which contains all the numbers:
xkn59438
chdsye847
hedle3455
We can actually make this a bit more concise. The character group of all digits is
such a common one that there's a built-in shorthand for it: \d. We can also take
advantage of a shorthand in the curly bracket quantifier – if we leave off the upper
bound, then it matches with no upper limit. The more concise version:
xkn59438
chdsye847
hedle3455
172 Chapter 7: Regular expressions
The final requirement is quite simple and only requires a character group and an
end-of-string anchor to solve:
45da
de37dp
Double digest
This is a hard problem, and there are several ways to approach it. Let's simplify it by
first figuring out what the fragment lengths would be if we digested the sequence
with just a single restriction enzyme1. We'll open and read the file all in one go
(there's no need to process it line-by-line as it's just a single sequence), then we'll
use re.finditer() to figure out the positions of all the cut sites.
The patterns themselves are relatively simple: N means any base, so the pattern for
the AbcI site is A[ATGC]TAAT. The ambiguity code R means A or G and the code W
means A or T, so the pattern for AbcII is GC[AG][AT]TG. Here's the code to
calculate the start positions of the matches for AbcI:
dna = open("dna.txt").read().rstrip("\n")
1 For the purposes of this exercise, we are of course ignoring all the interesting chemical kinetics of
restriction enzymes and assuming that all enzymes cut with complete specificity and efficiency.
173 Chapter 7: Regular expressions
but it's not quite right – it's telling us the positions of the start of each match, but
the enzyme actually cuts 3 base pairs upstream of the start. To get the position of
the cut site, we need to add three to the start of each match:
dna = open("dna.txt").read().rstrip("\n")
Now we've got the cut positions, how are we going to work out the fragment sizes?
One way is to go through each cut site in order and measure the distance between
it and the previous one – that will give us the length of a single fragment. To make
this work we'll have to add "imaginary" cut sites at the very start and end of the
sequence:
174 Chapter 7: Regular expressions
1 dna = open("dna.txt").read().rstrip("\n")
2
3 all_cuts = [0]
4 for match in re.finditer(r"A[ATGC]TAAT", dna):
5 all_cuts.append(match.start() + 3)
6 all_cuts.append(len(dna))
7
8 print(all_cuts)
Now we can write a second loop to go through the all_cuts list and, for each cut
position, work out the size of the fragment that will be created by figuring out the
distance to the previous cut site (i.e. the previous element in the list). To make this
work, however, we can't just use a normal loop – we have to start at the second
element of the list (because the first element has no previous element) and we have
to work with the index of each element, rather than the element itself. We'll use the
range() function to generate the list of indexes that we want to process – we need
to go from index 1 (i.e. the second element of the list) to the last index (which is
the length of the list):
175 Chapter 7: Regular expressions
1 for i in range(1,len(all_cuts)):
2 this_cut_position = all_cuts[i]
3 previous_cut_position = all_cuts[i-1]
4 fragment_size = this_cut_position - previous_cut_position
5 print("one fragment size is " + str(fragment_size))
The loop variable i is used to store each value that is generated by the range
function (line 1). For each value of i we get the cut position at that index (line 2)
and the cut position at the previous index (line 3) and then figure out the distance
between them (line 4). The output shows how, for two cuts, we get three
fragments:
Now for the final part of the solution: how do we do the same thing for two
different enzymes? We can add in the second enzyme pattern with the appropriate
cut site offset and append the cut positions to the all_cuts variable:
dna = open("dna.txt").read().rstrip("\n")
all_cuts = [0]
print(all_cuts)
176 Chapter 7: Regular expressions
We get zero, then the two cut positions for the first enzyme in ascending order,
then the two cut positions for the second enzyme in ascending order, then the
position of the end of the sequence. The method for turning a list of cut positions
into fragment sizes that we developed above isn't going to work with this list,
because it relies on the list of positions being in ascending order. If we try it with
the list of cut positions produced by the above code, we'll end up with obviously
incorrect fragment sizes:
Happily, Python's built-in sort() function can come to the rescue. All we need to
do is sort the list of cut positions before processing it, and we get the right answers.
Here's the complete, final code:
177 Chapter 7: Regular expressions
import re
dna = open("dna.txt").read().rstrip("\n")
print(str(len(dna)))
all_cuts = [0]
for i in range(1,len(sorted_cuts)):
this_cut_position = sorted_cuts[i]
previous_cut_position = sorted_cuts[i-1]
fragment_size = this_cut_position - previous_cut_position
print("one fragment size is " + str(fragment_size))
double_digest.py
178 Chapter 8: Dictionaries
8: Dictionaries
dna = "ATGATCGATCGAGTGA"
a_count = dna.count("A")
How will our code change if we want to generate a complete list of base counts for
the sequence? We'll add a new variable for each base:
dna = "ATGATCGATCGAGTGA"
a_count = dna.count("A")
t_count = dna.count("T")
g_count = dna.count("G")
c_count = dna.count("C")
and now our code is starting to look rather repetitive. It's not too bad for the four
individual bases, but what if we want to generate counts for the 16 dinucleotides:
dna = "ATGATCGATCGAGTGA"
aa_count = dna.count("AA")
at_count = dna.count("AT")
ag_count = dna.count("AG")
...etc. etc.
or the 64 trinucleotides:
179 Chapter 8: Dictionaries
dna = "ATGATCGATCGAGTGA"
aaa_count = dna.count("AAA")
aat_count = dna.count("AAT")
aag_count = dna.count("AAG")
...etc. etc.
For trinucleotides and longer, the situation is particularly bad. The DNA sequence
is 20 bases long, so it only contains 18 overlapping trinucleotides in total:
ATCGATCGATCGTACGCTGA
ATC
TCG
CGA
GAT
...etc..
So there can be, at most, 18 unique trinucleotides in the sequence (and for a
repetitive sequence, many fewer unique trinucleotides). This means that at least 46
out of our 64 variables will hold the value zero.
One possible way round this is to store the values in a list. Let's look at an example
involving dinucleotides. If we create a list of the 16 possible dinucleotides we can
iterate over it, calculate the count for each one, and store all the counts in a list 1.
Take a look at the code – the list of dinucleotides is quite long so it's been split over
four lines to make it easier to read:
1 For this example, we are just going to write out the dinucleotides as a list in the code in order to keep
things simple. For a discussion of how to generate lists of DNA sequences of any length – not just
dinucleotides! – see the start of the chapter on recursion in Advanced Python for Biologists.
180 Chapter 8: Dictionaries
dna = "ATGATCGATCGAGTGA"
dinucleotides = ['AA','AT','AG','AC',
'TA','TT','TG','TC',
'GA','GT','GG','GC',
'CA','CT','CG','CT']
all_counts = []
for dinucleotide in dinucleotides:
count = dna.count(dinucleotide)
print("count is " + str(count) + " for " + dinucleotide)
all_counts.append(count)
print(all_counts)
count_dinucleotides.py
Although the code is above is quite compact, and doesn't require huge numbers of
variables, the output shows two problems with this approach:
count is 2 for AA
count is 2 for AT
count is 0 for AG
count is 2 for AC
count is 0 for TA
count is 0 for TT
count is 2 for TG
count is 0 for TC
count is 3 for GA
count is 0 for GT
count is 0 for GG
count is 0 for GC
count is 0 for CA
count is 0 for CT
count is 1 for CG
count is 0 for CT
[2, 2, 0, 2, 0, 0, 2, 0, 3, 0, 0, 0, 0, 0, 1, 0]
Firstly, the data are still incredibly sparse – the vast majority of the counts are zero.
Secondly, the counts themselves are now disconnected from the trinucleotides. If
we want to look up the count for a single trinucleotide – for example, TG – we first
181 Chapter 8: Dictionaries
have to figure out that TG was the 7th dinucleotide generated by our loops. Only
then can we get the element at the correct index:
We can try various tricks to get round this problem. What if we used the index()
method to figure out the position of the dinucleotide we are looking for using the
original loop?
i = dinucleotides.index('TG')
print(all_counts[i])
This works because we have two lists of the same length, with a one-to-one
correspondence between the elements:
print(dinucleotides)
print(all_counts)
['AA', 'AT', 'AG', 'AC', 'TA', 'TT', 'TG', 'TC', 'GA', 'GT', 'GG', 'GC',
'CA', 'CT', 'CG', 'CT']
[2, 2, 0, 2, 0, 0, 2, 0, 3, 0, 0, 0, 0, 0, 1, 0]
This is a little bit nicer, but still has major drawbacks. We're still storing all those
zeros, and now we have two lists to keep track of. We need to be incredibly careful
when manipulating either of the two lists to make sure that they stay perfectly
synchronized – if we make any change to one list but not the other, then there will
no longer be a one-to-one correspondence between elements and we'll get the
wrong answer when we try to look up a count.
This approach is also slow1. To find the index of a given dinucleotide in the
dinucleotides list, Python has to look at each element one at a time until it
1 As a rule, we've avoided talking about performance in this book, but we'll break the rule in this case.
182 Chapter 8: Dictionaries
finds the one we're looking for. This means that as the size of the list grows1, the
time taken to look up the count for a given element will grow alongside it.
If we take a step back and think about the problem in more general terms, what we
need is a way of storing pairs of data (in this case, dinucleotides and their counts)
in a way that allows us to efficiently look up the count for any given dinucleotide.
This problem of storing paired data is incredibly common in programming. We
might want to store:
• protein sequence names and their sequences
• DNA restriction enzyme names and their motifs
• codons and their associated amino acid residues
• colleagues' names and their email addresses
• sample names and their co-ordinates
• words and their definitions
All these are examples of what we call key-value pairs. In each case we have pairs of
keys and values:
Key Value
trinucleotide count
name protein sequence
name restriction enzyme motif
codon amino acid residue
sample coordinates
word definition
1 For instance, imagine carrying out the same exercise with the approximately one million unique 10-mers.
183 Chapter 8: Dictionaries
The last example in this table – words and their definitions – is an interesting one
because we have a tool in the physical world for storing this type of data – a
dictionary. Python's tool for solving this type of problem is also called a dictionary
(usually abbreviated to dict) and in this chapter we'll see how to create and use
them.
Creating a dictionary
The syntax for creating a dict is similar to that for creating a list, but we use curly
brackets rather than square ones. Each pair of data, consisting of a key and a value,
is called an item. When storing items in a dict, we separate them with commas.
Within an individual item, we separate the key and the value with a colon. Here's a
bit of code that creates a dict of restriction enzymes (using data from the previous
chapter) with three items:
In this case, the keys and values are both strings1. Splitting the dict definition over
several lines makes it easier to read:
enzymes = {
'EcoRI' : r'GAATTC',
'AvaII' : r'GG(A|T)CC',
'BisI' : r'GC[ATGC]GC'
}
but doesn't affect the code at all. To retrieve a bit of data from the dict – i.e. to look
up the motif for a particular enzyme – we write the name of the dict, followed by
the key in square brackets:
1 The values are actually raw strings, but that's not important.
184 Chapter 8: Dictionaries
print(enzymes['BisI'])
The code looks very similar to using a list, but instead of giving the index of the
element we want, we're giving the key for the value that we want to retrieve. Notice
that this looks very different from the "two lists" approach that we sketched out
earlier. When we want to retrieve a key from the dict, we don't have to iterate over
all the items until we find the one we want – we just give the name of the key and
get back the value.
Before we dive in and start learning about what we can do with dictionaries, we
need to take note of a couple of restrictions. The only types of data we are allowed
to use as keys are strings and numbers1, so we can't, for example, create a
dictionary where the keys are file objects. Values can be whatever type of data we
like. Also, keys must be unique – we can't store multiple values for the same key.
You might think that this makes dicts less useful, but there are ways round the
problem of storing multiple values – we won't need them for the examples in this
chapter, but the chapter on complex data structures in Advanced Python for
Biologists gives details.
In real-life programs, it's relatively rare that we'll want to create a dict all in one go
like in the example above. More often, we'll want to create an empty dict, then add
key/value pairs to it (just as we often create an empty list and then add elements to
it).
To create an empty dict we simply write a pair of curly brackets on their own, and
to add elements, we use the square-brackets notation on the left-hand side of an
assignment. Here's a bit of code that stores the restriction enzyme data one item at
a time:
1 Not strictly true; we can use any immutable type, but that is beyond the scope of this book.
185 Chapter 8: Dictionaries
enzymes = {}
enzymes['EcoRI'] = r'GAATTC'
enzymes['AvaII] = r'GG(A|T)CC'
enzymes['BisI'] = r'GC[ATGC]GC'
We can delete a key from a dict using the pop() method. pop() actually returns
the value and deletes the key at the same time:
enzymes = {
'EcoRI' : r'GAATTC',
'AvaII' : r'GG(A|T)CC',
'BisI' : r'GC[ATGC]GC'
}
# remove the EcoRI enzyme from the dict
enzymes.pop('EcoRI')
Let's take another look at the dinucleotide count example from the start of the
module. Here's how we store the dinucleotides and their counts in a dict:
dna = "AATGATGAACGAC"
dinucleotides = ['AA','AT','AG','AC',
'TA','TT','TG','TC',
'GA','GT','GG','GC',
'CA','CT','CG','CT']
all_counts = {}
for dinucleotide in dinucleotides:
count = dna.count(dinucleotide)
print("count is " + str(count) + " for " + dinucleotide)
all_counts[dinucleotide] = count
print(all_counts)
dinucleotide_dict.py
We can see from the output that the trinucleotides and their counts are stored
together in one variable:
186 Chapter 8: Dictionaries
We still have a lot of repetitive counts of zero, but looking up the count for a
particular dinucleotide is now very straightforward:
print(all_counts['TA'])
We no longer have to worry about either "memorizing" the order of the counts or
maintaining two separate lists.
Let's now see if we can find a way of avoiding storing all those zero counts. We can
add an if statement that ensures that we only store a count if it's greater than
zero:
dna = "AATGATGAACGAC"
dinucleotides = ['AA','AT','AG','AC',
'TA','TT','TG','TC',
'GA','GT','GG','GC',
'CA','CT','CG','CT']
all_counts = {}
for dinucleotide in dinucleotides:
count = dna.count(dinucleotide)
if count > 0:
all_counts[dinucleotide] = count
print(all_counts)
nonzero_dinucleotides.py
When we look at the output from the above code, we can see that the amount of
data we're storing is much smaller – just the counts for the dinculeotides that are
greater than zero:
Now we have a new problem to deal with. Looking up the count for a given
dinucleotide works fine when the count is positive:
print(all_counts['TA'])
But when the count is zero, the dinucleotide doesn't appear as a key in the dict:
print(all_counts['TC'])
KeyError: 'TC'
There are two possible ways to fix this. We can check for the existence of a key in a
dict (just like we can check for the existence of an element in a list), and only try to
retrieve it once we know it exists:
if 'TC' in all_counts:
print(all_counts('TC'))
Alternatively, we can use the dict's get() method. get() usually works just like
using square brackets: the following two lines do exactly the same thing:
print(all_counts['TC'])
print(all_counts.get('TC'))
The thing that makes get() really useful, however, is that it can take an optional
second argument, which is the default value to be returned if the key isn't present
in the dict. In this case, we know that if a given dinucleotide doesn't appear in the
dict then its count is zero, so we can give zero as the default value and use get()
to print out the count for any dinucleotide:
188 Chapter 8: Dictionaries
As we can see from the output, we now don't have to worry about whether or not
any given dinucleotide appears in the dict – get() takes care of everything and
returns zero when appropriate:
count for TG is 2
count for TT is 0
count for GC is 0
count for CG is 1
More generally, assuming we have a dinucleotide string store in the variable dn, we
can run a line of code like this:
1 Strictly speaking, in this example there's no need to build a dict at all – we could just check the count and
print a line if it's equal to two – but most programs that use dicts will be a bit more complex.
189 Chapter 8: Dictionaries
AA
AT
AC
TG
For this example, this approach works because we have a list of the dinucleotides
already written as part of the program. Most of the time when we create a dict,
however, we'll do it using some other method which doesn't require an explicit list
of the keys. For example, here's a different way to generate a dict of dinucleotide
counts which uses two nested for loops to enumerate all the possible
dinucleotides:
dna = "AATGATGAACGAC"
bases = ['A','T','G','C']
all_counts = {}
for base1 in bases:
for base2 in bases:
dinucleotide = base1 + base2
count = dna.count(dinucleotide)
if count > 0:
all_counts[dinucleotide] = count
loops_dinucleotides.py
The resulting dict is just the same as in our previous examples, but because we
haven't got a list of dinucleotides handy, we have to take a different approach to
find all the dinucleotides where the count is two. Fortunately, the information we
need – the list of dinucleotides that occur at least once – is stored in the dict as the
keys.
190 Chapter 8: Dictionaries
print(all_counts.keys())
Looking at the output1 confirms that this is the list of dinucleotides we want to
consider (remember that we're looking for dinucleotides with a count of two, so we
don't need to consider ones that aren't in the dict as we already know that they
have a count of zero):
To find all the dinucleotides that occur exactly twice in the DNA sequence we can
take the output of keys() and iterate over it, keeping the body of the loop the
same as before:
iterate_over_keys.py
This version prints exactly the same set of dinucleotides as the approach that used
our list:
AA
AC
AT
TG
1 If you're using Python 3 you might see slightly different output here, but all the code examples will work
just the same
191 Chapter 8: Dictionaries
Before we move on, take a moment to compare the output immediately above this
paragraph with the output from the version that used the list from earlier in this
section. You'll notice that while the set of dinucleotides is the same, the order in
which they appear is different. This illustrates an important point about dicts –
they are inherently unordered. That means that when we use the keys() method to
iterate over a dict, we can't rely on processing the items in the same order that we
added them. This is in contrast to lists, which always maintain the same order
when looping. If we want to control the order in which keys are printed we can use
the sorted() function to sort the list before processing it:
We can use the items() method to iterate over pairs of data, rather than just keys:
The items() method does something slightly different from all the other methods
we've seen so far in this book; rather than returning a single value, or a list of
192 Chapter 8: Dictionaries
values, it returns a list of pairs of values1. That's why we have to give two variable
names at the start of the loop. Here's how we can use the items() method to
process our dict of dinucleotide counts just like before:
and this will work, but it's completely unnecessary (and slow). Instead, simply use
the get() method to ask for the value associated with the key you want:
print(all_counts.get('AT'))
1 Each pair is actually a tuple – see the chapter on complex data structures in Advanced Python for Biologists
for a full explanation.
193 Chapter 8: Dictionaries
Recap
We started this chapter by examining the problem of storing paired data in Python.
After looking at a couple of unsatisfactory ways to do it using tools that we've
already learned about, we introduced a new type of data structure – the dict –
which offers a much nicer solution to the problem of storing paired data.
Later in the chapter, we saw that the real benefit of using dictionaries is the
efficient lookup they provide. We saw how to create dictionaries and manipulate
the items in them, and several different ways to look up values for known keys. We
also saw how to iterate over all the items in dictionary.
In the process, we uncovered a few restrictions on what dictionaries are capable of
– we're only allowed to use a couple of different data types for keys, they must be
unique, and we can't rely on their order. Just as a physical dictionary allows us to
rapidly look up the definition for a word but not the other way round, Python
dictionaries allow us to rapidly look up the value associated with a key, but not the
reverse.
Because of their ability to look up a given value very rapidly given a key, dicts are
extremely useful when storing complex data. Take a look at the last few sections of
the chapter on complex data structures in Advanced Python for Biologists for some
examples of this technique.
194 Chapter 8: Dictionaries
Exercises
DNA translation
Write a program that will translate a DNA sequence into protein. Your program
should use the standard genetic code which can be found at this URL:
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1
195 Chapter 8: Dictionaries
Solutions
DNA translation
The description of this exercise is very short, but it hides quite a bit of complexity!
To translate a DNA sequence we need to carry out a number of different steps. First,
we have to split up the sequence into codons. Then, we need to go through each
codon and translate it into the corresponding amino acid residue. Finally, we need
to create a protein sequence by adding all the amino acid residues together.
We'll start off by figuring out how to split a DNA sequence into codons. Because this
exercise is quite tricky, we'll pick a very short test DNA sequence to work on – just
three codons:
dna = "ATGTTCGGT"
How are we going to split up the DNA sequence into groups of three bases? It's
tempting to try to use the split() method, but remember that split() only
works if the things you want to split are separated by a delimiter. In our case,
there's nothing separating the codons, so split() will not help us.
Something that might be able to help us is substring notation. We know that this
allows us to extract part of a string, so we can do something like this:
dna = "ATGTTCGGT"
codon1 = dna[0:3]
codon2 = dna[3:6]
codon3 = dna[6:9]
print(codon1, codon2, codon3)
but it's not a great solution, as we have to fill in the numbers manually. Since the
numbers follow a very predictable pattern, it should be possible to generate them
automatically. The start position for each substring is initially zero, then goes up by
three for each successive codon. The stop position is just the start position plus
three.
Recall that the job of the range() function is to generate sequences of numbers.
In order to generate the sequence of substring start positions, we need to use the
three-argument version of range(), where the first argument is the number to
start at, the second argument is the number to finish at, and the third argument is
the step size. For our DNA sequence above, the number to start at is zero, and the
step size is three. The number to finish at it not six but seven, because ranges are
exclusive at the finish. This bit of code shows how we can use the range()
function to generate the list of start positions:
0
3
6
To find the stop position for a given start position we just add three, so now we can
easily split our DNA into codons using a loop:
dna = "ATGTTCGGT"
for start in range(0,7,3):
codon = dna[start:start+3]
print("one codon is" + codon)
197 Chapter 8: Dictionaries
This works fine for our test DNA sequence, but if we give it a shorter sequence we
will get incomplete and empty codons:
dna = "ATGTT"
for start in range(0,7,3):
codon = dna[start:start+3]
print(codon)
and if we give it a longer sequence, we will miss out the fourth and subsequent
codons:
dna = "ATGTTCGGTGAAGCGGGCTAGAT"
for start in range(0,7,3):
codon = dna[start:start+3]
print("one codon is " + codon)
Clearly we need to modify the second argument to range() – the position to finish
the sequence of numbers – in order to take into account the length of the DNA
sequence. At this point, we have to confront the problem of what to do if we're
given a DNA sequence whose length is not an exact multiple of three. Clearly, we
cannot translate an incomplete codon, so we want the start position of the final
198 Chapter 8: Dictionaries
codon to equal to the length of the DNA sequence minus two. This guarantees that
there will always be two more characters following the position of the final codon
start – i.e. enough for a complete codon.
Here's the modified code:
dna = "ATGTTCGGT"
Now that we know how to split a DNA sequence up into codons, let's turn our
attention to the problem of translating those codons. If we pull up the URL from
the exercise description in a web browser, we can see the standard codon
translation table in various formats. Storing this translation table seems like a
perfect job for a dictionary: we have codons (keys) and amino acid residues (values)
and we want to be able to look up the amino acid for a given codon.
199 Chapter 8: Dictionaries
Here's a bit of code – it's actually a single statement, spread out over multiple lines
– which creates a dictionary to hold the translation table:
gencode = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}
We can look up the amino acid for a given codon using either of the two methods
that we learned about:
print(gencode['CAT'])
print(gencode.get('GTC'))
H
V
200 Chapter 8: Dictionaries
If we look up the amino acid for each codon inside the loop of our original code, we
can print both the codon and the amino acid translation1:
dna = "ATGTTCGGT"
last_codon_start = len(dna) - 2
for start in range(0,last_codon_start,3):
codon = dna[start:start+3]
aa = gencode.get(codon)
print("one codon is " + codon)
print("the amino acid is " + aa)
This is starting to look promising. The final step is to actually do something with
the amino acid residues rather than just printing them. A nice idea is to take our
cue from the way that a ribosome behaves and add each new amino acid residue
onto the end of a protein to create a gradually-growing string:
1 From now on, we won't include the statement which creates the dictionary in our code samples as it takes
up too much room, so if you want to try running these yourself you'll need to add it back at the top.
201 Chapter 8: Dictionaries
1 dna = "ATGTTCGGT"
2 last_codon_start = len(dna) - 2
3 protein = ""
4 for start in range(0,last_codon_start,3):
5 codon = dna[start:start+3]
6 aa = gencode.get(codon)
7 protein = protein + aa
8 print("protein sequence is " + protein)
In the above code, we create a new variable to hold the protein sequence
immediately before we start the loop (line 3), then add a single character onto the
end of that variable each time round the loop (line 7). By the time we exit the loop,
we have built up the complete protein sequence and we can print it out (line 8):
This looks like a very useful bit of code, so let's turn it into a function. Our function
will take one argument – the DNA sequence as a string – and will return a string
containing the protein sequence1:
def translate_dna(dna):
last_codon_start = len(dna) - 2
protein = ""
for start in range(0,last_codon_start,3):
codon = dna[start:start+3]
aa = gencode.get(codon)
protein = protein + aa
return protein
We can now test our function by printing out the protein translation for a few more
test sequences:
1 You'll notice that this function relies on the gencode variable which is defined outside the function –
something that I told you not to do in chapter 5. This is an exception to the rule: defining the gencode
variable inside the function means that it would have to be created anew each time we wanted to translate
a DNA sequence.
202 Chapter 8: Dictionaries
print(translate_dna("ATGTTCGGT"))
print(translate_dna("ATCGATCGATCGTTGCTTATCGATCAG"))
print(translate_dna("actgatcgtagctagctgacgtatcgtat"))
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))
The output from this code shows that we run into a problem with the third
sequence:
MFG
IDRSLLIDQ
Traceback (most recent call last):
File "dna_translation.py", line 30, in <module>
print(translate_dna("actgatcgtagctagctgacgtatcgtat"))
File "dna_translation.py", line 25, in translate_dna
protein = protein + aa
TypeError: cannot concatenate 'str' and 'NoneType' objects
The problem occurs when we try to look up the amino acid for the first codon of the
third sequence – "act". Because the third sequence is in lower case but the
translation table dictionary is in upper case, the key isn't found, the get() method
returns None, and we get an error. Fixing it is straightforward – we just need to
convert the codon to upper case before looking up the amino acid:
def translate_dna(dna):
last_codon_start = len(dna) - 2
protein = ""
for start in range(0,last_codon_start,3):
codon = dna[start:start+3]
aa = gencode.get(codon.upper())
protein = protein + aa
return protein
Now the output shows that the first three sequences are fine, but that our function
has a problem translating the fourth sequence:
203 Chapter 8: Dictionaries
MFG
IDRSLLIDQ
TDRSLLTYR
Traceback (most recent call last):
File "dna_translation.py", line 31, in <module>
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))
File "dna_translation.py", line 25, in translate_dna
protein = protein + aa
TypeError: cannot concatenate 'str' and 'NoneType' objects
Glancing at the input sequences, it's not clear what the problem is. Let's try
printing the codons as they're translated in order to identify the one that's causing
the error:
def translate_dna(dna):
last_codon_start = len(dna) - 2
protein = ""
for start in range(0,last_codon_start,3):
codon = dna[start:start+3]
print("about to translate codon: " + codon)
aa = gencode.get(codon.upper())
protein = protein + aa
return protein
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))
There is an unknown base in the middle of the DNA sequence, which causes our
function to try to look up the amino acid for the codon NAC, which causes an error
because that codon isn't in the dictionary. How should we fix this? We could add an
if statement to the function which only translates the DNA sequence if it doesn't
contain any unambiguous bases, but that seems a little too conservative – there are
plenty of situations in which we might want to generate a protein sequence for a
DNA sequence that has unknown bases. We could add an if statement inside the
loop which only translates a given codon if it doesn't contain any unambiguous
bases, but that would lead to protein translations of an incorrect length – we know
that the codon NAC will translate to an amino acid, we just don't know which one it
will be.
The most sensible solution seems to be to translate any codon with an unknown
base into the symbol for an unknown amino acid residue, which is X. The optional
second argument to the get() function makes it very easy to do just that:
205 Chapter 8: Dictionaries
def translate_dna(dna):
last_codon_start = len(dna) - 2
protein = ""
for start in range(0,last_codon_start,3):
codon = dna[start:start+3]
aa = gencode.get(codon.upper(), 'X')
protein = protein + aa
return protein
and now we can translate all four of our test sequences correctly:
print(translate_dna("ATGTTCGGT"))
print(translate_dna("ATCGATCGATCGTTGCTTATCGATCAG"))
print(translate_dna("actgatcgtagcttgcttacgtatcgtat"))
print(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG"))
MFG
IDRSLLIDQ
TDRSLLTYR
TIDRXVRSYS
At this point, it's a good idea to turn these test sequences into assert statements
– that way, we can easily re-test the function if we make some changes to it in the
future:
assert(translate_dna("ATGTTCGGT")) == "MFG"
assert(translate_dna("ATCGATCGATCGTTGCTTATCGATCAG")) == "IDRSLLIDQ"
assert(translate_dna("actgatcgtagcttgcttacgtatcgtat")) == "TDRSLLTYR"
assert(translate_dna("ACGATCGATCGTNACGTACGATCGTACTCG")) == "TIDRXVRSYS"