100% found this document useful (11 votes)
96 views

Python for Bioinformatics 2nd Edition Sebastian Bassi All Chapters Instant Download

Python

Uploaded by

seffshamey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (11 votes)
96 views

Python for Bioinformatics 2nd Edition Sebastian Bassi All Chapters Instant Download

Python

Uploaded by

seffshamey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Download the full version of the ebook at ebookfinal.

com

Python for Bioinformatics 2nd Edition Sebastian


Bassi

https://ebookfinal.com/download/python-for-
bioinformatics-2nd-edition-sebastian-bassi/

OR CLICK BUTTON

DOWNLOAD EBOOK

Download more ebook instantly today at https://ebookfinal.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Python for Bioinformatics Chapman Hall CRC Mathematical


Computational Biology Sebastian Bassi

https://ebookfinal.com/download/python-for-bioinformatics-chapman-
hall-crc-mathematical-computational-biology-sebastian-bassi/

ebookfinal.com

Python Machine Learning Second Edition Sebastian Raschka

https://ebookfinal.com/download/python-machine-learning-second-
edition-sebastian-raschka/

ebookfinal.com

Bioinformatics Programming Using Python 1st Edition


Mitchell L. Model

https://ebookfinal.com/download/bioinformatics-programming-using-
python-1st-edition-mitchell-l-model/

ebookfinal.com

Black Hat Python Python Programming For Hackers And


Pentesters 2nd Edition Justin Seitz And Tim Arnold

https://ebookfinal.com/download/black-hat-python-python-programming-
for-hackers-and-pentesters-2nd-edition-justin-seitz-and-tim-arnold/

ebookfinal.com
Clinical Bioinformatics 2nd Edition Ronald Trent (Eds.)

https://ebookfinal.com/download/clinical-bioinformatics-2nd-edition-
ronald-trent-eds/

ebookfinal.com

Raspberry Pi for Python Programmers Cookbook 2nd, revised


Edition Tim Cox

https://ebookfinal.com/download/raspberry-pi-for-python-programmers-
cookbook-2nd-revised-edition-tim-cox/

ebookfinal.com

Instant Notes in Bioinformatics 2nd Edition Charlie


Hodgman

https://ebookfinal.com/download/instant-notes-in-bioinformatics-2nd-
edition-charlie-hodgman/

ebookfinal.com

Python in easy steps Covers Python 3 7 2nd Edition Mike


Mcgrath

https://ebookfinal.com/download/python-in-easy-steps-covers-
python-3-7-2nd-edition-mike-mcgrath/

ebookfinal.com

Data Mining for Bioinformatics 1st Edition Sumeet Dua

https://ebookfinal.com/download/data-mining-for-bioinformatics-1st-
edition-sumeet-dua/

ebookfinal.com
PYTHON FOR
BIOINFORMATICS
SECOND EDITION
CHAPMAN & HALL/CRC
Mathematical and Computational Biology Series

Aims and scope:


This series aims to capture new developments and summarize what is known
over the entire spectrum of mathematical and computational biology and
medicine. It seeks to encourage the integration of mathematical, statistical,
and computational methods into biology by publishing a broad range of
textbooks, reference works, and handbooks. The titles included in the
series are meant to appeal to students, researchers, and professionals in the
mathematical, statistical and computational sciences, fundamental biology
and bioengineering, as well as interdisciplinary researchers involved in the
field. The inclusion of concrete examples and applications, and programming
techniques and examples, is highly encouraged.

Series Editors

N. F. Britton
Department of Mathematical Sciences
University of Bath

Xihong Lin
Department of Biostatistics
Harvard University

Nicola Mulder
University of Cape Town
South Africa

Maria Victoria Schneider


European Bioinformatics Institute

Mona Singh
Department of Computer Science
Princeton University

Anna Tramontano
Department of Physics
University of Rome La Sapienza

Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
3 Park Square, Milton Park
Abingdon, Oxfordshire OX14 4RN
UK
Published Titles
An Introduction to Systems Biology: Statistical Methods for QTL Mapping
Design Principles of Biological Circuits Zehua Chen
Uri Alon An Introduction to Physical Oncology:
Glycome Informatics: Methods and How Mechanistic Mathematical
Applications Modeling Can Improve Cancer Therapy
Kiyoko F. Aoki-Kinoshita Outcomes
Computational Systems Biology of Vittorio Cristini, Eugene J. Koay,
Cancer and Zhihui Wang
Emmanuel Barillot, Laurence Calzone, Normal Mode Analysis: Theory and
Philippe Hupé, Jean-Philippe Vert, and Applications to Biological and Chemical
Andrei Zinovyev Systems
Python for Bioinformatics, Second Edition Qiang Cui and Ivet Bahar
Sebastian Bassi Kinetic Modelling in Systems Biology
Quantitative Biology: From Molecular to Oleg Demin and Igor Goryanin
Cellular Systems Data Analysis Tools for DNA Microarrays
Sebastian Bassi Sorin Draghici
Methods in Medical Informatics: Statistics and Data Analysis for
Fundamentals of Healthcare Microarrays Using R and Bioconductor,
Programming in Perl, Python, and Ruby Second Edition
Jules J. Berman Sorin Drăghici
Chromatin: Structure, Dynamics, Computational Neuroscience:
Regulation A Comprehensive Approach
Ralf Blossey Jianfeng Feng
Computational Biology: A Statistical Biological Sequence Analysis Using
Mechanics Perspective the SeqAn C++ Library
Ralf Blossey Andreas Gogol-Döring and Knut Reinert
Game-Theoretical Models in Biology Gene Expression Studies Using
Mark Broom and Jan Rychtář Affymetrix Microarrays
Computational and Visualization Hinrich Göhlmann and Willem Talloen
Techniques for Structural Bioinformatics Handbook of Hidden Markov Models
Using Chimera in Bioinformatics
Forbes J. Burkowski Martin Gollery
Structural Bioinformatics: An Algorithmic Meta-analysis and Combining
Approach Information in Genetics and Genomics
Forbes J. Burkowski Rudy Guerra and Darlene R. Goldstein
Spatial Ecology Differential Equations and Mathematical
Stephen Cantrell, Chris Cosner, and Biology, Second Edition
Shigui Ruan D.S. Jones, M.J. Plank, and B.D. Sleeman
Cell Mechanics: From Single Scale- Knowledge Discovery in Proteomics
Based Models to Multiscale Modeling Igor Jurisica and Dennis Wigle
Arnaud Chauvière, Luigi Preziosi, Introduction to Proteins: Structure,
and Claude Verdier Function, and Motion
Bayesian Phylogenetics: Methods, Amit Kessel and Nir Ben-Tal
Algorithms, and Applications
Ming-Hui Chen, Lynn Kuo, and Paul O. Lewis
Published Titles (continued)
RNA-seq Data Analysis: A Practical Introduction to Bio-Ontologies
Approach Peter N. Robinson and Sebastian Bauer
Eija Korpelainen, Jarno Tuimala, Dynamics of Biological Systems
Panu Somervuo, Mikael Huss, and Garry Wong Michael Small
Introduction to Mathematical Oncology Genome Annotation
Yang Kuang, John D. Nagy, and Jung Soh, Paul M.K. Gordon, and
Steffen E. Eikenberry Christoph W. Sensen
Biological Computation Niche Modeling: Predictions from
Ehud Lamm and Ron Unger Statistical Distributions
Optimal Control Applied to Biological David Stockwell
Models Algorithms for Next-Generation
Suzanne Lenhart and John T. Workman Sequencing
Clustering in Bioinformatics and Drug Wing-Kin Sung
Discovery Algorithms in Bioinformatics: A Practical
John D. MacCuish and Norah E. MacCuish Introduction
Spatiotemporal Patterns in Ecology Wing-Kin Sung
and Epidemiology: Theory, Models, Introduction to Bioinformatics
and Simulation Anna Tramontano
Horst Malchow, Sergei V. Petrovskii, and
The Ten Most Wanted Solutions in
Ezio Venturino
Protein Bioinformatics
Stochastic Dynamics for Systems Anna Tramontano
Biology
Combinatorial Pattern Matching
Christian Mazza and Michel Benaïm
Algorithms in Computational Biology
Statistical Modeling and Machine Using Perl and R
Learning for Molecular Biology Gabriel Valiente
Alan M. Moses
Managing Your Biological Data with
Engineering Genetic Circuits Python
Chris J. Myers Allegra Via, Kristian Rother, and
Pattern Discovery in Bioinformatics: Anna Tramontano
Theory & Algorithms Cancer Systems Biology
Laxmi Parida Edwin Wang
Exactly Solvable Models of Biological Stochastic Modelling for Systems
Invasion Biology, Second Edition
Sergei V. Petrovskii and Bai-Lian Li Darren J. Wilkinson
Computational Hydrodynamics of Big Data Analysis for Bioinformatics and
Capsules and Biological Cells Biomedical Discoveries
C. Pozrikidis Shui Qing Ye
Modeling and Simulation of Capsules Bioinformatics: A Practical Approach
and Biological Cells Shui Qing Ye
C. Pozrikidis
Introduction to Computational
Cancer Modelling and Simulation Proteomics
Luigi Preziosi Golan Yona
PYTHON FOR
BIOINFORMATICS
SECOND EDITION

SEBASTIAN BASSI
MATLAB• is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the
accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB • software or related products
does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular
use of the MATLAB• software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170626

International Standard Book Number-13: 978-1-1380-3526-3 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Names: Bassi, Sebastian, author.


Title: Python for bioinformatics / Sebastian Bassi.
Description: Second edition. | Boca Raton : CRC Press, 2017. | Series:
Chapman & Hall/CRC mathematical and computational biology | Includes
bibliographical references and index.
Identifiers: LCCN 2017014460| ISBN 9781138035263 (pbk. : alk. paper) |
ISBN 9781138094376 (hardback : alk. paper) | ISBN 9781315268743 (ebook) |
ISBN 9781351976961 (ebook) | ISBN 9781351976954 (ebook) |
ISBN 9781351976947 (ebook)
Subjects: LCSH: Bioinformatics. | Python (Computer program language)
Classification: LCC QH324.2 .B387 2017 | DDC 570.285--dc23
LC record available at https://lccn.loc.gov/2017014460

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

List of Figures xvii

List of Tables xxi

Preface to the First Edition xxiii

Preface to the Second Edition xxv

Acknowledgments xxix

Section I Programming

Chapter 1  Introduction 3
1.1 WHO SHOULD READ THIS BOOK 3
1.1.1 What the Reader Should Already Know 4
1.2 USING THIS BOOK 4
1.2.1 Typographical Conventions 4
1.2.2 Python Versions 5
1.2.3 Code Style 5
1.2.4 Get the Most from This Book without Reading It All 6
1.2.5 Online Resources Related to This Book 7
1.3 WHY LEARN TO PROGRAM? 7
1.4 BASIC PROGRAMMING CONCEPTS 8
1.4.1 What Is a Program? 8
1.5 WHY PYTHON? 10
1.5.1 Main Features of Python 10
1.5.2 Comparing Python with Other Languages 11
1.5.3 How Is It Used? 14
1.5.4 Who Uses Python? 15
1.5.5 Flavors of Python 15
1.5.6 Special Python Distributions 16
1.6 ADDITIONAL RESOURCES 17

vii
viii  Contents

Chapter 2  First Steps with Python 19


2.1 INSTALLING PYTHON 20
2.1.1 Learn Python by Using It 20
2.1.2 Install Python Locally 20
2.1.3 Using Python Online 21
2.1.4 Testing Python 22
2.1.5 First Use 22
2.2 INTERACTIVE MODE 23
2.2.1 Baby Steps 23
2.2.2 Basic Input and Output 23
2.2.3 More on the Interactive Mode 24
2.2.4 Mathematical Operations 26
2.2.5 Exit from the Python Shell 27
2.3 BATCH MODE 27
2.3.1 Comments 29
2.3.2 Indentation 30
2.4 CHOOSING AN EDITOR 32
2.4.1 Sublime Text 32
2.4.2 Atom 33
2.4.3 PyCharm 34
2.4.4 Spyder IDE 35
2.4.5 Final Words about Editors 36
2.5 OTHER TOOLS 36
2.6 ADDITIONAL RESOURCES 37
2.7 SELF-EVALUATION 37

Chapter 3  Basic Programming: Data Types 39


3.1 STRINGS 40
3.1.1 Strings Are Sequences of Unicode Characters 41
3.1.2 String Manipulation 42
3.1.3 Methods Associated with Strings 42
3.2 LISTS 44
3.2.1 Accessing List Elements 45
3.2.2 List with Multiple Repeated Items 45
3.2.3 List Comprehension 46
3.2.4 Modifying Lists 47
Contents  ix

3.2.5 Copying a List 49


3.3 TUPLES 49
3.3.1 Tuples Are Immutable Lists 49
3.4 COMMON PROPERTIES OF THE SEQUENCES 51
3.5 DICTIONARIES 54
3.5.1 Mapping: Calling Each Value by a Name 54
3.5.2 Operating with Dictionaries 56
3.6 SETS 59
3.6.1 Unordered Collection of Objects 59
3.6.2 Set Operations 60
3.6.3 Shared Operations with Other Data Types 62
3.6.4 Immutable Set: Frozenset 63
3.7 NAMING OBJECTS 63
3.8 ASSIGNING A VALUE TO A VARIABLE VERSUS BINDING A NAME
TO AN OBJECT 64
3.9 ADDITIONAL RESOURCES 67
3.10 SELF-EVALUATION 68

Chapter 4  Programming: Flow Control 69


4.1 IF-ELSE 69
4.1.1 Pass Statement 74
4.2 FOR LOOP 75
4.3 WHILE LOOP 77
4.4 BREAK: BREAKING THE LOOP 78
4.5 WRAPPING IT UP 80
4.5.1 Estimate the Net Charge of a Protein 80
4.5.2 Search for a Low-Degeneration Zone 81
4.6 ADDITIONAL RESOURCES 83
4.7 SELF-EVALUATION 83

Chapter 5  Handling Files 85


5.1 READING FILES 86
5.1.1 Example of File Handling 87
5.2 WRITING FILES 89
5.2.1 File Reading and Writing Examples 90
5.3 CSV FILES 90
x  Contents

5.4 PICKLE: STORING AND RETRIEVING THE CONTENTS OF VARI-


ABLES 94
5.5 JSON FILES 96
5.6 FILE HANDLING: OS, OS.PATH, SHUTIL, AND PATH.PY MODULE 98
5.6.1 path.py Module 100
5.6.2 Consolidate Multiple DNA Sequences into One FASTA File 102
5.7 ADDITIONAL RESOURCES 102
5.8 SELF-EVALUATION 103

Chapter 6  Code Modularizing 105


6.1 INTRODUCTION TO CODE MODULARIZING 105
6.2 FUNCTIONS 106
6.2.1 Standard Way to Make Python Code Modular 106
6.2.2 Function Parameter Options 110
6.2.3 Generators 113
6.3 MODULES AND PACKAGES 114
6.3.1 Using Modules 115
6.3.2 Packages 116
6.3.3 Installing Third-Party Modules 117
6.3.4 Virtualenv: Isolated Python Environments 119
6.3.5 Conda: Anaconda Virtual Environment 121
6.3.6 Creating Modules 124
6.3.7 Testing Modules 125
6.4 ADDITIONAL RESOURCES 127
6.5 SELF-EVALUATION 128

Chapter 7  Error Handling 129


7.1 INTRODUCTION TO ERROR HANDLING 129
7.1.1 Try and Except 131
7.1.2 Exception Types 134
7.1.3 Triggering Exceptions 135
7.2 CREATING CUSTOMIZED EXCEPTIONS 136
7.3 ADDITIONAL RESOURCES 137
7.4 SELF-EVALUATION 138

Chapter 8  Introduction to Object Orienting Programming (OOP) 139


8.1 OBJECT PARADIGM AND PYTHON 139
Contents  xi

8.2 EXPLORING THE JARGON 140


8.3 CREATING CLASSES 142
8.4 INHERITANCE 145
8.5 SPECIAL METHODS 149
8.5.1 Create a New Data Type Using a Built-in Data Type 154
8.6 MAKING OUR CODE PRIVATE 154
8.7 ADDITIONAL RESOURCES 155
8.8 SELF-EVALUATION 156

Chapter 9  Introduction to Biopython 157


9.1 WHAT IS BIOPYTHON? 158
9.1.1 Project Organization 158
9.2 INSTALLING BIOPYTHON 159
9.3 BIOPYTHON COMPONENTS 162
9.3.1 Alphabet 162
9.3.2 Seq 163
9.3.3 MutableSeq 165
9.3.4 SeqRecord 166
9.3.5 Align 167
9.3.6 AlignIO 169
9.3.7 ClustalW 171
9.3.8 SeqIO 173
9.3.9 AlignIO 176
9.3.10 BLAST 177
9.3.11 Biological Related Data 187
9.3.12 Entrez 190
9.3.13 PDB 194
9.3.14 PROSITE 196
9.3.15 Restriction 197
9.3.16 SeqUtils 200
9.3.17 Sequencing 202
9.3.18 SwissProt 205
9.4 CONCLUSION 207
9.5 ADDITIONAL RESOURCES 207
9.6 SELF-EVALUATION 209
xii  Contents

Section II Advanced Topics

Chapter 10  Web Applications 213


10.1 INTRODUCTION TO PYTHON ON THE WEB 213
10.2 CGI IN PYTHON 214
10.2.1 Configuring a Web Server for CGI 215
10.2.2 Testing the Server with Our Script 215
10.2.3 Web Program to Calculate the Net Charge of a Protein
(CGI version) 219
10.3 WSGI 221
10.3.1 Bottle: A Python Web Framework for WSGI 222
10.3.2 Installing Bottle 223
10.3.3 Minimal Bottle Application 223
10.3.4 Bottle Components 224
10.3.5 Web Program to Calculate the Net Charge of a Protein
(Bottle Version) 229
10.3.6 Installing a WSGI Program in Apache 232
10.4 ALTERNATIVE OPTIONS FOR MAKING PYTHON-BASED DYNAMIC
WEB SITES 232
10.5 SOME WORDS ABOUT SCRIPT SECURITY 232
10.6 WHERE TO HOST PYTHON PROGRAMS 234
10.7 ADDITIONAL RESOURCES 235
10.8 SELF-EVALUATION 236

Chapter 11  XML 237


11.1 INTRODUCTION TO XML 237
11.2 STRUCTURE OF AN XML DOCUMENT 241
11.3 METHODS TO ACCESS DATA INSIDE AN XML DOCUMENT 246
11.3.1 SAX: cElementTree Iterparse 246
11.4 SUMMARY 251
11.5 ADDITIONAL RESOURCES 252
11.6 SELF-EVALUATION 252

Chapter 12  Python and Databases 255


12.1 INTRODUCTION TO DATABASES 256
12.1.1 Database Management: RDBMS 257
12.1.2 Components of a Relational Database 258
Contents  xiii

12.1.3 Database Data Types 260


12.2 CONNECTING TO A DATABASE 261
12.3 CREATING A MYSQL DATABASE 262
12.3.1 Creating Tables 263
12.3.2 Loading a Table 264
12.4 PLANNING AHEAD 266
12.4.1 PythonU: Sample Database 266
12.5 SELECT: QUERYING A DATABASE 269
12.5.1 Building a Query 271
12.5.2 Updating a Database 273
12.5.3 Deleting a Record from a Database 273
12.6 ACCESSING A DATABASE FROM PYTHON 274
12.6.1 PyMySQL Module 274
12.6.2 Establishing the Connection 274
12.6.3 Executing the Query from Python 275
12.7 SQLITE 276
12.8 NOSQL DATABASES: MONGODB 278
12.8.1 Using MongoDB with PyMongo 278
12.9 ADDITIONAL RESOURCES 282
12.10 SELF-EVALUATION 284

Chapter 13  Regular Expressions 285


13.1 INTRODUCTION TO REGULAR EXPRESSIONS (REGEX) 285
13.1.1 REGEX Syntax 286
13.2 THE RE MODULE 287
13.2.1 Compiling a Pattern 290
13.2.2 REGEX Examples 292
13.2.3 Pattern Replace 294
13.3 REGEX IN BIOINFORMATICS 294
13.3.1 Cleaning Up a Sequence 296
13.4 ADDITIONAL RESOURCES 297
13.5 SELF-EVALUATION 298

Chapter 14  Graphics in Python 299


14.1 INTRODUCTION TO BOKEH 299
14.2 INSTALLING BOKEH 299
14.3 USING BOKEH 301
xiv  Contents

14.3.1 A Simple X-Y Plot 303


14.3.2 Two Data Series Plot 304
14.3.3 A Scatter Plot 306
14.3.4 A Heatmap 308
14.3.5 A Chord Diagram 309

Section III Python Recipes with Commented Source Code

Chapter 15  Sequence Manipulation in Batch 315


15.1 PROBLEM DESCRIPTION 315
15.2 PROBLEM ONE: CREATE A FASTA FILE WITH RANDOM SE-
QUENCES 315
15.2.1 Commented Source Code 315
15.3 PROBLEM TWO: FILTER NOT EMPTY SEQUENCES FROM A
FASTA FILE 316
15.3.1 Commented Source Code 317
15.4 PROBLEM THREE: MODIFY EVERY RECORD OF A FASTA FILE 319
15.4.1 Commented Source Code 320

Chapter 16  Web Application for Filtering Vector Contamination 321


16.1 PROBLEM DESCRIPTION 321
16.1.1 Commented Source Code 322
16.2 ADDITIONAL RESOURCES 326

Chapter 17  Searching for PCR Primers Using Primer3 329


17.1 PROBLEM DESCRIPTION 329
17.2 PRIMER DESIGN FLANKING A VARIABLE LENGTH REGION 330
17.2.1 Commented Source Code 331
17.3 PRIMER DESIGN FLANKING A VARIABLE LENGTH REGION,
WITH BIOPYTHON 332
17.4 ADDITIONAL RESOURCES 333

Chapter 18  Calculating Melting Temperature from a Set of Primers 335


18.1 PROBLEM DESCRIPTION 335
18.1.1 Commented Source Code 336
18.2 ADDITIONAL RESOURCES 336

Chapter 19  Filtering Out Specific Fields from a GenBank File 339


19.1 EXTRACTING SELECTED PROTEIN SEQUENCES 339
Contents  xv

19.1.1 Commented Source Code 339


19.2 EXTRACTING THE UPSTREAM REGION OF SELECTED PRO-
TEINS 340
19.2.1 Commented Source Code 340
19.3 ADDITIONAL RESOURCES 341

Chapter 20  Inferring Splicing Sites 343


20.1 PROBLEM DESCRIPTION 343
20.1.1 Infer Splicing Sites with Commented Source Code 345
20.1.2 Sample Run of Estimate Intron Program 347

Chapter 21  Web Server for Multiple Alignment 349


21.1 PROBLEM DESCRIPTION 349
21.1.1 Web Interface: Front-End. HTML Code 349
21.1.2 Web Interface: Server-Side Script. Commented Source Code 351
21.2 ADDITIONAL RESOURCES 353

Chapter 22  Drawing Marker Positions Using Data Stored in a Database 355


22.1 PROBLEM DESCRIPTION 355
22.1.1 Preliminary Work on the Data 355
22.1.2 MongoDB Version with Commented Source Code 358

Section IV Appendices

Appendix A  Collaborative Development: Version Control with GitHub 365


A.1 INTRODUCTION TO VERSION CONTROL 366
A.2 VERSION YOUR CODE 367
A.3 SHARE YOUR CODE 375
A.4 CONTRIBUTE TO OTHER PROJECTS 381
A.5 CONCLUSION 382
A.6 METHODS 384
A.7 ADDITIONAL RESOURCES 384

Appendix B  Install a Bottle App in PythonAnywhere 385


B.1 PYTHONANYWHERE 385
B.1.1 What Is PythonAnywhere 385
B.1.2 Installing a Web App in PythonAnywhere 385
xvi  Contents

Appendix C  Scientific Python Cheat Sheet 393


C.1 PURE PYTHON 394
C.2 VIRTUALENV 400
C.3 CONDA 402
C.4 IPYTHON 403
C.5 NUMPY 405
C.6 MATPLOTLIB 410
C.7 SCIPY 412
C.8 PANDAS 413

Index 417
List of Figures

2.1 Anaconda install in macOS. 21


2.2 Anaconda Python interactive terminal. 23
2.3 PyCharm Edu welcome screen. 35

3.1 Intersection. 60
3.2 Union. 61
3.3 Difference. 61
3.4 Symmetric difference. 62
3.5 Case 1. 65
3.6 Case 2. 66

5.1 Excel formatted spreadsheet called sampledata.xlsx. 93

8.1 IUPAC nucleic acid notation table. 147

9.1 Anatomy of a BLAST result. 181

10.1 Our first CGI. 216


10.2 CGI accessed from local disk instead from a web server. 217
10.3 greeting.html: A very simple form. 217
10.4 Output of CGI program that processes greeting.html. 218
10.5 Form protcharge.html ready to be submitted. 220
10.6 Net charge CGI result. 222
10.7 Hello World program made in Bottle, as seen in a browser. 224
10.8 Form for the web app to calculate the net charge of a protein. 229

11.1 Screenshot of XML viewer. 244


11.2 Codebeautify, a web based XML viewer. 245

12.1 Screenshot of PhpMyAdmin. 258


12.2 Creating a new database using phpMyAdmin. 262
12.3 Creating a new table using phpMyAdmin. 264

xvii
xviii  LIST OF FIGURES

12.4 View of the Student table. 266


12.5 An intentionally faulty “Grades” table. 267
12.6 A better “Grades” table. 267
12.7 Courses table: A lookup table. 268
12.8 Modified “Grades” table. 268
12.9 Screenshot of SQLite manager. 277
12.10 View from a MongoDB cloud provider. 281

14.1 A circle with Bokeh. 302


14.2 Four circles with Bokeh. 303
14.3 A simple plot with Bokeh. 305
14.4 A two data series plot with Bokeh. 306
14.5 Scatter plot graphics. 308
14.6 A heatmap out of a microarray experiment. 310
14.7 A chord diagram. 312

16.1 HTML form for sequence filtering. 327


16.2 HTML form for sequence filtering. 328

21.1 Muscle Web interface. 350

22.1 Product of Listing 22.2, using the demo dataset (NODBDEMO). 356

A.1 The git add/commit process. 369


A.2 Working with a local repository. 370
A.3 Working with both a local and remote repository as a single user. 379
A.4 Contributing to open source projects. 383

B.1 “Consoles” tab. 386


B.2 The “Web” tab. 386
B.3 Upgrading domain type option. 387
B.4 Select a web framework screen, select Bottle. 388
B.5 Select a Python and Bottle version. 389
B.6 Form to enter the path of the web app. 390
B.7 The sample web app is ready to use. 390
B.8 The “File” tab. 391
B.9 Form to create a new directory in PythonAnywhere. 391
B.10 View and upload files into your account. 391
LIST OF FIGURES  xix

B.11 Front-end of the program to calculate charge of a protein using


Bottle and hosted in PythonAnywhere. 392
List of Tables

2.1 Arithmetic-Style Operators 26

3.1 Common List Operations 48


3.2 Methods Associated with Dictionaries 58

9.1 Sequence and Alignment Formats 175


9.2 Blast programs 178
9.3 eUtils 191

10.1 Frameworks for Web Development 233

12.1 Students in Python University 259


12.2 Table with primary key 260
12.3 MySQL Data Types 261

13.1 REGEX Special Sequences 287

A.1 Resources 367

xxi
Preface to the First Edition

This book is a result of the experience accumulated during several years of working
for an agricultural biotechnology company. As a genomic database curator, I gave
support to staff scientists with a broad range of bioinformatics needs. Some of them
just wanted to automate the same procedure they were already doing by hand, while
others would come to me with biological problems to ask if there were bioinformat-
ics solutions. Most cases had one thing in common: Programming knowledge was
necessary for finding a solution to the problem. The main purpose of this book is to
help those scientists who want to solve their biological problems by helping them
to understand the basics of programming. To this end, I have attempted to avoid
taking for granted any programming-related concepts. The chosen language for this
task is Python.
Python is an easy-to-learn computer language that is gaining traction among
scientists. This is likely because it is easy to use, yet powerful enough to accomplish
most programming goals. With Python the reader can start doing real programming
very quickly. Journals such as Computing in Science and Engineering, Briefings
in Bioinformatics, and PLOS Computational Biology have published introductory
articles about Python. Scientists are using Python for molecular visualization, ge-
nomic annotation, data manipulation, and countless other applications.
In the particular case of the life sciences, the development of Python has been
very important; the best exponent is the Biopython package. For this reason, Section
II is devoted to Biopython. Anyhow, I don’t claim that Biopython is the solution to
every biology problem in the world. Sometimes a simple custom-made solution may
better fit the problem at hand. There are other packages like BioNEB and CoreBio
that the reader may want to try.
The book begins from the very basic, with Section I (“Programming”), teaching
the reader the principles of programming. From the very beginning, I place a special
emphasis on practice, since I believe that programming is something that is best
learned by doing. That is why there are code fragments spread over the book. The
reader is expected to experiment with them, and attempt to internalize them. There
are also some spare comparisons with other languages; they are included only when
doing so enlightens the current topic. I believe that most language comparisons do
more harm than good when teaching a new language. They introduce information
that is incomprehensible and irrelevant for most readers.
In an attempt to keep the interest of the reader, most examples are somehow
related to biology. In spite of that, these examples can be followed even if the reader
doesn’t have any specific knowledge in that field.
To reinforce the practical nature of this book, and also to use as reference

xxiii
xxiv  Preface to the First Edition

material, Section IV is called “Python Recipes with Commented Source Code.”


These programs can be used as is, but are intended to be used as a basis for other
projects. Readers may find that some examples are very simple; they do their job
without too many bells and whistles. This is intentional. The main reason for this
is to illustrate a particular aspect of the application without distracting the reader
with unnecessary features, as well as to avoid discouraging the reader with complex
programs. There will always be time to add features and customizations once the
basics have been learned.
The title of Section III (“Advanced Topics”) may seem intimidating, but in
this case, advanced doesn’t necessarily mean difficult. Eventually, everyone will
use the chapters in this section [especially relational database management system
—RDBMS— and XML]. An important part of the bioinformatics work is building
and querying databases, which is why I consider knowing a RDBMS like MySQL
to be a relevant part of the bioinformatics skill set. Integrating data from different
sources is one of tasks most frequently performed in bioinformatics. The tool of
choice for this task is XML. This standard is becoming a widely used platform for
data interchange between applications. Python has several XML parsers and we
explain most of them in this book.
Appendix B, “Selected Papers,” provides introductory level papers on Python.
Although there is some overlapping of subjects, this was done to show several points
of view of the same subject.
Researchers are not the only ones for whom this book will be beneficial. It has
also been structured to be used as a university textbook. Students can use it for
programming classes, especially in the new bioinformatics majors.
Preface to the Second
Edition

The first edition of Python for Bioinformatics was written in 2008 and published
in 2009. Even after eight years, the lessons in this book are still valuable. This is
quite an accomplishment in a field that evolves at such a fast pace. In spite of its
usefulness, the book is showing its age and would greatly benefit from a second
edition.
The predominant Python version is 3.6, although Python 2.7 is still in use in
production systems. Since there are incompatibilities between these versions, lot of
effort was made to make all code in the book Python 3 compatible.
Not only has the software changed in these past eight years, but enterprise atti-
tude and support toward Open Source Software in general and Python in particular
has changed dramatically. There are also new computing paradigms that can’t be
ignored such as collaborative development and cloud computing.
In the original book, Chapter 14 was called “Collaborative Development: Version
Control” and was based on Bazaar, a software that follows the currently used
distributed development workflow but is not what is being used by most developers
today. By far the most software development is done with Git at GitHub. This
chapter was rewritten to focus on current practices.
Web development is another area that changed significantly. Although this is
not a book about web development, the chapter “Web Applications” now reflects
current usage of long-running processes and frameworks instead of CGI/WSGI and
middleware-based applications. Frameworks were discussed as a side note in this
chapter, but now the chapter is based around a framework (Bottle) and leave the
old method as a historical footnote.
In databases, the NoSQL gained lot of traction, from being a bullet point in
the first edition, now has its own section using MongoDB, and a Python recipe
was changed to use this NoSQL database.
Graphical libraries have improved since 2009, and there are great quality com-
peting graphic libraries available for Python. There is a whole chapter devoted to
Bokeh, a free interactive visualization library.
Another change that is reflected in this book is the usage of Anaconda and
Jupyter Notebooks (with all code in a cloud notebook provided by Microsoft
Azure1 ).
1
See https://notebooks.azure.com/py4bio/libraries/py3.us

xxv
xxvi  Preface to the Second Edition

Regarding source code, there is a GitHub repository at https://github.com/


Serulab/Py4Bio where you can download all the code and sample files used in this
book.
There are corrections in every chapter. Sometimes there were actual mistakes,
but most of the corrections were related to the Python 3 upgrade and in keeping
with current good practices. Regarding corrections, I expect that this book may
need corrections, so I made a web page where the readers can get updates. Please
take a look at http://py3.us and subscribe to the low volume mailing list while
at it.
Apart from software evolution and paradigms shifts, I also gained development
experience and changed my views on pedagogical matters. During these years I
worked in a genome sequencing project at an international consortium and as a
senior software developer in an NYSE listed company (Globant). In the last 5 years
I worked for several well-known clients such as Salesforce and National Geographic.
I am currently working at PLOS (Public Library of Science).
By request of MATLAB, I include their contact information:
MATLAB ® is a registered trademark of The MathWorks, Inc. For product
information please contact: The MathWorks, Inc. 3 Apple Hill Drive Natick, MA,
01760-2098 USA Tel: 508-647-7000 Fax: 508-647-7001 E-mail: [email protected]
Web: www.mathworks.com
Regarding the logo of Biopython, that is used in the cover, here it is usage
license (this covers all Biopython files, including its logo):
Biopython is currently released under the "Biopython License Agreement"
(given in full below). Unless stated otherwise in individual file headers, all Biopy-
thon’s files are under the "Biopython License Agreement".
Some files are explicitly dual licensed under your choice of the "Biopython Li-
cense Agreement" or the "BSD 3-Clause License" (both given in full below). This
is with the intention of later offering all of Biopython under this dual licensing
approach.

Biopython License Agreement


Permission to use, copy, modify, and distribute this software and its documenta-
tion with or without modifications and for any purpose and without fee is hereby
granted, provided that any copyright notices appear in all copies and that both
those copyright notices and this permission notice appear in supporting documen-
tation, and that the names of the contributors or copyright holders not be used in
advertising or publicity pertaining to distribution of the software without specific
prior permission.
THE CONTRIBUTORS AND COPYRIGHT HOLDERS OF THIS SOFT-
WARE DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFT-
WARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS, IN NO EVENT SHALL THE CONTRIBUTORS OR COPY-
RIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CON-
SEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING
Preface to the Second Edition  xxvii

FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF


CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
SOFTWARE.

BSD 3-Clause License


Copyright (c) 1999-2017, The Biopython Contributors All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer. Redistributions in binary form must repro-
duce the above copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution. Nei-
ther the name of the copyright holder nor the names of its contributors may be
used to endorse or promote products derived from this software without specific
prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPY-
RIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IM-
PLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PAR-
TICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPY-
RIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDI-
RECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAM-
AGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTI-
TUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSI-
NESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (IN-
CLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.
Acknowledgments

A project such as this book couldn’t be done by just one person. For this reason,
there is a long list of people who deserve my thanks. In spite of the fact that the
average reader doesn’t care about the names, and at the risk of leaving someone out,
I would like to acknowledge the following people: my wife Virginia Gonzalez (Vicky)
and my son Maximo Bassi, who had to contend with my virtual absence during
more than a year. Vicky also assisted me in uncountable ways during manuscript
preparation. My parents and professors taught me important lessons. My family
(Oscar, Graciela, and Ramiro) helped me with the English copyediting, along with
Hugo and Lucas Bejar. Vicky, Griselda, and Eugenio also helped by providing a
development abstraction layer, which is needed for writers and developers.
I would like to thank the people in the local Python community (http://www.
python.org.ar): Facundo Batista, Lucio Torre, Gabriel Genellina, John Lenton,
Alejandro J. Cura, Manuel Kaufmann, Gabriel Patiño, Alejandro Weil, Marcelo
Fernandez, Ariel Rossanigo, Mariano Draghi, and Buanzo. I would choose Python
again just for this great community. I also thank the people at Biopython: Jeffrey
Chang, Brad Chapman, Peter Cock, Michiel de Hoon, and Iddo Friedberg. Peter
Cock is specially thanked for his comments on the Biopython chapter. I also thank
Shashi Kumar and Pablo Di Napoli who helped me with the LATEX2ε issues, and
Sunil Nair who believed in me from the first moment. Also people at Globant
who trusted in me, like Guido Barosio, Josefina Chausovsky, Lucas Campos, Pablo
Brenner and Guibert Englebienne. Globant co-workers such as Pedro Mourelle,
Chris DeBlois, Rodrigo Obi-Wan Iloro, Carlos Del Rio and Alejandro Valle. People
at PLOS, Jeffrey Gray and Nick Peterson.

xxix
I
Programming

1
CHAPTER 1

Introduction
CONTENTS
1.1 Who Should Read This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 What the Reader Should Already Know . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Using this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Typographical Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Python Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Code Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Get the Most from This Book without Reading It All . . . . . . . . . 6
1.2.5 Online Resources Related to This Book . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Why Learn to Program? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Basic Programming Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 What Is a Program? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 Main Features of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 Comparing Python with Other Languages . . . . . . . . . . . . . . . . . . . . . 11
Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.3 How Is It Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 Who Uses Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.5 Flavors of Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.6 Special Python Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

The most effective way to do it, is to do it.

Amelia Earhart

1.1 WHO SHOULD READ THIS BOOK


This book is for the life science researcher who wants to learn how to program.
He/she may have previous exposure to computer programming, but this is not
necessary to understand this book (although it surely helps).
This book is designed to be useful to several separate but related audiences,
students, graduates, postdocs, and staff scientists, since all of them can benefit
from knowing how to program.

3
4  Python for Bioinformatics

Exposing students to programming at early stages in their career helps to boost


their creativity and logical thinking, and both skills can be applied in research. In
order to ease the learning process for students, all subjects are introduced with the
minimal prerequisites. There are also questions at the end of each chapter. They
can be used for self-assessing how much you’ve learned. The answers are available
to teachers in a separate guide.
Graduates and staff scientists having actual programming needs should find its
several real-world examples and abundant reference material extremely valuable.

1.1.1 What the Reader Should Already Know


Since this book is called Python for Bioinformatics, it has been written with the
following assumptions in mind:

• No programming knowledge is assumed, but the reader is required to have


minimum computer proficiency to be able to use a text editor and handle basic
tasks in your operating system (OS). Since Python is multi-platform, most
instructions from this book will apply to the most common operating systems
(Windows, macOS and Linux); when there is a command or a procedure that
applies only to a specific OS, it will be clearly noted.

• The reader should be working (or at least planning to work) with bioinfor-
matics tools. Even low-scale handmade jobs, such as using the NCBI BLAST
to ID a sequence, aligning proteins, primer searching, or estimating a phy-
logenetic tree will be useful to follow the examples. The more familiar the
reader is with bioinformatics, the better he will be able to apply the concepts
learned in this book.

1.2 USING THIS BOOK


1.2.1 Typographical Conventions
There are some typographical conventions I have tried to use in a uniform way
throughout the book. They should aid readability and were chosen to tell apart
user-made names (or variables) from language keywords. This comes in handy when
learning a new computer language.
Bold: Objects provided by Python and by third-party modules. With this no-
tation it should be clear that round is part of the language and not a user-defined
name. Bold is also used to highlight parts of the text. There is no way to confuse
one bold usage with the other.
Mono-spaced font: User declared variables, code, and filenames. For example:
sequence = ’MRVLLVALALLALAASATS’.
Italics: In commands, it is used to denote a variable that can take different
values. For example, in len(iterable), “iterable” can take different values. Used in
Introduction  5

text, it marks a new word or concept. For example “One such fundamental data
structure is a dictionary.”
The content of lines starting with $ (dollar sign) are meant to be typed in your
operating system console (also called command prompt in Windows or terminal
in macOS).
←֓ : Break line. Some lines are longer than the available space in a printed
page, so this symbol is inserted to mean that what is on the next line in the page
represents the same line on the computer screen. Inside code, the symbol used is
<=.

1.2.2 Python Versions


The current version of Python at this moment is 3.6.1. There is a 2.7.12 version that
is maintained1 because there are still a sizable number of applications in production
using the 2.7 branch. Versions 3.x and 2.x are slightly different, at the point of
being incompatible. Python 3 is more efficient than Python 2 in many aspects.
Large websites such as Instagram migrated from Python 2.7 to Python 3.6 to save
in CPU and memory consumption by up to 30%. This book uses Python 3.6.
The only scenario where you may need to use Python 2.7, apart from mainte-
nance of old code, is when there is no availability of a specific library for Python
3. In this case, before starting a project in Python 2.7, try to search for a replace-
ment library. For example, you want to connect with a MySQL database and you
are told to use MySQLdb, since this package is not Python 3 compatible; instead
of using Python 2.7, use mysqlclient or mysql-connector-python, both works
with Python 3.

1.2.3 Code Style


Python source code that appears in this book is presented as listings. Each line of
these listings is numbered. These numbers are not intended to be typed; they are
used to reference each line in the text. You don’t need to copy the code from the
book, since it can be downloaded from the GitHub repository at https://github.
com/Serulab/Py4Bio.
Code can be formatted in several ways and still be valid to the Python inter-
preter. This following code is syntactically correct:

def GetAverage(X):
avG=sum(X)/len(X)
" Calculate the average "
return avG

Also this one:


1
Python 2.7.x has an end-of-life date in 2020. There will be no Python 2.8. For more information
see https://www.python.org/dev/peps/pep-0373/.
6  Python for Bioinformatics

def get_average(items):
""" Calculate the average
"""
average = sum(items) / len(items)
return average

The former code sample follows most accepted coding styles for Python.2
Throughout the book you will find mostly code formatted as the second sample.
Some code in the book will not follow accepted coding styles for the following
reasons:

• There are some instances where the most didactic way to show a particular
piece of code conflicts with the style guide. On those few occasions, I choose
to deviate from the style guide in favor of clarity.

• Due to size limitation in a printed book, some names were shortened and
other minor drifts from the coding styles have been introduced.

• To show that there is more than one way to write the same code. Coding
style is a guideline, and enforcement is not made at a language level, so some
programmers don’t follow it thoroughly. You should be able to read “bad”
code, since sooner or later you will have to read other people’s code.

1.2.4 Get the Most from This Book without Reading It All
• If you want to learn how to program, read the first section, from Chapter
1 to Chapter 8. The Regular Expressions (REGEX) chapter (Chapter 13) can
be skipped if you don’t need to deal with REGEX.

• If you know Python and just want to know about Biopython, read first
Chapter 9 (from page 158 to page 209). It is about Biopython modules and
functions. Then try to follow programs found in Section III (from page 315
to page 363).

• There are three appendixes that can be read in an independent way. Appendix
A (Collaborative Development: Version Control with GitHub) reproduces a
paper called “A Quick Introduction to Version Control with Git and GitHub.”
Appendix B shows how to install a web application using Python Anywhere.
Appendix C is a reference material that can be used as a cheat sheet when
you need a quick answer without having to read a chapter.

2
The official Python style guide is located at https://www.python.org/dev/peps/pep-0008,
and a more easy-to-read style guide is located at http://docs.python-guide.org/en/latest/
writing/style.
Introduction  7

1.2.5 Online Resources Related to This Book


The book website is at http://py3.us. In this site you will find errata, a mail-
ing list to keep updated about Python and links to source code repositories. Re-
garding source code, the official source code repository of this book is at GitHub
(https://github.com/Serulab/Py4Bio). From this site you can inspect online
or download all the code used in this book. To download all scripts, go to the
“Clone or download” green button and press it. If you have Git installed in
your machine (and know how to use it3 ), clone the repository using this ad-
dress: [email protected]:Serulab/Py4Bio.git. Another alternative is to click on
“Download ZIP”. Once you have the repository in your machine, go to the code
folder, where there are a set of folders, each one has the scripts related to the
chapter. Each script in the book has a name and this corresponds with the file-
name. There is another folder called notebooks, and it contains Jupyter note-
books that can be run locally. For more information on how to run a Jupyter
notebook, please see http://jupyter-notebook-beginner-guide.readthedocs.
io/en/latest/execute.html.
Another online resource are the Jupyter Notebooks available at Microsoft Azure
Notebook website (https://notebooks.azure.com/py4bio/libraries/py3.us).
The same notebooks that are in the book repository, can be used online in this site.

1.3 WHY LEARN TO PROGRAM?


Many of the tasks that a researcher performs with his or her computer are repetitive:
Collect data from a Web page, convert files from one format to another, execute or
interpret hundreds of BLAST results, primer design, look for restriction enzymes,
etc. In many cases it is evident that these are tasks that can be performed with a
computer, with less effort on our part and without the possibility of errors caused
by tiredness or distractions.
An important consideration when you’re evaluating whether or not to create a
program is the apparent time lost in the definition and formulation of the problem,
implementing it with code, and then debugging it (correcting errors). It is incorrect
to consider problem definition and evaluation as a waste of time. It is generally at
this precise point in the process where we understand thoroughly the problem that
we face. It is common that during the attempt to formulate a problem, we realize
that many of our initial assumptions were mistaken. It also helps us to detect when
it is necessary to restart the planning process. When this happens, it is better that
it happens at the planning stage than when we are in the middle of the project. In
these cases, the planning of the program represents time saved. Another advantage
to take into account is that the time that is invested to create a program once is
compensated by the speed with which the tasks are performed every time we run
it.
3
In Appendix A there is a tutorial on how to use GitHub
8  Python for Bioinformatics

Not only can it automate the procedures that we do manually, but it will also
be able to do things that would otherwise not be possible.
Sometimes it is not very clear if a particular task can be done by a program.
Reading a book such as this one (including the examples) will help you identify
which tasks are feasible to automate with software and which ones are better done
manually.

1.4 BASIC PROGRAMMING CONCEPTS


Before installing Python, let’s review some programming fundamentals. If you have
some previous programming experience, you may want to skip this section and jump
straight to Chapter 2 “Installing Python.” This section introduces basic concepts
such as instructions, data types, variables, and some other related terminology that
is used throughout this book.

1.4.1 What Is a Program?


Computers only know what you tell them. The way to tell them to do something
is by a program. A program is a set of ordered instructions designed to command
the computer to do something. The word “ordered” is there because is not enough
to declare what to do, but the actual order of directions should also be stated.4
A program is often characterized as a recipe. A typical recipe consists of a
list of ingredients followed by step-by-step instructions on how to prepare a dish.
This analogy is reflected in several programming websites and tutorials with the
words “recipe” and “cookbook.” A laboratory protocol is another useful analogy. A
protocol is defined as a “predefined written procedural method in the design and
implementation of experiments.”
Here is a typical protocol, followed almost every day in several molecular labo-
ratories:

Listing 1.1: Protocol for Lambda DNA digestion

Restriction Digestion of Lambda DNA

Materials

5.0 mcL Lambda DNA (0.1 g/L)


2.5 mcL 10x buffer
16.5 mcL H2O
1.0 mcL EcoRI

4
There are declarative languages that state what the program should accomplish, rather than
describing how to accomplish it. Most computer languages (Python included) are imperative instead
of declarative.
Introduction  9

Procedure

Incubate the reagents at 37°C for 1 hr.


Add 2.5 mcL loading dye and incubate for another 15 minutes.
Load 20 mcL of the digestion mixture onto a minigel

There are at least two components of a protocol: materials or ingredients, and


procedures. A procedure provides specific order like incubate, add, mix, load and
many others. The same goes for a computer program. The programmer gives specific
order to the computer: print, read, write, add, multiply, round, and others.
While protocol procedures correlate with program instructions, materials are
the data. In protocols, procedures are applied to materials: Mix 2.5 µL of buffer
with 5 µL of Lambda DNA and 16.5 µL of H2 0, load 20 µL onto a minigel. In a
program, instructions are applied to data: print the text string “Hello”, add two
integer numbers, round a float number.
As a protocol can we written in different languages (like English, Spanish, or
French), there are different languages to program a computer. In science, English is
the de facto language. Due to historical, commercial and practical reasons, there is
no such equivalent in computer science. There are several languages, each with its
own strong points and weakness. For reasons that will make sense shortly, Python
was the computer language chosen for this book.
Let’s see a simple Python program:

Listing 1.2: sample.py: Sample Python Program

1 seq_1 = ’Hello,’
2 seq_2 = ’ you!’
3 total = seq_1 + seq_2
4 seq_size = len(total)
5 print(seq_size)

Note: The numbers at the beginning of the each line are for reference only,
they are not meant to be typed.
This small program can be read as “Name the string Hello, as seq_1. Name
the string you! as seq_2. Add the strings named seq_1 and seq_2 and call the
result as total. Get the length of the string called total and name this value as
seq_size. Print the value of seq_size.” This program prints the number 11.
As shown, there are different types of data (often called “data types” or just
“types”). Numbers (integers or float), text string, and other data types are covered
in Chapter 3. In print(seq_size), the instruction is print and seq_size is the
name of the data. Data is often represented as variables. A variable is a name
that stands for a value that may vary during program execution. With variables,
a programmer can represent a generic command like “round n” instead of “round
2.9.” This way he can take into account a non-fixed (hence variable) value. When
10  Python for Bioinformatics

the program is executed, “n” should take a specific value since there is no way to
“round n.” This can be done by assigning a value to a variable or by binding a name
to a value.5 The difference between “assign a value to a variable” and “bind a name
to a value” is explained in detail in Chapter 3 (from page 64). In any case, it is
expressed as:

variable_name = value

Note that this is not an equality as seen in mathematics. In an equality,


terms can be interchanged, but in programming, the term on the right (value)
takes the name of the term on the left (variable_name). For example,

seq_1 = ’Hello’

After this assignment, the variable seq_1 can be used, like,

print(seq_1)

This is translated as “print out the value called seq_1”. This command returns
“Hello” because this is the value of this variable.

1.5 WHY PYTHON?


Let’s have a look at some Python features worth pointing out.

1.5.1 Main Features of Python


• Readability: When we talk about readability, we refer as much to the original
programmer as any other person interested in understanding the code. It is
not an uncommon occurrence for someone to write some code then return
to it a month later and find it difficult to understand. Sometimes Python is
called a “human-readable language.”

• Built-in features: Python comes with “batteries included.” It has a rich and
versatile standard library that is immediately available, without the user hav-
ing to download separate packages. With Python you can, with few lines, read
and write XML and JSON files, parse and generate email messages, extract
files from a zip archive, open a URL as if were a file, and many other possi-
bilities that in other languages, it would require a third-party library.

• Availability of third-party modules for a broad spectrum of activities. Data


visualization6 and plotting, PDF generation, bioinformatics analysis,7 image
5
In Python the latter form is used.
6
MatPlotLib (http://matplotlib.org/) and Bokeh http://bokeh.pydata.org/en/latest/
are the most used.
7
Biopython library to make your own bioinformatics applications (http://biopython.org/).
Introduction  11

processing,8 machine learning,9 game development, interface with popular


databases,10 and application software are only a handful of examples of mod-
ules that can be installed to extend Python functionality.

• High-level built-in data structures: Dictionaries, sets, lists, tuples, and others.
These are very useful to model real-world data. Third-party modules such as
NumPy and SciPy can also extend the structures to kd-trees, n-dimensional
arrays, matrix operations, time-series, image objects, and more.

• Multiparadigm: Python can be used as a “classic” procedural language or as


“modern” object-oriented programming (OOP) language. Most programmers
start writing code in a procedural way and when they need to, they upgrade
to OOP. Python doesn’t force programmers to write OOP code when they
just want to write a simple script.

• Extensibility: If the built-in methods and available third-party modules are


not enough for your needs, you can easily extend Python, even in other pro-
gramming languages. There are some applications written mostly in Python
but with a processor demanding routine in C or FORTRAN. Python can also
be extended by connecting it to specialized high-level languages like R or
MATLAB11 .

• Open source: Python has a liberal open source license that makes it freely
usable and distributable, even for commercial use.

• Cross platform: A program made in Python can be run under any computer
that has a Python interpreter. This way, a program made under Windows 10
can run unmodified in Linux or OSX. Python interpreters are available for
most computer and operating systems, and even some devices with embedded
computers like the Raspberry Pi.

• Thriving community: Python is nowadays the programming language to use


for scientists and researchers.12 This translates into more libraries for your
projects and people you can go to for support.

1.5.2 Comparing Python with Other Languages


You may be wondering why you should use Python, and not more well-known
languages like Java, PHP, or C++. It is a good question. A programming language
8
Scikit-image paper: http://peerj.com/articles/453
9
scikit-learn website: http://scikit-learn.org/stable/
10
https://wiki.python.org/moin/DatabaseProgramming
11
MATLAB® is a registered trademark of The MathWorks, Inc. For product information please
contact: The MathWorks, Inc. 3 Apple Hill Drive Natick, MA, 01760-2098 USA. Tel: 508-647-7000.
Fax: 508-647-7001. E-mail: [email protected]. Web: www.mathworks.com.
12
http://www.nature.com/news/programming-pick-up-python-1.16833
12  Python for Bioinformatics

can be regarded as a tool, and choosing the best tool for the job makes a lot of
sense.

Readability
Nonprofessional programmers tend to value the learning curve as much as the leg-
ibility of the code (both aspects are tightly related).
A simple “hello world” program in Python looks like this:

print("Hello world!")

Compare it with the equivalent code in Java:

public class Hello


{
public static void main(String[] args) {
System.out.printf("Hello world!");
}
}

Let’s see a code sample in C language. The following program reads a file
(input.txt) and copies its contents into another file (output.txt):

#include <stdio.h>
int main(int argc, char **argv) {
FILE *in, *out;
int c;
in = fopen("input.txt", "r");
out = fopen("output.txt", "w");
while ((c = fgetc(in)) != EOF) {
fputc(c, out);
}
fclose(out);
fclose(in);
}

The same program in Python is shorter and easier to read:

with open("input.txt") as input_file:


with open("output.txt") as output_file:
output_file.writelines(in)

Let’s see a Perl program that calculates the average of a series of numbers:
Introduction  13

sub avg(@_) {
$sum += $_ foreach @_;
return $sum / @_ unless @_ == 0;
return 0;
}
print avg((1..5))."\n";

The equivalent program in Python is:

def avg(data):
if len(data)==0:
return 0
else:
return sum(data)/len(data)
print(avg([1,2,3,4,5]))

The purpose of this Python program could be almost fully understood by just
knowing English.
Python is designed to be a highly readable language.13 The use of English key-
words, and the use of spaces to limit code blocks and its internal logic (indentation),
contribute to this end. It’s possible to write hard-to-read code in Python, but it
requires a deliberate effort to obfuscate the code.14

Speed
Another criterion to consider when choosing a programming language is execution
speed. In the early days of computer programming, computers were so slow that
some differences due to language implementation were very significant. It could take
a week for a program to be executed in an interpreted language, while the same
code in a compiled language could be executed in a day. This performance difference
between interpreted and compiled languages still has the same proportion, but it
is less relevant. This is because a program that took a week to run, now takes less
than ten seconds, while the compiled one takes about one second. Although the
difference seems important (at least one order of magnitude), it is not so relevant
if we consider the time it takes to develop it.
This does not mean that execution speed does not need to be considered. A 10X
speed difference can be crucial in some high-performance computing operations.
Sometimes a lot of improvements can be achieved by writing optimized code. If the
code is written with speed optimization in mind, it is possible to obtain results quite
13
Other languages are regarded as “write only,” since once written it is very difficult to understand
it.
14
A simple print ’Hello World’ program could be written, if you are so inclined, as
print ”.join([chr((L>=65 and L<=122) and (((((L>=97) and (L-96) or (L-64))-
1)+13)%26+((L>=97) and 97 or 65)) or L) for L in [ ord(C) for C in ’Uryyb Jbeyq!’]])
(py3.us/1).
14  Python for Bioinformatics

similar to the ones that could be obtained in a compiled language. In the cases where
the programmer is not satisfied with the speed obtained by Python, it is possible
to link to an external library written in another language (like C or Fortran). This
way, we can get the best of both worlds: the ease of Python programming with the
speed of a compiled language.

1.5.3 How Is It Used?


Python has a wide range of applications. From cell phones to web servers, there
are thousands of Python applications in the most diverse fields. There is Python
code powering Wikipedia robots, helping design next generation special effects at
Industrial Light & Magic,15 embedded in D-link modems and routers,16 and it is
the scripting language of the OpenOffice suite17 .
Some languages are strong in one niche (like PHP for web applications, Java for
desktop programs), but Python can’t be typecast easily.
With a single code base, Python desktop applications run with a native look
and feel on multiple platforms. Well-known examples of this category include the
BitTorrent p2p client/server, Calibre, an Ebook manager, Sage Math (a math-
ematics software system), the Dropbox client, and more.
As a language for building web applications, Python can be found in high traffic
sites like Reddit, NationalGeographic, Instagram, and NASA. There are specialized
software for building web sites (called webframeworks) in Python like Django,
Web2Py, Pyramid, Flask, and Bottle.
From system administration to data analysis, Python provides a broad range of
tools to this end:

• Generic Operating System Services (os, io, time, curses)

• File and Directory Access (os.path, glob, tempfile, shutil)

• Data Compression and Archiving (zipfile, gzip, bz2)

• Interprocess Communication and Networking (subprocess, socket, ssl)

• Internet (email, mimetools, rfc822, cgi, urllib)

• String Services (string, re, codecs, unicodedata)

Python is gaining momentum as the default computer language for the scien-
tific community. There are several libraries oriented toward scientific users, such as
SciPy18 and Anaconda.19 Both distributions integrate modules for linear algebra,
15
https://www.python.org/about/success/ilm/
16
https://www.python.org/about/success/dlink/
17
http://wiki.services.openoffice.org/wiki/Python
18
https://www.scipy.org
19
https://www.continuum.io/anaconda-overview
Introduction  15

signal processing, optimization, statistics, genetic algorithms, interpolation, ODE


solvers, special functions, etc.
Python has support for parallel programming with pyMPI and 2D/3D scientific
data plotting.
Python is known to be used in wide and diverse fields like engineering, electron-
ics, astronomy, biology, paleomagnetism, geography, and many more.

1.5.4 Who Uses Python?


Python is used by several companies, from small and unknown shops up to big
players in their fields like Google, National Geographic, Disney, NASA, NYSE, and
many more.
It is one of the four “official languages” of Google among Java, C++ and Go.
They have web sites made in Python, stand-alone programs and even hosting so-
lutions.20 As a confirmation that Google is taking Python seriously, in December
2005 they hired Guido van Rossum, the creator of Python. It may not be Google’s
main language, but this shows that they are a strong supporter of it.
Even Microsoft, a company not known for their support of open source pro-
grams, developed a version of Python to run their “.Net” platform (IronPython)
and also developed a the Python Tools for Visual Studio,21 a Free, open source
plugin that turns Visual Studio into a Python IDE.
Many well-known Linux distributions already use Python in their key tools.
Ubuntu Linux “prefers the community to contribute work in Python.” Python is so
tightly integrated into Linux that some distributions won’t run without a working
copy of Python.

1.5.5 Flavors of Python


Although in this book I refer to Python as a programming language, Python is
actually a language definition. What we use most of the time is a specific imple-
mentation, CPython, that is the Python language definition implemented in C.
Since this implementation is the most used, we just call Python to the CPython
implementation.
The most relevant Python implementations are: CPython, PyPy,22 Stackless,23
Jython24 and IronPython.25 This book will focus on the standard Python version
(CPython), but it is worth knowing about the different versions.

• CPython: The most used Python version, so the terms CPython and Python
are used interchangeably. It is made mostly in C (with some modules made
20
https://cloud.google.com/appengine/
21
https://www.visualstudio.com/vs/python/
22
http://codespeak.net/pypy/dist/pypy/doc/home.html
23
http://www.stackless.com
24
http://www.jython.org/Project
25
http://ironpython.net
16  Python for Bioinformatics

in Python) and is the version that is available from the official Python Web
site (http://www.python.org).
• PyPy: A Python version made in Python. It was conceived to allow program-
mers to experiment with the language in a flexible way (to change Python
code without knowing C). It is mostly an experimental platform.
• Stackless: Another experimental Python implementation. The aim of this im-
plementation doesn’t focus on flexibility like PyPy; instead, it provides ad-
vanced features not available in the “standard” Python version. This is done in
order to overcome some design decisions taken early in Python development
history. Stackless allows custom-designed Python application to scale better
than CPython counterparts. This implementation is being used in the EVE
Online massively multi-player online game, Civilization IV, Second Life, and
Twisted.
• Jython: A Python version written in Java. It works in a JVM (Java Virtual
Machine). One application of Jython is to add the Jython libraries to their
Java system to allow users to add functionality to the application. A very
well known learning 3D programming environment (Alice26 ) uses Jython to
let the users program their own scripts.
• IronPython: Python version adapted by Microsoft to run on “.Net” and
“.Mono” platform. .Net is a technology that competes with Java regarding
“write once, runs everywhere.”

1.5.6 Special Python Distributions


Apart from Python implementations, there are some special adaptations of the
original CPython that are packaged for specific purposes. They are called Python
bundles or distributions. Most of them brings to the table 3rd party software such as
editors, visualization modules and the Jupyter Notebook. This is a web application
that allows you to create and share documents that contain live code, equations,
visualizations and explanatory text. Here is a list of most useful distributions27:

• ActivePython:28 Aimed at enterprise users, ActiveState provides a precom-


piled, supported, quality-assured Python distribution that makes it easy for
corporations to comply with policy requirements to have supported open
source products. From a technical standpoint it offers all modern Python
versions with most used external modules already pre-installed. It also has its
own package management and external modules repository (PyPM29 )
26
Alice is available for free at http://www.alice.org.
27
For a complete list of Python implementations and distributions see https://www.python.
org/download/alternatives
28
http://www.activestate.com/activepython
29
https://code.activestate.com/pypm/
Introduction  17

• Enthought Canopy:30 Another all-in-one Python solution. Includes over


450 core scientific analytic and Python packages, like NumPy, SciPy, IPython,
2D and 3D visualization, database adapters, and others. Also includes a Code
Editor with Jupyter Notebook Support. It has some add ons such a graphical
package manager that notifies you of updates, installs with one click and
helps you roll back package versions. Everything is available as a single-click
installer for the three major operating systems. This bundle is suitable for
scientific users, and it is made by the same people who made NumPy and
SciPy. There are different licenses like a free academic one, and various paid
commercial enterprise licenses.

• WinPython:31 It defines itself as a free open-source portable distribution


of the Python programming language for Windows 7/8/10 and scientific and
educational usage. Also includes packages suitable for scientists, data scien-
tists, and education (NumPy, SciPy, Sympy, Matplotlib, Pandas, pyqtgraph,
etc.). Uses Spyder (Scientific PYthon Development EnviRonment) as the de-
fault editor and it is portable in the sense the user can move the WinPython
directory and all settings are kept. You can have multiples copies of isolated
and self-consistent WinPython installations.

• Anaconda:32 A Python and R distribution for scientific computing. Includes


over 720 packages for data preparation, data analysis, data visualization, ma-
chine learning, and interactive data science. It shares the objective and user
type with Enthought Canopy. Also comes with Spyder as the default code
editor. It has several products that differentiates it from other Python dis-
tribution, like Repository, Accelerate, Scale, Mosaic, Notebooks and Fusion.
Most of these services are available only to the expensive subscriptions. If
you don’t use any of these services you still get an excellent all-in-one Python
distribution. Continuum, the company behind Anaconda is a institutional
partner of Project Jupyter, which means that they support the development
of Jupyter Notebook, a web application to run Python code in a browser.

You may be wondering which one to use (or just use the standard “plain vanilla”
Python). There is no single and correct answer to this question, since it will depend
on your needs, work habits, budget, and personal preferences. Personally I tend
to use the standard Python in servers and Anaconda in the computers I use for
software development.

1.6 ADDITIONAL RESOURCES


• Interactive notebooks: Sharing the code. Interactive notebooks: Sharing the
code. The free IPython notebook makes data analysis easier to record, under-
30
https://www.enthought.com/products/canopy/
31
http://winpython.github.io/
32
https://www.continuum.io/anaconda-overview
18  Python for Bioinformatics

stand and reproduce. Helen Shen. Nature 515, 151–152 (06 November 2014)
doi:10.1038/515151a
https://goo.gl/HfBJ12

• Python for feature film:


http://dgovil.com/blog/2016/11/30/python-for-feature-film/

• Alternative Python implementations:


https://www.python.org/download/alternatives/

• IPython: an interactive computing environment.


http://ipython.org/

• bpython: A fancy interface to the Python interpreter for Unix-like operating


systems:
https://www.bpython-interpreter.org

• Python history, a blog by Guido van Rossum:


http://python-history.blogspot.com
CHAPTER 2

First Steps with Python


CONTENTS
2.1 Installing Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Learn Python by Using It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Install Python Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Installing Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3 Using Python Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.4 Testing Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.5 First Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Interactive Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Baby Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Basic Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Output: Print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Input: input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 More on the Interactive Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.5 Exit from the Python Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Batch Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Indentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Choosing an Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Sublime Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Atom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.3 PyCharm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.4 Spyder IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.5 Final Words about Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5 Other Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7 Self-Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

The journey of a thousand miles begins with one step.

Lao Tzu

19
Exploring the Variety of Random
Documents with Different Content
Creswick, Thomas
Critius and Nesiotes
Crivelli, Carlo
Crome, John
Cropsey, J. F.
Crowe, Sir J. A.
Cruikshank, George
Cuyp
Dahl, Hans
Dahl, J. C.
Dahl, Michael
Dallin, Cyrus E.
Dalou, Jules
Damophon
Danby, Francis
Daniell, Thomas
Dannat, William T.
Dannecker, J. H. von
Daubigny, C. F.
Daumier, Honoré
David, Gerard
David, J. L.
David, Pierre Jean
Davis, C. H.
Davis, H. W. B.
De Camp, Joseph
Decamps, A. G.
Degas, H. G. E.
De Haas, M. F. H.
De Keyser, Thomas
Delacroix, F. V. E.
Delaroche, H. (Paul)
Delaunay, Elie
Della Bella, Stefano
Della Colle, Raffaellino
Della Quercia, Jacopo
Della Robbia
De Loutherbourg, P. J.
Demetrius
Desiderio da Settignano
Detaille, J. B. E.
Dewing, T. W.
De Wint, Peter
Diamante, Fra
Diaz, N. V.
Dielmann, Frederick
Diepenbeck, A. van
Dies, C. A.
Dietrich, C. W. E.
Dillens, Julien
Dipoenus and Scyllis
Dobson, William
Dolci, Carlo
Domenichino, Zampieri
Donatello
Doré, L. A. Gustave
Douw, Gerhard
Downman, John
Doyen, G. F.
Doyle, Richard
Drawing
Drouais, J. G.
Dubois, Paul
Du Maurier, G. L. P. B.
Dumont (family)
Dumont, François
Duncan, Thomas
Dupré, Jules
Durand, Asher Brown
Dürer, Albrecht
Duveneck, Frank
Dyce, William
Eakins, Thomas
Earle, Ralph
Earlom, Richard
East, Alfred
Eastlake, Sir C. L.
Eaton, Wyatt
Eckersberg, Kristoffer
Edelinck, Gerard
Eeckhout, G. van den
Effigies, Monumental
Egg, A. L.
Encaustic Painting
Endoeus
Engleheart, George
Engraving
Enneking, J. J.
Etching
Etex, Antoine
Etty, William
Euphranor
Euphronius
Eupompus
Eutychides
Everdingen, Allart van
Eyck, Van
Faed, Thomas
Faithorne, William
Falcone, Aniello
Falconet, E. M.
Falguière, J. A. J.
Fantin-Latour, I. H. T.
Farinato, Paolo
Feltre, Morto da
Fernow, K. L.
Ferrari, Gaudenzio
Ferri, Ciro
Feuerbach, Anselm
Fielding, A. V. Copley
Fildes, Sir Luke
Finden, William
Fiorenzo di Lorenzo
Fiorillo, J. D.
Fisher, Alvan
Flandrin, J. Hippolyte
Flaxman, John
Flinck, Govert
Floris, Frans
Fontana, Lavinia
Fontana, Prospero
Fogelberg, B. E.
Foley, J. H.
Foppa, Vincenzo
Forain, J. L.
Ford, E. Onslow
Forster, François
Fortuny, M. J. M. B.
Foster, M. Birket
Foucquet, Jean
Fragonard, J. H.
Français, F. L.
Franceschi, Piero de’
Franceschini, Baldassare
Francia
Franciabigio
Franck
Francken (family)
Frèmiet, Emmanuel
French, Daniel C.
Frère, P. E.
Fresco
Fresnoy, C. A. du
Frith, W. P.
Fromentin, Eugène
Frost, W. E.
Fruytiers, Philip
Führich, Joseph von
Fuller, George
Furniss, Harry
Furse, C. W.
Fuseli, Henry
Fyt, Johannes
Gaddi (family)
Gainsborough, Thomas
Gallait, Louis
Gauermann, Friedrich
Gaul, G. W.
Gavarni
Gay, Walter
Geddes, Andrew
Geikie, Walter
Genelli, G. B.
Genga, Girolamo
Gentile da Fabriano
Gentileschi, Artemisia and Orazio de’
Gérard, Baron F.
Gérard, J. I. I.
Géricault, J. L. A. T.
Gérôme, Jean Léon
Gervex, Henri
Ghiberti, Lorenzo
Ghirlandajo, Domenico
Ghirlandajo, Ridolfo
Gibson, C. Dana
Gibson, John
Gibson, W. H.
Gifford, R. S.
Gifford, S. R.
Gilbert, Alfred
Gilbert, Sir John
Gillot, Claude
Gillray, James
Giordano, Luca
Giorgione
Giottino
Giotto
Girardon, François
Girodet de Roussy, A. L.
Girtin, Thomas
Giulio Romano
Giunta Pisano
Giusto da Guanto
Gleyre, M. C. G.
Goes, Hugo van der
Goldschmidt, Hermann
Goltzius, Hendrik
Gordon, Sir J. W.
Gouache
Goujon, Jean
Gould, Sir F. C.
Goya y Lucientes, F.
Goyen, J. J. Van
Gozzoli, Benozzo
Grafly, Charles
Granet, F. M.
Grant, Sir Francis
Gray, Henry Peters
Greco, El
Green, Valentine
Greenaway, Kate
Greenough, Horatio
Gregory, Edward John
Greuze, J. B.
Grimaldi, G. F.
Grisaille
Gros, Antoine Jean
Grün, Hans Baldung
Grünewald, Mathias
Guardi, Francesco
Guariento (Guerriero)
Guérin, J. B. P.
Guérin, P. N.
Guido of Siena
Guido Reni
Guillaume, J. B. C. E.
Guthrie, Sir James
Haag, Carl
Haden, Sir F. Seymour
Hals, Frans
Hamerton, P. G.
Hamon, Jean Louis
Harding, Chester
Harding, J. D.
Harpignies, Henri
Harrison, T. A.
Hart, William
Hartley, Jonathan S.
Harvey, Sir George
Hassam, Childe
Haydon, B. R.
Hayter, Sir George
Head, Sir E. W.
Healy, G. P. A.
Heda, Willem Claasz
Heem, Jan Davidsz van
Heemskerk, M. J.
Heim, F. J.
Helst, B. van der
Hemy, C. Napier
Hennequin, P. A.
Henner, J. J.
Henry, E. L.
Herkomer, Sir H. von
Herlen, Fritz
Herrera, Francisco
Hersent, Louis
Hess (family)
Heusch, Willem
Heyden, Jan van der
Hildebrandt, Eduard
Hildebrandt, Theodor
Hilliard, Lawrence
Hilliard, Nicholas
Hilton, William
Hiroshige
Hitchcock, George
Hobbema, Meyndert
Hoefnagel, Joris
Hogarth, William
Hokusai
Holbein, Hans (elder)
Holbein, Hans (younger)
Holl, Frank
Hollar, Wenzel
Holroyd, Sir Charles
Homer, Winslow
Hondecoeter, M. d’
Hone, Nathaniel
Honthorst, Gerard van
Hooch, Pieter de
Hoogstraten, S. D. van
Hook, James Clarke
Hoppner, John
Horsley, J. C.
Hoskins, John
Hosmer, Harriet G.
Hotho, Heinrich G.
Houbraken, Jacobus
Houdon, J. A.
Hovenden, Thomas
Huchtenburg (family)
Humphry, Ozias
Hunt, Alfred William
Hunt, William Henry
Hunt, William Holman
Hunt, William Morris
Huntington, Daniel
Hurlstone, F. Y.
Huysmans (family)
Huysum, Jan van
Illuminated MSS.
Illustration
Impressionism
Ingham, C. C.
Ingres, J. A. D.
Inman, Henry
Inness, George
Isabey, Jean Baptiste
Israëls, Josef
Ivory
Jackson, Mason
Jameson, George
Janssen, Cornelius
Janssens, V. H.
Janssens van Nuyssen, Abraham
Jarvis, J. W.
Joanes, Vicente
Johnson, Eastman
Jordaens, Jacob
Jouvenet, Jean
Kalckreuth, Leopold von
Kauffmann, Angelica
Kaulbach, Wilhelm von
Kay, John
Keene, C. S.
Keller, Albert
Kensett, J. F.
Khnopff, F. E. J. M.
Klinger, Max
Kneller, Sir Godfrey
Knight, D. R.
Knight, John Buxton
Koninck, Philip de
Korin, Ogata
Krafft, Adam
Kyosai, Sho-fu
Laer, Pieter van
La Farge, John
Lafosse, Charles de
Lagrenée, L. J. F.
Lahire, Laurent de
Lambeaux, Jef
Lancret, Nicolas
Landon, C. P.
Landseer, Sir E. H.
Lantara, S. M.
Lanzi, Luigi
Largillière, Nicolas
Lathrop, Francis
La Tour, Quentin de
Lavery, John
Lawrence, Sir Thomas
Lawson, Cecil Gordon
Leader, B. W.
Léandre, C. L.
Lear, Edward
LeBrun, Charles
Leech, John
Legros, Alphonse
Leighton, Baron Frederick
Lejeune, Baron L. F.
Lely, Sir Peter
Lemoyne, J. B.
Le Nain
Lenbach, Franz von
Leochares
Leonardo da Vinci
Leopardo, Alessandro
Leslie, C. R.
Le Sueur, Eustache
Leutze, Emanuel
Lewis, J. F.
Leys, Hendrik
Liebermann, Max
Limousin, Léonard
Line Engraving
Linnell, John
Linton, W. J.
Liotard, J. E.
Lippi
Lockwood, Wilton
Lombardo (family)
Longhi, Pietro
Lotto, Lorenzo
Low, Will Hicok
Lucas, J. Seymour
Leyden, Lucas van
Luini, Bernardino
Lysippus
Lysistratus
Mabuse, Jan
MacCulloch, Horatio
Macdonald, Lawrence
McEntee, Jervis
Maclise, Daniel
MacMonnies, F. W.
Macnee, Sir Daniel
MacNeil, Hermon A.
Madou, J. B.
Madrazo y Kunt, Don F. de
Maes, Nicolas
Makart, Hans
Mander, Carel van
Manet, Edouard
Manson, George
Mantegna, Andrea
Marcantonio
Maris, Jacob
Marochetti, Baron Carlo
Marr, Carl
Martin, Homer Dodge
Martin, John
Martini, Simone
Masaccio
Masolino da Panicale
Mason, G. H.
Matsys, Quintin
Mauve, Anton
May, Phil
Mead, Larkin G.
Meer, Jan van der
Meissonier, J. L. E.
Melanthius
Melchers, Gari
Melozzo da Forli
Melville, Arthur
Memling, Hans
Mena, Pedro de
Mengs, Anthony Raphael
Menzel, A. F. E. von
Mercié, M. J. A.
Merian, Matthew
Méryon, Charles
Metcalf, W. L.
Metsu, Gabriel
Meulen, A. F. van der
Meunier, Constantin
Mezzotint
Michel, Claude
Michelangelo
Michelozzo di Bartolommeo
Micon
Mierevelt, M. J. van
Mieris (family)
Mignard, Pierre
Mignon, Abraham
Milanesi, Gaetano
Millais, Sir J. E.
Miller, William
Millet, Francis Davis
Millet (Milé), Jean François
Millet, Jean François
Miniature
Mino di Giovanni (da Fiesole)
Minor, Robert C.
Models, Artists’
Monet, Claude
Montañes, J. M.
Moore, Albert J.
Moore, Henry
Mora, José
Moran, Edward
Moran, Thomas
Moreau, Gustave
Morelli, Giovanni
Moretto, Il
Morghen, R. S.
Morland, George
Moro, Antonio
Moroni, Giambattista
Mosler, Henry
Mount, W. S.
Mowbray, H. S.
Müller, W. J.
Mulready, William
Munkacsy, Michael von
Murillo, B. E.
Murphy, John Francis
Murray, David
Muziano, Girolamo
Muzzioli, Giovanni
Myron
Nanteuil, Robert
Nasmyth, Alexander
Nast, Thomas
Nattier, J. M.
Navarrete, J. F.
Neal, D. D.
Neer, van der
Netscher, Gaspar
Neuville, Alphonse M. de
Newlyn
Niehaus, C. H.
Nicholson, William
Nicias
Nicomachus
Nollekens, Joseph
Northcote, James
Oberlander, A. A.
Ochtman, Leonard
O’Donovan, W. R.
Oliver, Isaac
Oliver, Peter
Onatas
Opie, John
Orcagna
Orchardson, Sir W. Q.
Orley, Bernard von
Ostade
Oudiné, E. A.
Overbeck, J. F.
Pacchia, Girolamo del, and Pacchiarotto, Jacopo
Pacheco, Francisco
Paeonius
Page, William
Painting
Pajou, Augustin
Palette
Palma, Jacopo
Palmer, E. D.
Palmer, Samuel
Palomino, de Castro y Velasco
Pamphilus
Panaenus
Panorama
Pareja, Juan de
Parmigiano
Parrhasius
Partridge, J. Bernard
Partridge, W. O.
Pasiteles
Pastel
Paton, Sir J. Noel
Paul Veronese
Pausias
Peale, C. W.
Peale, Rembrandt
Pearce, C. S.
Pennell, Joseph
Penni, Gianfrancesco
Perino del Vaga
Perkins, C. C.
Perugino, Pietro
Peruzzi, Baldassare
Petitot, Jean
Petitot, Jean Louis
Pettenkofen, A. von
Pettie, John
Pheidias
Phillip, John
Phillips, Thomas
Picknell, W. L.
Piero di Cosimo
Pigalle, J. B.
Piloty, Karl von
Pinturicchio
Pinwell, G. J.
Piranesi, G. B.
Pisano, Andrea
Pisano, Giovanni
Pisano, Niccola
Pisano, Vittore
Pissarro, Camille
Plimer, Andrew
Plimer, Nathaniel
Plumbago Drawings
Pollaiuolo (family)
Polyclitus
Polygnotus
Pontormo, Jacopo da
Poole, Paul Falconer
Pordenone, Il
Portaels, J. F.
Porter, B. C.
Portraiture
Poster
Potter, Paul
Poussin, Nicolas
Powers, Hiram
Poynter, Sir E. J.
Pradier, James
Pradilla, Francisco
Praxias and Androsthenes
Praxiteles
Predella
Preller, Friedrich
Prieur, Pierre
Prinsep, V. C.
Proctor, A. P.
Protogenes
Prout, Samuel
Prud’hon, Pierre
Puget, Pierre
Puvis de Chavannes
Pythagoras
Pyle, Howard
Raeburn, Sir Henry
Raffaellino del Garbo
Raffet, D. A. M.
Raimbach, Abraham
Ramsay, Allan
Ranger, H. W.
Raoux, Jean
Raphael Sanzio
Raven-Hill, Leonard
Rauch, C. D.
Redgrave, Richard
Regnault, Henri
Regnault, J. B.
Reid, Sir George
Reid, Robert
Reinhart, C. S.
Reinhart, J. C.
Relief
Rembrandt
Remington, Frederick
Renoir, F. A.
Repin, I. J.
Restout, Jean
Rethel, Alfred
Reynolds, Sir Joshua
Rhoecus
Ribera, Giuseppe
Ribot, Théodule
Ricard, L. G.
Ricciarelli, Daniele
Richards, W. T.
Richmond, Sir W. B.
Richter, A. L.
Rietschel, E. F. A.
Rigaud, Hyacinthe
Rimmer, William
Riviere, Briton
Robert, Hubert
Robert, L. L.
Robert-Fleury, J. N.
Roberts, David
Robinson, Theodore
Rodin, Auguste
Rogers, John
Roll, A. P.
Romney, George
Rops, Félicien
Rosa, Salvator
Rosenthal, T. E.
Rosselli, Cosimo
Rossellino, Antonio
Rossetti, D. G.
Roubiliac, L. F.
Rousseau, Jacques
Rousseau, P. E. T.
Rowlandson, Thomas
Rubens, Peter Paul
Rude, François
Runciman, Alexander
Russell, John
Ruysdael, Jacob van
Ryder, A. P.
Ryland, W. W.
Sacchi, Andrea
Saint-Gaudens, Augustus
Sambourne, E. Linley
Sandby, Paul
Sandrart, Joachim von
Sandys, Frederick
Sansovino, Andrea C. del Monte
Sansovino, Jacopo
Santerre, J. B.
Sargent, J. S.
Sarrazin, Jacques
Sartain, John
Satterlee, Walter
Sayer, James
Schadow
Schadow, J. G. and R.
Schalcken, Godfried
Scharf, Sir George
Scheemakers, Peter
Scheffer, Ary
Schetky, J. C.
Schiavonetti, Luigi
Schirmer, Friedrich W.
Schirmer, Johann W.
Schlüter, Andreas
Schnorr von Karolsfeld
Schongauer, Martin
Schreyer, Adolf
Schwanthaler, L. M.
Schwartze, Teresa
Schwind, Moritz von
Scopas
Scott, David
Scott, William Bell
Sculpture
Sebastiano del Piombo
Seddon, Thomas
Segantini, Giovanni
Sequeira, D. A. de
Sergel, Johan Tobias
Severn, Joseph
Shannon, C. H.
Shannon, J. J.
Sharp, William
Shee, Sir M. A.
Sherwin, J. K.
Short, F. J.
Sigalon, Xavier
Signorelli, Luca
Silanion
Simon, Abraham
Simon, Thomas
Simmons, E. E.
Simson, William
Sisley, Alfred
Slodtz, René Michel
Smart, John
Smedley, W. T.
Smillie, J. D.
Smirke, Robert
Smith, Colvin
Smith, John Raphael
Smybert, John
Snyders, Franz
Sodoma, Il
Solario, Antonio
Sorolla y Bastida, J.
Spagna, Lo
Spinello, Aretino
Stanfield, W. C.
Stannard, Joseph
Stark, James
Steen, Jan Havicksz
Steer, P. Wilson
Stevens, Alfred
Stevens, Alfred
Stewart, Julius L.
Stillman, W. J.
Stone, Frank
Stone, Marcus
Stone, Nicholas
Stoss, Veit
Stothard, C. A.
Stothard, Thomas
Strang, William
Strange, Sir Robert
Strongylion
Stuart, Gilbert
Stuck, Franz
Subleyras, Pierre
Sully, Thomas
Swan, J. M.
Taft, Lorado
Tait, A. F.
Tanner, H. O.
Tarbell, Edmund C.
Tempera
Teniers (family)
Tenniel, Sir John
Ter Borch, Gerard
Terra Cotta
Thayer, Abbott H.
Theon of Samos
Thoma, Hans
Thompson, Launt
Thomson, John
Thornhill, Sir James
Thornycroft, W. Hamo
Thorwaldsen, Bertel
Thrasymedes
Tiepolo, G. B.
Tiffany, L. C.
Timanthes
Timomachus
Timotheus
Tintoretto
Tisio, Benvenuto
Tissot, J. J. J.
Titian
Torrigiano, Pietre
Triptych
Troy, J. F. de
Troyon, Constant
Trumbull, John
Tryon, D. W.
Turner, Charles
Turner, J. M. W.
Uhde, F. K. H. Von
Utamaro
Vanderlyn, John
Van der Stappen, C.
Van der Weyden, R.
Vandevelde, Adrian
Vandevelde, William
Van Dyck, Sir Anthony
Vanloo, C. A.
Vanloo, J. B.
Varley, Cornelius
Varley, John
Vasari, Giorgio
Vedder, Elihu
Veit, Philipp
Velazquez, D. R. de Silva y
Verboeckhoven, E. J.
Vereshchagin, V. V.
Verlat, M. M. C.
Vernet (family)
Verrocchio, Andrea del
Vertue, George
Vien, J. M.

You might also like