Realworld - Python - Hackers Guide2021
Realworld - Python - Hackers Guide2021
PY THON
A H A C K E R ’ S G U I D E TO
S O L V I N G P R O B L E M S W ITH CODE
LEE VAUGHAN
REAL-WORLD PYTHON
REAL-WORLD
PYTHON
A Hacker’s Guide to
Solving Problems with Code
by Lee Vaughan
San Francisco
REAL-WORLD PYTHON. Copyright © 2021 by Lee Vaughan.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any information storage or retrieval
system, without the prior written permission of the copyright owner and the publisher.
The following images are reproduced with permission: Figure 3-3 from istockphoto.com; Figure 5-1 courtesy
of Lowell Observatory Archives; Figures 5-2, 6-2, 7-6, 7-7, 8-18, and 11-2 courtesy of Wikimedia Commons;
Figures 7-2, 7-9, 7-17, 8-20, and 11-1 courtesy of NASA; Figure 8-1 photo by Evan Clark; Figure 8-4 photo by
author; Figure 9-5 from pixelsquid.com; Figure 11-9 photo by Hannah Vaughan
For information on distribution, translations, or bulk sales, please contact No Starch Press, Inc. directly:
No Starch Press, Inc.
245 8th Street, San Francisco, CA 94103
phone: 1-415-863-9900; [email protected]
www.nostarch.com
No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other prod-
uct and company names mentioned herein may be the trademarks of their respective owners. Rather than use
a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial
fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The information in this book is distributed on an “As Is” basis, without warranty. While every precaution
has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any
liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or
indirectly by the information contained in it.
For my uncle, Kenneth P. Vaughan.
He brightened every room he entered.
About the Author
Lee Vaughan is a programmer, pop culture enthusiast, educator, and author
of Impractical Python Projects (No Starch Press, 2018). As an executive-level
scientist at ExxonMobil, he constructed and reviewed computer models,
developed and tested software, and trained geoscientists and engineers.
He wrote both Impractical Python Projects and Real-World Python to help
self-learners hone their Python skills and have fun doing it!
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
CONTE NT S IN DE TA IL
ACKNOWLEDGMENTS xvii
INTRODUCTION xix
Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
What’s in This Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx
Python Version, Platform, and IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Installing Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii
Running Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv
Using a Virtual Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv
Onward! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv
1
SAVING SHIPWRECKED SAILORS WITH BAYES’ RULE 1
Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Project #1: Search and Rescue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Installing the Python Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
The Bayes Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Playing the Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Challenge Project: Smarter Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Challenge Project: Finding the Best Strategy with MCS . . . . . . . . . . . . . . . . . . . . . . . . 25
Challenge Project: Calculating the Probability of Detection . . . . . . . . . . . . . . . . . . . . . . 25
2
ATTRIBUTING AUTHORSHIP WITH STYLOMETRY 27
Project #2: The Hound, The War, and The Lost World . . . . . . . . . . . . . . . . . . . . . . . . . 28
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Installing NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
The Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
The Stylometry Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Practice Project: Hunting the Hound with Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Practice Project: Punctuation Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Challenge Project: Fixing Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3
SUMMARIZING SPEECHES WITH NATURAL
LANGUAGE PROCESSING 51
Project #3: I Have a Dream . . . to Summarize Speeches! . . . . . . . . . . . . . . . . . . . . . . 52
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
The “I Have a Dream” Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Project #4: Summarizing Speeches with gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Installing gensim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
The Make Your Bed Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Project #5: Summarizing Text with Word Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
The Word Cloud and PIL Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
The Word Cloud Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Fine-Tuning the Word Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Challenge Project: Game Night . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Challenge Project: Summarizing Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Challenge Project: Summarizing a Novel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Challenge Project: It’s Not Just What You Say,
It’s How You Say It! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4
SENDING SUPER-SECRET MESSAGES WITH A BOOK CIPHER 77
The One-Time Pad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
The Rebecca Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Project #6: The Digital Key to Rebecca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
The Encryption Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Sending Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Practice Project: Charting the Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Practice Project: Sending Secrets the WWII Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5
FINDING PLUTO 95
Project #7: Replicating a Blink Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
The Blink Comparator Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Using the Blink Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Project #8: Detecting Astronomical Transients with Image Differencing . . . . . . . . . . . . 112
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
The Transient Detector Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Using the Transient Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Practice Project: Plotting the Orbital Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xii Contents in Detail
Practice Project: What’s the Difference? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Challenge Project: Counting Stars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6
WINNING THE MOON RACE WITH APOLLO 8 123
Understanding the Apollo 8 Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
The Free Return Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
The Three-Body Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Project #9: To the Moon with Apollo 8! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Using the turtle Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
The Apollo 8 Free Return Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Running the Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Practice Project: Simulating a Search Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Practice Project: Start Me Up! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Practice Project: Shut Me Down! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Challenge Project: True-Scale Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Challenge Project: The Real Apollo 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7
SELECTING MARTIAN LANDING SITES 151
How to Land on Mars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
The MOLA Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Project #10: Selecting Martian Landing Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
The Site Selector Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Practice Project: Confirming That Drawings Become Part of an Image . . . . . . . . . . . . . 172
Practice Project: Extracting an Elevation Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Practice Project: Plotting in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Practice Project: Mixing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Challenge Project: Making It Three in a Row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Challenge Project: Wrapping Rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8
DETECTING DISTANT EXOPLANETS 177
Transit Photometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Project #11: Simulating an Exoplanet Transit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
The Transit Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Experimenting with Transit Photometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Project #12: Imaging Exoplanets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
The Pixelator Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Contents in Detail xiii
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Practice Project: Detecting Alien Megastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Practice Project: Detecting Asteroid Transits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Practice Project: Incorporating Limb Darkening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Practice Project: Detecting Starspots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Practice Project: Detecting an Alien Armada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Practice Project: Detecting a Planet with a Moon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Practice Project: Measuring the Length of an Exoplanet’s Day . . . . . . . . . . . . . . . . . . . 201
Challenge Project: Generating a Dynamic Light Curve . . . . . . . . . . . . . . . . . . . . . . . . 202
9
IDENTIFYING FRIEND OR FOE 203
Detecting Faces in Photographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Project #13: Programming a Robot Sentry Gun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
The Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Detecting Faces from a Video Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Practice Project: Blurring Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Challenge Project: Detecting Cat Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10
RESTRICTING ACCESS WITH FACE RECOGNITION 225
Recognizing Faces with Local Binary Pattern Histograms . . . . . . . . . . . . . . . . . . . . . . 226
The Face Recognition Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Extracting Local Binary Pattern Histograms . . . . . . . . . . . . . . . . . . . . . . . . . 228
Project #14: Restricting Access to the Alien Artifact . . . . . . . . . . . . . . . . . . . . . . . . . . 231
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Supporting Modules and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
The Video Capture Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
The Face Trainer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
The Face Predictor Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Challenge Project: Adding a Password and Video Capture . . . . . . . . . . . . . . . . . . . . 242
Challenge Project: Look-Alikes and Twins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Challenge Project: Time Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
11
CREATING AN INTERACTIVE ZOMBIE ESCAPE MAP 245
Project #15: Visualizing Population Density with a Choropleth Map . . . . . . . . . . . . . . 246
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
The Python Data Analysis Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
The bokeh and holoviews Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Installing pandas, bokeh, and holoviews . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
xiv Contents in Detail
Accessing the County, State, Unemployment, and Population Data . . . . . . . . . 250
Hacking holoviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
The Choropleth Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Planning the Escape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Challenge Project: Mapping US Population Change . . . . . . . . . . . . . . . . . . . . . . . . . 266
12
ARE WE LIVING IN A COMPUTER SIMULATION? 269
Project #16: Life, the Universe, and Yertle’s Pond . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
The Pond Simulation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Implications of the Pond Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Measuring the Cost of Crossing the Lattice . . . . . . . . . . . . . . . . . . . . . . . . . 275
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
The Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Moving On . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Challenge Project: Finding a Safe Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Challenge Project: Here Comes the Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Challenge Project: Seeing Through a Dog’s Eyes . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Challenge Project: Customized Word Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Challenge Project: Simplifying a Celebration Slideshow . . . . . . . . . . . . . . . . . . . . . . . 281
Challenge Project: What a Tangled Web We Weave . . . . . . . . . . . . . . . . . . . . . . . . 281
Challenge Project: Go Tell It on the Mountain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
APPENDIX
PRACTICE PROJECT SOLUTIONS 283
Chapter 2: Attributing Authorship with Stylometry . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Hunting the Hound with Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Punctuation Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Chapter 4: Sending Super-Secret Messages with a Book Cipher . . . . . . . . . . . . . . . . . 285
Charting the Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Sending Secrets the WWII Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Chapter 5: Finding Pluto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Plotting the Orbital Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
What’s the Difference? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Chapter 6: Winning the Moon Race with Apollo 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Simulating a Search Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Start Me Up! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Shut Me Down! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Chapter 7: Selecting Martian Landing Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Confirming That Drawings Become Part of an Image . . . . . . . . . . . . . . . . . . 298
Extracting an Elevation Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Plotting in 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Mixing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Chapter 8: Detecting Distant Exoplanets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Detecting Alien Megastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Detecting Asteroid Transits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Contents in Detail xv
Incorporating Limb Darkening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Detecting an Alien Armada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Detecting a Planet with a Moon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Measuring the Length of an Exoplanet’s Day . . . . . . . . . . . . . . . . . . . . . . . . 311
Chapter 9: Identifying Friend or Foe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Blurring Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Chapter 10: Restricting Access with Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . 312
Challenge Project: Adding a Password and Video Capture . . . . . . . . . . . . . . 312
INDEX 315
xvi Contents in Detail
ACKNOWLEDGMENTS
Why Python?
Python is a high-level, interpretive, general-purpose programming language.
It’s free, highly interactive, and portable across all major platforms and micro-
controllers such as the Raspberry Pi. Python supports both functional and
object-oriented programming and can interact with code written in many
other programming languages, such as C++.
Because Python is accessible to beginners and useful to experts, it has
penetrated schools, universities, large corporations, financial institutions,
and most, if not all, fields of science. As a result, it’s now the most popular
language for machine learning, data science, and artificial intelligence
applications.
xx Introduction
cool display for advertising or promotional material. Gain experience
with BeautifulSoup, Requests, regex, NLTK, Collections, wordcloud, and
matplotlib.
Chapter 4: Sending Super-Secret Messages with a Book Cipher Share
unbreakable ciphers with your friends by digitally reproducing the one-
time pad approach used in Ken Follet’s best-selling spy novel, The Key to
Rebecca. Gain experience with the Collections module.
Chapter 5: Finding Pluto Reproduce the blink comparator device
used by Clyde Tombaugh to discover Pluto in 1930. Then use modern
computer vision techniques to automatically find and track subtle tran-
sients, such as comets and asteroids, moving against a starfield. Gain
experience with OpenCV and NumPy.
Chapter 6: Winning the Moon Race with Apollo 8 Take the gamble
and help America win the moon race with Apollo 8. Plot and execute
the clever free return flight path that convinced NASA to go to the moon
a year early and effectively killed the Soviet space program. Gain expe-
rience using the turtle module.
Chapter 7: Selecting Martian Landing Sites Scope out potential land-
ing sites for a Mars lander based on realistic mission objectives. Display
the candidate sites on a Mars map, along with a summary of site statis-
tics. Gain experience with OpenCV, the Python Imaging Library, NumPy,
and tkinter.
Chapter 8: Detecting Distant Exoplanets Simulate an exoplanet’s
passing before its sun, plot the resulting changes in relative brightness,
and estimate the diameter of the planet. Finish by simulating the direct
observation of an exoplanet by the new James Webb Space Telescope,
including estimating the length of the planet’s day. Use OpenCV, NumPy,
and matplotlib.
Chapter 9: Identifying Friend or Foe Program a robot sentry gun
to visually distinguish between Space Force Marines and evil mutants.
Gain experience with OpenCV, NumPy, playsound, pyttsxw, and datetime.
Chapter 10: Restricting Access with Face Recognition Restrict access
to a secure lab using face recognition. Use OpenCV, NumPy, playsound,
pyttsxw, and datetime.
Chapter 11: Creating an Interactive Zombie Escape Map Build a pop-
ulation density map to help the survivors in the TV show The Walking Dead
escape Atlanta for the safety of the American West. Gain experience
with pandas, bokeh, holoviews, and webbrowser.
Chapter 12: Are We Living in a Computer Simulation? Identify a way
for simulated beings—perhaps us—to find evidence that they’re living
in a computer simulation. Use turtle, statistics, and perf_counter.
Each chapter ends with at least one practice or challenge project. You
can find solutions to the practice projects in the appendix or online. These
aren’t the only solutions, or necessarily the best ones; you may come up with
better ones on your own.
Introduction xxi
When it comes to the challenge projects, however, you’re on your own.
It’s sink or swim, which is a great way to learn! My hope is that this book
motivates you to create new projects, so think of the challenge projects as
seeds for the fertile ground of your own imagination.
You can download all of the book’s code, including solutions to the
practice projects, from the book’s website at https://nostarch.com/real-world
-python/. You’ll also find the errata sheet there, along with any other updates.
It’s almost impossible to write a book like this without some initial errors.
If you see a problem, please pass it on to the publisher at [email protected].
We’ll add any necessary corrections to the errata and include the fix in future
printings of the book, and you will gain eternal glory.
Installing Python
You can choose to install Python directly on your machine or through a
distribution. To install directly, find the installation instructions for your
operating system at https://www.python.org/downloads/. Linux and macOS
machines usually come with Python preinstalled, but you may want to upgrade
this installation. With each new Python release, some features are added and
some are deprecated, so I recommend upgrading if your version predates
Python v3.6.
The download button on the Python site (Figure 1) may install 32-bit
Python by default.
xxii Introduction
Figure 1: Downloads page for Python.org, with the “easy button” for the Windows platform
If you want the 64-bit version, scroll down to the listing of specific
releases (Figure 2) and click the link with the same version number.
Clicking the specific release will take you to the screen shown in
Figure 3. From here, click the 64-bit executable installer, which will launch
an installation wizard. Follow the wizard directions and take the default
suggestions.
Introduction xxiii
Some of the projects in this book call for nonstandard packages that
you’ll need to install individually. This isn’t difficult, but you can make
things easier by installing a Python distribution that efficiently loads and
manages hundreds of Python packages. Think of this as one-stop shopping.
The package managers in these distributions will automatically find and
download the latest version of a package, including all of its dependencies.
Anaconda is a popular free distribution of Python provided by
Continuum Analytics. You can download it from https://www.anaconda.com/.
Another is Enthought Canopy, though only the basic version is free. You
can find it at https://www.enthought.com/product/canopy/. Whether you install
Python and its packages individually or through a distribution, you should
encounter no problems working through the projects in the book.
Running Python
After installation, Python should show up in your operating system’s list of
applications. When you launch it, the shell window should appear (shown
in the background of Figure 4). You can use this interactive environment
to run and test code snippets. But to write larger programs, you’ll use a text
editor, which lets you save your code, as shown in Figure 4 (foreground).
Figure 4: The native Python shell window (background) and text editor (foreground)
To create a new file in the IDLE text editor, click File4New File.
To open an existing file, click File4 Open or File4Recent Files. From
here, you can run your code by clicking Run4Run Module or by pressing
F5 after clicking in the editor window. Note that your environment may
look different from Figure 4 if you chose to use a package manager like
Anaconda or an IDE like PyCharm.
xxiv Introduction
You can also start a Python program by typing the program name in
PowerShell or Terminal. You’ll need to be in the directory where your Python
program is located. For example, if you didn’t launch the Windows PowerShell
from the proper directory, you’ll need to change the directory path using
the cd command (Figure 5).
Figure 5: Changing directories and running a Python program in the Windows PowerShell
Onward!
Many of the projects in this book rely on statistical and scientific concepts
that are hundreds of years old but impractical to apply by hand. But with
the introduction of the personal computer in 1975, our ability to store, pro-
cess, and share information has increased by many orders of magnitude.
In the 200,000-year history of modern humans, only those of us living
in the last 45 years have had the privilege of using this magical device and
realizing dreams long out of reach. To quote Shakespeare, “We few. We
happy few.”
Let’s make the most of the opportunity. In the pages that follow, you’ll
easily accomplish tasks that frustrated past geniuses. You’ll scratch the surface
of some of the amazing feats we’ve recently achieved. And you might even
start to imagine discoveries yet to come.
Introduction xxv
1
SAVING SHIPWRECKED SAILORS
W I T H B AY E S’ R U L E
Bayes’ Rule
Bayes’ rule helps investigators determine the probability that something is
true given new evidence. As the great French mathematician Laplace put it,
“The probability of a cause—given an event—is proportional to the prob-
ability of the event—given its cause.” The basic formula is
P (B A) P (A)
P ( A B) =
P ( B)
(
Probability of
)
( ) ( )
Probability of cancer Probability of a positive test having cancer
= ×
( )
given a positive test among cancer patients Probability of
a positive test
2 Chapter 1
Likelihood
Probability of seeing the new
data given initial hypothesis
(Probability of positive test
given cancer) Prior
Posterior Probability of
Probability being estimated hypothesis
(Probability of cancer with no new data
given a positive test) (Probability of cancer
with no test results)
P(B/A) P(A)
P(A/B) =
P(B)
Marginal Likelihood
Overall probability of
seeing the new data
(Probability of anyone
getting a positive test)
Figure 1-1: Bayes’ rule with terms defined and related to the cancer test example
To illustrate further, let’s consider a woman who has lost her reading
glasses in her house. The last time she remembers wearing them, she was in
her study. She goes there and looks around. She doesn’t see her glasses, but
she does see a teacup and remembers that she went to the kitchen. At this
point, she must make a choice: search the study more thoroughly or leave
and check the kitchen. She decides to go to the kitchen. She has unknow-
ingly made a Bayesian decision.
She went to the study first because she felt it offered the highest prob-
ability for success. In Bayesian terms, this initial probability of finding the
glasses in the study is called the prior. After a cursory search, she changed
her decision based on two new bits of information: she did not easily find
the glasses, and she saw the teacup. This represents a Bayesian update, in
which a new posterior estimate (P(A/B) in Figure 1-1) is calculated as more
evidence becomes available.
Let’s imagine that the woman decided to use Bayes’ rule for her search.
She would assign actual probabilities both to the likelihood of the glasses
being in either the study or the kitchen and to the effectiveness of her searches
in the two rooms. Rather than intuitive hunches, her decisions are now
grounded in mathematics that can be continuously updated if future
searches fail.
Figure 1-2 illustrates the woman’s search for her glasses with these prob-
abilities assigned.
Bath Bath
1% / 0% 5%
Study Study
85% / 95% 22%
Kitchen Kitchen
10% / 0% 52%
Lounge Lounge
1% / 0% Dining 5% Dining
1% / 0% 5%
Figure 1-2: Initial probabilities for the location of the glasses and search effectiveness
(left) versus updated target probabilities for the glasses (right)
The left diagram represents the initial situation; the right diagram is
updated with Bayes’ rule. Initially, let’s say there was an 85 percent chance
of finding the glasses in the study and a 10 percent chance that the glasses
are in the kitchen. Other possible rooms are given 1 percent because Bayes’
rule can’t update a target probability of zero (plus there’s always a small
chance the woman left them in one of the other rooms).
Each number after a slash in the left diagram represents the search
effectiveness probability (SEP). The SEP is an estimate of how effectively you’ve
searched an area. Because the woman has searched only in the study at
this point, this value is zero for all other rooms. After the Bayesian update
(the discovery of the teacup), she can recalculate the probabilities based
on the search results, shown on the right. The kitchen is now the most likely
place to look, but the probability for the other rooms increases as well.
Human intuition tells us that if something isn’t where we think it is, the
odds that it is someplace else go up. Bayes’ rule takes this into account, and
thus the probability that the glasses are in other rooms increases. But this
can happen only if there was a chance of them being in the other room in
the first place.
The formula used for calculating the probability that the glasses are in
a given room, given the search effectiveness, is
P ( E G ) Pprior (G )
P (G E ) =
Σ P (E G 9) Pprior (G 9)
where G is the probability that the glasses are in a room, E is the search effec-
tiveness, and P prior is the prior, or initial, probability estimate before receiving
the new evidence.
4 Chapter 1
You can obtain the updated possibility that the glasses are in the study by
inserting the target and search effectiveness probabilities into the equation
as follows:
(0.85 × (1 − 0.95))
(0.85 × (1 − 0.95) + 0.1 × (1 − 0) + 0.01 × (1 − 0) + 0.01 × (1 − 0) + 0.01 × (1 − 0) + 0.01 × (1 − 0) + 0.01 × (1 − 0))
As you can see, the simple math behind Bayes’ rule can quickly get
tedious if you do it by hand. Fortunately for us, we live in the wonderous
age of computers, so we can let Python handle the boring stuff!
THE OBJEC TI V E
Create a search and rescue game that uses Bayes’ rule to inform player choices on how
to conduct a search.
The Strategy
Searching for the sailor is like looking for the lost glasses in our previous
example. You’ll start with initial target probabilities for the sailor’s location
and update them for the search results. If you achieve an effective search of
an area but find nothing, the probability that the sailor is in another area
will increase.
But just as in real life, there are two ways things could go wrong: you
thoroughly search an area but still miss the sailor, or your search goes
poorly, wasting a day’s effort. To equate this to search effectiveness scores,
in the first case, you might get an SEP of 0.85, but the sailor is in the
remaining 15 percent of the area not searched. In the second case, your
SEP is 0.2, and you’ve left 80 percent of the area unsearched!
You can see the dilemma real commanders face. Do you go with your gut
and ignore Bayes? Do you stick with the pure, cold logic of Bayes because
you believe it’s the best answer? Or do you act expediently and protect your
career and reputation by going with Bayes even when you doubt it?
To aid the player, you’ll use the OpenCV library to build an interface for
working with the program. Although the interface can be something simple,
like a menu built in the shell, you’ll also want a map of the cape and the
search areas. You’ll use this map to display the sailor’s last known position
and his position when found. The OpenCV library is an excellent choice for
this game since it lets you display images and add drawings and text.
6 Chapter 1
images like humans. OpenCV began as an Intel Research initiative in 1999
and is now maintained by the OpenCV Foundation, a nonprofit foundation
which provides the software for free.
OpenCV is written in C++, but there are bindings in other languages,
such as Python and Java. Although aimed primarily at real-time computer
vision applications, OpenCV also includes common image manipulation
tools such as those found in the Python Imaging Library. As of this writing,
the current version is OpenCV 4.1.
OpenCV requires both the Numerical Python (NumPy) and SciPy pack-
ages to perform numerical and scientific computing in Python. OpenCV
treats images as three-dimensional NumPy arrays (Figure 1-4). This allows for
maximum interoperability with other Python scientific libraries.
40 41 42 43 44
20 21 22 23 24 49
0 1 2 3 4 29 54
5 6 7 8 9 34 59 –Red channel
10 11 12 13 14 39 –Green channel
15 16 17 18 19 –Blue channel
$ python -m pip install --user numpy scipy matplotlib ipython jupyter pandas sympy nose
If you have both Python 2 and 3 installed, use python3 in place of python.
To verify that NumPy has been installed and is available for OpenCV,
open a Python shell and enter the following:
or
To check that everything loaded properly, enter the following in the shell:
8 Chapter 1
No error means you’re good to go! If you get an error, read the trouble-
shooting list at https://pypi.org/project/opencv-python/.
Importing Modules
Listing 1-1 starts the bayes.py program by importing the required modules
and assigning some constants. We’ll look at what these modules do as we
implement them in the code.
MAP_FILE = 'cape_python.png'
Listing 1-1: Importing modules and assigning constants used in the bayes.py program
You’ll draw the search areas on the image as rectangles. OpenCV will
define each rectangle by the pixel number at the corner points, so assign a
variable to hold these four points as a tuple. The required order is upper-
left x, upper-left y, lower-right x, and lower-right y. Use SA in the variable
name to represent “search area.”
10 Chapter 1
bayes.py, part 2 class Search():
"""Bayesian Search & Rescue game with 3 search areas."""
self.area_actual = 0
self.sailor_actual = [0, 0] # As "local" coords within search area
self.p1 = 0.2
self.p2 = 0.5
self.p3 = 0.3
self.sep1 = 0
self.sep2 = 0
self.sep3 = 0
Start by defining a class called Search. According to PEP8, the first letter
of a class name should be capitalized.
Next, define a method that sets up the initial attribute values for your
object. In OOP, an attribute is a named value associated with an object.
If your object is a person, an attribute might be their weight or eye color.
Methods are attributes that also happen to be functions, which are passed
a reference to their instance when they run. The __init__() method is a
special built-in function that Python automatically invokes as soon as a new
object is created. It binds the attributes of each newly created instance of a
class. In this case, you pass it two arguments: self and the name you want to
use for your object.
The self parameter is a reference to the instance of the class that is
being created, or that a method was invoked on, technically referred to as a
context instance. For example, if you create a battleship named the Missouri,
then for that object, self becomes Missouri, and you can call a method for
that object, like one for firing the big guns, with dot notation: Missouri.fire_
big_guns(). By giving objects unique names when they are instantiated, the
scope of each object’s attributes is kept separate from all others. This way,
damage taken by one battleship isn’t shared with the rest of the fleet.
12 Chapter 1
Drawing the Map
Inside the Search class, you’ll use functionality within OpenCV to create a
method that displays the base map. This map will include the search areas,
a scale bar, and the sailor’s last known position (Figure 1-6).
Listing 1-3 defines the draw_map() method that displays the initial map.
Define the draw_map() method with self and the sailor’s last known
coordinates (last_known) as its two parameters. Then use OpenCV’s line()
method to draw a scale bar. Pass it the base map image, a tuple of the
left and right (x, y) coordinates, a line color tuple, and a line width as
arguments.
Use the putText() method to annotate the scale bar. Pass it the attribute
for the base map image and then the actual text, followed by a tuple of the
coordinates of the bottom-left corner of the text. Then add the font name,
font scale, and color tuple.
Now draw a rectangle for the first search area . As usual, pass the base
map image, then the variables representing the four corners of the box,
and finally a color tuple and a line weight. Use putText() again to place the
search area number just inside the upper-left corner. Repeat these steps
for search areas 2 and 3.
Use putText() to post a + at the sailor’s last known position . Note that
the symbol is red, but the color tuple reads (0, 0, 255), instead of (255, 0, 0).
This is because OpenCV uses a Blue-Green-Red (BGR) color format, not
the more common Red-Green-Blue (RGB) format.
Continue by placing text for a legend that describes the symbols for
the last known position and actual position, which should display when a
player’s search finds the sailor. Use blue for the actual position marker.
Complete the method by showing the base map, using OpenCV’s
imshow() method . Pass it a title for the window and the image.
To avoid the base map and interpreter windows interfering with each
other as much as possible, force the base map to display in the upper-right
corner of your monitor (you may need to adjust the coordinates for your
machine). Use OpenCV’s moveWindow() method and pass it the name of the
window, 'Search Area', and the coordinates for the top-left corner.
Finish by using the waitKey() method, which introduces a delay of
n milliseconds while rendering images to windows. Pass it 500, for 500 milli-
seconds. This should result in the game menu appearing a half-second after
the base map.
14 Chapter 1
bayes.py, part 4 def sailor_final_location(self, num_search_areas):
"""Return the actual x,y location of the missing sailor."""
# Find sailor coordinates with respect to any Search Area subarray.
self.sailor_actual[0] = np.random.choice(self.sa1.shape[1], 1)
self.sailor_actual[1] = np.random.choice(self.sa1.shape[0], 1)
if area == 1:
x = self.sailor_actual[0] + SA1_CORNERS[0]
y = self.sailor_actual[1] + SA1_CORNERS[1]
self.area_actual = 1
elif area == 2:
x = self.sailor_actual[0] + SA2_CORNERS[0]
y = self.sailor_actual[1] + SA2_CORNERS[1]
self.area_actual = 2
elif area == 3:
x = self.sailor_actual[0] + SA3_CORNERS[0]
y = self.sailor_actual[1] + SA3_CORNERS[1]
self.area_actual = 3
return x, y
Listing 1-4: Defining a method to randomly choose the sailor’s actual location
>>> print(np.shape(self.SA1))
(50, 50, 3)
The shape attribute for a NumPy array must be a tuple with as many elements
as dimensions in the array. And remember that, for an array in OpenCV, the
order of elements in the tuple is rows, columns, and then channels.
Each of the existing search areas is a three-dimensional array 50×50 pixels
in size. So, internal coordinates for both x and y will range from 0 to 49.
Selecting [0] with random.choice() means that rows are used, and the final
argument, 1, selects a single element. Selecting [1] chooses from columns.
The coordinates generated by random.choice() will range from 0 to 49.
To use these with the full base map image, you first need to pick a search
area . Do this with the random module, which you imported at the start of
the program. According to the SAROPS output, the sailor is most likely in
area 2, followed by area 3. Since these initial target probabilities are guesses
that won’t correspond directly to reality, use a triangular distribution to
choose the area containing the sailor. The arguments are the low and high
endpoints. If a final mode argument is not provided, the mode defaults
NOTE In real life, the sailor would drift along, and the odds of his moving into area 3 would
increase with each search. I chose to use a static location, however, to make the logic
behind Bayes’ rule as clear as possible. As a result, this scenario behaves more like a
search for a sunken submarine.
Listing 1-5: Defining methods to randomly choose search effectiveness and conduct search
16 Chapter 1
Start by defining the search effectiveness method. The only parameter
needed is self. For each of the search effectiveness attributes, such as E1,
randomly choose a value between 0.2 and 0.9. These are arbitrary values
that mean you will always search at least 20 percent of the area but never
more than 90 percent.
You could argue that the search effectiveness attributes for the three
search areas are dependent. Fog, for example, might affect all three areas,
yielding uniformly poor results. On the other hand, some of your helicop-
ters may have infrared imaging equipment and would fare better. At any
rate, making these independent, as you’ve done here, makes for a more
dynamic simulation.
Next, define a method for conducting a search . Necessary parameters
are the object itself, the area number (chosen by the user), the subarray for
the chosen area, and the randomly chosen search effectiveness value.
You’ll need to generate a list of all the coordinates within a given search
area. Name a variable local_y_range and assign it a range based on the first
index from the array shape tuple, which represents rows. Repeat for the
x_range value.
To generate the list of all coordinates in the search area, use the
itertools module . This module is a group of functions in the Python
Standard Library that create iterators for efficient looping. The product()
function returns tuples of all the permutations-with-repetition for a given
sequence. In this case, you’re finding all the possible ways to combine x and y
in the search area. To see it in action, type the following in the shell:
As you can see, the coords list contains every possible paired combination
of the elements in the x_range and y_range lists.
Next, shuffle the list of coordinates. This is so you won’t keep searching
the same end of the list with each search event. In the next line, use index
slicing to trim the list based on the search effectiveness probability. For
example, a poor search effectiveness of 0.3 means that only one-third of
the possible locations in an area are included in the list. As you’ll check the
sailor’s actual location against this list, you’ll effectively leave two-thirds of
the area “unsearched.”
Assign a local variable, loc_actual, to hold the sailor’s actual location .
Then use a conditional to check that the sailor has been found. If the user
chose the correct search area and the shuffled and trimmed coords list con-
tains the sailor’s (x, y) location, return a string stating the sailor has been
found, along with the coords list. Otherwise, return a string stating the sailor
has not been found and the coords list.
def draw_menu(search_num):
"""Print menu of choices for conducting area searches."""
print('\nSearch {}'.format(search_num))
print(
"""
Choose next areas to search:
0 - Quit
1 - Search Area 1 twice
2 - Search Area 2 twice
3 - Search Area 3 twice
4 - Search Areas 1 & 2
5 - Search Areas 1 & 3
6 - Search Areas 2 & 3
7 - Start Over
"""
)
Listing 1-6: Defining ways to apply Bayes’ rule and draw a menu in the Python shell
Listing 1-7: Defining the start of the main() function, used to run the program
if choice == "0":
sys.exit()
else:
print("\nSorry, but that isn't a valid choice.", file=sys.stderr)
continue
Listing 1-8: Using a loop to evaluate menu choices and run the game
Start a while loop that will run until the user chooses to exit. Immediately
use dot notation to call the method that calculates the effectiveness of the
search. Then call the function that displays the game menu and pass it
the search number. Finish the preparatory stage by asking the user to make
a choice, using the input() function.
The player’s choice will be evaluated using a series of conditional state-
ments. If they choose 0, exit the game. Exiting uses the sys module you
imported at the beginning of the program.
20 Chapter 1
If the player chooses 1, 2, or 3, it means they want to commit both
search teams to the area with the corresponding number. You’ll need to call
the conduct_search() method twice to generate two sets of results and coor-
dinates . The tricky part here is determining the overall SEP, since each
search has its own SEP. To do this, add the two coords lists together and con-
vert the result to a set to remove any duplicates . Get the length of the set
and then divide it by the number of pixels in the 50×50 search area. Since
you didn’t search the other areas, set their SEPs to 0.
Repeat and tailor the previous code for search areas 2 and 3. Use an
elif statement since only one menu choice is valid per loop. This is more
efficient than using additional if statements, as all elif statements below a
true response will be skipped.
If the player chooses a 4, 5, or 6, it means they want to divide their teams
between two areas. In this case, there’s no need to recalculate the SEP .
If the player finds the sailor and wants to play again or just wants to
restart, call the main() function . This will reset the game and clear the
map.
If the player makes a nonvalid choice, like “Bob”, let them know with
a message and then use continue to skip back to the start of the loop and
request the player’s choice again.
if __name__ == '__main__':
main()
22 Chapter 1
Figure 1-8: Base map image for a successful search result
After search 2, with only one search left, the target probabilities are so
similar they provide little guidance for where to search next. In this case,
it’s best to divide your searches between two areas and hope for the best.
Summary
In this chapter, you learned about Bayes’ rule, a simple statistical theorem
with broad applications in our modern world. You wrote a program that
used the rule to take new information—in the form of estimates of search
effectiveness—and update the probability of finding a lost sailor in each
area being searched.
You also loaded and used multiple scientific packages, like NumPy and
OpenCV, that you’ll implement throughout the book. And you applied the
useful itertools, sys, and random modules from the Python Standard Library.
Further Reading
The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code,
Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries
of Controversy (Yale University Press, 2011), by Sharon Bertsch McGrayne,
recounts the discovery and controversial history of Bayes’ rule. The appendix
includes several example applications of Bayes’ rule, one of which inspired
the missing-sailor scenario used in this chapter.
A major source of documentation for NumPy is https://docs.scipy.org/doc/.
24 Chapter 1
Challenge Project: Finding the Best Strategy with MCS
Monte Carlo simulation (MCS) uses repeated random sampling to predict
different outcomes under a specified range of conditions. Create a version
of bayes.py that automatically chooses menu items and keeps track of thou-
sands of results, allowing you to determine the most successful search strat-
egy. For example, have the program choose menu item 1, 2, or 3 based on
the highest Bayesian target probability and then record the search number
when the sailor is found. Repeat this procedure 10,000 times and take the
average of all the search numbers. Then loop again, choosing from menu
item 4, 5, or 6 based on the highest combined target probability. Compare
the final averages. Is it better to double up your searches in a single area or
split them between two areas?
New Planned Search Effectiveness and Target Probabilities (P) for Search 2:
E1 = 0.509, E2 = 0.826, E3 = 0.686
P1 = 0.168, P2 = 0.520, P3 = 0.312
Search 2
0 - Quit
7 - Start Over
Choice:
To combine PoD when searching the same area twice, use this formula:
1 − (1 − PoD )2
Otherwise, just sum the probabilities.
When calculating the actual SEP for an area, constrain it somewhat to
the expected value. This considers the general accuracy of weather reports
made only a day in advance. Replace the random.uniform() method with a
distribution, such as triangular, built around the planned SEP value. For a
list of available distribution types, see https://docs.python.org/3/library/random
.html#real-valued-distributions. Of course, the actual SEP for an unsearched
area will always be zero.
How does incorporating planned SEPs affect gameplay? Is it easier or
harder to win? Is it harder to grasp how Bayes’ rule is being applied? If you
oversaw a real search, how would you deal with an area with a high target
probability but a low planned SEP due to rough seas? Would you search
anyway, call off the search, or move the search to an area with a low target
probability but better weather?
26 Chapter 1
2
AT T R IBU T ING AU T HORSHIP
W ITH ST Y LOME TRY
Project #2: The Hound, The War, and The Lost World
Sir Arthur Conan Doyle (1859–1930) is best known for the Sherlock Holmes
stories, considered milestones in the field of crime fiction. H. G. Wells
(1866–1946) is famous for several groundbreaking science fiction novels
including The War of The Worlds, The Time Machine, The Invisible Man, and
The Island of Dr. Moreau.
In 1912, the Strand Magazine published The Lost World, a serialized version
of a science fiction novel. It told the story of an Amazon basin expedition,
led by zoology professor George Edward Challenger, that encountered living
dinosaurs and a vicious tribe of ape-like creatures.
Although the author of the novel is known, for this project, let’s pretend
it’s in dispute and it’s your job to solve the mystery. Experts have narrowed
the field down to two authors, Doyle and Wells. Wells is slightly favored
because The Lost World is a work of science fiction, which is his purview. It
also includes brutish troglodytes redolent of the morlocks in his 1895 work
The Time Machine. Doyle, on the other hand, is known for detective stories
and historical fiction.
THE OBJEC TI V E
Write a Python program that uses stylometry to determine whether Sir Arthur Conan
Doyle or H. G. Wells wrote the novel The Lost World.
The Strategy
The science of natural language processing (NLP) deals with the interactions
between the precise and structured language of computers and the nuanced,
frequently ambiguous “natural” language used by humans. Example uses
for NLP include machine translations, spam detection, comprehension of
search engine questions, and predictive text recognition for cell phone users.
The most common NLP tests for authorship analyze the following fea-
tures of a text:
Word length A frequency distribution plot of the length of words
in a document
Stop words A frequency distribution plot of stop words (short,
noncontextual function words like the, but, and if )
28 Chapter 2
Parts of speech A frequency distribution plot of words based on their
syntactic functions (such as nouns, pronouns, verbs, adverbs, adjectives,
and so on)
Most common words A comparison of the most commonly used
words in a text
Jaccard similarity A statistic used for gauging the similarity and
diversity of a sample set
If Doyle and Wells have distinctive writing styles, these five tests should
be enough to distinguish between them. We’ll talk about each test in more
detail in the coding section.
To capture and analyze each author’s style, you’ll need a representative
corpus, or a body of text. For Doyle, use the famous Sherlock Holmes novel
The Hound of the Baskervilles, published in 1902. For Wells, use The War of the
Worlds, published in 1898. Both these novels contain more than 50,000 words,
more than enough for a sound statistical sampling. You’ll then compare
each author’s sample to The Lost World to determine how closely the writing
styles match.
To perform stylometry, you’ll use the Natural Language Toolkit (NLTK), a
popular suite of programs and libraries for working with human language
data in Python. It’s free and works on Windows, macOS, and Linux. Created
in 2001 as part of a computational linguistics course at the University of
Pennsylvania, NLTK has continued to develop and expand with the help of
dozens of contributors. To learn more, check out the official NLTK website
at http://www.nltk.org/.
Installing NLTK
You can find installation instructions for NLTK at http://www.nltk.org/install
.html. To install NLTK on Windows, open PowerShell and install it with
Preferred Installer Program (pip).
To check that the installation was successful, open the Python interac-
tive shell and enter the following:
If you don’t get an error, you’re good to go. Otherwise, follow the instal-
lation instructions at http://www.nltk.org/install.html.
Note that you can also download NLTK packages directly in the shell.
Here’s an example:
You’ll also need access to the Stopwords Corpus, which can be down-
loaded in a similar manner.
30 Chapter 2
Downloading the Stopwords Corpus
Click the Corpora tab in the NLTK Downloader window and download the
Stopwords Corpus, as shown in Figure 2-2.
Let’s download one more package to help you analyze parts of speech,
like nouns and verbs. Click the All Packages tab in the NLTK Downloader
window and download the Averaged Perceptron Tagger.
To use the shell, enter the following:
The Corpora
You can download the text files for The Hound of the Baskervilles (hound.txt),
The War of the Worlds (war.txt), and The Lost World (lost.txt), along with the
book’s code, from https://nostarch.com/real-world-python/.
These came from Project Gutenberg (http://www.gutenberg.org/), a great
source for public domain literature. So that you can use these texts right
away, I’ve stripped them of extraneous material such as table of contents,
chapter titles, copyright information, and so on.
def main():
strings_by_author = dict()
strings_by_author['doyle'] = text_to_string('hound.txt')
strings_by_author['wells'] = text_to_string('war.txt')
strings_by_author['unknown'] = text_to_string('lost.txt')
print(strings_by_author['doyle'][:300])
words_by_author = make_word_dict(strings_by_author)
len_shortest_corpus = find_shortest_corpus(words_by_author)
word_length_test(words_by_author, len_shortest_corpus)
stopwords_test(words_by_author, len_shortest_corpus)
parts_of_speech_test(words_by_author, len_shortest_corpus)
32 Chapter 2
vocab_test(words_by_author)
jaccard_test(words_by_author, len_shortest_corpus)
{'Doyle': 'Mr. Sherlock Holmes, who was usually very late in the mornings --snip--'}
Immediately after populating the dictionary, print the first 300 items
for the doyle key to ensure things went as planned. This should produce the
following printout:
Mr. Sherlock Holmes, who was usually very late in the mornings, save
upon those not infrequent occasions when he was up all night, was seated
at the breakfast table. I stood upon the hearth-rug and picked up the
stick which our visitor had left behind him the night before. It was a
fine, thick piec
With the corpora loaded correctly, the next step is to tokenize the
strings into words. Currently, Python doesn’t recognize words but instead
works on characters, such as letters, numbers, and punctuation marks. To
remedy this, you’ll use the make_word_dict() function to take the strings_
by_author dictionary as an argument, split out the words in the strings, and
return a dictionary called words_by_author with the authors as keys and a list
of words as values .
Stylometry relies on word counts, so it works best when each corpus is
the same length. There are multiple ways to ensure apples-to-apples com-
parisons. With chunking, you divide the text into blocks of, say, 5,000 words,
and compare the blocks. You can also normalize by using relative frequen-
cies, rather than direct counts, or by truncating to the shortest corpus.
Let’s explore the truncation option. Pass the words dictionary to
another function, find_shortest_corpus(), which calculates the number of
words in each author’s list and returns the length of the shortest corpus.
Table 2-1 shows the length of each corpus.
Corpus Length
Hound (Doyle) 58,387
War (Wells) 59,469
World (Unknown) 74,961
def make_word_dict(strings_by_author):
"""Return dictionary of tokenized words by corpus by author."""
words_by_author = dict()
for author in strings_by_author:
tokens = nltk.word_tokenize(strings_by_author[author])
words_by_author[author] = ([token.lower() for token in tokens
if token.isalpha()])
return words_by_author
First, define the text_to_string() function to load a text file. The built-in
read() function reads the whole file as an individual string, allowing rela-
tively easy file-wide manipulations. Use with to open the file so that it will
be closed automatically regardless of how the block terminates. Just like
putting away your toys, closing files is good practice. It prevents bad things
from happening, like running out of file descriptors, locking files from
further access, corrupting files, or losing data if writing to files.
34 Chapter 2
Some users may encounter a UnicodeDecodeError like the following one
when loading the text:
If you still have problems loading the corpora files, try adding an errors
argument as follows:
You can ignore errors because these text files were downloaded as UTF-8
and have already been tested using this approach. For more on UTF-8, see
https://docs.python.org/3/howto/unicode.html.
Next, define the make_word_dict() function that will take the dictionary
of strings by author and return a dictionary of words by author . First, ini-
tialize an empty dictionary named words_by_author. Then, loop through the
keys in the strings_by_author dictionary. Use NLTK’s word_tokenize() method
and pass it the string dictionary’s key. The result will be a list of tokens that
will serve as the dictionary value for each author. Tokens are just chopped
up pieces of a corpus, typically sentences or words.
Once you have the tokens, populate the words_by_author dictionary using
list comprehension . List comprehension is a shorthand way to execute loops
in Python. You need to surround the code with square brackets to indicate
a list. Convert the tokens to lowercase and use the built-in isalpha() method,
which returns True if all the characters in a token are part of the alphabet
and False otherwise. This will filter out numbers and punctuation. It will
also filter out hyphenated words or names. Finish by returning the words_by_
author dictionary.
36 Chapter 2
Define the function that takes the words_by_author dictionary as an argu-
ment. Immediately start an empty list to hold a word count.
Loop through the authors (keys) in the dictionary. Get the length
of the value for each key, which is a list object, and append the length to
the word_count list. The length here represents the number of words in the
corpus. For each pass through the loop, print the author’s name and the
length of his tokenized corpus.
When the loop ends, use the built-in min() function to get the lowest
count and assign it to the len_shortest_corpus variable. Print the answer and
then return the variable.
Figure 2-3: The NLTK cumulative plot (left) versus the default frequency plot (right)
Finish by calling the plt.show() method to display the plot, but leave it
commented out. If you want to see the plot immediately after coding this
function, you can uncomment it. Also note that if you launch this program
38 Chapter 2
via Windows PowerShell, the plots may close immediately unless you use the
block flag: plt.show(block=True). This will keep the plot up but halt execution
of the program until the plot is closed.
Based solely on the word length frequency plot in Figure 2-3, Doyle’s
style matches the unknown author’s more closely, though there are seg-
ments where Wells compares the same or better. Now let’s run some other
tests to see whether we can confirm that finding.
Define a function that takes the words dictionary and the length of the
shortest corpus variables as arguments. Then initialize a dictionary to hold
the frequency distribution of stop words for each author. You don’t want to
cram all the plots in the same figure, so start a new figure named 2.
Both Doyle and the unknown author use stop words in a similar man-
ner. At this point, two analyses have favored Doyle as the most likely author
of the unknown text, but there’s still more to do.
40 Chapter 2
Table 2-2: Parts of Speech with Tag Values
The taggers are typically trained on large datasets like the Penn Treebank
or Brown Corpus, making them highly accurate though not perfect. You can
also find training data and taggers for languages other than English. You
don’t need to worry about all these various terms and their abbreviations.
As with the previous tests, you’ll just need to compare lines in a chart.
Listing 2-6 defines a function to plot the frequency distribution of POS
in the three corpora.
['NN', 'NNS', 'WP', 'VBD', 'RB', 'RB', 'RB', 'IN', 'DT', 'NNS', --snip--]
Next, make a frequency distribution of the POS list and with each loop
plot the curve, using the top 35 samples. Note that there are only 36 POS
tags and several, such as list item markers, rarely appear in novels.
This is the final plot you’ll make, so call plt.show() to draw all the
plots to the screen. As pointed out in the discussion of Listing 2-4, if you’re
using Windows PowerShell to launch the program, you may need to use
plt.show(block=True) to keep the plots from closing automatically.
The previous plots, along with the current one (Figure 2-5), should
appear after about 10 seconds.
Once again, the match between the Doyle and unknown curves is clearly
better than the match of unknown to Wells. This suggests that Doyle is the
author of the unknown corpus.
42 Chapter 2
Comparing Author Vocabularies
To compare the vocabularies among the three corpora, you’ll use the
chi-squared random variable ( X 2), also known as the test statistic, to measure
the “distance” between the vocabularies employed in the unknown corpus
and each of the known corpora. The closest vocabularies will be the most
similar. The formula is
(Oi − Ei )
n 2
X2 = ∑
i =1 Ei
where O is the observed word count and E is the expected word count
assuming the corpora being compared are both by the same author.
If Doyle wrote both novels, they should both have the same—or a
similar—proportion of the most common words. The test statistic lets you
quantify how similar they are by measuring how much the counts for each
word differ. The lower the chi-squared test statistic, the greater the similar-
ity between two distributions.
Listing 2-7 defines a function to compare vocabularies among the three
corpora.
The vocab_test() function needs the word dictionary but not the length
of the shortest corpus. Like the previous functions, however, it starts by cre-
ating a new dictionary to hold the chi-squared value per author and then
loops through the word dictionary.
[('the', 7778), ('of', 4112), ('and', 3713), ('i', 3203), ('a', 3195), --snip--]
Next, you get the observed count per author from the word dictionary.
For Doyle, this would be the count of the most common words in the corpus
of The Hound of the Baskervilles. Then, you get the expected count, which for
Doyle would be the count you would expect if he wrote both The Hound of
the Baskervilles and the unknown corpus. To do this, multiply the number
of counts in the combined corpus by the previously calculated author’s pro-
portion. Then apply the formula for chi-squared and add the result to the
dictionary that tracks each author’s chi-squared score . Display the result
for each author.
To find the author with the lowest chi-squared score, call the built-in
min() function and pass it the dictionary and dictionary key, which you
obtain with the get() method. This will yield the key corresponding to the
minimum value. This is important. If you omit this last argument, min() will
return the minimum key based on the alphabetical order of the names, not
their chi-squared score! You can see this mistake in the following snippet:
>>> print(mydict)
{'doyle': 100, 'wells': 5}
>>> minimum = min(mydict)
>>> print(minimum)
'doyle'
>>> minimum = min(mydict, key=mydict.get)
>>> print(minimum)
'wells'
It’s easy to assume that the min() function returns the minimum numer-
ical value, but as you saw, it looks at dictionary keys by default.
Complete the function by printing the most likely author based on the
chi-squared score.
44 Chapter 2
Chi-squared for doyle = 4744.4
Chi-squared for wells = 6856.3
Most-likely author by vocabulary is doyle
Area of
overlap
Area
of
union
The more overlap there is between sets created from two texts, the
more likely they were written by the same author. Listing 2-8 defines a func-
tion for gauging the similarity of sample sets.
if __name__ == '__main__':
main()
Like most of the previous tests, the jaccard_test() function takes the
word dictionary and length of the shortest corpus as arguments. You’ll also
need a dictionary to hold the Jaccard coefficient for each author.
Jaccard similarity works with unique words, so you’ll need to turn the
corpora into sets to remove duplicates. First, you’ll build a set from the
unknown corpus. Then you’ll loop through the known corpora, turning them
into sets and comparing them to the unknown set. Be sure to truncate all
the corpora to the length of the shortest corpus when making the sets.
Prior to running the loop, use a generator expression to get the names
of the authors, other than unknown, from the words_by_author dictionary . A
generator expression is a function that returns an object that you can iterate over
one value at a time. It looks a lot like list comprehension, but instead of square
brackets, it’s surrounded by parentheses. And instead of constructing a poten-
tially memory-intensive list of items, the generator yields them in real time.
Generators are useful when you have a large set of values that you need to use
only once. I use one here as an opportunity to demonstrate the process.
When you assign a generator expression to a variable, all you get is a
type of iterator called a generator object. Compare this to making a list, as
shown here:
def generator(my_range):
for i in range(my_range):
yield i
46 Chapter 2
Whereas the return statement ends a function, the yield statement sus-
pends the function’s execution and sends a value back to the caller. Later,
the function can resume where it left off. When a generator reaches its end,
it’s “empty” and can’t be called again.
Back to the code, start a for loop using the authors generator. Find the
unique words for each known author, just as you did for unknown. Then use
the built-in intersection() function to find all the words shared between the
current author’s set of words and the set for unknown. The intersection of two
given sets is the largest set that contains all the elements that are common
to both. With this information, you can calculate the Jaccard similarity
coefficient .
Update the jaccard_by_author dictionary and print each outcome in the
interpreter window. Then find the author with the maximum Jaccard value
and print the results.
Summary
The true author of The Lost World is Doyle, so we’ll stop here and declare
victory. If you want to explore further, a next step might be to add more
known texts to doyle and wells so that their combined length is closer to
that for The Lost World and you don’t have to truncate it. You could also test
for sentence length and punctuation style or employ more sophisticated
techniques like neural nets and genetic algorithms.
You can also refine existing functions, like vocab_test() and jaccard_
test(), with stemming and lemmatization techniques that reduce words to
their root forms for better comparisons. As the program is currently writ-
ten, talk, talking, and talked are all considered completely different words
even though they share the same root.
At the end of the day, stylometry can’t prove with absolute certainty that
Sir Arthur Conan Doyle wrote The Lost World. It can only suggest, through
weight of evidence, that he is the more likely author than Wells. Framing
the question very specifically is important, since you can’t evaluate all pos-
sible authors. For this reason, successful authorship attribution begins with
good old-fashioned detective work that trims the list of candidates to a
manageable length.
Figure 2-7: Dispersion plot for major characters in The Hound of the Baskervilles
48 Chapter 2
bimodal distribution of Mortimer, and the late story overlap of Barrymore,
Selden, and the hound.
Dispersion plots can have more practical applications. For example,
as the author of technical books, I need to define a new term when it first
appears. This sounds easy, but sometimes the editing process can shuffle
whole chapters, and issues like this can fall through the cracks. A dispersion
plot, built with a long list of technical terms, can make finding these first
occurrences a lot easier.
For another use case, imagine you’re a data scientist working with para-
legals on a criminal case involving insider trading. To find out whether the
accused talked to a certain board member just prior to making the illegal
trades, you can load the subpoenaed emails of the accused as a continuous
string and generate a dispersion plot. If the board member’s name appears
as expected, case closed!
For this practice project, write a Python program that reproduces
the dispersion plot shown in Figure 2-7. If you have problems loading the
hound.txt corpus, revisit the discussion of Unicode on page 35. You can
find a solution, practice_hound_dispersion.py, in the appendix and online.
Figure 2-8: Heatmap of semicolon use (dark squares) for Wells (left) and Doyle (right)
50 Chapter 2
SUMMARIZING SPEECHES WITH
3
N AT UR A L L A NGUAGE PROCE SSING
THE OBJEC TI V E
Write a Python program that summarizes a speech using NLP text extraction.
The Strategy
The Natural Language Toolkit includes the functions you’ll need to sum-
marize Dr. King’s speech. If you skipped Chapter 2, see page 29 for installa-
tion instructions.
To summarize the speech, you’ll need a digital copy. In previous
chapters, you manually downloaded files you needed from the internet.
52 Chapter 3
This time you’ll use a more efficient technique, called web scraping, which
allows you to programmatically extract and save large amounts of data from
websites.
Once you’ve loaded the speech as a string, you can use NLTK to split
out and count individual words. Then, you’ll “score” each sentence in the
speech by summing the word counts within it. You can use those scores to
print the top-ranked sentences, based on how many sentences you want in
your summary.
Web Scraping
Scraping the web means using a program to download and process content.
This is such a common task that prewritten scraping programs are freely
available. You’ll use the requests library to download files and web pages,
and you’ll use the Beautiful Soup (bs4) package to parse HTML. Short for
Hypertext Markup Language, HTML is the standard format used to create
web pages.
To install the two modules, use pip in a terminal window or Windows
PowerShell (see page 8 in Chapter 1 for instructions on installing and
using pip):
To check the installation, open the shell and import each module as
shown next. If you don’t get an error, you’re good to go!
def main():
url = 'http://www.analytictech.com/mb021/mlk.htm'
page = requests.get(url)
page.raise_for_status()
soup = bs4.BeautifulSoup(page.text, 'html.parser')
p_elems = [element.text for element in soup.find_all('p')]
speech = ''.join(p_elems)
Start by importing Counter from the collections module to help you keep
track of the sentence scoring. The collections module is part of the Python
Standard Library and includes several container data types. A Counter is
a dictionary subclass for counting hashable objects. Elements are stored
as dictionary keys, and their counts are stored as dictionary values.
Next, to clean up the speech prior to summarizing its contents, import
the re module. The re stands for regular expressions, also referred to as regexes,
which are sequences of characters that define a search pattern. This mod-
ule will help you clean up the speech by allowing you to selectively remove
bits that you don’t want.
Finish the imports with the modules for scraping the web and doing
natural language processing. The last module brings in the list of functional
stop words (such as if, and, but, for) that contain no useful information.
You’ll remove these from the speech prior to summarization.
Next, define a main() function to run the program. To scrape the speech
off the web, provide the url address as a string . You can copy and paste
this from the website from which you want to extract text.
The requests library abstracts the complexities of making HTTP
requests in Python. HTTP, short for HyperText Transfer Protocol, is the
foundation of data communication using hyperlinks on the World Wide
Web. Use the requests.get() method to fetch the url and assign the output
to the page variable, which references the Response object the web page
returned for the request. This object’s text attribute holds the web page,
including the speech, as a string.
54 Chapter 3
To check that the download was successful, call the Response object’s
raise_for_status() method. This does nothing if everything goes okay but
otherwise will raise an exception and halt the program.
At this point, the data is in HTML, as shown here:
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<title>Martin Luther King Jr.'s 1962 Speech</title>
</head>
--snip--
<p>I am happy to join with you today in what will go down in
history as the greatest demonstration for freedom in the history
of our nation. </p>
--snip--
As you can see, HTML has a lot of tags, such as <head> and <p>, that let
your browser know how to format the web page. The text between starting
and closing tags is called an element. For example, the text “Martin Luther
King Jr.’s 1962 Speech” is a title element sandwiched between the starting
tag <title> and the closing tag </title>. Paragraphs are formatted using <p>
and </p> tags.
Because these tags are not part of the original text, they should be
removed prior to any natural language processing. To remove the tags,
call the bs4.BeautifulSoup() method and pass it the string containing the
HTML . Note that I’ve explicitly specified html.parser. The program will
run without this but complain bitterly with warnings in the shell.
The soup variable now references a BeautifulSoup object, which means
you can use the object’s find_all() method to locate the speech buried in
the HTML document. In this case, to find the text between paragraph tags
(<p>), use list comprehension and find_all() to make a list of just the para-
graph elements.
Finish by turning the speech into a continuous string. Use the join()
method to turn the p_elems list into a string. Set the “ joiner” character to a
space, designated by ''.
Note that with Python, there is usually more than one way to accom-
plish a task. The last two lines of the listing can also be written as follows:
p_elems = soup.select('p')
speech = ''.join(p_elems)
while True:
max_words = input("Enter max words per sentence for summary: ")
num_sents = input("Enter number of sentences for summary: ")
if max_words.isdigit() and num_sents.isdigit():
break
else:
print("\nInput must be in whole numbers.\n")
speech_edit_no_stop = remove_stop_words(speech_edit)
word_freq = get_word_freq(speech_edit_no_stop)
sent_scores = score_sentences(speech, word_freq, max_words)
counts = Counter(sent_scores)
summary = counts.most_common(int(num_sents))
print("\nSUMMARY:")
for i in summary:
print(i[0])
56 Chapter 3
NOTE According to research by the American Press Institute, comprehension is best with
sentences of fewer than 15 words. Similarly, the Oxford Guide to Plain English
recommends using sentences that average 15 to 20 words over a full document.
Listing 3-3: Defining a function to remove stop words from the speech
If you don’t convert the words to lowercase, one and One are considered
distinct elements. For counting purposes, every instance of one regardless of
its case should be treated as the same word. Otherwise, the contribution
of one to the document will be diluted.
Scoring Sentences
Listing 3-5 defines a function that scores sentences based on the frequency
distribution of the words they contain. It returns a dictionary with each sen-
tence as the key and its score as the value.
58 Chapter 3
words = nltk.word_tokenize(sent.lower())
sent_word_count = len(words)
if sent_word_count <= int(max_words):
for word in words:
if word in word_freq.keys():
sent_scores[sent] += word_freq[word]
sent_scores[sent] = sent_scores[sent] / sent_word_count
return sent_scores
if __name__ == '__main__':
main()
SUMMARY:
From every mountainside, let freedom ring.
Let freedom ring from Lookout Mountain in Tennessee!
Let freedom ring from every hill and molehill in Mississippi.
Let freedom ring from the curvaceous slopes of California!
Let freedom ring from the snow capped Rockies of Colorado!
But one hundred years later the Negro is still not free.
From the mighty mountains of New York, let freedom ring.
From the prodigious hilltops of New Hampshire, let freedom ring.
And I say to you today my friends, let freedom ring.
I have a dream today.
It is a dream deeply rooted in the American dream.
Free at last!
Thank God almighty, we're free at last!"
We must not allow our creative protest to degenerate into physical violence.
This is the faith that I go back to the mount with.
Not only does the summary capture the title of the speech, it captures
the main points.
But if you run it again with 10 words per sentence, a lot of the sentences
are clearly too long. Because there are only 7 sentences in the whole speech
with 10 or fewer words, the program can’t honor the input requirements. It
defaults to printing the speech from the beginning until the sentence count
is at least what was specified in the num_sents variable.
Now, rerun the program and try setting the word count limit to 1,000.
SUMMARY:
From every mountainside, let freedom ring.
Let freedom ring from Lookout Mountain in Tennessee!
Let freedom ring from every hill and molehill in Mississippi.
Let freedom ring from the curvaceous slopes of California!
Let freedom ring from the snow capped Rockies of Colorado!
But one hundred years later the Negro is still not free.
From the mighty mountains of New York, let freedom ring.
From the prodigious hilltops of New Hampshire, let freedom ring.
And I say to you today my friends, let freedom ring.
I have a dream today.
But not only there; let freedom ring from the Stone Mountain of Georgia!
It is a dream deeply rooted in the American dream.
With this faith we will be able to work together, pray together; to struggle
together, to go to jail together, to stand up for freedom forever, knowing
that we will be free one day.
Free at last!
One hundred years later the life of the Negro is still sadly crippled by the
manacles of segregation and the chains of discrimination.
60 Chapter 3
Although longer sentences don’t dominate the summary, a few slipped
through, making this summary less poetic than the previous one. The lower
word count limit forces the previous version to rely more on shorter phrases
that act like a chorus.
THE OBJEC TI V E
Write a Python program that uses the gensim module to summarize a speech.
Installing gensim
The gensim module runs on all the major operating systems but is dependent
on NumPy and SciPy. If you don’t have them installed, go back to Chapter 1
and follow the instructions in “Installing the Python Libraries” on page 6.
To install gensim on Windows, use pip install -U gensim. To install it in a
terminal, use pip install --upgrade gensim. For conda environments, use conda
install -c conda-forge gensim. For more on gensim, go to https://radimrehurek
.com/gensim/.
url = 'https://jamesclear.com/great-speeches/make-your-bed-by-admiral
-william-h-mcraven'
page = requests.get(url)
page.raise_for_status()
soup = bs4.BeautifulSoup(page.text, 'html.parser')
p_elems = [element.text for element in soup.find_all('p')]
speech = ''.join(p_elems)
You’ll test gensim on the raw speech scraped from the web, so you won’t
need modules for cleaning the text. The gensim module will also do any
counting internally, so you don’t need Counter, but you will need gensim’s
summarize() function to summarize the text . The only other change is to
the url address .
Listing 3-7: Running gensim, removing duplicate lines, and printing the summary
Start by printing a header for your summary. Then, call the gensim
summarize() function to summarize the speech in 225 words. This word
count will produce about 15 sentences, assuming the average sentence has
15 words. In addition to a word count, you can pass summarize() a ratio, such
as ratio=0.01. This will produce a summary whose length is 1 percent of the
full document.
Ideally, you could summarize the speech and print the summary in
one step.
print(summarize(speech, word_count=225))
62 Chapter 3
Unfortunately, gensim sometimes duplicates sentences in summaries,
and that occurs here:
To avoid duplicating text, you first need to break out the sentences in
the summary variable using the NLTK sent_tokenize() function. Then make
a set from these sentences, which will remove duplicates. Finish by printing
the results.
Because sets are unordered, the arrangement of the sentences may
change if you run the program multiple times.
Figure 3-1: Word cloud made from 1993 State of the Union address
by Bill Clinton
Less than 10 years later, George W. Bush’s word cloud reveals a focus on
security (Figure 3-2).
Figure 3-2: Word cloud made from 2002 State of the Union address
by George W. Bush
64 Chapter 3
Another use for word clouds is to extract keywords from customer feed-
back. If words like poor, slow, and expensive dominate, you’ve got a problem!
Writers can also use the clouds to compare chapters in a book or scenes in a
screenplay. If the author is using very similar language for action scenes and
romantic interludes, some editing is needed. If you’re a copywriter, clouds can
help you check your keyword density for search engine optimization (SEO).
There are lots of ways to generate word clouds, including free websites
like https://www.wordclouds.com/ and https://www.jasondavies.com/wordcloud/.
But if you want to fully customize your word cloud or embed the generator
within another program, you need to do it yourself. In this project, you’ll
use a word cloud to make a promotional flyer for a school play based on the
Sherlock Holmes story The Hound of the Baskervilles.
Instead of using the basic rectangle shown in Figures 3-1 and 3-2, you’ll
fit the words into an outline of Holmes’s head (Figure 3-3).
THE OBJEC TI V E
Use the wordcloud module to generate a shaped word cloud for a novel.
Listing 3-8: Importing modules and loading text, image, and stop words
66 Chapter 3
Begin by importing NumPy and PIL. PIL will open the image, and NumPy
will turn it into a mask. You started using NumPy in Chapter 1; in case you
skipped it, see the “Installing the Python Libraries” section on page 6.
Note that the pillow module continues to use the acronym PIL for backward
compatibility.
You’ll need matplotlib, which you downloaded in the “Installing the
Python Libraries” section of Chapter 1, to display the word cloud. The
wordcloud module comes with its own list of stop words, so import STOPWORDS
along with the cloud functionality.
Next, load the novel’s text file and store it in a variable named text .
As described in the discussion of Listing 2-2 in Chapter 2, you may encounter
a UnicodeDecodeError when loading the text.
In this case, try modifying the open() function by adding encoding and
errors arguments.
With the text loaded, use PIL’s Image.open() method to open the image
of Holmes and use NumPy to turn it into an array. If you’re using the iStock
image of Holmes, change the image’s filename as appropriate.
Assign the STOPWORDS set imported from wordcloud to the stopwords vari-
able. Then update the set with a list of additional words that you want to
exclude . These will be words like said and now that dominate the word
cloud but add no useful content. Determining what they are is an iterative
process. You generate the word cloud, remove words that you don’t think
contribute, and repeat. You can comment out this line to see the benefit.
NOTE To update a container like STOPWORDS, you need to know whether it’s a list, dictionary,
set, and so on. Python’s built-in type() function returns the class type of any object
passed as an argument. In this case, print(type(STOPWORDS)) yields <class 'set'>.
wc_hound.py, wc = WordCloud(max_words=500,
part 2 relative_scaling=0.5,
mask=mask,
background_color='white',
stopwords=stopwords,
margin=2,
random_state=7,
colors = wc.to_array()
68 Chapter 3
Figure 3-4: Example of masked word cloud with an outline (left) versus without (right)
wc_hound.py, plt.figure()
part 3 plt.title("Chamberlain Hunt Academy Senior Class Presents:\n",
fontsize=15, color='brown')
plt.text(-10, 0, "The Hound of the Baskervilles",
fontsize=20, fontweight='bold', color='brown')
plt.suptitle("7:00 pm May 10-12 McComb Auditorium",
x=0.52, y=0.095, fontsize=15, color='brown')
plt.imshow(colors, interpolation="bilinear")
plt.axis('off')
plt.show()
##plt.savefig('hound_wordcloud.png')
You can change the size of the display by adding an argument when you
initialize the figure. Here’s an example: plt.figure(figsize=(50, 60)).
There are many other ways to change the results. For example, setting
the margin parameter to 10 yields a sparser word cloud (Figure 3-6).
70 Chapter 3
Figure 3-6: The word cloud generated with margin=10
Summary
In this chapter, you used extraction-based summarization techniques to
produce a synopsis of Martin Luther King Jr.’s “I Have a Dream” speech.
You then used a free, off-the-shelf module called gensim to summarize
Admiral McRaven’s “Make Your Bed” speech with even less code. Finally,
you used the wordcloud module to create an interesting design with words.
Figure 3-8: Word clouds for two movies released in 2010: How to
Train Your Dragon and Prince of Persia
72 Chapter 3
If you’re not into movies, pick something else. Alternatives include
famous novels, Star Trek episodes, and song lyrics (Figure 3-9).
Figure 3-9: Word cloud made from song lyrics (Donald Fagen’s “I.G.Y.”)
Board games have seen a resurgence in recent years, so you could follow
this trend and print the word clouds on card stock. Alternatively, you could
keep things digital and present the player with multiple-choice answers for
each cloud. The game should keep track of the number of correct answers.
SUMMARY:
Gensim is implemented in Python and Cython.
Gensim is an open-source library for unsupervised topic modeling and natural
language processing, using modern statistical machine learning.
[12] Gensim is commercially supported by the company rare-technologies.com,
who also provide student mentorships and academic thesis projects for Gensim
via their Student Incubator programme.
The software has been covered in several new articles, podcasts and
interviews.
Gensim is designed to handle large text collections using data streaming and
incremental online algorithms, which differentiates it from most other machine
learning software packages that target only in-memory processing.
Next, try the gensim version from Project 4 on those boring services
agreements no one ever reads. An example Microsoft agreement is available
at https://www.microsoft.com/en-us/servicesagreement/default.aspx. Of course, to
evaluate the results, you’ll have to read the full agreement, which almost no
one ever does! Enjoy the catch-22!
chapter_elems = soup.select('div[class="chapter"]')
chapters = chapter_elems[2:]
--snip--
Chapter 3:
"Besides, besides—" "Why do you hesitate?” "There is a realm in which the most
acute and most experienced of detectives is helpless." "You mean that the
thing is supernatural?" "I did not positively say so." "No, but you evidently
think it." "Since the tragedy, Mr. Holmes, there have come to my ears several
incidents which are hard to reconcile with the settled order of Nature." "For
example?" "I find that before the terrible event occurred several people had
seen a creature upon the moor which corresponds with this Baskerville demon,
and which could not possibly be any animal known to science.
--snip--
Chapter 6:
"Bear in mind, Sir Henry, one of the phrases in that queer old legend which
Dr. Mortimer has read to us, and avoid the moor in those hours of darkness
when the powers of evil are exalted." I looked back at the platform when we
had left it far behind and saw the tall, austere figure of Holmes standing
motionless and gazing after us.
Chapter 7:
I feared that some disaster might occur, for I was very fond of the old man,
and I knew that his heart was weak." "How did you know that?" "My friend
Mortimer told me." "You think, then, that some dog pursued Sir Charles, and
that he died of fright in consequence?" "Have you any better explanation?" "I
have not come to any conclusion." "Has Mr. Sherlock Holmes?" The words took
away my breath for an instant but a glance at the placid face and steadfast
eyes of my companion showed that no surprise was intended.
--snip--
74 Chapter 3
Chapter 14:
"What’s the game now?" "A waiting game." "My word, it does not seem a very
cheerful place," said the detective with a shiver, glancing round him at the
gloomy slopes of the hill and at the huge lake of fog which lay over the
Grimpen Mire.
Far away on the path we saw Sir Henry looking back, his face white in the
moonlight, his hands raised in horror, glaring helplessly at the frightful
thing which was hunting him down.
--snip--
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
78 Chapter 4
Next, convert the letters in your short message into numbers:
H E R E K I T T Y K I T T Y original message
08 05 18 05 11 09 20 20 25 11 09 20 20 25 convert letters to numbers
Starting at the upper left of the one-time pad sheet and reading left to
right, assign a number pair (key) to each letter and add it to the number
value of the letter. You’ll want to work with base 10 number pairs, so if your
sum is greater than 100, use modular arithmetic to truncate the value to
the last two digits (103 becomes 03). The numbers in shaded cells in the
following diagrams are the result of modular arithmetic.
H E R E K I T T Y K I T T Y original message
08 05 18 05 11 09 20 20 25 11 09 20 20 25 convert letters to numbers
73 98 39 15 43 74 55 60 12 83 24 32 58 86 from sender’s OTP
81 03 57 20 54 83 75 80 37 94 33 52 78 11 ciphertext
The last row in this diagram represents the ciphertext. Note that KITTY,
duplicated in the plaintext, is not repeated in the ciphertext. Each encryp-
tion of KITTY is unique.
To decrypt the ciphertext back to plaintext, the recipient uses the same
sheet from their identical one-time pad. They place their number pairs
below the ciphertext pairs and subtract. When this results in a negative
number, they use modular subtraction (adding 100 to the ciphertext value
before subtracting). They finish by converting the resulting number pairs
back to letters.
81 03 57 20 54 83 75 80 37 94 33 52 78 11 ciphertext
73 98 39 15 43 74 55 60 12 83 24 32 58 86 from recipient’s OTP
To ensure that no keys are repeated, the number of letters in the mes-
sage can’t exceed the number of keys on the pad. This forces the use of
short messages, which have the advantage of being easier to encrypt and
decrypt and which offer a cryptanalyst fewer opportunities to decipher the
message. Some other guidelines include the following:
The last item is important. In the novel, the British intelligence offi-
cer finds a copy of Rebecca at a captured German outpost. Through simple
deductive reasoning he recognizes it as a substitute for a one-time pad.
With a digital approach, this would have been much more difficult. In fact,
the novel could be kept on a small, easily concealed device such as an SD card.
80 Chapter 4
This would make it similar to a one-time pad, which is often no bigger than
a postage stamp.
A digital approach does have one disadvantage, however: the program
is a discoverable item. Whereas a spy could simply memorize the rules for a
one-time pad, with a digital approach the rules must be ensconced in the
software. This weakness can be minimized by writing the program so that
it looks innocent—or at least cryptic—and having it request input from the
user for the message and the name of the code book.
THE OBJEC TI V E
Write a Python program that encrypts and decrypts messages using a digital novel as a
one-time pad.
The Strategy
Unlike the spy, you won’t need all the rules used in the novel, and many
wouldn’t work anyway. If you’ve ever used any kind of ebook, you know that
page numbers are meaningless. Changes to screen sizes and text sizes ren-
der all such page numbers nonunique. And because you can choose letters
from anywhere in the book, you don’t necessarily need special rules for rare
letters or for discounting numbers in a count.
So, you don’t need to focus on perfectly reproducing the Rebecca
cipher. You just need to produce something similar and, ideally, better.
Luckily, Python iterables, such as lists and tuples, use numerical indexes
to keep track of every single item within them. By loading a novel as a list,
you can use these indexes as unique starting keys for each character. You
can then shift the indexes based on the day of the year, emulating the spy’s
methodology in The Key to Rebecca.
Unfortunately, Rebecca is not yet in the public domain. In its place, we’ll
substitute the text file of Sir Arthur Conan Doyle’s The Lost World that you
used in Chapter 2. This novel contains 51 distinct characters that occur
421,545 times, so you can randomly choose indexes with little chance of
duplication. This means you can use the whole book as a one-time pad each
time you encrypt a message, rather than restrict yourself to a tiny collection
of numbers on a single one-time pad sheet.
NOTE You can download and use a digital version of Rebecca if you want. I just can’t
provide you with a copy for free!
Because you’ll be reusing the book, you’ll need to worry about both
message-to-message and in-message duplication of keys. The longer the mes-
sage, the more material the cryptanalyst can study, and the easier it is to
crack the code. And if each message is sent with the same encryption key,
all the intercepted messages can be treated as a single large message.
82 Chapter 4
def main():
message = input("Enter plaintext or ciphertext: ")
process = input("Enter 'encrypt' or 'decrypt': ")
while process not in ('encrypt', 'decrypt'):
process = input("Invalid process. Enter 'encrypt' or 'decrypt': ")
shift = int(input("Shift value (1-366) = "))
while not 1 <= shift <= 366:
shift = int(input("Invalid value. Enter digit from 1 to 366: ")
infile = input("Enter filename with extension: ")
if not os.path.exists(infile):
print("File {} not found. Terminating.".format(infile), file=sys.stderr)
sys.exit(1)
text = load_file(infile)
char_dict = make_dict(text, shift)
if process == 'encrypt':
ciphertext = encrypt(message, char_dict)
if check_for_fail(ciphertext):
print("\nProblem finding unique keys.", file=sys.stderr)
print("Try again, change message, or change code book.\n",
file=sys.stderr)
sys.exit()
print("\nCharacter and number of occurrences in char_dict: \n")
print("{: >10}{: >10}{: >10}".format('Character', 'Unicode', 'Count'))
for key in sorted(char_dict.keys()):
print('{:>10}{:>10}{:>10}'.format(repr(key)[1:-1],
str(ord(key)),
len(char_dict[key])))
print('\nNumber of distinct characters: {}'.format(len(char_dict)))
print("Total number of characters: {:,}\n".format(len(text)))
for i in ciphertext:
print(text[i - shift], end='', flush=True)
Start by importing sys and os, two modules that let you interface with
the operating system; then the random module; and then defaultdict and
Counter from the collections module.
The collections module is part of the Python Standard Library and
includes several container data types. You can use defaultdict to build a
dictionary on the fly. If defaultdict encounters a missing key, it will supply
a default value rather than throw an error. You’ll use it to build a dictionary
of the characters in The Lost World and their corresponding index values.
NOTE Python’s random module does not produce truly random numbers but rather pseudo-
random numbers that can be predicted. Any cipher using pseudorandom numbers
can potentially be cracked by a cryptanalyst. For maximum security when generating
random numbers, you should use Python’s os.urandom() function.
84 Chapter 4
Now, print the contents of the character dictionary so you can see how
many times the various characters occur in the novel . This will help
guide what you put in messages, though The Lost World contains a healthy
helping of useful characters.
86 Chapter 4
Loading a File and Making a Dictionary
Listing 4-2 defines functions to load a text file and make a dictionary of
characters in the file and their corresponding indexes.
In this case, try modifying the open function by adding the encoding and
errors arguments.
The encrypt() function will take the message and char_dict as arguments.
Start it by creating an empty list to hold the ciphertext. Next, start loop-
ing through the characters in message and converting them to lowercase to
match the characters in char_dict.
If the number of indexes associated with the character is greater than 1,
the program uses the random.choice() method to choose one of the character’s
indexes at random .
If a character occurs only once in char_dict, random.choice() will throw
an error. To handle this, the program uses a conditional and hardwires the
choice of the index, which will be at position [0].
88 Chapter 4
If the character doesn’t exist in The Lost World, it won’t be in the diction-
ary, so use a conditional to check for this . If it evaluates to True, print an
alert for the user and use continue to return to the start of the loop without
choosing an index. Later, when the quality control steps run on the cipher-
text, a space will appear in the decrypted plaintext where this character
should be.
If continue is not called, then the program appends the index to the
encrypted list. When the loop ends, you return the list to end the function.
To see how this works, let’s look at the first message the Nazi spy sends
in The Key to Rebecca, shown here:
HAVE ARRIVED. CHECKING IN. ACKNOWLEDGE.
Using this message and a shift value of 70 yielded the following ran-
domly generated ciphertext:
[125711, 106950, 85184, 43194, 45021, 129218, 146951, 157084, 75611, 122047,
121257, 83946, 27657, 142387, 80255, 160165, 8634, 26620, 105915, 135897,
22902, 149113, 110365, 58787, 133792, 150938, 123319, 38236, 23859, 131058,
36637, 108445, 39877, 132085, 86608, 65750, 10733, 16934, 78282]
Your results may differ due to the stochastic nature of the algorithm.
if __name__ == '__main__':
main()
Listing 4-5: Defining a function to check for duplicate indexes and calling main()
Sending Messages
The following message is based on a passage from The Key to Rebecca. You
can find it in the downloadable Chapter_4 folder as allied_attack_plan.txt.
90 Chapter 4
As a test, try sending it with a shift of 70. Use your operating system’s Select
All, Copy, and Paste commands to transfer the text when asked for input. If
it doesn’t pass the check_for_fail() test, run it again!
Allies plan major attack for Five June. Begins at oh five twenty with
bombardment from Aslagh Ridge toward Rommel east flank. Followed by tenth
Indian Brigade infantry with tanks of twenty second Armored Brigade on Sidi
Muftah. At same time, thirty second Army Tank Brigade and infantry to charge
north flank at Sidra Ridge. Three hundred thirty tanks deployed to south and
seventy to north.
The nice thing about this technique is that you can use proper punc-
tuation, at least if you type the message into the interpreter window. Text
copied in from outside may need to be stripped of the newline character
(such as \r\n or \n), placed wherever the carriage return was used.
Of course, only characters that occur in The Lost World can be encrypted.
The program will warn you of exceptions and then replace missing charac-
ters with a space.
To be sneaky, you don’t want to save plaintext or ciphertext messages to
a file. Cutting and pasting from the shell is the way to go. Just remember to
copy something new when you’re finished so you don’t leave incriminating
evidence on your clipboard!
If you want to get fancy, you can copy and paste text to the clipboard
straight from Python using pyperclip, written by Al Sweigart. You can learn
more at https://pypi.org/project/pyperclip/.
Summary
In this chapter, you got to work with defaultdict and Counter from the collec-
tions module; choice() from the random module; and replace(), enumerate(),
ord(), and repr() from the Python Standard Library. The result was an
encryption program, based on the one-time pad technique, that produces
unbreakable ciphertext.
Further Reading
The Key to Rebecca (Penguin Random House, 1980), by Ken Follett, is an
exciting novel noted for its depth of historical detail, accurate descriptions
of Cairo in World War II, and thrilling espionage storyline.
The Code Book: The Science of Secrecy from Ancient Egypt to Quantum
Cryptography (Anchor, 2000), by Simon Singh, is an interesting review of
cryptography through the ages, including a discussion of the one-time pad.
If you enjoy working with ciphers, check out Cracking Codes with Python
(No Starch Press, 2018), by Al Sweigart. Aimed at beginners in both cryp-
tography and Python programming, this book covers many cipher types,
including reverse, Caesar, transposition, substitution, affine, and Vigenère.
Figure 4-2: Frequency of occurrence of characters in the digital version of The Lost World
Note that the most common character is a space. This makes it easy to
encrypt spaces, further confounding any cryptanalysis!
You can find a solution, practice_barchart.py, in the appendix and on the
book’s website.
92 Chapter 4
Practice Project: Sending Secrets the WWII Way
According to the Wikipedia article on Rebecca (https://en.wikipedia.org/wiki/
Rebecca_(novel)), the Germans in North Africa in World War II really did
attempt to use the novel as the key to a book code. Rather than encode the
message letter by letter, sentences would be made using single words in the
book, referred to by page number, line, and position in the line.
Copy and edit the rebecca.py program so that it uses words rather than
letters. To get you started, here’s how to load the text file as a list of words,
rather than characters, using list comprehension:
with open('lost.txt') as f:
words = [word.lower() for line in f for word in line.split()]
words_no_punct = ["".join(char for char in word if char.isalpha())
for word in words]
['i', 'have', 'wrought', 'my', 'simple', 'plan', 'if', 'i', 'give', 'one',
'hour', 'of', 'joy', 'to', 'the', 'boy', 'whos', 'half', 'a', 'man']
encrypted ciphertext =
[23371, 7491]
decrypted plaintext =
with ten
encrypted ciphertext =
[29910, 70641, 30556, 60850, 72292, 32501, 6507, 18593, 41777, 23831, 41833,
16667, 32749, 3350, 46088, 37995, 12535, 30609, 3766, 62585, 46971, 8984,
44083, 43414, 56950]
decrypted plaintext =
a a so if do in my under for to all he the the with ten a a tell all night
kind so the the
There are 1,864 occurrences of a and 4,442 of the in The Lost World. If
you stick to short messages, you shouldn’t duplicate keys. Otherwise, you
may need to use multiple flag characters or disable the check-for-fail()
function and accept some duplicates.
Feel free to come up with your own method for handling problem
words. As consummate planners, the Germans surely had something in mind
or they wouldn’t have considered a book code in the first place!
You can find a simple first-letter solution, practice_WWII_words.py, in the
appendix or online at https://nostarch.com/real-world-python/.
94 Chapter 4
5
FINDING PLUTO
NOTE In 2006, the International Astronomical Union reclassified Pluto as a dwarf planet.
This was based on the discovery of other Pluto-sized bodies in the Kuiper Belt, including
one—Eris—that is volumetrically smaller but 27 percent more massive than Pluto.
96 Chapter 5
Figure 5-2: A blink comparator
For this technique to work, the photos need to be taken with the same
exposure and under similar viewing conditions. Most importantly, the stars
in the two images must line up perfectly. In Tombaugh’s day, technicians
achieved this through painstaking manual labor; they carefully guided the
telescope during the hour-long exposures, developed the photographic
plates, and then shifted them in the blink comparator to fine-tune the
alignment. Because of this exacting work, it would sometimes take Tombaugh
a week to examine a single pair of plates.
In this project, you’ll digitally duplicate the process of aligning the plates
and blinking them on and off. You’ll work with bright and dim objects, see
the impact of different exposures between photos, and compare the use of
positive images to the negative ones that Tombaugh used.
THE OBJEC TI V E
Write a Python program that aligns two nearly identical images and displays each one in
rapid succession in the same window.
The Strategy
The photos for this project are already taken, so all you need to do is align
them and flash them on and off. Aligning images is often referred to as
image registration. This involves making a combination of vertical, horizon-
tal, or rotational transformations to one of the images. If you’ve ever taken
a panorama with a digital camera, you’ve seen registration at work.
Image registration follows these steps:
Finding Pluto 97
3. Use the numerical descriptors to match identical features in each
image.
4. Warp one image so that matched features share the same pixel locations
in both images.
For this to work well, the images should be the same size and cover close to
the same area.
Fortunately, the OpenCV Python package ships with algorithms that
perform these steps. If you skipped Chapter 1, you can read about OpenCV
on page 6.
Once the images are registered, you’ll need to display them in the same
window so that they overlay exactly and then loop through the display a
set number of times. Again, you can easily accomplish this with the help of
OpenCV.
The Data
The images you’ll need are in the Chapter_5 folder in the book’s supporting
files, downloadable from https://nostarch.com/real-world-python/. The folder
structure should look like Figure 5-3. After downloading the folders, don’t
change this organizational structure or the folder contents and names.
The night_1 and night_2 folders contain the input images you’ll use to
get started. In theory, these would be images of the same region of space
taken on different nights. The ones used here are the same star field image
to which I’ve added an artificial transient. A transient, short for transient
astronomical event, is a celestial object whose motion is detectable over rela-
tively short time frames. Comets, asteroids, and planets can all be consid-
ered transients, as their movement is easily detected against the more static
background of the galaxy.
Table 5-1 briefly describes the contents of the night_1 folder. This folder
contains files with left in their filenames, which means they should go on
the left side of a blink comparator. The images in the night_2 folder contain
right in the filenames and should go on the other side.
98 Chapter 5
Table 5-1: Files in the night_1 folder
Filename Description
1_bright_transient_left.png Contains a large, bright transient
Figure 5-4 is an example of one of the images. The arrow points to the
transient (but isn’t part of the image file).
Finding Pluto 99
Importing Modules and Assigning a Constant
Listing 5-1 imports the modules you’ll need to run the program and assigns
a constant for the minimum number of keypoint matches to accept. Also
called interest points, keypoints are interesting features in an image that you
can use to characterize the image. They’re usually associated with sharp
changes in intensity, such as corners or, in this case, stars.
blink import os
_comparator.py, from pathlib import Path
part 1
import numpy as np
import cv2 as cv
MIN_NUM_KEYPOINT_MATCHES = 50
Listing 5-1: Importing modules and assigning a constant for keypoint matches
Start by importing the operating system module, which you’ll use to list
the contents of folders. Then import pathlib, a handy module that simplifies
working with files and folders. Finish by importing NumPy and cv (OpenCV)
for working with images. If you skipped Chapter 1, you can find installation
instructions for NumPy on page 8.
Assign a constant variable for the minimum number of keypoint
matches to accept. For efficiency, you ideally want the smallest value that
will yield an acceptable registration result. In this project, the algorithm
runs so quickly that you can increase this value without a significant cost.
Listing 5-2: Defining the first part of main(), used to manipulate files and folders
Start by defining main() and then use the os module’s listdir() method
to create a list of the filenames in the night_1 and night_2 folders. For the
night_1 folder, listdir() returns the following:
Note that os.listdir() does not impose an order on the files when
they’re returned. The underlying operating system determines the order,
100 Chapter 5
meaning macOS will return a different list than Windows! To ensure that
the lists are consistent and the files are paired correctly, wrap os.listdir()
with the built-in sorted() function. This function will return the files in
numerical order, based on the first character in the filename.
Next, assign path names to variables using the pathlib Path class. The
first two variables will point to the two input folders, and the third will
point to an output folder to hold the registered images.
The pathlib module, introduced in Python 3.4, is an alternative to
os.path for handling file paths. The os module treats paths as strings, which
can be cumbersome and requires you to use functionality from across the
Standard Library. Instead, the pathlib module treats paths as objects and
gathers the necessary functionality in one place. The official documentation
for pathlib is at https://docs.python.org/3/library/pathlib.html.
For the first part of the directory path, use the cwd() class method to get
the current working directory. If you have at least one Path object, you can
use a mix of objects and strings in the path designation. You can join the
string, representing the folder name, with the / symbol. This is similar to
using os.path.join(), if you’re familiar with the os module.
Note that you will need to execute the program from within the project
directory. If you call it from elsewhere in the filesystem, it will fail.
Looping in main()
Listing 5-3, still in the main() function, runs the program with a big for
loop. This loop will take a file from each of the two “night” folders, load
them as grayscale images, find matching keypoints in each image, use the
keypoints to warp (or register) the first image to match the second, save the
registered image, and then compare (or blink) the registered first image
with the original second image. I’ve also included a few optional quality
control steps that you can comment out once you’re satisfied with the results.
Finding Pluto 101
Begin the loop by enumerating the night1_files list. The enumerate()
built-in function adds a counter to each item in the list and returns this
counter along with the item. Since you only need the counter, use a single
underscore (_) for the list item. By convention, the single underscore indi-
cates a temporary or insignificant variable. It also keeps code-checking pro-
grams, such as Pylint, happy. Were you to use a variable name here, such as
infile, Pylint would complain about an unused variable.
Next, load the image, along with its pair from the night2_files list, using
OpenCV. Note that you have to convert the path to a string for the imread()
method. You’ll also want to convert the image to grayscale. This way, you’ll
need to work with only a single channel, which represents intensity. To keep
track of what’s going on during the loop, print a message to the shell indi-
cating which files are being compared.
Now, find the keypoints and their best matches . The find_best_matches()
function, which you’ll define later, will return these values as three variables:
kp1, which represents the keypoints for the first loaded image; kp2, which
represents the keypoints for the second; and best_matches, which represents
a list of the matching keypoints.
So you can visually check the matches, draw them on img1 and img2
using OpenCV’s drawMatches() method. As arguments, this method takes
each image with its keypoints, the list of best matching keypoints, and an
output image. In this case, the output image argument is set to None, as
you’re just going to look at the output, not save it to a file.
To distinguish between the two images, draw a vertical white line down
the right side of img1. First get the height and width of the image using
shape. Next, call OpenCV’s line() method and pass it the image on which
you want to draw, the start and end coordinates, the line color, and the
thickness. Note that this is a color image, so to represent white, you need
the full BGR tuple (255, 255, 255) rather than the single intensity value (255)
used in grayscale images.
Now, call the quality control function—which you’ll define later—to
display the matches . Figure 5-5 shows an example output. You may want to
comment out this line after you confirm the program is behaving correctly.
102 Chapter 5
With the best keypoint matches found and checked, it’s time to register
the first image to the second. Do this with a function you’ll write later. Pass the
function the two images, the keypoints, and the list of best matches.
The blink comparator, named blink(), is another function that you’ll
write later. Call it here to see the effect of the registration process on the
first image. Pass it the original and registered images, a name for the display
window, and the number of blinks you want to perform . The function will
flash between the two images. The amount of “wiggle” you see will depend
on the amount of warping needed to match img2. This is another line you
may want to comment out after you’ve confirmed that the program runs as
intended.
Next, save the registered image into a folder named night_1_registered,
which the path3 variable points to. Start by assigning a filename variable
that references the original filename, with _registered.png appended to the
end. So you don’t repeat the file extension in the name, use index slicing
([:-4]) to remove it before adding the new ending. Finish by using imwrite()
to save the file. Note that this will overwrite existing files with the same
name without warning.
You’ll want an uncluttered view when you start looking for transients, so
call the method to destroy all the current OpenCV windows. Then call the
blink() function again, passing it the registered image, the second image,
a window name, and the number of times to loop through the images.
The first images are shown side by side in Figure 5-6. Can you find the
transient?
Figure 5-6: Blink Comparator windows for first image in night_1_registered and night_2 folders
Finding Pluto 103
match keypoints, generate a list of the matches, and then truncate that list by
the constant for the minimum number of acceptable keypoints. The function
returns the list of keypoints for each image and the list of best matches.
Listing 5-4: Defining the function to find the best keypoint matches
104 Chapter 5
Sampling
pattern
Keypoint
Patch
Lines connect
paired points
used to build
feature vectors
Some example feature vectors are shown next. I’ve shortened the list of
vectors, because ORB usually compares and records 512 pairs of samples!
V1 = [010010110100101101100--snip--]
V2 = [100111100110010101101--snip--]
V3 = [001101100011011101001--snip--]
--snip--
Finding Pluto 105
Figure 5-8: OpenCV can match keypoints despite differences in scale and orientation.
When you create the ORB object, you can specify the number of key-
points to examine. The method defaults to 500, but 100 will be more than
enough for the image registration needed in this project.
Next, using the orb.detectAndCompute() method , find the keypoints and
their descriptors. Pass it img1 and then repeat the code for img2.
With the keypoints located and described, the next step is to find the
keypoints common to both images. Start this process by creating a BFMatcher
object that includes a distance measurement. The brute-force matcher takes
the descriptor of one feature in the first image and compares it to all the
features in the second image using the Hamming distance. It returns the
closest feature.
For two strings of equal length, the Hamming distance is the number of
positions, or indexes, at which the corresponding values are different. For
the following feature vectors, the positions that don’t match are shown in
bold, and the Hamming distance is 3:
1001011001010
1100111001010
The bf variable will be a BFMatcher object. Call the match() method and
pass it the descriptors for the two images . Assign the returned list of
DMatch objects to a variable named matches.
The best matches will have the lowest Hamming distance, so sort the
objects in ascending order to move these to the start of the list. Note that
you use a lambda function along with the object’s distance attribute. A
lambda function is a small, one-off, unnamed function defined on the fly.
Words and characters that directly follow lambda are parameters. Expressions
come after the colon, and returns are automatic.
106 Chapter 5
Since you only need the minimum number of keypoint matches
defined at the start of the program, create a new list by slicing the matches
list. The best matches are at the start, so slice from the start of matches up to
the value specified in MIN_NUM_KEYPOINT_MATCHES.
At this point, you’re still dealing with arcane objects, as shown here:
Define the function with one parameter: the matched image. This
image was generated by the main() function in Listing 5-3. It consists of the
left and right images with the keypoints drawn as colored circles and with
colored lines connecting corresponding keypoints.
Next, call OpenCV’s imshow() method to display the window. You can
use the format() method when naming the window. Pass it the constant for
the number of minimum keypoint matches.
Complete the function by giving the user 2.5 seconds to view the window.
Note that the waitKey() method doesn’t destroy the window; it just suspends
the program for the allocated amount of time. After the wait period, new
windows will appear as the program resumes.
Registering Images
Listing 5-6 defines the function to register the first image to the second
image.
Finding Pluto 107
height, width = img2.shape # Get dimensions of image 2.
img1_warped = cv.warpPerspective(img1, h_array, (width, height))
return img1_warped
else:
print("WARNING: Number of keypoint matches < {}\n".format
(MIN_NUM_KEYPOINT_MATCHES))
return img1
Define a function that takes the two input images, their keypoint lists,
and the list of DMatch objects returned from the find_best_matches() function
as arguments. Next, load the location of the best matches into NumPy arrays.
Start with a conditional to check that the list of best matches equals or
exceeds the MIN_NUM_KEYPOINT_MATCHES constant. If it does, then initialize two
NumPy arrays with as many rows as there are best matches.
The np.zeros() NumPy method returns a new array of a given shape and
data type, filled with zeros. For example, the following snippet produces a
zero-filled array three rows tall and two columns wide:
In the actual code, the arrays will be at least 50×2, since you stipulated
a minimum of 50 matches.
Now, enumerate the matches list and start populating the arrays with
actual data . For the source points, use the queryIdx.pt attribute to get the
index of the descriptor in the list of descriptors for kp1. Repeat this for the
next set of points, but use the trainIdx.pt attribute. The query/train termi-
nology is a bit confusing but basically refers to the first and second images,
respectively.
The next step is to apply homography. Homography is a transformation,
using a 3×3 matrix, that maps points in one image to corresponding points
in another image. Two images can be related by a homography if both are
viewing the same plane from a different angle or if both images are taken
from the same camera rotated around its optical axis with no shift. To run
correctly, homography needs at least four corresponding points in two
images.
Homography assumes that the matching points really are correspond-
ing points. But if you look carefully at Figures 5-5 and 5-8, you’ll see that
the feature matching isn’t perfect. In Figure 5-8, around 30 percent of the
matches are incorrect!
108 Chapter 5
Fortunately, OpenCV includes a findHomography() method with an outlier
detector called random sample consensus (RANSAC). RANSAC takes random
samples of the matching points, finds a mathematical model that explains
their distribution, and favors the model that predicts the most points. It
then discards outliers. For example, consider the points in the “Raw data”
box in Figure 5-9.
As you can see, you want to fit a line through the true data points
(called the inliers) and ignore the smaller number of spurious points (the
outliers). Using RANSAC, you randomly sample a subset of the raw data
points, fit a line to these, and then repeat this process a set number of times.
Each line-fit equation would then be applied to all the points. The line that
passes through the most points is used for the final line fit. In Figure 5-9,
this would be the line in the rightmost box.
To run findHomography(), pass it the source and destination points and
call the RANSAC method. This returns a NumPy array and a mask. The mask
specifies the inlier and outlier points or the good matches and bad matches,
respectively. You can use it to do tasks like draw only the good matches.
The final step is to warp the first image so that it perfectly aligns with
the second. You’ll need the dimensions of the second image, so use shape()
to get the height and width of img2 . Pass this information, along with img1
and the homography h_array, to the warpPerspective() method. Return the
registered image, which will be a NumPy array.
If the number of keypoint matches is less than the minimum number you
stipulated at the start of the program, the image may not be properly aligned.
So, print a warning and return the original, nonregistered image. This will
allow the main() function to continue looping through the folder images
uninterrupted. If the registration is poor, the user will be aware something is
wrong as the problem pair of images won’t be properly aligned in the blink
comparator window. An error message will also appear in the shell.
Finding Pluto 109
Comparing 2_dim_transient_left.png to 2_dim_transient_right.png.
WARNING: Number of keypoint matches < 50
if __name__ == '__main__':
main()
Define the blink() function with four parameters: two image files, a
window name, and the number of blinks to perform. Start a for loop with
a range set to the number of blinks. Since you don’t need access to the
running index, use a single underscore (_) to indicate the use of an insig-
nificant variable. As mentioned previously in this chapter, this will prevent
code-checking programs from raising an “unused variable” warning.
Now call OpenCV’s imshow() method and pass it the window name and
the first image. This will be the registered first image. Then pause the pro-
gram for 330 milliseconds, the amount of time recommended by Clyde
Tombaugh himself.
Repeat the previous two lines of code for the second image. Because the
two images are aligned, the only thing that will change in the window are tran-
sients. If only one image contains a transient, it will appear to blink on and off.
If both images capture the transient, it will appear to dance back and forth.
End the program with the standard code that lets it run in stand-alone
mode or be imported as a module.
110 Chapter 5
The third loop will show the same small transient, only this time the
second image will be brighter overall than the first. You should still be
able to find the transient, but it will be much more difficult. This is why
Tombaugh had to carefully take and develop the images to a consistent
exposure.
The fourth loop contains a single transient, shown in the left image. It
should blink on and off rather than dance back and forth as in the previous
images.
The fifth image pair represents control images with no transients. This
is what the astronomer would see almost all the time: disappointing static
star fields.
The final loop uses negative versions of the first image pair. The bright
transient appears as flashing black dots. This is the type of image Clyde
Tombaugh used, as it saved time. Since a black dot is as easy to spot as a
white one, he felt no need to print positive images for each negative.
If you look along the left side of the registered negative image, you’ll
see a black stripe that represents the amount of translation needed to align
the images (Figure 5-10). You won’t notice this on the positive images
because it blends in with the black background.
In all the loops, you may notice a dim star blinking in the upper-left
corner of each image pair. This is not a transient but a false positive caused
by an edge artifact. An edge artifact is a change to an image caused by image
misalignment. An experienced astronomer would ignore this dim star because:
it occurs very close to the edge of the image, and the possible transient doesn’t
move between images but just dims.
You can see the cause of this false positive in Figure 5-11. Because only
part of a star is captured in the first frame, its brightness is reduced relative
to the same star in the second image.
Finding Pluto 111
Image 1
Image 1
registered
Image 2
Figure 5-11: Registering a truncated star in Image 1 results in a noticeably dimmer star
than in Image 2
THE OBJEC TI V E
Write a Python program that takes two registered images and highlights any differences
between them.
112 Chapter 5
The Strategy
Instead of an algorithm that blinks the images, you now want one that
automatically finds the transients. This process will still require registered
images, but for convenience, just use the ones already produced in Project 7.
Detecting differences between images is a common enough practice
that OpenCV ships with an absolute difference method, absdiff(), dedicated
to this purpose. It takes the per-element difference between two arrays.
But just detecting the differences isn’t enough. Your program will need to
recognize that a difference exists and show the user only the images con-
taining transients. After all, astronomers have more important things to do,
like demoting planets!
Because the objects you’re looking for rest on a black background and
matching bright objects are removed, any bright object remaining after
differencing is worth noting. And since the odds of having more than one
transient in a star field are astronomically low, flagging one or two differ-
ences should be enough to get an astronomer’s attention.
transient import os
_detector.py, from pathlib import Path
part 1
import cv2 as cv
Listing 5-8: Importing modules and assigning a constant to manage edge effects
Finding Pluto 113
You’ll need all the modules used in the previous project except for
NumPy, so import them here. Set the pad distance to 5 pixels. This value may
change slightly with different datasets. Later, you’ll draw a rectangle around
the edge space within the image so you can see how much area this param-
eter is excluding.
114 Chapter 5
Figure 5-12: Difference image derived from the “bright transient”
input images
Finding Pluto 115
transient def main():
_detector.py, night1_files = sorted(os.listdir('night_1_registered_transients'))
part 3
night2_files = sorted(os.listdir('night_2'))
path1 = Path.cwd() / 'night_1_registered_transients'
path2 = Path.cwd() / 'night_2'
path3 = Path.cwd() / 'night_1_2_transients'
Listing 5-10: Defining main(), listing the folder contents, and assigning path variables
Define the main() function. Then, just as you did in Listing 5-2 on page
100, list the contents of the folders containing the input images and assign
their paths to variables. You’ll use an existing folder to hold images contain-
ing identified transients.
temp = diff_imgs1_2.copy()
transient1, transient_loc1 = find_transient(img1, temp, PAD)
cv.circle(temp, transient_loc1, 10, 0, -1)
Listing 5-11: Looping through the images and finding the transients
Start a for loop that iterates through the images in the night1_files list.
The program is designed to work on positive images, so use image slicing
([:-1]) to exclude the negative image. Use enumerate() to get a counter;
name it i, rather than _, since you’ll use it as an index later.
To find the differences between images, just call the cv.absdiff()
method and pass it the variables for the two images. Show the results for
two seconds before continuing the program.
Since you’re going to blank out the brightest transient, first make a
copy of diff_imgs1_2. Name this copy temp, for temporary. Now, call the
find_transient() function you wrote earlier. Pass it the first input image,
the difference image, and the PAD constant. Use the results to update the
transient variable and to create a new variable, transient_loc1, that records
the location of the brightest pixel in the difference image.
116 Chapter 5
The transient may or may not have been captured in both images taken
on successive nights. To see if it was, obliterate the bright spot you just found
by covering it with a black circle. Do this on the temp image by using black
as the color and a line width of –1, which tells OpenCV to fill the circle.
Continue to use a radius of 10, though you can reduce this if you’re con-
cerned the two transients will be very close together.
Call the find_transient() function again but use a single underscore for
the location variable, as you won’t be using it again. It’s unlikely there’ll be
more than two transients present, and finding even one will be enough to
open the images up to further scrutiny, so don’t bother looking for more.
out_filename = '{}_DECTECTED.png'.format(night1_files[i][:-4])
cv.imwrite(str(path3 / out_filename), blended) # Will overwrite!
else:
print('\nNo transient detected between {} and {}\n'
.format(night1_files[i], night2_files[i]))
if __name__ == '__main__':
main()
Listing 5-12: Showing the circled transients, logging the results, and saving the results
Finding Pluto 117
TRANSIENT DETECTED between 3_diff_exposures_left_registered.png and 3_diff_exposures_right.png
118 Chapter 5
Note the white rectangle near the edges of the image. This represents
the PAD distance. Any transients outside this rectangle were ignored by the
program.
Save the blended image using the filename of the current input image
plus “DETECTED” . The dim transient in Figure 5-13 would be saved as
1_bright_transient_left_registered_DECTECTED.png. Write it to the night_1_2
_transients folder, using the path3 variable.
If no transients were found, document the result in the shell window.
Then end the program with the code to run it as a module or in stand-
alone mode.
Summary
In this chapter, you replicated an old-time blink comparator device and
then updated the process using modern computer vision techniques. Along
the way, you used the pathLib module to simplify working with directory
paths, and you used a single underscore for insignificant, unused variable
names. You also used OpenCV to find, describe, and match interesting
features in images, align the features with homography, blend the images
together, and write the result to a file.
Further Reading
Out of the Darkness: The Planet Pluto (Stackpole Books, 2017), by Clyde Tombaugh
and Patrick Moore, is the standard reference on the discovery of Pluto, told
in the discoverer’s own words.
Chasing New Horizons: Inside the Epic First Mission to Pluto (Picador, 2018),
by Alan Stern and David Grinspoon, records the monumental effort to
finally send a spacecraft—which, incidentally, contained Clyde Tombaugh’s
ashes—to Pluto.
Finding Pluto 119
You can find a solution, practice_orbital_path.py, in the appendix and in
the Chapter_5 folder.
Figure 5-14: Spot the difference between the left and right images.
The starting images can be found in the montages folder in the Chapter_5
folder, downloadable from the book’s website. These are color images that
you’ll need to convert to grayscale and align prior to object detection. You
can find solutions, practice_montage_aligner.py and practice_montage_difference
_finder.py, in the appendix and in the montages folder.
120 Chapter 5
To estimate large numbers of stars, astronomers survey small regions
of the sky, use a computer program to count the stars, and then extrapolate
the results to larger areas. For this challenge project, pretend you’re an
assistant at Lowell Observatory and you’re on a survey team. Write a Python
program that counts the number of stars in the image 5_no_transient_left.png,
used in Projects 7 and 8.
For hints, search online for how to count dots in an image with Python and
OpenCV. For a solution using Python and SciPy, see http://prancer.physics
.louisville.edu/astrowiki/index.php/Image_processing_with_Python_and_SciPy.
You may find your results improve if you divide the image into smaller parts.
Finding Pluto 121
6
WINNING THE MOON
R ACE WITH APOLLO 8
Figure 6-1: The Apollo 8 insignia, with the circumlunar free return trajectory
serving as the mission number
124 Chapter 6
In the fall of 1968, the CSM engine had been tested in the earth’s orbit
only, and there were legitimate concerns about its reliability. To orbit the
moon, the engine would have to fire twice, once to slow the spacecraft to enter
lunar orbit and then again to leave orbit. With the free return trajectory, if the
first maneuver failed, the astronauts could still coast home. As it turned out,
the engine fired perfectly both times, and Apollo 8 orbited the moon 10 times.
(The ill-fated Apollo 13, however, made great use of its free return trajectory!)
This 2D simulation of the free return uses a few key values: the starting
position of the CSM (R0), the velocity and orientation of the CSM (V0), and
the phase angle between the CSM and the moon (g0). The phase angle, also
called the lead angle, is the change in the orbital time position of the CSM
required to get from a starting position to a final position. The translunar
injection velocity (V0) is a propulsive maneuver used to set the CSM on a
trajectory to the moon. It’s achieved from a parking orbit around the earth,
where the spacecraft performs internal checks and waits until the phase
angle with the moon is optimal. At this point, the third stage of the Saturn V
rocket fires and falls away, leaving the CSM to coast to the moon.
126 Chapter 6
Project #9: To the Moon with Apollo 8!
As a summer intern at NASA, you’ve been asked to create a simple simulation
of the Apollo 8 free return trajectory for consumption by the press and
general public. As NASA is always strapped for cash, you’ll need to use open
source software and complete the project quickly and cheaply.
THE OBJEC TI V E
Write a Python program that graphically simulates the free return trajectory proposed for
the Apollo 8 mission.
Arrow Square
Turtle Triangle
Circle Classic
Figure 6-4: Standard turtle shapes provided with the turtle module
As the turtle moves, you can choose to draw a line behind it to trace its
movement (Figure 6-5).
You can use Python functionality with turtle to write more concise
code. For example, you can use a for loop to create the same pattern.
Here, steve moves forward 50 pixels and then turns to the left at a right
angle. These steps are repeated three times by the for loop.
Other turtle methods let you change the shape of the turtle, change its
color, lift the pen so no path is drawn, “stamp” its current position on the
screen, set the heading of the turtle, and get its position on the screen. Figure
6-6 shows this functionality, which is described in the script that follows.
128 Chapter 6
u x
} | {
Figure 6-6: More examples of turtle behaviors. Numbers refer to script annotations.
Leave another stamp and then put the pen down to once more draw a
path behind the turtle . Move steve forward 50 spaces and then change
his shape to a triangle . That completes the drawing.
Don’t be fooled by the simplicity of what we’ve done so far. With the
right commands, you can draw intricate designs, such as the Penrose tiling
in Figure 6-7.
Figure 6-7: A Penrose tiling produced by the turtle module demo, penrose.py
130 Chapter 6
The turtle module is part of the Python Standard Library, and you can
find the official documentation at https://docs.python.org/3/library/turtle.html
?highlight=turtle#module-turtle/. For a quick tutorial, do an online search for
Al Sweigart’s Simple Turtle Tutorial for Python.
The Strategy
We’ve now made a strategic decision to use turtle to draw the simulation,
but how should the simulation look? For convenience, I’d suggest basing it
on Figure 6-3. You’ll start with the CSM in the same parking orbit position
around the earth (R0) and the moon at the same approximate phase angle
(g0). You can use images to represent the earth and the moon and custom
turtle shapes to build the CSM.
Another big decision at this point is whether to use procedural or
object-oriented programming (OOP). When you plan to generate multiple
objects that behave similarly and interact with each other, OOP is a good
choice. You can use an OOP class as a blueprint for the earth, the moon,
and the CSM objects and automatically update the object attributes as the
simulation runs.
You can run the simulation using time steps. Basically, each program
loop will represent one unit of dimensionless time. With each loop, you’ll
need to calculate each object’s position and update (redraw) it on the
screen. This requires solving the three-body problem. Fortunately, not only
has someone done this already, they’ve done it using turtle.
Python modules often include example scripts to show you how to use
the product. For instance, the matplotlib gallery includes code snippets and
tutorials for making a huge number of charts and plots. Likewise, the turtle
module comes with turtle-example-suite, which includes demonstrations of
turtle applications.
One of the demos, planet_and_moon.py, provides a nice “recipe” for han-
dling a three-body problem in turtle (Figure 6-8). To see the demos, open
a PowerShell or terminal window and enter python –m turtledemo. Depending
on your platform and how many versions of Python you have installed, you
may need to use python3 -m turtledemo.
132 Chapter 6
You’ll need to import four helper classes from turtle. You’ll use the
Shape class to make a custom turtle that looks like the CSM. The Screen sub-
class makes the screen, called a drawing board in turtle parlance. The Turtle
subclass creates the turtle objects. The Vec2D import is a two-dimensional
vector class. It will help you define velocity as a vector of magnitude and
direction.
Next, assign some variables that the user may want to tweak later.
Start with the gravitational constant, used in Newton’s gravity equations
to ensure the units come out right. Assign it 8, the value used in the turtle
demo. Think of this as a scaled gravitational constant. You can’t use the true
constant, as the simulation doesn’t use real-world units.
You’ll run the simulation in a loop, and each iteration will represent a
time step. With each step, the program will recalculate the position of the
CSM as it moves through the gravity fields of the earth and the moon. The
value of 4100, arrived at by trial and error, will stop the simulation just after
the spacecraft arrives back on the earth.
In 1968, a round-trip to the moon took about six days. Since you’re
incrementing the time unit by 0.001 with each loop and running 4,100 loops,
this means a time step in the simulation represents about two minutes of
time in the real world. The longer the time step, the faster the simulation
but the less accurate the results, as small errors compound over time. In
actual fight path simulations, you can optimize the time step by first running
a small step, for maximum accuracy, and then using the results to find the
largest time step that yields a similar result.
The next two variables, Ro_X and Ro_Y, represent the (x, y) coordinates of
the CSM at the time of the translunar injection (see Figure 6-3). Likewise,
Vo_X and Vo_Y represent the x- and y-direction components of the translunar
injection velocity, which is applied by the third stage of the Saturn V rocket.
These values started out as best guesses and were refined with repeated
simulations.
def __init__(self):
self.bodies = []
self.t = 0
self.dt = 0.001
Listing 6-2: Defining a class to manage the bodies in the gravity system
The GravSys class defines how long the simulation will run, how much
time will pass between time steps (loops), and what bodies will be involved.
It also calls the step() method of the Body class you’ll define in Listing 6-3.
This method will update each body’s position as a result of gravitational
acceleration.
Define the initialization method and, as per convention, pass it self as
a parameter. The self parameter represents the GravSys object you’ll create
later in the main() function.
Create an empty list named bodies to hold the earth, the moon, and the
CSM objects. Then assign attributes for when the simulation starts and the
amount to increment time with each loop, known as delta time or dt. Set the
starting time to 0 and set the dt time step to 0.001. As discussed in the previ-
ous section, this time step will correspond to about two minutes in the real
world and will produce a smooth, accurate, and fast simulation.
The last method controls the time steps in the simulation . It uses a for
loop with the range set to the NUM_LOOPS variable. Use a single underscore (_)
rather than i to indicate the use of an insignificant variable (see Listing 5-3
in Chapter 5 for details).
With each loop, increment the gravity system’s time variable by dt. Then,
apply the time shift to each body by looping through the list of bodies and
calling the body.step() method, which you’ll define later within the Body
class. This method updates the position and velocity of the bodies due to
gravitational attraction.
134 Chapter 6
gravsys.bodies.append(self)
#self.resizemode("user")
#self.pendown() # Uncomment to draw path behind object.
Listing 6-3: Defining a class to create objects for the earth, the moon, and the CSM
Define a new class by using the Turtle class as its ancestor. This means
the Body class will conveniently inherit all the Turtle class’s methods and
attributes.
Next, define an initializer method for the body object. You’ll use this
to create new Body objects in the simulation, a process called instantiation
in OOP. As parameters, the initialize method takes itself, a mass attribute,
a starting location, a starting velocity, the gravity system object, and a shape.
The super() function lets you invoke the method of a superclass to gain
access to inherited methods from the ancestor class. This allows your Body
objects to use attributes from the prebuilt Turtle class. Pass it the shape attri-
bute, which will allow you to pass a custom shape or image to your bodies
when you build them in the main() function.
Next, assign an instance attribute for the gravsys object. This will allow
the gravity system and body to interact. Note that it’s best to initialize attri-
butes through the __init__() method, as we do in this case, since it’s the first
method called after the object is created. This way, these attributes will be
immediately available to any other methods in the class, and other develop-
ers can see a list of all the attributes in one place.
The following penup() method of the Turtle class will remove the drawing
pen so the object doesn’t leave a path behind it as it moves. This gives you
the option of running the simulation with and without visible orbital paths.
Initialize a mass attribute for the body. You’ll need this to calculate the
force of gravity. Next, assign the body’s starting position using the setpos()
method of the Turtle class. The starting position of each body will be an
(x, y) tuple. The origin point (0, 0) will be at the center of the screen. The
x-coordinate increases to the right, and the y-coordinate increases upward.
Assign an initialization attribute for velocity. This will hold the starting
velocity for each object. For the CSM, this value will change throughout the sim-
ulation as the ship moves through the gravity fields of the earth and the moon.
As each body is instantiated, use dot notation to append it to the list
of bodies in the gravity system. You’ll create the gravsys object from the
GravSys() class in the main() function.
The final two lines, commented out, allow the user to change the simu-
lation window size and choose to draw a path behind each object. Start
out with a full-screen display and keep the pen in the up position to let the
simulation run quickly.
Still within the Body class, define the acceleration method, called acc(),
and pass it self. Within the method, name a local variable a, again for accel-
eration, and assign it to a vector tuple using the Vec2D helper class. A 2D
vector is a pair of real numbers (a, b), which in this case represent x and y
components, respectively. The Vec2D helper class enforces rules that permit
easy mathematical operations using vectors, as follows:
• (a, b) + (c, d) = (a + c, b + d)
• (a, b) – (c, d) = (a – c, b – d)
• (a, b) × (c, d) = ac + bd
Next, start looping through the items in the bodies list, which contains
the earth, the moon, and the CSM. You’ll use the gravitational force of
each body to determine the acceleration of the object for which you’re call-
ing the acc() method. It doesn’t make sense for a body to accelerate itself, so
exclude the body if it’s the same as self.
To calculate gravitational acceleration (stored in the g variable) at a
point in space, you’ll use the following formula:
GM
g = rˆ
r2
where M is the mass of the attracting body, r is the distance (radius) between
bodies, G is the gravitational constant you defined earlier, and rˆ is the unit
vector from the center of mass of the attracting body to the center of mass
of the body being accelerated. The unit vector, also known as the direction
vector or normalized vector, can be described as r/|r|, or:
136 Chapter 6
have to calculate the distance between bodies by using the turtle pos()
method to get each body’s current position as a Vec2D vector. As described
previously, this is a tuple of the (x, y) coordinates.
You’ll then input that tuple into the acceleration equation. Each time
you loop through a new body, you’ll change the a variable based on the
gravitational pull of the body being examined. For example, while the
earth’s gravity may slow the CSM, the moon’s gravity may pull in the oppo-
site direction and cause it to speed up. The a variable will capture the net
effect at the end of the loop. Complete the method by returning a.
Listing 6-5: Applying the time step and rotating the CSM
Defining main(), Setting Up the Screen, and Instantiating the Gravity System
You used object-oriented programming to build the gravity system and
the bodies within it. To run the simulation, you’ll return to procedural
programming and use a main() function. This function sets up the turtle
graphics screen, instantiates objects for the gravity system and the three
bodies, builds a custom shape for the CSM, and calls the gravity system’s
sim_loop() method to walk through the time steps.
138 Chapter 6
Listing 6-6 defines main() and sets up the screen. It also creates a gravity
system object to manage your mini solar system.
gravsys = GravSys()
Listing 6-6: Setting up the screen and making a gravsys object in main()
image_moon = 'moon_27x27.gif'
screen.register_shape(image_moon)
moon = Body(32000, (344, 42), Vec(-27, 147), gravsys, image_moon)
moon.pencolor('gray')
Start by assigning the image of the earth, which is included in the folder
for this project, to a variable. Note that images should be gif files and cannot
be rotated to show the turtle’s heading. So that turtle recognizes the new
shape, add it to the TurtleScreen shapelist using the screen.register_shape()
method. Pass it the variable that references the earth image.
Now it’s time to instantiate the turtle object for the earth. You call the
Body class and pass it the arguments for mass, starting position, starting
velocity, gravity system, and turtle shape—in this case, the image. Let’s talk
about each of these arguments in more detail.
You’re not using real-world units here, so mass is an arbitrary number. I
started with the value used for the sun in the turtle demo planet_and_moon.py,
on which this program is based.
The starting position is an (x, y) tuple that places the earth near the
center of the screen. It’s biased downward 25 pixels, however, as most of the
action will take place in the upper quadrant of the screen. This placement
will provide a little more room in that region.
The starting velocity is a simple (x, y) tuple provided as an argument to
the Vec2D helper class. As discussed previously, this will allow later methods
to alter the velocity attribute using vector arithmetic. Note that the earth’s
velocity is not (0, 0), but (0, -2.5). In real life and in the simulation, the
moon is massive enough to affect the earth so that the center of gravity
between the two is not at the center of the earth, but farther out. This will
cause the earth turtle to wobble and shift positions in a distracting man-
ner during the simulation. Because the moon will be in the upper part of
the screen during simulation, shifting the earth downward a small amount
each time step will dampen the wobbling.
The last two arguments are the gravsys object you instantiated in the
previous listing and the image variable for the earth. Passing gravsys means
the earth turtle will be added to the list of bodies and included in the sim_loop()
class method.
Note that if you don’t want to use a lot of arguments when instantiat-
ing an object, you can change an object’s attributes after it’s created. For
example, when defining the Body class, you could’ve set self.mass = 0, rather
than using an argument for mass. Then, after instantiating the earth body,
you could reset the mass value using earth.mass = 1000000.
140 Chapter 6
Because the earth wobbles a little, its orbital path will form a tight circle
at the top of the planet. To hide it in the polar cap, use the turtle pencolor()
method and set the line color to white.
Finish the earth turtle with code that delays the start of the simula-
tion and prevents the various turtles from flashing on the screen as the
program first draws and resizes them. The getscreen() method returns the
TurtleScreen object the turtle is drawing on. TurtleScreen methods can then
be called for that object. In the same line, call the tracer() method that
turns the turtle animation on or off and sets a delay for drawing updates.
The n parameter determines the number of times the screen updates. A
value of 0 means the screen updates with every loop; larger values progres-
sively repress the updates. This can be used to accelerate the drawing of
complex graphics, but at the cost of image quality. The second argument
sets a delay value, in milliseconds, between screen updates. Increasing the
delay slows the animation.
You’ll build the moon turtle in a similar fashion to the one for the earth.
Start by assigning a new variable to hold the moon image . The moon’s mass
is only a few percent of the earth’s mass, so use a much smaller value for the
moon. I started out with a mass of around 16,000 and tweaked the value until
the CSM’s flight path produced a visually pleasing loop around the moon.
The moon’s starting position is controlled by the phase angle shown
in Figure 6-3. Like this figure, the simulation you’re creating here is not to
scale. Although the earth and moon images will have the correct relative
sizes, the distance between the two is smaller than the actual distance, so
the phase angle will need to be adjusted accordingly. I’ve reduced the dis-
tance in the model because space is big. Really big. If you want to show the
simulation to scale and fit it all on your computer monitor, then you must
settle for a ridiculously tiny earth and moon (Figure 6-10).
Figure 6-10: Earth and moon system at closest approach, or perigee, shown to scale
To keep the two bodies recognizable, you’ll instead use larger, properly
scaled images but reduce the distance between them (Figure 6-11). This
configuration will be more relatable to the viewer and still allow you to
replicate the free return trajectory.
Because the earth and the moon are closer together in the simulation,
the moon’s orbital velocity will be faster than in real life, as per Kepler’s
second law of planetary motion. To compensate for this, the moon’s starting
position is designed to reduce the phase angle compared to that shown in
Figure 6-3.
Finally, you’ll want the option to draw a line behind the moon to trace
its orbit. Use the turtle pencolor() method and set the line color to gray.
NOTE Parameters such as mass, initial position, and initial velocity are good candidates
for global constants. Despite this, I chose to enter them as method arguments to avoid
overloading the user with too many input variables at the start of the program.
Name a variable csm and call the turtle Shape class. Pass it 'compound',
indicating you want to build the shape using multiple components.
142 Chapter 6
The first component will be the command module. Name a variable cm
and assign it to a tuple of coordinate pairs, known as a polygon type in turtle.
These coordinates build a triangle, as shown in Figure 6-12.
(–60, 30) (0, 30)
(–90, 20)
(–90, –20)
Add this triangle component to the csm shape using the addcomponent()
method, called with dot notation. Pass it the cm variable, a fill color, and an
outline color. Good fill colors are white, silver, gray, or red.
Repeat this general process for the service module rectangle. Set the
outline color to black when you add the component to delineate the service
and command modules (see Figure 6-12).
Use another triangle for the nozzle, also called the engine bell. Add the
component and then register the new csm compound shape to the screen.
Pass the method a name for the shape and then the variable referencing
the shape.
gravsys.sim_loop()
if __name__ == '__main__':
main()
Listing 6-9: Instantiating a CSM turtle, calling the simulation loop and main()
Create a turtle named ship to represent the CSM. The starting position
is an (x, y) tuple that places the CSM in a parking orbit directly below the
earth on the screen. I first approximated the proper height for the parking
orbit (R0 in Figure 6-3) and then fine-tuned it by repeatedly running the
screen.bye()
Figure 6-13: The simulation run with the pen up and the CSM approaching the moon
144 Chapter 6
To trace the journey of the CSM, go to the definition of the Body class
and uncomment this line:
You should now see the figure-eight shape of the free return trajectory
(Figure 6-14).
Figure 6-14: The simulation run with the pen down and the CM
at splashdown in the Pacific
Figure 6-15: The gravitational slingshot maneuver achieved with Vo_X = 520
Summary
In this chapter, you learned how to use the turtle drawing program, includ-
ing how to make customized turtle shapes. You also learned how to use
Python to simulate gravity and solve the famous three-body problem.
Further Reading
Apollo 8: The Thrilling Story of the First Mission to the Moon (Henry Holt and
Co., 2017), by Jeffrey Kluger, covers the historic Apollo 8 mission from its
unlikely beginning to its “unimaginable triumph.”
An online search for PBS Nova How Apollo 8 Left Earth Orbit should
return a short video clip on the Apollo 8 translunar injection maneuver,
marking the first time humans left the earth’s orbit and traveled to another
celestial body.
NASA Voyager 1 & 2 Owner’s Workshop Manual (Haynes, 2015), by
Christopher Riley, Richard Corfield, and Philip Dolling, provides interest-
ing background on the three-body problem and Michael Minovitch’s many
contributions to space travel.
The Wikipedia Gravity assist page contains lots of interesting anima-
tions of various gravity-assist maneuvers and historic planetary flybys that
you can reproduce with your Apollo 8 simulation.
Chasing New Horizons: Inside the Epic First Mission to Pluto (Picador, 2018),
by Alan Stern and David Grinspoon, documents the importance—and
ubiquity—of simulations in NASA missions.
146 Chapter 6
Figure 6-16: Two screenshots from practice_search_pattern.py
For fun, add a helicopter turtle and orient it properly for each pass.
Also add a randomly positioned sailor turtle, stop the simulation when the
sailor is found, and post the joyous news to the screen (Figure 6-17).
Figure 6-19: The moon and CSM cross orbits, and the moon slows and turns the CSM.
148 Chapter 6
Challenge Project: True-Scale Simulation
Rewrite apollo_8_free_return.py so that the earth, the moon, and the distance
between them are all accurately scaled, as shown in Figure 6-10. Use colored
circles, rather than images, for the earth and the moon and make the CSM
invisible (just draw a line behind it). Use Table 6-2 to help determine the
relative sizes and distances to use.
Mars Pathfinder
200 x 100 km
landing footprint
Figure 7-1: Scaled comparison of 1997 Mars Pathfinder landing site (left) with Southern California (right)
The 2018 InSight lander had a landing ellipse of only 130 km × 27 km.
The probability of the probe landing somewhere within that ellipse was
around 99 percent.
152 Chapter 7
The MOLA Map
To identify suitable landing spots, you’ll need a map of Mars. Between 1997
and 2001, a tool aboard the Mars Global Surveyor (MGS) spacecraft shined
a laser on Mars and timed its reflection 600 million times. From these mea-
surements, researchers led by Maria Zuber and David Smith produced a
detailed global topography map known as MOLA (Figure 7-2).
Write a Python program that uses an image of the MOLA map to choose the 20 safest
670 km × 335 km regions near the Martian equator from which to select landing ellipses
for the Orpheus lander.
The Strategy
First, you’ll need a way to divide the MOLA digital map into rectangular
regions and extract statistics on elevation and surface roughness. This means
you’ll be working with pixels, so you’ll need imaging tools. And since NASA
is always containing costs, you’ll want to use free, open source libraries like
OpenCV, the Python Imaging Library (PIL), tkinter, and NumPy. For an over-
view and installation instructions, see “Installing the Python Libraries” on
page 6 for OpenCV and NumPy, and see “The Word Cloud and PIL Modules”
on page 65 for PIL. The tkinter module comes preinstalled with Python.
To honor the elevation constraints, you can simply calculate the aver-
age elevation for each region. For measuring how smooth a surface is at
a given scale, you have lots of choices, some of them quite sophisticated.
Besides basing smoothness on elevation data, you can look for differential
shadowing in stereo images; the amount of scattering in radar, laser, and
microwave reflections; thermal variations in infrared images; and so on.
Many roughness estimates involve tedious analyses along transects, which
are lines drawn on the planet’s surface along which variations in height are
measured and scrutinized. Since you’re not really a summer intern with
three months to burn, you’re going to keep things simple and use two com-
mon measurements that you’ll apply to each rectangular region: standard
deviation and peak-to-valley.
Standard deviation, also called root-mean-square by physical scientists, is
a measure of the spread in a set of numbers. A low standard deviation indi-
cates that the values in a set are close to the average value; a high standard
deviation indicates they are spread out over a wider range. A map region with
a low standard deviation for elevation means that the area is flattish, with
little variance from the average elevation value.
Technically, the standard deviation for a population of samples is the
square root of the average of the squared deviations from the mean, repre-
sented by the following formula:
N
1
σ =
N
∑ (hi − h0 )2
i =1
154 Chapter 7
for the surface. This is important as a surface may have a relatively low stan-
dard deviation—suggesting smoothness—yet contain a significant hazard,
as shown in the cross section in Figure 7-3.
3
PV = 6
2
1 StD = 0.694
−1
−2
−3
−4
Figure 7-3: A surface profile (black line) with standard deviation (StD) and peak-to-valley
(PV) statistics
You can use the standard deviation and peak-to-valley statistics as compar-
ative metrics. For each rectangular region, you’re looking for the lowest values
of each statistic. And because each statistic records something slightly differ-
ent, you’ll find the best 20 rectangular regions based on each statistic and
then select only the rectangles that overlap to find the best rectangles overall.
Figure 7-4: Mars MGS MOLA Digital Elevation Model 463m v2 (mola_1024x501.png)
NOTE The MOLA map comes in multiple file sizes and resolutions. You’re using the smallest
size here to speed up the download and run times.
156 Chapter 7
without the flag to load the color image. Then add constants for the rect-
angle size. In the next listing, you’ll convert these dimensions to pixels for
use with the map image.
Next, to ensure the rectangles target smooth areas at low elevations, you
should limit the search to lightly cratered, flat terrain. These regions are
believed to represent old ocean bottoms. Thus, you’ll want to set the maxi-
mum elevation limit to a grayscale value of 55, which corresponds closely to
the areas thought to be remnants of ancient shorelines (see Figure 7-5).
Figure 7-5: MOLA map with pixel values ≤ 55 colored black to represent
ancient Martian oceans
screen = tk.Tk()
canvas = tk.Canvas(screen, width=IMG_WIDTH, height=IMG_HT + 130)
Listing 7-2: Assigning derived constants and setting up the tkinter screen
Latitude values start at 0° at the equator and end at 90° at the poles.
To find 30° north, all you need to do is divide the image height by 3 . To
get to 30° south, double the number of pixels it took to get to 30° north.
Restricting the search to the equatorial region of Mars has a beneficial
side effect. The MOLA map you’re using is based on a cylindrical projection,
used to transfer the surface of a globe onto a flat plane. This causes converg-
ing lines of longitude to be parallel, badly distorting features near the poles.
You may have noticed this on wall maps of the earth, where Greenland looks
like a continent and Antarctica is impossibly huge (see Figure 7-7).
Fortunately, this distortion is minimized near the equator, so you don’t
have to factor it into the rectangle dimensions. You can verify this by check-
ing the shape of craters on the MOLA map. So long as they’re nice and
circular—rather than oval—projection-related effects can be ignored.
158 Chapter 7
Figure 7-7: Forcing lines of longitude to be parallel distorts the size of features
near the poles.
The program will draw this first rectangle, number it, and calculate the
elevation statistics within it. It will then move the rectangle eastward and
repeat the process. How far you move the rectangle each time is defined by
the STEP_X and STEP_Y constants and depends on something called aliasing.
Aliasing is a resolution issue. It occurs when you don’t take enough
samples to identify all the important surface features in an area. This can
cause you to “skip over” a feature, such as a crater, and fail to recognize
it. For example, in Figure 7-9A, there’s a suitably smooth landing ellipse
between two large craters. However, as laid out in Figure 7-9B, no rectan-
gular region corresponds to this ellipse; both rectangles in the vicinity
partially sample a crater rim. As a result, none of the drawn rectangles
contains a suitable landing ellipse, even though one exists in the vicinity.
Acceptable
A landing ellipse
Rectangular region
C shifted 1/2 rect width
Rectangular region
B
The rule of thumb to avoid aliasing effects is to make the step size less
than or equal to half the width of the smallest feature you want to identify.
For this project, use half the rectangle width so the displays don’t become
too busy.
Now it’s time to look ahead to the final display. Create a screen instance
of the tkinter Tk() class . The tkinter application is Python’s wrapper of
the GUI toolkit Tk, originally written in a computer language called TCL.
It needs the screen window to link to an underlying tcl/tk interpreter that
translates tkinter commands into tcl/tk commands.
Next, create a tkinter canvas object. This is a rectangular drawing area
designed for complex layouts of graphics, text, widgets, and frames. Pass it
the screen object, set its width equal to the MOLA image, and set its height
equal to the height of the MOLA image plus 130. The extra padding beneath
the image will hold the text summarizing the statistics for the displayed
rectangles.
It’s more typical to place the tkinter code just described at the end of
programs, rather than at the beginning. I chose to put it near the top to
make the code explanation easier to follow. You can also embed this code
within the function that makes the final display. However, this can cause
problems for macOS users. For macOS 10.6 or newer, the Apple-supplied
Tcl/Tk 8.5 has serious bugs that can cause application crashes (see https://
www.python.org/download/mac/tcltk/).
160 Chapter 7
Defining and Initializing a Search Class
Listing 7-3 defines a class that you’ll use to search for suitable rectangular
regions. It then defines the class’s __init__() initialization method, used
to instantiate new objects. For a quick overview of OOP, see “Defining the
Search Class” on page 10, where you also define a search class.
Define a class called Search. Then define the __init__() method used to
create new objects. The name parameter will allow you to give a personalized
name to each object when you create it later in the main() function.
Now you’re ready to start assigning attributes. Start by linking the
object’s name with the argument you’ll provide when you create the object.
Then assign four empty dictionaries to hold important statistics for each
rectangle . These include the rectangle’s corner-point coordinates and its
mean elevation, peak-to-valley, and standard deviation statistics. For a key,
all these dictionaries will use consecutive numbers, starting with 1. You’ll
want to filter the statistics to find the lowest values, so set up two empty lists
to hold these . Note that I use the term ptp, rather than ptv, to represent
the peak-to-valley statistic. That’s to be consistent with the NumPy built-in
method for this calculation, which is called peak-to-peak.
At the end of the program, you’ll place rectangles that occur in both
the sorted standard deviation and peak-to-valley lists in a new list named
high_graded_rects. This list will contain the numbers of the rectangles with
the lowest combined scores. These rectangles will be the best places to look
for landing ellipses.
while True:
rect_img = IMG_GRAY[ul_y : lr_y, ul_x : lr_x]
self.rect_coords[rect_num] = [ul_x, ul_y, lr_x, lr_y]
if np.mean(rect_img) <= MAX_ELEV_LIMIT:
self.rect_means[rect_num] = np.mean(rect_img)
self.rect_ptps[rect_num] = np.ptp(rect_img)
self.rect_stds[rect_num] = np.std(rect_img)
rect_num += 1
ul_x += STEP_X
lr_x = ul_x + RECT_WIDTH
if lr_x > IMG_WIDTH:
ul_x = 0
ul_y += STEP_Y
lr_x = RECT_WIDTH
lr_y += STEP_Y
if lr_y > LAT_30_S + STEP_Y:
break
162 Chapter 7
If the rectangle passes the elevation test, populate the three dictionar-
ies with the coordinates, peak-to-valley, and standard deviation statistics,
as appropriate. Note that you can perform the calculation as part of the
process, using np.ptp for peak-to-valley and np.std for standard deviation.
Next, advance the rect_num variable by 1 and move the rectangle. Move
the upper-left x-coordinate by the step size and then shift the lower-right
x-coordinate by the width of the rectangle. You don’t want the rectangle
to extend past the right side of the image, so check whether lr_x is greater
than the image width . If it is, set the upper-left x-coordinate to 0 to move
the rectangle back to the starting position on the left side of the screen.
Then move its y-coordinates down so that the new rectangles move along a
new row. If the bottom of this new row is more than half a rectangle height
below 30° south latitude, you’ve fully sampled the search area and can end
the loop .
Between 30° north and south latitude, the image is bounded on both
sides by relatively high, cratered terrain that isn’t suitable for a landing site
(see Figure 7-6). Thus, you can ignore the final step that shifts the rectan-
gle by one-half its width. Otherwise, you would need to add code that wraps
a rectangle from one side of the image to the other and calculates the statis-
tics for each part. We’ll take a closer look at this situation in the final chal-
lenge project at the end of the chapter.
NOTE When you draw something on an image, such as a rectangle, the drawing becomes
part of the image. The altered pixels will be included in any NumPy analyses you run,
so be sure to calculate any statistics before you annotate the image.
Listing 7-5: Drawing all the rectangles on the MOLA map as a quality control step
--snip--
164 Chapter 7
If you compare Figure 7-10 to Figure 7-8, you may notice that the rect-
angles appear smaller than expected. This is because you stepped the
rectangles across and down the image using half the rectangle width and
height so that they overlap each other.
Listing 7-6: Sorting and high grading the rectangles based on their statistics
return img_copy
Listing 7-7: Drawing filtered rectangles and latitude lines on MOLA map
Start by defining the method, which in this case takes multiple argu-
ments. Besides self, the method will need a loaded image and a list of
rectangle numbers. Use a local variable to copy the image and then start
looping through the rectangle numbers in the filtered_rect_list. With each
loop, draw a rectangle by using the rectangle number to access the corner
coordinates in the rect_coords dictionary.
So you can tell one rectangle from another, use OpenCV’s putText()
method to post the rectangle number in the bottom-left corner of each
rectangle. It needs the image, the text (as a string), coordinates for the
upper-left x and lower-right x, a font, a line width, and a color.
Next, draw the annotated latitude limits, starting with the text for 30°
north . Then draw the line using OpenCV’s line() method. It takes as
arguments an image, a pair of (x, y) coordinates for the start and end of the
line, a color, and a thickness. Repeat these basic instructions for 30° south
latitude.
End the method by returning the annotated image. The best rect-
angles, based on the peak-to-valley and standard deviation statistics, are
shown in Figures 7-11 and 7-12, respectively.
These two figures show the top 20 rectangles for each statistic. That
doesn’t mean they always agree. The rectangle with the lowest standard
deviation may not appear in the peak-to-valley figure due to the presence of
a single small crater. To find the flattest, smoothest rectangles, you need to
identify the rectangles that appear in both figures and show them in their
own display.
166 Chapter 7
Figure 7-11: The 20 rectangles with the lowest peak-to-valley scores
img_color_rects = self.draw_filtered_rects(IMG_COLOR,
self.high_graded_rects)
txt_x = 5
txt_y = IMG_HT + 20
for k in self.high_graded_rects:
canvas.create_text(txt_x, txt_y, anchor='w', font=None,
text="rect={} mean elev={:.1f} std={:.2f} ptp={}"
.format(k, self.rect_means[k], self.rect_stds[k],
self.rect_ptps[k]))
txt_y += 15
if txt_y >= int(canvas.cget('height')) - 10:
txt_x += 300
txt_y = IMG_HT + 20
canvas.pack()
screen.mainloop()
Listing 7-8: Making the final display using the color MOLA map
After defining the method, give the tkinter screen window a title that
links to the name of your search object.
Then, to make the final color image for display, name a local variable
img_color_rects and call the draw_filtered_rects() method. Pass it the color
MOLA image and the list of high-graded rectangles. This will return the
colored image with the final rectangles and latitude limits.
Before you can post this new color image in the tkinter canvas, you need
to convert the colors from OpenCV’s Blue-Green-Red (BGR) format to the
Red-Green-Blue (RGB) format used by tkinter. Do this with the OpenCV
cvtColor() method. Pass it the image variable and the COLOR_BGR2RGB flag .
Name the result img_converted.
At this point, the image is still a NumPy array. To convert to a tkinter-
compatible photo image, you need to use the PIL ImageTk module’s PhotoImage
class and the Image module’s fromarray() method. Pass the method the RGB
image variable you created in the previous step.
With the image finally tkinter ready, place it in the canvas using the
create_image() method. Pass the method the coordinates of the upper-left
corner of the canvas (0, 0), the converted image, and a northwest anchor
direction.
Now all that’s left is to add the summary text. Start by assigning coor-
dinates for the bottom-left corner of the first text object . Then begin
looping through the rectangle numbers in the high-graded rectangle list.
Use the create_text() method to place the text in the canvas. Pass it a pair
168 Chapter 7
of coordinates, a left-justified anchor direction, the default font, and a text
string. Get the statistics by accessing the different dictionaries using the
rectangle number, designated k for “key.”
Increment the text box’s y-coordinate by 15 after drawing each text
object. Then write a conditional to check that the text is greater than or
within 10 pixels of the bottom of the canvas . You can obtain the height of
the canvas using the cget() method.
If the text is too close to the bottom of the canvas, you need to start
a new column. Shift the txt_x variable over by 300 and reset txt_y to the
image height plus 20.
Finish the method definition by packing the canvas and then calling the
screen object’s mainloop(). Packing optimizes the placement of objects in the
canvas. The mainloop() is an infinite loop that runs tkinter, waits for an event
to occur, and processes the event until the window is closed.
NOTE The height of the color image (506 pixels) is slightly larger than that of the grayscale
image (501 pixels). I chose to ignore this, but if you’re a stickler for accuracy, you
can use OpenCV to shrink the height of the color image using IMG_COLOR = cv.resize
(IMG_COLOR, (1024, 501), interpolation = cv.INTER_AREA).
if __name__ == '__main__':
main()
Listing 7-9: Defining and calling the main() function used to run the program
Start by instantiating an app object from the Search class. Name it 670x335
km to document the size of the rectangular regions being investigated. Next,
call the Search methods in order. Run the statistics on the rectangles and
draw the quality control rectangles. Sort the statistics from smallest to larg-
est and then draw the rectangles with the best peak-to-valley and standard
deviation statistics. Show the results and finish the function by making
the final summary display.
Figure 7-13: Final display with high-graded rectangles and summary statistics sorted by
standard deviation
Results
After you’ve made the final display, the first thing you should do is perform
a sanity check. Make sure that the rectangles are within the allowed latitude
and elevation limits and that they appear to be in smooth terrain. Likewise,
the rectangles based on the peak-to-valley and standard deviation statistics,
shown in Figures 7-11 and 7-12, respectively, should match the constraints
and mostly pick the same rectangles.
As noted previously, the rectangles in Figures 7-11 and 7-12 don’t per-
fectly overlap. That’s because you’re using two different metrics for smooth-
ness. One thing you can be sure of, though, is that the rectangles that do
overlap will be the smoothest of all the rectangles.
While all the rectangle locations look reasonable in the final display, the
concentration of rectangles on the far-west side of the map is particularly
encouraging. This is the smoothest terrain in the search area (Figure 7-14),
and your program clearly recognized it.
This project focused on safety concerns, but scientific objectives drive
site selection for most missions. In the practice projects at the end of the
chapter, you’ll get a chance to incorporate an additional constraint—
geology—into the site selection equation.
170 Chapter 7
30◦ North
Olympus
Mons
Figure 7-14: The very smooth terrain west of the Olympus Mons lava fields
Summary
In this chapter, you used Python, OpenCV, the Python Imaging Library,
NumPy, and tkinter to load, analyze, and display an image. Because OpenCV
treats images as NumPy arrays, you can easily extract information from parts
of an image and evaluate it with Python’s many scientific libraries.
The dataset you used was quick to download and fast to run. While a
real intern would have used a larger and more rigorous dataset, such as one
composed of millions of actual elevation measurements, you got to see how
the process works with little effort and reasonable results.
Further Reading
The Jet Propulsion Laboratory has several short and fun videos about land-
ing on Mars. Find them with online searches for Mars in a Minute: How Do
You Choose a Landing Site?, Mars in a Minute: How Do You Get to Mars?, and
Mars in a Minute: How Do You Land on Mars?.
Mapping Mars: Science, Imagination, and the Birth of a World (Picador,
2002), by Oliver Morton, tells the story of the contemporary exploration of
Mars, including the creation of the MOLA map.
The Atlas of Mars: Mapping Its Geography and Geology (Cambridge
University Press, 2019), by Kenneth Coles, Kenneth Tanaka, and Philip
Christensen, is a spectacular all-purpose reference atlas of Mars that
includes maps of topography, geology, mineralogy, thermal properties,
near-surface water-ice, and more.
The data page for the MOLA map used in Project 10 is at
https://astrogeology.usgs.gov/search/map/Mars/GlobalSurveyor/MOLA/Mars_MGS
_MOLA_DEM_mosaic_global_463m/.
Detailed Martian datasets are available on the Mars Orbital Data
Explorer site produced by the PDS Geoscience Node at Washington
University in St. Louis (https://ode.rsl.wustl.edu/mars/index.aspx).
Use the Mars MGS MOLA - MEX HRSC Blended DEM Global 200m v2 map
shown in Figure 7-15. This version has better lateral resolution than the one
you used for Project 10. It also uses the full elevation range in the MOLA
data. You can find a copy, mola_1024x512_200mp.jpg, in the Chapter_7 folder,
downloadable from the book’s website. A solution, practice_profile_olympus.py,
is available in the same folder and in the appendix.
172 Chapter 7
Practice Project: Plotting in 3D
Mars is an asymmetrical planet, with the southern hemisphere dominated
by ancient cratered highlands and the north characterized by smooth, flat
lowlands. To make this more apparent, use the 3D plotting functionality in
matplotlib to display the mola_1024x512_200mp.jpg image you used in the
previous practice project (Figure 7-16).
Olympus Mons
Southern Highlands
Northern Lowlands
Argyre Planitia
Polar Cap
Hellas Planitia
Elysium Mons
With matplotlib, you can make 3D relief plots using points, lines, con-
tours, wireframes, and surfaces. Although the plots are somewhat crude,
you can generate them quickly. You can also use the mouse to interactively
grab the plot and change the viewpoint. They are particularly useful for
people who have trouble visualizing topography from 2D maps.
In Figure 7-16, the exaggerated vertical scale makes the elevation
difference from south to north easy to see. It’s also easy to spot the tallest
mountain (Olympus Mons) and the deepest crater (Hellas Planitia).
You can reproduce the plot in Figure 7-16—sans annotation—with the
practice_3d_plotting.py program in the appendix or Chapter_7 folder, down-
loadable from the book’s website. The map image can be found in the same
folder.
Since the Tharsis Montes region lies at a high altitude, focus on find-
ing the flattest and smoothest parts of the volcanic deposits, rather than
targeting the lowest elevations. To isolate the volcanic deposits, consider
thresholding a grayscale version of the map. Thresholding is a segmentation
technique that partitions an image into a foreground and a background.
With thresholding, you convert a grayscale image into a binary image
where pixels above or between specified threshold values are set to 1 and all
others are set to 0. You can use this binary image to filter the MOLA map,
as shown in Figure 7-18.
Figure 7-18: Filtered MOLA map over the Tharsis Montes region, with ptp (left) and std
(right) rectangles
174 Chapter 7
You can find the geological map, Mars_Global_Geology_Mariner9_1024.jpg,
in the Chapter_7 folder, downloadable from the book’s website. The volcanic
deposits will be light pink in color. For the elevation map, use mola_1024x512
_200mp.jpg from the “Extracting an Elevation Profile” practice project on
page 172.
A solution, contained in practice_geo_map_step_1of2.py and practice_geo
_map_step_2of2.py, can be found in the same folder and in the appendix.
Run the practice_geo_map_step_1of2.py program first to generate the filter
for step 2.
Valles
Marineri
s
Hellas
Planitia
Figure 7-19: Diagonal profile through the three volcanoes on Tharsis Montes
Of course, for efficiency, you don’t have to duplicate the whole map. You
only need a strip along the eastern margin wide enough to accommodate
the final overlapping rectangle.
176 Chapter 7
8
DE TEC TING DIS TA NT
E XOPL ANE TS
Transit Photometry
In astronomy, a transit occurs when a relatively small celestial body passes
directly between the disc of a larger body and an observer. When the small
body moves across the face of the larger body, the larger body dims slightly.
The best-known transits are those of Mercury and Venus against our own
sun (Figure 8-1).
Figure 8-1: Clouds and Venus (the black dot) passing before the sun in June 2012
Star
Planet
1 2 3 4 5
Brightness
Observations
Light curve
Time
178 Chapter 8
In Figure 8-2, the dots on the light curve graph represent measure-
ments of the light given off by a star. When a planet is not positioned over
the star , the measured brightness is at a maximum. (We’ll ignore light
reflected off the exoplanet as it goes through its phases, which would very
slightly increase the apparent brightness of the star). As the leading edge of
a planet moves onto the disc , the emitted light progressively dims, form-
ing a ramp in the light curve. When the entire planet is visible against the
disc , the light curve flattens, and it remains flat until the planet begins
exiting the far side of the disc. This creates another ramp , which rises
until the planet passes completely off the disc . At that point, the light
curve flattens at its maximum value, as the star is no longer obscured.
Because the amount of light blocked during a transit is proportional
to the size of the planet’s disc, you can calculate the radius of the planet
using the following formula:
Rp = Rs Depth
Transit time
Light curve
Depth
Brightness
Time
Figure 8-3: Depth represents the total change in brightness observed in a light curve.
Of course, these calculations assume that the whole exoplanet, not just
part of it, moved over the face of the star. The latter may occur if the exoplanet
skims either the top or bottom of the star (from our point of view). We’ll look
at this case in “Experimenting with Transit Photometry” on page 182.
Figure 8-4: Diamond ring effect at the end of totality, 2017 solar eclipse
THE OBJEC TI V E
Write a Python program that simulates an exoplanet transit, plots the resulting light curve,
and calculates the radius of the exoplanet.
The Strategy
To generate a light curve, you need to be able to measure changes in bright-
ness. You can do this by performing mathematical operations on pixels,
such as finding mean, minimum, and maximum values, with OpenCV.
Instead of using an image of a real transit and star, you’ll draw circles
on a black rectangle, just as you drew rectangles on the Mars map in the
previous chapter. To plot the light curve, you can use matplotlib, Python’s
main plotting library. You installed matplotlib in “Installing NumPy and
Other Scientific Packages with pip” on page 8 and began using it to make
graphs in Chapter 2.
180 Chapter 8
The Transit Code
The transit.py program uses OpenCV to generate a visual simulation of an
exoplanet transiting a star, plots the resulting light curve with matplotlib,
and estimates the size of the planet using the planetary radius equation
from page 179. You can enter the code yourself or download it from
https://nostarch.com/real-world-python/.
IMG_HT = 400
IMG_WIDTH = 500
BLACK_IMG = np.zeros((IMG_HT, IMG_WIDTH, 1), dtype='uint8')
STAR_RADIUS = 165
EXO_RADIUS = 7
EXO_DX = 3
EXO_START_X = 40
EXO_START_Y = 230
NUM_FRAMES = 145
Import the math module for the planetary radius equation, NumPy for cal-
culating the brightness of the image, OpenCV for drawing the simulation,
and matplotlib for plotting the light curve. Then start assigning constants
that will represent user-input values.
Start with a height and width for the simulation window. The window
will be a black, rectangular image built using the np.zeros() method, which
returns an array of a given shape and type filled with zeros.
Recall that OpenCV images are NumPy arrays and items in the arrays
must have the same type. The uint8 data type represents an unsigned integer
from 0 to 255. You can find a useful listing of other data types and their
descriptions at https://numpy.org/devdocs/user/basics.types.html.
Next, assign radius values, in pixels, for the star and exoplanet. OpenCV
will use these constants when it draws circles representing them.
The exoplanet will move across the face of the star, so you need to define
how quickly it will move. The EXO_DX constant will increment the exoplanet’s
x position by three pixels with each programming loop, causing the exoplanet
to move left to right.
Assign two constants to set the exoplanet’s starting position. Then assign
a NUM_FRAMES constant to control the number of simulation updates. Although
you can calculate this number (IMG_WIDTH/EXO_DX), assigning it lets you fine-
tune the duration of the simulation.
182 Chapter 8
transit.py, part 3 def record_transit(exo_x, exo_y):
"""Draw planet transiting star and return list of intensity changes."""
intensity_samples = []
for _ in range(NUM_FRAMES):
temp_img = BLACK_IMG.copy()
cv.circle(temp_img, (int(IMG_WIDTH / 2), int(IMG_HT / 2)),
STAR_RADIUS, 255, -1)
cv.circle(temp_img, (exo_x, exo_y), EXO_RADIUS, 0, -1)
intensity = temp_img.mean()
cv.putText(temp_img, 'Mean Intensity = {}'.format(intensity), (5, 390),
cv.FONT_HERSHEY_PLAIN, 1, 255)
cv.imshow('Transit', temp_img)
cv.waitKey(30)
intensity_samples.append(intensity)
exo_x += EXO_DX
return intensity_samples
Listing 8-3: Drawing the simulation, calculating the image intensity, and returning it as a list
After showing the image, use the OpenCV waitKey() method to update
it every 30 milliseconds. The lower the number passed to waitKey(), the
faster the exoplanet will move across the star.
Append the intensity measurement to the intensity_samples list and then
advance the exoplanet circle by incrementing its exo_x value by the EXO_DX
constant . Finish the function by returning the list of mean intensity
measurements.
def plot_light_curve(rel_brightness):
"""Plot changes in relative brightness vs. time."""
plt.plot(rel_brightness, color='red', linestyle='dashed',
linewidth=2, label='Relative Brightness')
plt.legend(loc='upper center')
plt.title('Relative Brightness vs. Time')
plt.show()
if __name__ == '__main__':
main()
Listing 8-4: Calculating relative brightness, plotting the light curve, and calling main()
184 Chapter 8
Light curves display the relative brightness over time so that an un-
obscured star has a value of 1.0 and a totally eclipsed star has a value of 0.0.
To convert the mean intensity measurements to relative values, define the
calc_rel_brightness() function, which takes a list of mean intensity measure-
ments as an argument.
Within the function, start an empty list to hold the converted values
and then use Python’s built-in max() function to find the maximum value in
the intensity_samples list. To get relative brightness, loop through the items
in this list and divide them by the maximum value. Append the result to the
rel_brightness list as you go. End the function by returning the new list.
Define a second function to plot the light curve and pass it the rel
_brightness list . Use the matplotlib plot() method and pass it the list, a
line color, a line style, a line width, and a label for the plot legend. Add the
legend and plot title and then show the plot. You should see the chart in
Figure 8-6.
The brightness variation on the plot might seem extreme at first glance,
but if you look closely at the y-axis, you’ll see that the exoplanet diminished
the star’s brightness by only 0.175 percent! To see how this looks on a plot
of the star’s absolute brightness (Figure 8-7), add the following line just
before plt.show():
plt.ylim(0, 1.2)
The deflection in the light curve caused by the transit is subtle but
detectable. Still, you don’t want to go blind squinting at a light curve, so
continue to let matplotlib automatically fit the y-axis as in Figure 8-6.
Finish the program by calling the main() function . In addition to the
light curve, you should see the estimated radius of the exoplanet in the shell.
Figure 8-7: Light curve from Figure 8-6 with rescaled y-axis
Figure 8-8: Light curve for an exoplanet with a radius of 7 that only partly crosses its star
If you run the simulation again with an exoplanet radius of 5 and let the
exoplanet pass fully over the face of the star, you get the graph in Figure 8-9.
186 Chapter 8
Figure 8-9: Light curve for an exoplanet with a radius of 5 that fully crosses its star
When an exoplanet skims the side of a star, never fully passing over it,
the overlapping area changes constantly, generating the U-shaped curve in
Figure 8-8. If the entire exoplanet passes over the face of the star, the base
of the curve is flatter, as in Figure 8-9. And because you never see the plan-
et’s full disc against the star in a partial transit, you have no way to measure
its true size. Thus, size estimates should be taken with a grain of salt if your
light curve lacks a flattish bottom.
If you run a range of exoplanet sizes, you’ll see that the light curve
changes in predictable ways. As size increases, the curve deepens, with lon-
ger ramps on either side, because a larger fraction of the star’s brightness is
diminished (Figures 8-10 and 8-11).
Long sm
Increasing
ooth
depth
ramp
“V” shape
THE OBJEC TI V E
Write a Python program that pixelates images of Earth and plots the intensity of the red,
green, and blue color channels.
The Strategy
To demonstrate that you can capture different surface features and cloud
formations with a single saturated pixel, you need only two images: one of
the western hemisphere and one of the eastern. Conveniently, NASA has
already photographed both hemispheres of Earth from space (Figure 8-12).
earth_west.png earth_east.png
188 Chapter 8
The size of these images is 474×474 pixels, a resolution far too high for a
future exoplanet image, where the exoplanet is expected to occupy 9 pixels,
with only the center pixel fully covered by the planet (Figure 8-13).
Figure 8-13: The earth_west.png and earth_east.png images overlaid with a 9-pixel grid
You’ll need to degrade the Earth images by mapping them into a 3×3
array. Since OpenCV uses NumPy, this will be easy to do. To detect changes
in the exoplanet’s surface, you’ll need to extract the dominant colors (blue,
green, and red). OpenCV will let you average these color channels. Then
you can display the results with matplotlib.
Listing 8-5: Importing modules and loading, degrading, and showing images
Import NumPy and OpenCV to work with the images and use matplotlib
to plot their color components as pie charts. Then start a list of filenames
containing the two images of Earth.
Now start looping through the files in the list and use OpenCV to load
them as NumPy arrays. Recall that OpenCV loads color images by default, so
you don’t need to add an argument for this.
Your goal is to reduce the image of Earth into a single saturated pixel
surrounded by partially saturated pixels. To degrade the images from their
original 474×474 size to 3×3, use OpenCV’s resize() method. First, name the
new image pixelated and pass the method the current image, the new width
and height in pixels, and an interpolation method. Interpolation occurs when
you resize an image and use known data to estimate values at unknown
points. The OpenCV documentation recommends the INTER_AREA interpolation
method for shrinking images (see the geometric image transformations at
https://docs.opencv.org/4.3.0/da/d54/group__imgproc__transform.html).
At this point, you have a tiny image that’s too small to visualize, so resize
it again to 300×300 so you can check the results. Use either INTER_NEAREST
or INTER_AREA as the interpolation method, as these will preserve the pixel
boundaries.
Show the image (Figure 8-14) and delay the program for two seconds
using waitKey().
Note that you can’t restore the images to their original state by resizing
them to 474×474. Once you average the pixel values down to a 3×3 matrix,
all the detailed information is lost forever.
190 Chapter 8
Averaging the Color Channels and Making the Pie Charts
Still in the for loop, Listing 8-6 makes and displays pie charts of the blue,
green, and red color components of each pixelated image. You can com-
pare these to make inferences about the planet’s weather, landmasses, rota-
tion, and so on.
plt.show()
Listing 8-6: Splitting out and averaging color channels and making a pie chart of colors
Use OpenCV’s split() method to break out the blue, green, and red
color channels in the pixelated image and unpack the results into b, g, and
r variables. These are arrays, and if you call print(b), you should see this
output:
[[ 49 93 22]
[124 108 65]
[ 52 118 41]]
Each Text object has (x, y) coordinates and a percent value as a text
string. These will still post in black, so you need to loop through the objects
and change the color to white using their set_color() method. Now all
you need to do is set the chart title to the filename and show the plots
(Figure 8-15).
Although the pie charts are similar, the differences are meaningful.
If you compare the original color images, you’ll see that the earth_west.png
photograph includes more ocean and should produce a larger blue wedge.
192 Chapter 8
This code represents an edited copy of pixelator.py, with the lines that
change annotated. You can find a digital copy in the Chapter_8 folder as
pixelator_saturated_only.py.
plt.show()
Listing 8-7: Plotting pie charts for the colors in the center pixel of the pixelated image
The four lines of code in Listing 8-6 that split the image and averaged
the color channels can be replaced with one line . The pixelated vari-
able is a NumPy array, and [1, 1] represents row 1, column 1 in the array.
Remember that Python starts counting at 0, so these values correspond to
the center of a 3×3 array. If you print the color_values variable, you’ll see
another array.
These are the blue, green, and red color channel values for the center
pixel, and you can pass them directly to matplotlib . For clarity, change
the plot title so it indicates that you’re analyzing the center pixel only .
Figure 8-16 shows the resulting plots.
Summary
In this chapter, you used OpenCV, NumPy, and matplotlib to create images
and measure their properties. You also resized images to different resolu-
tions and plotted image intensity and color channel information. With
short and simple Python programs, you simulated important methods that
astronomers use to discover and study distant exoplanets.
Further Reading
How to Search for Exoplanets, by the Planetary Society (https://www.planetary.org/),
is a good overview of the techniques used to search for exoplanets, including
the strengths and weaknesses of each method.
194 Chapter 8
“Transit Light Curve Tutorial,” by Andrew Vanderburg, explains the
basics of the transit photometry method and provides links to Kepler Space
Observatory transit data. You can find it at https://www.cfa.harvard.edu
/~avanderb/tutorial/tutorial.html.
“NASA Wants to Photograph the Surface of an Exoplanet” (Wired, 2020),
by Daniel Oberhaus, describes the effort to turn the sun into a giant cam-
era lens for studying exoplanets.
“Dyson Spheres: How Advanced Alien Civilizations Would Conquer
the Galaxy” (Space.com, 2014), by Karl Tate, is an infographic on how an
advanced civilization could capture the power of a star using vast arrays of
solar panels.
Ringworld (Ballantine Books, 1970), by Larry Niven, is one of the classic
novels of science fiction. It tells the story of a mission to a massive abandoned
alien construct—the Ringworld—that encircles an alien star.
1.00
Normalized Flux
0.95
0.90
0.85
0.80
Besides the dramatic drop in brightness, the light curve was asym-
metrical and included weird bumps that aren’t seen in typical planetary
transits. Proposed explanations posited that the light curve was caused by
the consumption of a planet by the star, the transit of a cloud of disintegrat-
ing comets, a large ringed planet trailed by swarms of asteroids, or an alien
megastructure.
Scientists speculated that an artificial structure of this size was most
likely an attempt by an alien civilization to collect energy from its sun. Both
science literature and science fiction describe these staggeringly large solar
panel projects. Examples include Dyson swarms, Dyson spheres, ringworlds,
and Pokrovsky shells (Figure 8-18).
We now know that whatever is orbiting Tabby’s Star allows some wave-
lengths of light to pass, so it can’t be a solid object. Based on this behavior
and the wavelengths it absorbed, scientists believe dust is responsible for the
weird shape of the star’s light curve. Other stars, however, like HD 139139
in the constellation Libra, have bizarre light curves that remain unexplained
at the time of this writing.
196 Chapter 8
Practice Project: Detecting Asteroid Transits
Asteroid fields may be responsible for some bumpy and asymmetrical
light curves. These belts of debris often originate from planetary collisions
or the creation of a solar system, like the Trojan asteroids in Jupiter’s
orbit (Figure 8-20). You can find an interesting animation of the Trojan
asteroids on the web page “Lucy: The First Mission to the Trojan Asteroids”
at https://www.nasa.gov/.
Figure 8-20: More than one million Trojan asteroids share Jupiter’s orbit.
198 Chapter 8
Use your modified program to revisit “Experimenting with Transit
Photometry” on page 186, where you analyzed the light curves produced
by partial transits. You should see that, compared to partial transits, full
transits still produce broader dips with flattish bottoms (Figure 8-24).
1.001
Transit time
Transit time R = 7 (partial transit)
1.000
R=3
0.999
0.998
0.997
R=7
0.996
0 20 40 60 80 100 120 140
Figure 8-24: Limb-darkened light curves for full and partial transits (R = exoplanet radius)
If the full transit of a small planet occurs near the edge of a star, limb
darkening may make it difficult to distinguish from the partial transit of a
larger planet. You can see this in Figure 8-25, where arrows denote the loca-
tion of the planets.
Figure 8-25: Partial transit of planet with a radius of 8 pixels versus full transit
of planet with a radius of 5 pixels
Starspots
Figure 8-26: An exoplanet (arrow, left image) passing over a starspot produces a bump in
the light curve.
200 Chapter 8
Write a Python program that simulates multiple spaceships transiting
a star. Give the ships different sizes, shapes, and speeds (such as those in
Figure 8-27).
Compare the resultant light curves to those from Tabby’s Star (Figure 8-17)
and the asteroids practice project. Do the ships produce distinctive curves,
or can you get similar patterns from asteroid swarms, starspots, or other
natural phenomena?
You can find a solution, practice_alien_armada.py, in the appendix and in
the Chapter_8 folder, downloadable from the book’s website.
202 Chapter 8
9
IDENTIF YING
FRIEND OR FOE
Figure 9-1: Example of some consistently bright and dark regions in a face
You can extract these patterns using templates like those in Figure 9-2.
These yield Haar features, a fancy name for the attributes of digital images
used in object recognition. To calculate a Haar feature, place one of the
templates on a grayscale image, add up the grayscale pixels that overlap
with the white part, and subtract them from the sum of the pixels that over-
lap the black part. Thus, each feature consists of a single intensity value.
We can use a range of template sizes to sample all possible locations on the
image, making the system scale invariant.
204 Chapter 9
To apply the classifier, the algorithm uses a sliding window approach.
A small rectangular area is incrementally moved across the image and
evaluated using a cascade classifier consisting of multiple stages of filters.
The filters at each stage are combinations of Haar features. If the window
region fails to pass the threshold of a stage, it’s rejected, and the window
slides to the next position. Quickly rejecting nonface regions, like the one
shown in the right inset in Figure 9-3, helps speed up the overall process.
Figure 9-3: Images are searched for faces using a rectangular sliding window.
Figure 9-4: Security camera footage of a mutated scientist (left) and marine (right)
206 Chapter 9
Unfortunately, the squad’s sentry gun was damaged during landing,
so the transponders no longer function. Worse, the requisitions corporal
forgot to download the software that visually interrogates targets. With the
transponder sensor down, there’s no way to positively identify marines and
civilians. You’ll need to get this fixed as quickly as possible, because your
buddies are badly outnumbered and the mutants are on the move!
Fortunately, planet LV-666 has no indigenous life forms, so you need to
distinguish between humans and mutants only. Since the mutants are basi-
cally faceless, a face detection algorithm is the logical solution.
THE OBJEC TI V E
Write a Python program that disables the sentry gun’s firing mechanism when it detects
human faces in an image.
The Strategy
In situations like this, it’s best to keep things simple and leverage existing
resources. This means relying on OpenCV’s face detection functionality
rather than writing customized code to recognize the humans on the base.
But you can’t be sure how well these canned procedures will work, so you’ll
need to guide your human targets to make the job as easy as possible.
The sentry gun’s motion detector will handle the job of triggering the
optical identification process. To permit humans to pass unharmed, you’ll
need to warn them to stop and face the camera. They’ll need a few seconds
to do this and a few seconds to proceed past the gun after they’re cleared.
You’ll also want to run some tests to ensure OpenCV’s training set
is adequate and you’re not generating any false positives that would let a
mutant sneak by. You don’t want to kill anyone with friendly fire, but you
can’t be too cautious, either. If one mutant gets by, everyone could perish.
NOTE In real life, the sentry guns would use a video feed. Since I don’t have my own film
studio with special effects and makeup departments, you’ll work off still photos
instead. You can think of these as individual video frames. Later in the chapter,
you’ll get a chance to detect your own face using your computer’s video camera.
The Code
The sentry.py code will loop through a folder of images, identify human
faces in the images, and show the image with the faces outlined. It will then
either fire or disable the gun depending on the result. You’ll use the images
in the corridor_5 folder in the Chapter_9 folder, downloadable from https://
nostarch.com/real-world-python/. As always, don’t move or rename any files
after downloading and launch sentry.py from the folder in which it’s stored.
You’ll also need to install two modules, playsound and pyttsx3. The first
is a cross-platform module for playing WAV and MP3 format audio files.
You may need to restart the Python shell and editor following this
installation.
For more on playsound, see https://pypi.org/project/playsound/. The docu-
mentation for pyttsx3 can be found at https://pyttsx3.readthedocs.io/en/latest/
and https://pypi.org/project/pyttsx3/.
If you don’t already have OpenCV installed, see “Installing the Python
Libraries” on page 6.
Importing Modules, Setting Up Audio, and Referencing the Classifier Files and Corridor Images
Listing 9-1 imports modules, initializes and sets up the audio engine, assigns
the classifier files to variables, and changes the directory to the folder con-
taining the corridor images.
engine = pyttsx3.init()
engine.setProperty('rate', 145)
engine.setProperty('volume', 1.0)
root_dir = os.path.abspath('.')
gunfire_path = os.path.join(root_dir, 'gunfire.wav')
tone_path = os.path.join(root_dir, 'tone.wav')
path= "C:/Python372/Lib/site-packages/cv2/data/"
face_cascade = cv.CascadeClassifier(path +
'haarcascade_frontalface_default.xml')
208 Chapter 9
eye_cascade = cv.CascadeClassifier(path + 'haarcascade_eye.xml')
os.chdir('corridor_5')
contents = sorted(os.listdir())
Listing 9-1: Importing modules, setting up the audio, and locating the classifier files and
corridor images
Except for datetime, playsound, and pytts3, the imports should be familiar
to you if you’ve worked through the earlier chapters . You’ll use datetime to
record the exact time at which an intruder is detected in the corridor.
To use pytts3, initialize a pyttsx3 object and assign it to a variable
named, by convention, engine . According to the pyttsx3 docs, an applica-
tion uses the engine object to register and unregister event callbacks, pro-
duce and stop speech, get and set speech engine properties, and start and
stop event loops.
In the next two lines, set the rate of speech and volume properties.
The rate of speech value used here was obtained through trial and error.
It should be fast but still clearly understandable. The volume should be set
to the maximum value (1.0) so any humans stumbling into the corridor can
easily hear the warning instructions.
The default voice on Windows is male, but other voices are available.
For example, on a Windows 10 machine, you can switch to a female voice
using the following voice ID:
engine.setProperty('voice',
'HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-US_ZIRA_11.0')
face_rect_list = []
face_rect_list.append(face_cascade.detectMultiScale(image=img_gray,
scaleFactor=1.1,
minNeighbors=5))
Listing 9-2: Looping through images, issuing a verbal warning, and searching for faces
210 Chapter 9
Start looping through the images in the folder. Each new image repre-
sents a new intruder in the corridor. Print a log of the event and the time
at which it occurred . Note the f before the start of the string. This is the
new f-string format introduced with Python 3.6 (https://www.python.org/dev
/peps/pep-0498/). An f-string is a literal string that contains expressions,
such as variables, strings, mathematical operations, and even function calls,
inside curly braces. When the program prints the string, it replaces the
expressions with their values. These are the fastest and most efficient string
formats in Python, and we certainly want this program to be fast!
Assume every intruder is a mutant and prepare to discharge the
weapon. Then, verbally warn the intruder to stop and be scanned.
Use the pyttsx3 engine object’s say() method to speak . It takes a string
as an argument. Follow this with the runAndWait() method. This halts pro-
gram execution, flushes the say() queue, and plays the audio.
NOTE For some macOS users, the program may exit with the second call to runAndWait().
If this occurs, download the sentry_for_Mac_bug.py code from the book’s website.
This program uses the operating system’s text-to-speech functionality in place of
pyttsx3. You’ll still need to update the Haar cascade path variable in this program,
as you did at in Listing 9-1.
Next, use the time module to pause the program for three seconds. This
gives the intruder time to squarely face the gun’s camera.
At this point, you’d make a video capture, except we’re not using video.
Instead, load the images in the corridor_5 folder. Call the cv.imread() method
with the IMREAD_GRAYSCALE flag .
Use the image’s shape attribute to get its height and width in pixels. This
will come in handy later, when you post text on the images.
Face detection works only on grayscale images, but OpenCV will convert
color images behind the scenes when applying the Haar cascades. I chose
to use grayscale from the start as the results look creepier when the images
display. If you want to see the images in color, just change the two previous
lines as follows:
img_gray = cv.imread(image)
height, width = img_gray.shape[:2]
Next, show the image prior to face detection, keep it up for two seconds
(input as milliseconds), and then destroy the window. This is for quality
control to be sure all the images are being examined. You can comment out
these steps later, after you’re satisfied everything is working as planned.
Create an empty list to hold any faces found in the current image .
OpenCV treats images as NumPy arrays, so the items in this list are the corner-
point coordinates (x, y, width, height) of a rectangle that frames the face, as
shown in the following output snippet:
Original image
The scale pyramid resizes the image downward a set number of times.
For example, a scaleFactor of 1.2 means the image will be scaled down in
increments of 20 percent. The sliding window will repeat its movement
across this smaller image and check again for Haar features. This shrink-
ing and sliding will continue until the scaled image reaches the size of the
images used for training. This is 20×20 pixels for the Haar cascade classifier
(you can confirm this by opening one of the .xml files). Windows smaller
than this can’t be detected, so the resizing ends at this point. Note that the
scale pyramid will only downscale images, as upscaling can introduce arti-
facts in the resized image.
With each rescaling, the algorithm calculates lots of new Haar features,
resulting in lots of false positives. To weed these out, use the minNeighbors
parameter.
To see how this process works, look at Figure 9-7. The rectangles in
this figure represent faces detected by the haarcascade_frontalface_alt2.xml
classifier, with the scaleFactor parameter set to 1.05 and minNeighbors set to 0.
The rectangles have different sizes depending on which scaled image—
determined by the scaleFactor parameter—was in use when the face was
detected. Although there are many false positives, the rectangles tend to
cluster around the true face.
212 Chapter 9
Figure 9-7: Detected face rectangles with minNeighbors=0
To see why, check out Figure 9-10. Despite using a minNeighbor value of 5,
the toe region of the mutant is incorrectly identified as a face. With a little
imagination, you can see two dark eyes and a bright nose at the top of the
rectangle, and a dark, straight mouth at the base. This could allow the
mutant to pass unharmed, earning you a dishonorable discharge at best
and an excruciatingly painful death at worst.
214 Chapter 9
Detecting Eyes and Disabling the Weapon
Still in the for loop through the corridor images, Listing 9-3 uses OpenCV’s
built-in eye cascade classifier to search for eyes in the list of detected face
rectangles. Searching for eyes reduces false positives by adding a second
verification step. And because mutants don’t have eyes, if at least one eye is
found, you can assume a human is present and disable the sentry gun’s fir-
ing mechanism to let them pass.
Listing 9-3: Detecting eyes in face rectangles and disabling the weapon
Print the name of the image being searched and start a loop through
the rectangles in the face_rect_list. If a rectangle is present, start looping
through the tuple of coordinates. Use these coordinates to make a subarray
from the image, in which you’ll search for eyes .
Call the eye cascade classifier on the subarray. Because you’re now
searching a much smaller area, you can reduce the minNeighbors argument.
Like the cascade classifiers for faces, the eye cascade returns coordi-
nates for a rectangle. Start a loop through these coordinates, naming them
with an e on the end, which stands for “eye,” to distinguish them from the
face rectangle coordinates .
Next, draw a circle around the first eye you find. This is just for your
own visual confirmation; as far as the algorithm’s concerned, the eye is
already found. Calculate the center of the rectangle and then calculate
a radius value that’s slightly larger than an eye. Use OpenCV’s circle()
method to draw a white circle on the rect_4_eyes subarray.
Now, draw a rectangle around the face by calling OpenCV’s rectangle()
method and passing it the img_gray array. Show the image for two seconds
and then destroy the window. Because the rect_4_eyes subarray is part of
img_gray, the circle will show up even though you didn’t explicitly pass the
subarray to the im_show() method (Figure 9-11).
With a human identified, disable the weapon and break out of the
for loop. You need to identify only one eye to confirm that you have a face,
so it’s time to move on to the next face rectangle.
else:
print(f"No face in {image}. Discharging weapon!")
cv.putText(img_gray, 'FIRE!', (int(width / 2) - 20, int(height / 2)),
cv.FONT_HERSHEY_PLAIN, 3, 255, 3)
playsound(gunfire_path, block=False)
cv.imshow('Mutant', img_gray)
cv.waitKey(2000)
cv.destroyWindow('Mutant')
time.sleep(3)
engine.stop()
Listing 9-4: Determining the course of action if the gun is disabled or enabled
216 Chapter 9
Use a conditional to check whether the weapon is disabled. You set the
discharge_weapon variable to True when you chose the current image from the
corridor_5 folder (see Listing 9-2). If the previous listing found an eye in a
face rectangle, it changed the state to False.
If the weapon is disabled, show the positive detection image (such as
in Figure 9-11) and play the tone. First, call playsound, pass it the tone_path
string, and set the block argument to False. By setting block to False, you
allow playsound to run at the same time as OpenCV displays the image. If
you set block=True, you won’t see the image until after the tone audio has
completed. Show the image for two seconds and then destroy it and pause
the program for five seconds using time.sleep().
If discharge_weapon is still True, print a message to the shell that the gun
is firing. Use OpenCV’s putText() method to announce this in the center of
the image and then show the image (see Figure 9-12).
Now play the gunfire audio. Use playsound, passing it the gunfire_path
string and setting the block argument to False. Note that you have the
option of removing the root_dir and gunfire_path lines of code in Listing 9-1
if you provide the full path when you call playsound. For example, I would
use the following on my Windows machine:
playsound('C:/Python372/book/mutants/gunfire.wav', block=False)
Show the window for two seconds and then destroy it. Sleep the pro-
gram for three seconds to pause between showing the mutant and display-
ing the next image in the corridor_5 folder. When the loop completes, stop
the pyttsx3 engine.
Mutants might also trigger the firing mechanism with humans in the
corridor, assuming the humans look away from the camera at the wrong
moment (Figure 9-14).
I’ve seen enough sci-fi and horror movies to know that in a real scenario,
I’d program the gun to shoot anything that moved. Fortunately, that’s a
moral dilemma I’ll never have to face!
218 Chapter 9
Detecting Faces from a Video Stream
You can also detect faces in real time using video cameras. This is easy to
do, so we won’t make it a dedicated project. Enter the code in Listing 9-5 or
use the digital version, video_face_detect.py, in the Chapter_9 folder download-
able from the book’s website. You’ll need to use your computer’s camera or
an external camera that works through your computer.
path = "C:/Python372/Lib/site-packages/cv2/data/"
face_cascade = cv.CascadeClassifier(path + 'haarcascade_frontalface_alt.xml')
cap = cv.VideoCapture(0)
while True:
_, frame = cap.read()
face_rects = face_cascade.detectMultiScale(frame, scaleFactor=1.2,
minNeighbors=3)
cv.imshow('frame', frame)
if cv.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv.destroyAllWindows()
After importing OpenCV, set up your path to the Haar cascade classifi-
ers as you did at in Listing 9-1. I use the haarcascade_frontalface_alt.xml file
here as it has higher precision (fewer false positives) than the haarcascade
_frontalface_default.xml file you used in the previous project. Next, instantiate
a VideoCapture class object, called cap for “capture.” Pass the constructor the
index of the video device you want to use . If you have only one camera,
such as your laptop’s built-in camera, then the index of this device should
be 0.
To keep the camera and face detection process running, use a while
loop. Within the loop, you’ll capture each video frame and analyze it for
faces, just as you did with the static images in the previous project. The face
detection algorithm is fast enough to keep up with the continuous stream,
despite all the work it must do!
To load the frames, call the cap object’s read() method. It returns a tuple
consisting of a Boolean return code and a NumPy ndarray object representing
the current frame. The return code is used to check whether you’ve run out
of frames when reading from a file. Since we’re not reading from a file here,
assign it to an underscore to indicate an insignificant variable.
The fx and fy arguments are scaling factors for the screen’s x and y
dimensions. Using 0.5 will halve the default size of the window.
The program should have no trouble tracking your face unless you do
something crazy, like tilt your head slightly to the side. That’s all it takes to
break detection and make the rectangle disappear (Figure 9-15).
220 Chapter 9
The Haar cascade classifiers can handle a bit of tilt (Figure 9-16), so you
could rotate the image by 5 degrees or so with each pass and have a good
chance of getting a positive result.
Summary
In this chapter, you got to work with OpenCV’s Haar cascade classifier for
detecting human faces; playsound, for playing audio files; and pyttsx3, for
text-to-speech audio. Thanks to these useful libraries, you were able to
quickly write a face detection program that also issued audio warnings and
instructions.
222 Chapter 9
In this example, you replace the value of a given pixel in image with
the average of all the pixels in a 20×20 square centered on that pixel. This
operation repeats for every pixel in image.
You can find a solution, practice_blur.py, in the appendix and in the
Chapter_9 folder downloadable from the book’s website.
226 Chapter 10
Capture
ID = 1
Bobby
ID = 1
Database
Webcam
ID = 2
Lloyd
ID = 2
The next step in the capture process is to detect the face in the image,
draw a rectangle around it, crop the image to the rectangle, resize the
cropped images to the same dimensions (depending on the algorithm),
and convert them to grayscale. The algorithms typically keep track of faces
using integers, so each subject will need a unique ID number. Once pro-
cessed, the faces are stored in a single folder, which we’ll call the database.
The next step is to train the face recognizer (Figure 10-2). The algorithm
—in our case, LBPH—analyzes each of the training images and then writes
the results to a YAML (.yml) file, a human-readable data-serialization lan-
guage used for data storage. YAML originally meant “Yet Another Markup
Language” but now stands for “YAML Ain’t Markup Language” to stress
that it’s more than just a document markup tool.
Train
ID = 1
ID = 2
Figure 10-2: Training the face recognizer and writing the results to a file
Predict
Unknown ID = 1 ID = 1
Recognizer Confidence = 64
Confidence threshold = 80
Bobby
Webcam ID = 2
Confidence threshold = 95
trainer.yml
Unknown
ID = 2
Confidence = 159
ID = 1 ID = 2 Unknown
Note that the recognizer will make a prediction about the identity of
every face. If there’s only one trained face in the YAML file, the recognizer
will assign every face the trained face’s ID number. It will also output a
confidence factor, which is really a measurement of the distance between the
new face and the trained face. The larger the number, the worse the match.
We’ll talk about this more in a moment, but for now, know that you’ll use a
threshold value to decide whether the predicted face is correct. If the con-
fidence exceeds the accepted threshold value, the program will discard the
match and classify the face as “unknown” (see Figure 10-3).
228 Chapter 10
3x3 pixels
160 110 50 1 1 0
180 90 50 1 0 215
Figure 10-4: Example 3×3 pixel sliding window used to capture local binary patterns
The next step is to convert the pixels into a binary number, using the
central value (in this case 90) as a threshold. You do this by comparing the
eight neighboring values to the threshold. If a neighboring value is equal
to or higher than the threshold, assign it 1; if it’s lower than the threshold,
assign it 0. Next, ignoring the central value, concatenate the binary values
line by line (some methods use a clockwise rotation) to form a new binary
value (11010111). Finish by converting this binary number into a decimal
number (215) and storing it at the central pixel location.
Continue sliding the window until all the pixels have been converted to
LBP values. In addition to using a square window to capture neighboring
pixels, the algorithm can use a radius, a process called circular LBP.
Now it’s time to extract histograms from the LBP image produced in the
previous step. To do this, you use a grid to divide the LBP image into rectan-
gular regions (Figure 10-5). Within each region, you construct a histogram
of the LBP values (labeled “Local Region Histogram” in Figure 10-5).
230 Chapter 10
Project #14: Restricting Access to the Alien Artifact
Your squad has fought its way to the lab containing the portal-producing
alien artifact. Captain Demming orders it locked down immediately, with
access restricted to just him. Another technician will override the current
system with a military laptop. Captain Demming will gain access through
this laptop using two levels of security: a typed password and face verifica-
tion. Aware of your skills with OpenCV, he’s ordered you to handle the
facial verification part.
THE OBJEC TI V E
The Strategy
You’re pressed for time and working under adverse conditions, so you want
to use a fast and easy tool with a good performance record, like OpenCV’s
LBPH face recognizer. You’re aware that LBPH works best under controlled
conditions, so you’ll use the same laptop webcam to capture both the train-
ing images and the face of anyone trying to access the lab.
In addition to pictures of Demming’s face, you’ll want to capture
some faces that don’t belong to Captain Demming. You’ll use these faces
to ensure that all the positive matches really belong to the captain. Don’t
worry about setting up the password, isolating the program from the user,
or hacking into the current system; the other technician will handle these
tasks while you go out and blast some mutants.
Project code
Audio file
Importing Modules, and Setting Up Audio, a Webcam, Instructions, and File Paths
Listing 10-1 imports modules, initializes and sets up the audio engine
and the Haar cascade classifier, initializes the camera, and provides user
232 Chapter 10
instructions. You need the Haar cascades because you must detect a face
before you can recognize it. For a refresher on Haar cascades and face
detection, see “Detecting Faces in Photographs” on page 204.
1_capture.py, import os
part 1 import pyttsx3
import cv2 as cv
from playsound import playsound
engine = pyttsx3.init()
engine.setProperty('rate', 145)
engine.setProperty('volume', 1.0)
root_dir = os.path.abspath('.')
tone_path = os.path.join(root_dir, 'tone.wav')
path = "C:/Python372/Lib/site-packages/cv2/data/"
face_detector = cv.CascadeClassifier(path +
'haarcascade_frontalface_default.xml')
cap = cv.VideoCapture(0)
if not cap.isOpened():
print("Could not open video device.")
cap.set(3, 640) # Frame width.
cap.set(4, 480) # Frame height.
Listing 10-1: Importing modules and setting up audio and detector files, a webcam, and
instructions
The imports are the same as those used to detect faces in the previous
chapter. You’ll use the operating system (via the os module) to manipulate
file paths, pyttsx3 to play text-to-speech audio instructions, cv to work with
images and run the face detector and recognizer, and playsound to play a tone
that lets users know when the program has finished capturing their image.
Next, set up the text-to-speech engine. You’ll use this to tell the user
how to run the program. The default voice is dependent on your particular
operating system. The engine’s rate parameter is currently optimized for
the American “David” voice on Windows . You may want to edit the argu-
ment if you find the speech to be too fast or too slow. If you want to change
the voice, see the instructions accompanying Listing 9-1 on page 209.
You’ll use a tone to alert the user that the video capture process has
ended. Set up the path to the tone.wav audio file as you did in Chapter 9.
frame_count = 0
while True:
# Capture frame-by-frame for total of 30 frames.
_, frame = cap.read()
gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
234 Chapter 10
face_rects = face_detector.detectMultiScale(gray, scaleFactor=1.2,
minNeighbors=5)
for (x, y, w, h) in face_rects:
frame_count += 1
cv.imwrite(str(name) + '.' + str(user_id) + '.'
+ str(frame_count) + '.jpg', gray[y:y+h, x:x+w])
cv.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv.imshow('image', frame)
cv.waitKey(400)
if frame_count >= 30:
break
2_train.py import os
import numpy as np
import cv2 as cv
cascade_path = "C:/Python372/Lib/site-packages/cv2/data/"
face_detector = cv.CascadeClassifier(cascade_path +
'haarcascade_frontalface_default.xml')
236 Chapter 10
images.append(train_image[y:y + h, x:x + w])
labels.append(label)
print(f"Preparing training images for {name}.{label}.{frame_num}")
cv.imshow("Training Image", train_image[y:y + h, x:x + w])
cv.waitKey(50)
cv.destroyAllWindows()
recognizer = cv.face.LBPHFaceRecognizer_create()
recognizer.train(images, np.array(labels))
recognizer.write('lbph_trainer.yml')
print("Training complete. Exiting...")
You’ve seen the imports and the face detector code before. Although
you’ve already cropped the training images to face rectangles in 1_capture.py,
it doesn’t hurt to repeat this procedure. Since 2_train.py is a stand-alone
program, it’s best not to take anything for granted.
Next, you must choose which set of training images to use: the ones you
captured yourself in the trainer folder or the set provided in the demming
_trainer folder . Comment out or delete the line for the one you don’t use.
Remember, because you’re not providing a full path to the folder, you’ll
need to launch your program from the folder containing it, which should
be one level above the trainer and demming_trainer folders.
Create a list named image_paths using list comprehension. This will hold
the directory path and filename for each image in the training folder. Then
create empty lists for the images and their labels.
Start a for loop through the image paths. Read the image in grayscale;
then extract its numeric label from the filename and convert it to an inte-
ger . Remember that the label corresponds to the user ID input through
1_capture.py right before it captured the video frames.
Let’s take a moment to unpack what’s happening in this extraction and
conversion process. The os.path.split() method takes a directory path and
returns a tuple of the directory path and the filename, as shown in the fol-
lowing snippet:
>>> import os
>>> path = 'C:\demming_trainer\demming.1.5.jpg'
>>> os.path.split(path)
('C:\\demming_trainer', 'demming.1.5.jpg')
You can then select the last item in the tuple, using an index of -1, and
split it on the dot. This yields a list with four items (the user’s name, user
ID, frame number, and file extension).
>>> os.path.split(path)[-1].split('.')
['demming', '1', '5', 'jpg']
To extract the label value, you choose the second item in the list using
index 1.
Repeat this process to extract the name and frame_num for each image.
These are all strings at this point, which is why you need to turn the user ID
into an integer for use as a label.
Now, call the face detector on each training image . This will return a
numpy.ndarray, which you’ll call faces. Start looping through the array, which
contains the coordinates of the detected face rectangles. Append the part of
the image in the rectangle to the images list you made earlier. Also append the
image’s user ID to the labels list.
Let the user know what’s going on by printing a message in the shell.
Then, as a check, show each training image for 50 milliseconds. If you’ve
ever seen Peter Gabriel’s popular 1986 music video for “Sledgehammer,”
you’ll appreciate this display.
It’s time to train the face recognizer. Just as you do when using OpenCV’s
face detector, you start by instantiating a recognizer object . Next, you call
the train() method and pass it the images list and the labels list, which you
turn into a NumPy array on the fly.
You don’t want to train the recognizer every time someone verifies their
face, so write the results of the training process to a file called lbph_trainer.yml.
Then let the user know the program has ended.
238 Chapter 10
cascade_path = "C:/Python372/Lib/site-packages/cv2/data/"
face_detector = cv.CascadeClassifier(cascade_path +
'haarcascade_frontalface_default.xml')
recognizer = cv.face.LBPHFaceRecognizer_create()
recognizer.read('lbph_trainer.yml')
#test_path = './tester'
test_path = './demming_tester'
image_paths = [os.path.join(test_path, f) for f in os.listdir(test_path)]
Listing 10-4: Importing modules and preparing for face detection and recognition
Listing 10-5: Running face recognition and updating the access log file
Start by looping through the images in the test folder. This will be
either the demming_tester folder or the tester folder. Read each image in as
grayscale and assign the resulting array to a variable named predict_image.
Then run the face detector on it.
Now loop through the face rectangles, as you’ve done before. Print a
message about access being requested; then use OpenCV to resize the face
subarray to 100×100 pixels . This is close to the dimensions of the train-
ing images in the demming_trainer folder. Synchronizing the size of the
images isn’t strictly necessary but helps to improve results in my experience.
If you’re using your own images to represent Captain Demming, you should
check that the training image and test image dimensions are similar.
Now it’s time to predict the identity of the face. Doing so takes only one
line. Just call the predict() method on the recognizer object and pass it the
face subarray. This method will return an ID number and a distance value.
The lower the distance value, the more likely the predicted face has
been correctly identified. You can use the distance value as a threshold: all
images that are predicted to be Captain Demming and score at or below the
threshold will be positively identified as Captain Demming. All the others
will be assigned to 'unknown'.
To apply the threshold, use an if statement . If you’re using your own
training and test images, set the distance value to 1,000 the first time you
run the program. Review the distance values for all the images in the test
folder, both known and unknown. Find a threshold value below which all
the faces are correctly identified as Captain Demming. This will be your
discriminator going forward. For the images in the demming_trainer and
demming_tester folders, the threshold distance should be 95.
Next, get the name for the image by using the predicted_id value as
a key in the names dictionary. Print a message in the shell stating that the
image has been identified and include the image filename, the name from
the dictionary, and the distance value.
For the log, print a message indicating that name (in this case, Captain
Demming) has been granted access to the lab and include the time using
the datetime module .
240 Chapter 10
You’ll want to keep a persistent file of people’s comings and goings.
Here’s a neat trick for doing so: just write to a file using the print() func-
tion. Open the lab_access_log.txt file and include the a parameter for
“append.” This way, instead of overwriting the file for each new image,
you’ll add a new line at the bottom. Here’s an example of the file contents:
If the conditional is not met, set name to 'unknown' and print a message to
that effect. Then draw a rectangle around the face and post the user’s name
using OpenCV’s putText() method. Show the image for two seconds before
destroying it.
Results
You can see some example results, from the 20 images in the demming_tester
folder, in Figure 10-8. The predictor code correctly identified the eight images
of Captain Demming with no false positives.
For the LBPH algorithm to be highly accurate, you need to use it under
controlled conditions. Remember that by forcing the user to gain access
through the laptop, you controlled their pose, the size of their face, the
image resolution, and the lighting.
Further Reading
“Local Binary Patterns Applied to Face Detection and Recognition”
(Polytechnic University of Catalonia, 2010), by Laura María Sánchez López,
is a thorough review of the LBPH approach. The PDF can be found online
at sites such as https://www.semanticscholar.org/.
“Look at the LBP Histogram,” on the AURlabCVsimulator site (https://
aurlabcvsimulator.readthedocs.io/en/latest/), includes Python code that lets you
visualize an LBPH image.
If you’re a macOS or Linux user, be sure to check out Adam Geitgey’s
face_recognition library, a simple-to-use and highly accurate face recognition
system that utilizes deep learning. You can find installation instructions and
an overview at the Python Software Foundation site: https://pypi.org/project
/face_recognition/.
“Machine Learning Is Fun! Part 4: Modern Face Recognition with Deep
Learning” (Medium, 2016), by Adam Geitgey, is a short and enjoyable over-
view of modern face recognition using Python, OpenFace, and dlib.
“Liveness Detection with OpenCV” (PyImageSearch, 2019), by Adrian
Rosebrock, is an online tutorial that teaches you how to protect your face
recognition system against spoofing by fake faces, such as a photograph of
Captain Demming held up to the webcam.
Cities and colleges around the world have begun banning facial recog-
nition systems. Inventors have also gotten into the act, designing clothing
that can confound the systems and protect your identity. “These Clothes
Use Outlandish Designs to Trick Facial Recognition Software into Thinking
You’re Not Human” (Business Insider, 2020), by Aaron Holmes, and “How
to Hack Your Face to Dodge the Rise of Facial Recognition Tech” (Wired,
2019), by Elise Thomas, review some recent practical—and impractical—
solutions to the problem.
“OpenCV Age Detection with Deep Learning” (PyImageSearch, 2020)
by Adrian Rosebrock, is an online tutorial for using OpenCV to predict a
person’s age from their photograph.
242 Chapter 10
dynamically recognizes faces in the webcam’s video stream. The face rect-
angle and name should appear in the video frame as they do on the folder
images.
To start the program, have the user enter a password that you verify. If
it’s correct, add audio instructions telling the user to look at the camera.
If the program positively identifies Captain Demming, use audio to announce
that access is granted. Otherwise, play an audio message stating that access
is denied.
If you need help with identifying the face from the video stream, see
the challenge_video_recognize.py program in the appendix. Note that you may
need to use a higher confidence value for the video frame than the value
you used for the still photographs.
So that you can keep track of who has tried to enter the lab, save a
single frame to the same folder as the lab_access_log.txt file. Use the logged
results from datetime.now() as the filename so you can match the face to the
access attempt. Note that you’ll need to reformat the string returned from
datetime.now() so that it only contains characters acceptable for filenames,
as defined by your operating system.
Atlanta
246 Chapter 11
Figure 11-2: Choropleth map of the 2016 US presidential election results
(light gray = Democrat, dark gray = Republican)
To determine the best routes through the counties, the survivors can
use state highway maps like the ones found in service stations and welcome
centers. These paper maps include county and parish outlines, making it
easy to relate their network of cities and roads to a page-sized printout of
the choropleth map.
THE OBJEC TI V E
Create an interactive map of the conterminous United States (the 48 adjoining states) that
displays population density by county.
The Strategy
Like all data visualization exercises, this task consists of the following basic
steps: finding and cleaning the data, choosing the type of plot and the tool
with which to show the data, preparing the data for plotting, and drawing
the data.
Finding the data is easy in this case, as the US census population data
is made readily available to the public. You still need to clean it, however, by
finding and handling bogus data points, null values, and formatting issues.
Ideally you would also verify the accuracy of the data, a difficult job that
data scientists probably skip far too often. The data should at least pass a
sanity check, something that may have to wait until the data is drawn. New
York City should have a greater population density than Billings, Montana,
for example.
Index Value
0 25
1 432
2 –112
3 99
248 Chapter 11
Unlike the indexes of Python list items, the indexes in a series don’t have
to be integers. In Table 11-2, the indexes are the names of people, and the
values are their ages.
As with a list or NumPy array, you can slice a series or select individual
elements by specifying an index. You can manipulate the series many ways,
such as filtering it, performing mathematical operations on it, and merging
it with other series.
A dataframe is a more complex structure comprising two dimensions.
It has a tabular structure similar to a spreadsheet, with columns, rows, and
data (Table 11-3). You can think of it as an ordered collection of columns
with two indexing arrays.
Columns
Index Country State County Population
0 USA Alabama Autauga 54,571
1 USA Alabama Baldwin 182,265
2 USA Alabama Barbour 27,457
3 USA Alabama Bibb 22,915
The first index, for the rows, works much like the index array in a series.
The second keeps track of the series of labels, with each label representing
a column header. Dataframes also resemble dictionaries; the column names
form the keys, and the series of data in each column forms the values. This
structure lets you easily manipulate dataframes.
Covering all the functionality in pandas would require a whole book, and
you can find plenty online! We’ll defer additional discussion until the code
section, where we’ll look at specific examples as we apply them.
250 Chapter 11
As you can see, the program will tell you where on your machine it’s
putting the data so that bokeh can automatically find it. Your path will differ
from mine. For more on downloading the sample data, see https://docs.bokeh
.org/en/latest/docs/reference/sampledata.html.
Look for US_Counties.csv and unemployment09.csv in the folder of down-
loaded files. These plaintext files use the popular comma-separated values
(CSV) format, in which each line represents a data record with multiple
fields separated by commas. (Good luck saying “CSV” right if you regularly
shop at a CVS pharmacy!)
The unemployment file is instructive of the plight of the data scientist.
If you open it, you’ll see that there are no column names describing the data
(Figure 11-3), though it’s possible to guess what most of the fields represent.
We’ll deal with this later.
If you open the US counties file, you’ll see lots of columns, but at least
they have headers (Figure 11-4). Your challenge will be to relate the un-
employment data in Figure 11-3 to the geographical data in Figure 11-4 so
that you can do the same later with the census data.
At this point, you have all the Python libraries and data files you need
to generate a population density choropleth map in theory. Before you can
write the code, however, you need to know how you’re going to link the
population data to the geographical data so that you can place the correct
county data in the correct county shape.
Hacking holoviews
Learning to adapt existing code for your own use is a valuable skill for a
data scientist. This may require a bit of reverse engineering. Because open
source software is free, it’s sometimes poorly documented, so you have to
figure out how it works on your own. Let’s take a moment and apply this
skill to our current problem.
In previous chapters, we took advantage of the gallery examples provided
by open source modules such as turtle and matplotlib. The holoviews library
also has a gallery (http://holoviews.org/gallery/index.html), and it includes
Texas Choropleth Example, a choropleth map of the Texas unemployment
rate in 2009 (Figure 11-6).
Figure 11-6: Choropleth map of the 2009 Texas unemployment rate from the holoviews gallery
252 Chapter 11
Listing 11-1 contains the code provided by holoviews for this map.
You’ll build your project based on this example, but to do so, you’ll have
to address two main differences. First, you plan to plot population density
rather than unemployment rate. Second, you want a map of the contermi-
nous United States, not just Texas.
choropleth.opts(opts.Polygons(logz=True,
tools=['hover'],
xaxis=None, yaxis=None,
show_grid=False,
show_frame=False,
width=500, height=500,
color_index='Unemployment',
colorbar=True, toolbar='above',
line_color='white'))
Listing 11-1: holoviews gallery code for generating the Texas choropleth
The code imports the data from the bokeh sample data . You’ll need to
know the format and content of both the unemployment and counties variables.
The unemployment rate is accessed later using the unemployment variable and
an index or key of cid, which may stand for “county ID” . The program
selects Texas, rather than the whole United States, based on a conditional
using a state code .
Let’s investigate this in the Python shell.
Start by importing the bokeh sample data using the syntax from the gal-
lery example. Next, use the type() built-in function to check the data type
of the unemployment variable . You’ll see that it’s a dictionary.
Now, use dictionary comprehension to make a new dictionary comprising
the first two lines in unemployment . Print the results, and you’ll see that the
keys are tuples and the values are numbers, presumably the unemployment
rate in percent . Check the data type for the numbers in the key. They’re
integers rather than strings .
Compare the output at to the first two rows in the CSV file in
Figure 11-3. The first number in the key tuple, presumably a state code,
comes from column B. The second number in the tuple, presumably a
county code, comes from column C. The unemployment rate is obviously
stored in column I.
Now compare the contents of unemployment to Figure 11-4, representing
the county data. The STATE num (column J) and COUNTY num (column K)
obviously hold the components of the key tuple.
So far so good, but if you look at the population data file in Figure 11-5,
you won’t find a state or county code to direct into a tuple. There are numbers
in column E, however, that match those in the last column of the county
data, labeled FIPS formula in Figure 11-4. These FIPS numbers seem to
relate to the state and county codes.
As it turns out, a Federal Information Processing Series (FIPS) code is
basically a ZIP code for a county. The FIPS code is a five-digit numeric
code assigned to each county by the National Institute of Standards and
Technology. The first two digits represent the county’s state, and the final
three digits represent the county (Table 11-4).
Congratulations, you now know how to link the US census data to the
county shapes in the bokeh sample data. It’s time to write the final code!
254 Chapter 11
Importing Modules and Data and Constructing a Dataframe
Listing 11-2 imports modules and the bokeh county sample data that
includes coordinates for all the US county polygons. It also loads and cre-
ates a dataframe object to represent the population data. Then it begins the
process of cleaning and preparing the data for use with the county data.
df = pd.read_csv('census_data_popl_2010.csv', encoding="ISO-8859-1")
df = pd.DataFrame(df,
columns=
['Target Geo Id2',
'Geographic area.1',
'Density per square mile of land area - Population'])
df.rename(columns =
{'Target Geo Id2':'fips',
'Geographic area.1': 'County',
'Density per square mile of land area - Population':'Density'},
inplace = True)
Listing 11-2: Importing modules and data, creating a dataframe, and renaming columns
Start by importing abspath from the operating system library. You’ll use
this to find the absolute path to the choropleth map HTML file after it’s
created. Then import the webbrowser module so you can launch the HTML
file. You need this because the holoviews library is designed to work with a
Jupyter Notebook and won’t automatically display the map without some help.
Next, import pandas and repeat the holoviews imports from the gallery
example in Listing 11-1. Note that you must specify bokeh as the holoviews
extension, or backend . This is because holoviews can work with other plot-
ting libraries, such as matplotlib, and needs to know which one to use.
You brought in the geographical data with the imports. Now load the
population data using pandas. This module includes a set of input/output
API functions to facilitate reading and writing data. These readers and writers
address major formats such as comma-separated values (read_csv, to_csv),
Excel (read_excel, to_excel), Structured Query Language (read_sql, to_sql),
HyperText Markup Language (read_html, to_html), and more. In this project,
you’ll work with the CSV format.
df = pd.read_csv('census_data_popl_2010.csv')
That’s because the file contains characters encoded with Latin-1, also
known as ISO-8859-1, rather than the default UTF-8 encoding. Adding the
encoding argument will fix the problem .
Now, turn the population data file into a tabular dataframe by calling
the DataFrame() constructor. You don’t need all the columns in the original
file, so pass the names of the column you want to keep to the constructor.
These represent columns E, G, and M in Figure 11-5, or the FIPS code,
county name (without the state name), and population density, respectively.
Next, use the rename() dataframe method to make the column labels
shorter and more meaningful. Call them fips, County, and Density.
Finish the listing by printing the first few rows of the dataframe using
the head() method and by printing the shape of the dataframe using its
shape attribute. By default, the head() method prints the first five rows. If you
want to see more rows, you can pass it the number as an argument, such as
head(20). You should see the following output in the shell:
Notice that the first two rows (rows 0 and 1) are not useful. In fact,
you can glean from this output that each state will have a row for its name,
which you’ll want to delete. You can also see from the shape attribute that
there are 3,274 rows in the dataframe.
Removing Extraneous State Name Rows and Preparing the State and County Codes
Listing 11-3 removes all rows whose FIPS code is less than or equal to 100.
These are header rows that indicate where a new state begins. It then cre-
ates new columns for the state and county codes, which it derives from the
existing column of FIPS codes. You’ll use these later to select the proper
county outline from the bokeh sample data.
256 Chapter 11
choropleth.py, df = df[df['fips'] > 100]
part 2 print(f"Popl data with non-county rows removed:\n {df.head()}")
print(f"Shape of df = {df.shape}\n")
Listing 11-3: Removing extraneous rows and preparing the state and county codes
To display the population density data in the county polygons, you need
to turn it into a dictionary where the keys are a tuple of the state code and
county code and the values are the density data. But as you saw previously,
the population data does not include separate columns for the state and
county codes; it has only the FIPS codes. So, you’ll need to split out the state
and county components.
First, get rid of all the noncounty rows. If you look at the previous shell
output (or rows 3 and 4 in Figure 11-5), you’ll see that these rows do not
include a four- or five-digit FIPS code. You can thus use the fips column to
make a new dataframe, still named df, that preserves only rows with a fips
value greater than 100. To check that this worked, repeat the printout from
the previous listing, as shown here:
The two “bad” rows at the top of the dataframe are now gone, and
based on the shape attribute, you’ve lost a total of 53 rows. These represent
the header rows for the 50 states, United States, District of Columbia (DC),
and Puerto Rico. Note that DC has a FIPS code of 11001 and Puerto Rico
uses a state code of 72 to go with the three-digit county code for its 78
municipalities. You’ll keep DC but remove Puerto Rico later.
Next, create columns for state and county code numbers. Name the
first new column state_id . Dividing by 1,000 using floor division (//)
returns the quotient with the digits after the decimal point removed. Since
the last three numbers of the FIPS code are reserved for county codes, this
leaves you with the state code.
df info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3221 entries, 2 to 3273
Data columns (total 5 columns):
fips 3221 non-null float64
County 3221 non-null object
Density 3221 non-null float64
state_id 3221 non-null int64
cid 3221 non-null int64
dtypes: float64(2), int64(2), object(1)
memory usage: 151.0+ KB
None
As you can see from the columns and information summary, the state_id
and cid numbers are integer values.
The state codes in the first five rows are all single-digit numbers, but
it’s possible for state codes to have double digits, as well. Take the time to
check the state codes of later rows. You can do this by calling the loc()
method on the dataframe and passing it a high row number . This will
let you check double-digit state codes.
258 Chapter 11
The fips, state_id, and cid all look reasonable. This completes the data
preparation. The next step is to turn this data into a dictionary that holoviews
can use to make the choropleth map.
(1, 1) : 9.7
(1, 3) : 9.1
--snip--
To create a similar dictionary for the population data, first use the
pandas tolist() method to create separate lists of the dataframe’s state_id,
cid, and Density columns. Then, use the built-in zip() function to merge
the state and county code lists as tuple pairs. Create the final dictionary,
popl_dens_dict, by zipping this new tuple_list with the density list. (The
name tuple_list is misleading; technically, it’s a tuple_tuple.) That’s it for
the data preparation.
The Walking Dead survivors will be lucky to get out of Atlanta. Let’s
forget about them reaching Alaska. Make a tuple, named EXCLUDED, of states
and territories that are in the bokeh county data but aren’t part of the con-
terminous United States. These include Alaska, Hawaii, Puerto Rico, Guam,
Virgin Islands, Northern Mariana Islands, and American Samoa. To reduce
typing, you can use the abbreviations provided as a column in the county
dataset (see Figure 11-4).
The Density key-value pair now replaces the unemployment rate pair
used in the holoviews gallery example. Next up, plotting the map!
choropleth.opts(opts.Polygons(logz=True,
tools=['hover'],
xaxis=None, yaxis=None,
show_grid=False, show_frame=False,
width=1100, height=700,
colorbar=True, toolbar='above',
color_index='Density', cmap='Greys', line_color=None,
title='2010 Population Density per Square Mile of Land Area'
))
260 Chapter 11
Figure 11-7: Choropleth map with the hover feature active
Next, set the options for the map . First, permit use of a logarithmic
color bar by setting the logz argument to True.
The holoviews window will come with a set of default tools such as pan,
zoom, save, refresh, and so on (see the upper-right corner of Figure 11-7).
Use the tools argument to add the hover feature to this list. This allows you
query the map and get both the county name and detailed information on
the population density.
You’re not making a standard plot with an annotated x-axis and y-axis,
so set these to None. Likewise, don’t show a grid or frame around the map.
Set the width and height of the map in pixels. You may want to adjust
this for your monitor. Next set colorbar to True and place the toolbar at the
top of the display.
Since you want to color the counties based on population density, set
the color_index argument to Density, which represents the values in popl_dens
_dict. For the fill colors, use the Greys cmap. If you want to use brighter colors,
you can find a list of available colormaps at http://build.holoviews.org/user_guide
/Colormaps.html. Be sure to choose one with “bokeh” in the name. Finish
the color scheme by selecting a line color for the county outlines. Good
choices for a gray colormap are None, 'white', or 'black'.
Complete the options by adding a title. The choropleth map is now
ready for plotting.
To save your map in the current directory, use the holoviews save() method
and pass it the choropleth variable, a file name with the .html extension, and
the name of the plotting backend being used . As mentioned previously,
holoviews is designed for use with a Jupyter Notebook. If you want the map
to automatically pop up on your browser, first assign the full path to the
saved map to a url variable. Then use the webbrowser module to open url and
display the map (Figure 11-8).
You can use the toolbar at the top of the map to pan, zoom (using a
box or lasso), save, refresh, or hover. The hover tool, shown in Figure 11-7,
will help you find the least populated counties in places where the map
shading makes the difference hard to distinguish visually.
NOTE The Box Zoom tool permits a quick view of a rectangular area but may stretch or
squeeze the map axes. To preserve the map’s aspect ratio when zooming, use a combi-
nation of the Wheel Zoom and Pan tools.
262 Chapter 11
Figure 11-9: The Chisos Mountains of west Texas (left) with 3D relief map representation (right)
With your choropleth map, you can quickly plan a route to this natural
fortress far, far away. But first, you need to escape Atlanta. The shortest route
out of the metropolitan area is a narrow passage squeezed between the cities
of Birmingham and Montgomery in Alabama (Figure 11-10). You can skirt
the next big city, Jackson, Mississippi, by going either north or south. To
choose the best route, however, you need to look farther ahead.
Memphis
Little
Rock
Atlanta
Birmingham
Monroe
Jackson Montgomery
The southerly route around Jackson is shorter but forces the survivors
to pass over the highly developed I-35 corridor, anchored by San Antonio in
the south and Dallas–Fort Worth (DFW) in the north (Figure 11-11). This
creates a potentially dangerous choke point at Hill County, Texas (circled
in Figure 11-11).
Red River
Atlanta
Birmingham
DFW
Jackson
Austin
Houston
Big
Monroe
Monr
Bend San
Antonio
Alternatively, the northerly route through the Red River Valley, between
Oklahoma and Texas, would be longer but safer, especially if you took
advantage of the navigable river. Once west of Fort Worth, the survivors
could cross the river and turn south to salvation.
This type of planning would be even simpler if holoviews provided a
slider tool that allowed you to interactively alter the color bar. For example,
you could filter out or change the shading of counties by simply dragging
your cursor up and down the legend. This would make it easier to find
connected routes through the lowest population counties.
Unfortunately, a slider tool isn’t one of the holowviews window options.
Since you know pandas, though, that won’t stop you. Simply add the following
snippet of code after the line that prints the information at location 500:
This will change the population density values in the dataframe, setting
those greater than or equal to 65 to a constant value of 1000. Run the pro-
gram again, and you’ll get the plot in Figure 11-12. With the new values,
the San Antonio–Austin–Dallas barrier becomes more apparent, as does the
relative safety of the Red River Valley that forms the northern border of
east Texas.
You may be wondering, where did the survivors go in the TV show?
They went nowhere! They spent the first four seasons in the vicinity of
Atlanta, first camping at Stone Mountain and then holed up in a prison
near the fictional town of Woodbury (Figure 11-13).
264 Chapter 11
Red River Atlanta
Birmingham
DFW
Jackson
Austin Houston
Big
Bend San
Antonio
Figure 11-12: Counties with more than 65 people per square mile shaded black
Stone Mountain
Atlanta
Woodbury
Figure 11-13: Location of Stone Mountain and the fictional town of Woodbury
Further Reading
“If the Zombie Apocalypse Happens, Scientists Say You Should Run for the
Hills” (Business Insider, 2017), by Kevin Loria, describes the application of
standard disease models to infection rates in a theoretical zombie outbreak.
“What to Consider When Creating Choropleth Maps” (Chartable, 2018),
by Lisa Charlotte Rost, provides useful guidelines for making choropleth maps.
You can find it at https://blog.datawrapper.de/choroplethmaps/.
“Muddy America: Color Balancing the Election Map—Infographic”
(STEM Lounge, 2019) by Larry Weru, demonstrates ways to increase the
useful detail in choropleth maps, using the iconic red-blue United States
election map as an example.
Python Data Science Handbook: Essential Tools for Working with Data
(O’Reilly Media, 2016), by Jake VanderPlas, is a thorough reference for
important Python data science tools, including pandas.
Beneath the Window: Early Ranch Life in the Big Bend Country (Iron Mountain
Press, 2003), by Patricia Wilson Clothier, is an engaging recollection of
growing up in the early 20th century on a vast ranch in the Big Bend coun-
try of Texas, before it became a national park. It provides insight into how
apocalypse survivors might deal with life in the harsh country.
Game Theory: Real Tips for SURVIVING a Zombie Apocalypse (7 Days to Die)
(The Game Theorists, 2016) is a video on the best place in the world to escape
a zombie apocalypse. Unlike The Walking Dead, the video assumes that the
zombie virus can be transmitted by mosquitoes and ticks, and it selects the
location with this in mind. It’s available online.
266 Chapter 11
>>> pop_2020 = {'county': ['Autauga', 'Baldwin', 'Barbour', 'Bibb'],
'popl': [52910, 258321, 29073, 29881]}
>>>
>>> df_2010 = pd.DataFrame(pop_2010)
>>> df_2020 = pd.DataFrame(pop_2020)
>>> df_diff = df_2020.copy() # Copy the 2020 dataframe to a new df
>>> df_diff['diff'] = df_diff['popl'].sub(df_2010['popl']) # Subtract popl columns
>>> print(df_diff.loc[:4, ['county', 'diff']])
county diff
0 Autauga -1661
1 Baldwin 76056
2 Barbour 1616
3 Bibb 6966
THE OBJEC TI V E
Identify a feature of a computer simulation that might be detectable by those being simulated.
pond = turtle.Screen()
pond.setup(600, 400)
270 Chapter 12
pond.bgcolor('light blue')
pond.title("Yertle's Pond")
mud = turtle.Turtle('circle')
mud.shapesize(stretch_wid=5, stretch_len=5, outline=None)
mud.pencolor('tan')
mud.fillcolor('tan')
Listing 12-1: Importing the turtle module and drawing a pond and mud island
knot = turtle.Turtle()
knot.hideturtle()
knot.speed(0)
knot.penup()
knot.setpos(245, 5)
knot.begin_fill()
yertle = turtle.Turtle('turtle')
yertle.color('green')
yertle.speed(1) # Slowest.
yertle.fd(200)
yertle.lt(180)
yertle.fd(200)
yertle.rt(176)
yertle.fd(200)
Listing 12-2: Drawing a log and a turtle and then moving the turtle around
272 Chapter 12
Figure 12-1: Screenshot of completed simulation
In the turtle world, pixels are true atoms: indivisible. A line can’t be
shorter than one pixel. Movement can occur only as integers of pixels
(though you can input float values without raising an error). The smallest
object possible is one pixel in size.
An implication of this is that the simulation’s grid determines the
smallest feature you can observe. Since we can observe incredibly small sub-
atomic particles, our grid, assuming we’re a simulation, must be incredibly
fine. This leads many scientists to seriously doubt the simulation conjecture,
since it would require a staggering amount of computer memory. Still, who
knows what our distant descendants, or aliens, are capable of?
Besides setting a limit on the size of objects, a simulation grid might
force a preferred orientation, or anisotropy, on the fabric of the cosmos.
Anisotropy is the directional dependence of a material, such as the way
wood splits more easily along its grain rather than across it. If you look
closely at Yertle’s paths in the turtle simulation (Figure 12-3), you can see
evidence of anisotropy. His upper, slightly angled path zigzags, while the
lower, east-west path is perfectly straight.
274 Chapter 12
1+1 +1 +1 +... Ѳ Hypotenuse
Opposite
Adjacent
Figure 12-4: Movement along rows or columns (left) requires simpler arithmetic than
moving across them (right)
turtle.setup(1200, 600)
screen = turtle.Screen()
line_ave = statistics.mean(times)
print("Angle {} degrees: average time for {} runs at speed {} = {:.5f}"
.format(angle, NUM_RUNS, SPEED, line_ave))
Listing 12-3: Drawing a straight line and an angled line and recording the runtimes for each
276 Chapter 12
Move the turtle forward by 962 pixels, sandwiching this command
between calls to perf_counter() to time the movement . Subtract the end
time from the start time and append the result to the times list.
Finish by using the statistics.mean() function to find the average runtime
for each line. Print the results to five decimal places. After the program runs,
the turtle screen should look like Figure 12-5.
Because you used a Pythagorean triple, the angled line truly ends
on a pixel. It doesn’t just snap to the nearest pixel. Consequently, you
can be confident that the straight and angled lines have the same length
and that you’re comparing apples to apples when it comes to the timing
measurements.
Results
If you draw each line 500 times and then compare the results, you should
see that it takes roughly 2.4 times as long to draw the angled line as the
straight line.
Your times will likely differ slightly, as they’re affected by other pro-
grams you may have running concurrently on your computer. As noted pre-
viously, CPU scheduling will manage all these processes so that your system
is fast, efficient, and fair.
If you repeat the exercise for 1,000 runs, you should get similar results.
(If you decide to do so, you’ll want to get yourself a cup of coffee and some
of that good pie.) The angled line will take about 2.7 times as long to draw.
Clearly, moving across the pixel grid requires more work than moving
along it.
The Strategy
The goal of this project was to identify a way for simulated beings, perhaps
us, to find evidence of the simulation. At this point, we know at least two
things. First, if we’re living in a simulation, the grid is extremely small, as
we can observe subatomic particles. Second, if these small particles cross
the simulation’s grid at an angle, we should expect to find computational
resistance that translates into something measurable. This resistance might
look like a loss of energy, a scattering of particles, a reduction in velocity, or
something similar.
In 2012, physicists Silas R. Beane, from the University of Bonn, and
Zohreh Davoudi and Martin J. Savage, from the University of Washington,
published a paper arguing exactly this point. According to the authors,
if the laws of physics, which appear continuous, are superimposed on
a discrete grid, the grid spacing might impose a limitation on physical
processes.
They proposed investigating this by observing ultra-high energy cosmic
rays (UHECRs). UHECRs are the fastest particles in the universe, and they
are affected by increasingly smaller features as they get more energetic. But
there’s a limit to how much energy these particles can have. Known as the
GZK cutoff and confirmed by experiments in 2007, this limit is consistent
with the kind of boundary a simulation grid might cause. Such a boundary
should also cause UHECRs to travel preferentially along the grid’s axes and
scatter particles that try to cross it.
Not surprisingly, there are many potential obstacles to this approach.
UHECRs are rare, and anomalous behavior might not be obvious. If the
spacing of the grid is significantly smaller than 10−12 femtometers, we prob-
ably can’t detect it. There may not even be a grid, at least as we understand
it, as the technology in use may far exceed our own. And, as the philoso-
pher Preston Greene pointed out in 2019, there may be a moral obstacle to
the entire project. If we live in a simulation, our discovery of it may trigger
its end!
278 Chapter 12
Summary
From a coding standpoint, building Yertle’s simulated world was simple.
But a big part of coding is solving problems, and the small amount of work
you did had major implications. No, we didn’t make the leap to cosmic rays,
but we started the right conversation. The basic premise that a computer
simulation requires a grid that could imprint observable signatures on the
universe is an idea that transcends nitty-gritty details.
In the book Harry Potter and the Deathly Hallows, Harry asks the wizard
Dumbledore, “Tell me one last thing. Is this real? Or has this been happen-
ing inside my head?” Dumbledore replies, “Of course it is happening inside
your head, Harry, but why on Earth should that mean that it is not real?”
Even if our world isn’t located at the “fundamental level of reality,” as
Nick Bostrom postulates, you can still take pleasure in your ability to solve
problems such as this. As Descartes might’ve said, had he lived today, “I
code, therefore I am.” Onward!
Further Reading
“Are We Living in a Simulated Universe? Here’s What Scientists Say” (NBC
News, 2019), by Dan Falk, provides an overview of the simulation hypothesis.
“Neil deGrasse Tyson Says ‘It’s Very Likely’ the Universe Is a Simulation”
(ExtremeTech, 2016), by Graham Templeton, is an article with an embedded
video of the Isaac Asimov Memorial Debate, hosted by astrophysicist Neil
deGrasse Tyson, that addresses the possibility that we’re living in a simulation.
“Are We Living in a Computer Simulation? Let’s Not Find Out” (New
York Times, 2019), by Preston Greene, presents a philosophical argument
against investigating the simulation hypothesis.
“We Are Not Living in a Simulation. Probably.” (Fast Company, 2018), by
Glenn McDonald, argues that the universe is too big and too detailed to be
simulated computationally.
Moving On
There’s never enough time in life to do all the things we want, and that goes
double for writing a book. The challenge projects that follow represent the
ghosts of chapters not yet written. There was no time to finish these (or in
some cases, even start them), but you might have better luck. As always, the
book provides no solutions for challenge projects—not that you’ll need them.
This is the real world, baby, and you’re ready for it.
280 Chapter 12
Challenge Project: Seeing Through a Dog’s Eyes
Use your knowledge of computer vision to write a Python program that
takes an image and simulates what a dog would see. To get started, check
out https://www.akc.org/expert-advice/health/are-dogs-color-blind/ and https://
dog-vision.andraspeter.com/.
282 Chapter 12
PR ACTICE PROJECT SOLUTIONS
corpus = file_loader.text_to_string('hound.txt')
tokens = nltk.word_tokenize(corpus)
tokens = nltk.Text(tokens) # NLTK wrapper for automatic text analysis.
dispersion = tokens.dispersion_plot(['Holmes',
'Watson',
'Mortimer',
'Henry',
'Barrymore',
'Stapleton',
'Selden',
'hound'])
Punctuation Heatmap
PUNCT_SET = set(punctuation)
def main():
# Load text files into dictionary by author.
strings_by_author = dict()
strings_by_author['doyle'] = text_to_string('hound.txt')
strings_by_author['wells'] = text_to_string('war.txt')
strings_by_author['unknown'] = text_to_string('lost.txt')
def make_punct_dict(strings_by_author):
"""Return dictionary of tokenized punctuation by corpus by author."""
punct_by_author = dict()
for author in strings_by_author:
tokens = nltk.word_tokenize(strings_by_author[author])
punct_by_author[author] = ([token for token in tokens
if token in PUNCT_SET])
print("Number punctuation marks in {} = {}"
.format(author, len(punct_by_author[author])))
return punct_by_author
if __name__ == '__main__':
main()
def load_file(infile):
"""Read and return text file as string of lowercase characters."""
with open(infile) as f:
text = f.read().lower()
return text
text = load_file(infile)
if __name__ == '__main__':
main()
def main():
message = input("Enter plaintext or ciphertext: ")
process = input("Enter 'encrypt' or 'decrypt': ")
shift = int(input("Shift value (1-365) = "))
infile = input("Enter filename with extension: ")
if not os.path.exists(infile):
print("File {} not found. Terminating.".format(infile), file=sys.stderr)
sys.exit(1)
word_list = load_file(infile)
word_dict = make_dict(word_list, shift)
letter_dict = make_letter_dict(word_list)
if process == 'encrypt':
ciphertext = encrypt(message, word_dict, letter_dict)
count = Counter(ciphertext)
def load_file(infile):
"""Read and return text file as a list of lowercase words."""
with open(infile, encoding='utf-8') as file:
words = [word.lower() for line in file for word in line.split()]
words_no_punct = ["".join(char for char in word if char not in \
string.punctuation) for word in words]
return words_no_punct
def make_letter_dict(word_list):
firstLetterDict = defaultdict(list)
encrypted.append(random.choice(word_dict['the']))
encrypted.append(random.choice(word_dict['the']))
continue
encrypted.append(index)
return encrypted
if __name__ == '__main__':
main()
practice_orbital import os
_path.py from pathlib import Path
import cv2 as cv
def main():
night1_files = sorted(os.listdir('night_1_registered_transients'))
night2_files = sorted(os.listdir('night_2'))
path1 = Path.cwd() / 'night_1_registered_transients'
path2 = Path.cwd() / 'night_2'
path3 = Path.cwd() / 'night_1_2_transients'
if transient1 or transient2:
print('\nTRANSIENT DETECTED between {} and {}\n'
.format(night1_files[i], night2_files[i]))
font = cv.FONT_HERSHEY_COMPLEX_SMALL
cv.putText(img1, night1_files[i], (10, 25),
font, 1, (255, 255, 255), 1, cv.LINE_AA)
cv.putText(img1, night2_files[i], (10, 55),
font, 1, (255, 255, 255), 1, cv.LINE_AA)
if transient1 and transient2:
cv.line(img1, transient_loc1, transient_loc2, (255, 255, 255),
1, lineType=cv.LINE_AA)
out_filename = '{}_DECTECTED.png'.format(night1_files[i][:-4])
cv.imwrite(str(path3 / out_filename), blended) # Will overwrite!
else:
print('\nNo transient detected between {} and {}\n'
.format(night1_files[i], night2_files[i]))
if __name__ == '__main__':
main()
practice_montage_aligner.py
MIN_NUM_KEYPOINT_MATCHES = 150
orb = cv.ORB_create(nfeatures=700)
cv.namedWindow('Matches', cv.WINDOW_NORMAL)
img3_resize = cv.resize(img3, (699, 700))
cv.imshow('Matches', img3_resize)
cv.waitKey(7000) # Keeps window open 7 seconds.
cv.destroyWindow('Matches')
cv.imwrite('montage_left_registered.JPG', img1_warped)
cv.imwrite('montage_right_gray.JPG', img2)
else:
print("\n{}\n".format('WARNING: Number of keypoint matches < 10!'))
cv.namedWindow('Difference', cv.WINDOW_NORMAL)
diff_imgs1_2_resize = cv.resize(diff_imgs1_2, (699, 700))
cv.imshow('Difference', diff_imgs1_2_resize)
# Setup screen.
screen = turtle.Screen()
screen.setup(width=SA_X, height=SA_Y)
turtle.resizemode('user')
screen.title("Search Pattern")
rand_x = random.randint(0, int(SA_X / 2)) * random.choice([-1, 1])
rand_y = random.randint(0, int(SA_Y / 2)) * random.choice([-1, 1])
break
Start Me Up!
practice_grav """gravity_assist_stationary.py
_assist_stationary
.py Moon approaches stationary ship, which is swung around and flung away.
# User input:
G = 8 # Gravitational constant used for the simulation.
NUM_LOOPS = 4100 # Number of time steps in simulation.
Ro_X = 0 # Ship starting position x coordinate.
Ro_Y = -50 # Ship starting position y coordinate.
Vo_X = 0 # Ship velocity x component.
Vo_Y = 0 # Ship velocity y component.
MOON_MASS = 1_250_000
class GravSys():
"""Runs a gravity simulation on n-bodies."""
def __init__(self):
self.bodies = []
self.t = 0
self.dt = 0.001
def sim_loop(self):
"""Loop bodies in a list through time steps."""
for _ in range(NUM_LOOPS):
self.t += self.dt
for body in self.bodies:
body.step()
class Body(Turtle):
"""Celestial object that orbits and projects gravity field."""
def __init__(self, mass, start_loc, vel, gravsys, shape):
super().__init__(shape=shape)
self.gravsys = gravsys
self.penup()
self.mass=mass
self.setpos(start_loc)
self.vel = vel
gravsys.bodies.append(self)
self.pendown() # uncomment to draw path behind object
def acc(self):
"""Calculate combined force on body and return vector components."""
a = Vec(0,0)
for body in self.gravsys.bodies:
if body != self:
r = body.pos() - self.pos()
a += (G * body.mass / abs(r)**3) * r # units dist/time^2
return a
def main():
# Setup screen
screen = Screen()
screen.setup(width=1.0, height=1.0) # for fullscreen
screen.bgcolor('black')
screen.title("Gravity Assist Example")
# Instantiate Planet
image_moon = 'moon_27x27.gif'
screen.register_shape(image_moon)
moon = Body(MOON_MASS, (500, 0), Vec(-500, 0), gravsys, image_moon)
moon.pencolor('gray')
gravsys.sim_loop()
if __name__=='__main__':
main()
practice_grav """gravity_assist_intersecting.py
_assist
_intersecting.py Moon and ship cross orbits and moon slows and turns ship.
# User input:
G = 8 # Gravitational constant used for the simulation.
NUM_LOOPS = 7000 # Number of time steps in simulation.
Ro_X = -152.18 # Ship starting position x coordinate.
Ro_Y = 329.87 # Ship starting position y coordinate.
Vo_X = 423.10 # Ship translunar injection velocity x component.
Vo_Y = -512.26 # Ship translunar injection velocity y component.
MOON_MASS = 1_250_000
class GravSys():
"""Runs a gravity simulation on n-bodies."""
def __init__(self):
self.bodies = []
self.t = 0
self.dt = 0.001
def sim_loop(self):
"""Loop bodies in a list through time steps."""
for index in range(NUM_LOOPS): # stops simulation after while
self.t += self.dt
for body in self.bodies:
body.step()
class Body(Turtle):
"""Celestial object that orbits and projects gravity field."""
def __init__(self, mass, start_loc, vel, gravsys, shape):
super().__init__(shape=shape)
self.gravsys = gravsys
self.penup()
self.mass=mass
self.setpos(start_loc)
self.vel = vel
gravsys.bodies.append(self)
self.pendown() # uncomment to draw path behind object
def step(self):
"""Calculate position, orientation, and velocity of a body."""
dt = self.gravsys.dt
a = self.acc()
self.vel = self.vel + dt * a
xOld, yOld = self.pos() # for orienting ship
self.setpos(self.pos() + dt * self.vel)
xNew, yNew = self.pos() # for orienting ship
if self.gravsys.bodies.index(self) == 1: # the CSM
dir_radians = math.atan2(yNew-yOld,xNew-xOld) # for orienting ship
dir_degrees = dir_radians * 180 / math.pi # for orienting ship
self.setheading(dir_degrees+90) # for orienting ship
def main():
# Setup screen
screen = Screen()
screen.setup(width=1.0, height=1.0) # for fullscreen
screen.bgcolor('black')
screen.title("Gravity Assist Example")
# Instantiate Planet
image_moon = 'moon_27x27.gif'
screen.register_shape(image_moon)
moon = Body(MOON_MASS, (-250, 0), Vec(500, 0), gravsys, image_moon)
moon.pencolor('gray')
if __name__=='__main__':
main()
def run_stats(image):
"""Run stats on a numpy array made from an image."""
print('mean = {}'.format(np.mean(image)))
print('std = {}'.format(np.std(image)))
print('ptp = {}'.format(np.ptp(image)))
print()
cv.imshow('img', IMG)
cv.waitKey(1000)
Plotting in 3D
practice_3d """Plot Mars MOLA map image in 3D. Credit Eric T. Mortenson."""
_plotting.py import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
x = np.linspace(1023, 0, 1024)
y = np.linspace(0, 511, 512)
X, Y = np.meshgrid(x, y)
Z = IMG_GRAY[0:512, 0:1024]
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 150, cmap='gist_earth') # 150=number of contours
ax.auto_scale_xyz([1023, 0], [0, 511], [0, 500])
plt.show()
practice_geo_map_step_1of2.py
practice_geo_map """Threshold a grayscale image using pixel values and save to file."""
_step_1of2.py import cv2 as cv
IMG_GEO = cv.imread('Mars_Global_Geology_Mariner9_1024.jpg',
cv.IMREAD_GRAYSCALE)
cv.imshow('map', IMG_GEO)
cv.waitKey(1000)
img_copy = IMG_GEO.copy()
lower_limit = 170 # Lowest grayscale value for volcanic deposits
upper_limit = 185 # Highest grayscale value for volcanic deposits
cv.imwrite('geo_thresh.jpg', img_copy)
cv.imshow('thresh', img_copy)
cv.waitKey(0)
practice_geo_map_step_2of2.py
practice_geo_map """Select Martian landing sites based on surface smoothness and geology."""
_step_2of2.py import tkinter as tk
from PIL import Image, ImageTk
import numpy as np
import cv2 as cv
#------------------------------------------------------------------------------
class Search():
"""Read image and identify landing sites based on input criteria."""
def run_rect_stats(self):
"""Define rectangular search areas and calculate internal stats."""
ul_x, ul_y = 0, LAT_30_N
lr_x, lr_y = RECT_WIDTH, LAT_30_N + RECT_HT
rect_num = 1
while True:
rect_img = IMG_GRAY_GEO[ul_y : lr_y, ul_x : lr_x]
self.rect_coords[rect_num] = [ul_x, ul_y, lr_x, lr_y]
if MAX_ELEV_LIMIT >= np.mean(rect_img) >= MIN_ELEV_LIMIT:
self.rect_means[rect_num] = np.mean(rect_img)
self.rect_ptps[rect_num] = np.ptp(rect_img)
self.rect_stds[rect_num] = np.std(rect_img)
rect_num += 1
def draw_qc_rects(self):
"""Draw overlapping search rectangles on image as a check."""
img_copy = IMG_GRAY_GEO.copy()
rects_sorted = sorted(self.rect_coords.items(), key=lambda x: x[0])
print("\nRect Number and Corner Coordinates (ul_x, ul_y, lr_x, lr_y):")
for k, v in rects_sorted:
print("rect: {}, coords: {}".format(k, v))
cv.rectangle(img_copy,
(self.rect_coords[k][0], self.rect_coords[k][1]),
(self.rect_coords[k][2], self.rect_coords[k][3]),
(255, 0, 0), 1)
cv.imshow('QC Rects {}'.format(self.name), img_copy)
cv.waitKey(3000)
cv.destroyAllWindows()
def sort_stats(self):
"""Sort dictionaries by values and create lists of top N keys."""
ptp_sorted = (sorted(self.rect_ptps.items(), key=lambda x: x[1]))
self.ptp_filtered = [x[0] for x in ptp_sorted[:NUM_CANDIDATES]]
std_sorted = (sorted(self.rect_stds.items(), key=lambda x: x[1]))
self.std_filtered = [x[0] for x in std_sorted[:NUM_CANDIDATES]]
return img_copy
def make_final_display(self):
"""Use Tk to show map of final rects & printout of their statistics."""
screen.title('Sites by MOLA Gray STD & PTP {} Rect'.format(self.name))
# Draw the high-graded rects on the colored elevation map.
img_color_rects = self.draw_filtered_rects(IMG_COLOR,
self.high_graded_rects)
# Convert image from CV BGR to RGB for use with Tkinter.
img_converted = cv.cvtColor(img_color_rects, cv.COLOR_BGR2RGB)
img_converted = ImageTk.PhotoImage(Image.fromarray(img_converted))
canvas.create_image(0, 0, image=img_converted, anchor=tk.NW)
# Add stats for each rectangle at bottom of canvas.
txt_x = 5
txt_y = IMG_HT + 15
for k in self.high_graded_rects:
canvas.create_text(txt_x, txt_y, anchor='w', font=None,
text=
"rect={} mean elev={:.1f} std={:.2f} ptp={}"
.format(k, self.rect_means[k],
self.rect_stds[k],
self.rect_ptps[k]))
txt_y += 15
if txt_y >= int(canvas.cget('height')) - 10:
txt_x += 300
txt_y = IMG_HT + 15
canvas.pack()
screen.mainloop()
def main():
app = Search('670x335 km')
app.run_rect_stats()
app.draw_qc_rects()
app.sort_stats()
ptp_img = app.draw_filtered_rects(IMG_GRAY_GEO, app.ptp_filtered)
std_img = app.draw_filtered_rects(IMG_GRAY_GEO, app.std_filtered)
if __name__ == '__main__':
main()
IMG_HT = 400
IMG_WIDTH = 500
BLACK_IMG = np.zeros((IMG_HT, IMG_WIDTH), dtype='uint8')
STAR_RADIUS = 165
EXO_START_X = -250
EXO_START_Y = 150
EXO_DX = 3
NUM_FRAMES = 500
def main():
intensity_samples = record_transit(EXO_START_X, EXO_START_Y)
rel_brightness = calc_rel_brightness(intensity_samples)
plot_light_curve(rel_brightness)
def calc_rel_brightness(intensity_samples):
"""Return list of relative brightness from list of intensity values."""
rel_brightness = []
max_brightness = max(intensity_samples)
for intensity in intensity_samples:
rel_brightness.append(intensity / max_brightness)
def plot_light_curve(rel_brightness):
"""Plot changes in relative brightness vs. time."""
plt.plot(rel_brightness, color='red', linestyle='dashed',
linewidth=2)
plt.title('Relative Brightness vs. Time')
plt.xlim(-150, 500)
plt.show()
if __name__ == '__main__':
main()
STAR_RADIUS = 165
BLACK_IMG = np.zeros((400, 500, 1), dtype="uint8")
NUM_ASTEROIDS = 15
NUM_LOOPS = 170
class Asteroid():
"""Draws a circle on an image that represents an asteroid."""
def record_transit(start_image):
"""Simulate transit of asteroids over star and return intensity list."""
asteroid_list = []
intensity_samples = []
for i in range(NUM_ASTEROIDS):
asteroid_list.append(Asteroid(i))
for _ in range(NUM_LOOPS):
temp_img = start_image.copy()
# Draw star.
def calc_rel_brightness(image):
"""Calculate and return list of relative brightness samples."""
rel_brightness = record_transit(image)
max_brightness = max(rel_brightness)
for i, j in enumerate(rel_brightness):
rel_brightness[i] = j / max_brightness
return rel_brightness
def plot_light_curve(rel_brightness):
"Plot light curve from relative brightness list."""
plt.plot(rel_brightness, color='red', linestyle='dashed',
linewidth=2, label='Relative Brightness')
plt.legend(loc='upper center')
plt.title('Relative Brightness vs. Time')
plt.show()
relative_brightness = calc_rel_brightness(BLACK_IMG)
plot_light_curve(relative_brightness)
practice_limb """Simulate transit of exoplanet, plot light curve, estimate planet radius."""
_darkening.py import cv2 as cv
import matplotlib.pyplot as plt
IMG_HT = 400
IMG_WIDTH = 500
BLACK_IMG = cv.imread('limb_darkening.png', cv.IMREAD_GRAYSCALE)
EXO_RADIUS = 7
EXO_START_X = 40
EXO_START_Y = 230
EXO_DX = 3
NUM_FRAMES = 145
def main():
intensity_samples = record_transit(EXO_START_X, EXO_START_Y)
relative_brightness = calc_rel_brightness(intensity_samples)
plot_light_curve(relative_brightness)
def calc_rel_brightness(intensity_samples):
"""Return list of relative brightness from list of intensity values."""
rel_brightness = []
max_brightness = max(intensity_samples)
for intensity in intensity_samples:
rel_brightness.append(intensity / max_brightness)
return rel_brightness
def plot_light_curve(rel_brightness):
"""Plot changes in relative brightness vs. time."""
plt.plot(rel_brightness, color='red', linestyle='dashed',
linewidth=2, label='Relative Brightness')
plt.legend(loc='upper center')
plt.title('Relative Brightness vs. Time')
## plt.ylim(0.995, 1.001)
plt.show()
if __name__ == '__main__':
main()
STAR_RADIUS = 165
BLACK_IMG = np.zeros((400, 500, 1), dtype="uint8")
NUM_SHIPS = 5
NUM_LOOPS = 300 # Number of simulation frames to run
class Ship():
"""Draws and moves a ship object on an image."""
def record_transit(start_image):
"""Runs simulation and returns list of intensity measurements per frame."""
ship_list = []
intensity_samples = []
for i in range(NUM_SHIPS):
ship_list.append(Ship(i))
for _ in range(NUM_LOOPS):
temp_img = start_image.copy()
cv.circle(temp_img, (250, 200), STAR_RADIUS, 255, -1) # The star.
for ship in ship_list:
ship.move_ship(temp_img)
intensity = temp_img.mean()
cv.putText(temp_img, 'Mean Intensity = {}'.format(intensity),
(5, 390), cv.FONT_HERSHEY_PLAIN, 1, 255)
cv.imshow('Transit', temp_img)
intensity_samples.append(intensity)
cv.waitKey(50)
cv.destroyAllWindows()
return intensity_samples
def calc_rel_brightness(image):
"""Return list of relative brightness measurments for planetary transit."""
rel_brightness = record_transit(image)
max_brightness = max(rel_brightness)
for i, j in enumerate(rel_brightness):
rel_brightness[i] = j / max_brightness
return rel_brightness
def plot_light_curve(rel_brightness):
"""Plots curve of relative brightness vs. time."""
plt.plot(rel_brightness, color='red', linestyle='dashed',
linewidth=2, label='Relative Brightness')
relative_brightness = calc_rel_brightness(BLACK_IMG)
plot_light_curve(relative_brightness)
IMG_HT = 500
IMG_WIDTH = 500
BLACK_IMG = np.zeros((IMG_HT, IMG_WIDTH, 1), dtype='uint8')
STAR_RADIUS = 200
EXO_RADIUS = 20
EXO_START_X = 20
EXO_START_Y = 250
MOON_RADIUS = 5
NUM_DAYS = 200 # number days in year
def main():
intensity_samples = record_transit(EXO_START_X, EXO_START_Y)
relative_brightness = calc_rel_brightness(intensity_samples)
print('\nestimated exoplanet radius = {:.2f}\n'
.format(STAR_RADIUS * math.sqrt(max(relative_brightness)
-min(relative_brightness))))
plot_light_curve(relative_brightness)
return intensity_samples
def calc_rel_brightness(intensity_samples):
"""Return list of relative brightness from list of intensity values."""
rel_brightness = []
max_brightness = max(intensity_samples)
for intensity in intensity_samples:
rel_brightness.append(intensity / max_brightness)
return rel_brightness
def plot_light_curve(rel_brightness):
"""Plot changes in relative brightness vs. time."""
plt.plot(rel_brightness, color='red', linestyle='dashed',
linewidth=2, label='Relative Brightness')
plt.legend(loc='upper center')
plt.title('Relative Brightness vs. Time')
plt.show()
if __name__ == '__main__':
main()
Figure A-1: Light curve for planet and moon where moon passes behind planet
practice_length """Read-in images, calculate mean intensity, plot relative intensity vs time."""
_of_day.py import os
from statistics import mean
import cv2 as cv
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal # See Chap. 1 to install scipy.
Blurring Faces
path = "C:/Python372/Lib/site-packages/cv2/data/"
face_cascade = cv.CascadeClassifier(path + 'haarcascade_frontalface_alt.xml')
cap = cv.VideoCapture(0)
while True:
_, frame = cap.read()
face_rects = face_cascade.detectMultiScale(frame, scaleFactor=1.2,
minNeighbors=3)
cv.imshow('frame', frame)
if cv.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv.destroyAllWindows()
while True:
_, frame = cap.read()
gray = cv.cvtColor(frame, cv.COLOR_BGR2GRAY)
face_rects = detector.detectMultiScale(gray,
scaleFactor=1.2,
minNeighbors=5)
cap.release()
cv.destroyAllWindows()
A B
absdiff() method, 113 backends, 255
absolute runtimes, 275 bar charts, creating, 92
abstraction, 52 Bayes’ rule, 1–5
Adding a Password and Video Capture applying, 18–19
project, 242–243, 312–313 basic formula, 2
algorithms, 226–227 terms in, 2–3
face recognition, 205, 227 Bayes, Thomas, 1
LBP, 228 Bayesian updates, 3
LBPH, 230 Beane, Silas R., 278
PageRank, 61 Beautiful Soup (bs4) package, 53, 55
performance, 243 BFMatcher objects, 105
TextRank, 61 Binary Robust Independent
aliasing, 159–160 Elementary Features
alien megastructures, 195 (BRIEF), 104
All Packages tab, in the Natural blink comparators
Language Toolkit (NLTK), 31 building, 110
Anaconda, xxiv, 7 defining, 103
ancestors, using the Turtle class as, 135 using, 96–97, 110–112
ANGLE constants, 272 blink() function, 103, 110
anisotropy, 274 blink microscopes, 96
arguments Blue-Green-Red (BGR) format, 168
color_index, 261 converting to Red-Green-Blue
fx, 220 (RGB) format, 168
fy, 220 used by OpenCV, 14
representations, 144 blur() method, 222
using when instantiating an Blurring Faces project, 222–223, 312
object, 140 Body class, 134
arrays, 12, 158 bokeh module, 248–250
looping through, 238 extension, 255
ORB, 104 installing, 250
series, 248 sample data, 250, 254
ASCII, 35 Bostrom, Nick, 269
astype() method, 258 Box Zoom tool, 262
Atom, xxii BRIEF (Binary Robust Independent
attributes, 11–12 Elementary Features), 104
changing, 140 brute-force matchers, 105
self.area_actual, 16 bs4.BeautifulSoup() method, 53, 55
audio recordings, 209, 217
autotexts, 192
ax.pie() method, 191–192
C check_for_fail() function, 90
chi-squared random variable (X 2), 43–44
calc_rel_brightness() function, 182
choice() method, 15, 88
Calculating the Probability of choropleth maps, 246–247, 252, 260–265
Detection project, 25–26 chunking, 33
canvas, 169
ciphertext, 79
cascade classifiers, 205, 209–210 circle() method, 215, 272
central processing units (CPUs), 275 circular LBP, 229
cget() method, 169
classes, 10–12
challenge projects defining, 10–12, 133–134
Adding a Password and Video object-oriented programming
Capture, 242–243, 312–313 (OOP), 10
Calculating the Probability of collections module, 54, 83
Detection, 25–26 color, 141–143, 168–169
Customized Word Search, 281 Blue-Green-Red (BGR) format, 168
Detecting Cat Faces, 223 converting to Red-Green-Blue
Finding a Safe Space, 279–280 (RGB) format, 168
Finding the Best Strategy with used by OpenCV, 14
MCS, 25 channels, 191
Fixing Frequency, 50 color_index argument, 261
Game Night, 72–73 images, height of, 169
Generating a Dynamic Light Red-Green-Blue (RGB) format,
Curve, 202 14, 168
Go Tell It on the Mountain, 281–282 converting to Blue-Green-Red
Here Comes the Sun, 280 (BGR) format, 168
It’s Not Just What You Say, It’s How selecting for setting up the
You Say It! 75 screen, 139
Look-alikes and Twins, 243 tuple data
Making It Three in a Row, 175 Blue-Green-Red (BGR) format,
Mapping US Population Change, 14, 102
266–267 using, 14
The Real Apollo 8, 149 colormaps, 68
Seeing Through a Dog’s Eyes, 281 comma-separated values (CSV),
Simplifying a Celebration 251–252, 255–256
Slideshow, 281 common words, analyzed by natural
Smarter Searches, 24 language processing
Summarizing a Novel, 74–75 (NLP), 29
Summarizing Summaries, 73 comprehension, 57
Time Machine, 243 computer vision, 6–7
True-Scale Simulation, 149 conditionals
What a Tangled Web We Weave, 281 errors, 88
Wrapping Rectangles, 175–176 using, 59, 216
see also projects conduct_search() method, 21
characters, 33 confidence factors, 228
Charting the Characters project, 92, Confirming That Drawings Become
285–286 Part of an Image project,
charts 172, 298
bar, 92 constants, 9, 33, 142, 157–158
dataframes, 249 ANGLE, 272
histograms, 229 assigning, 100, 113–114, 132–133,
pie, 191–192 181–182
316 Index
derived, 157–158 Detecting a Planet with a Moon project,
SIDE, 272 201, 309–310
user input, 156 Detecting Alien Megastructures
context instances, 11 project, 195–196, 304–305
Corpora tab, in the Natural Language Detecting an Alien Armada project,
Toolkit (NLTK), 31 200–201, 307–309
corpus Detecting Asteroid Transits project,
analyzed by natural language 197–198, 305–306
processing (NLP), 29 Detecting Astronomical Transients
loading, 33 with Image Differencing
normalizing, 36 project, 112–119
Counter()method, 54, 57, 84 Detecting Cat Faces project, 223
Counting Stars project, 120–121 Detecting Starspots project, 200
counts, 36 detectMultiscale() method, 212, 235
CP-1252, 35 dictionaries, making, 87–88
CPUs (central processing units), 275 dictionary comprehension, 254
cryptanalysis techniques, effectiveness, 78 difference maps, images, 114
CSM (command and service modules), diff_image variable, pixels, 114
124, 143 The Digital Key to Rebecca project, 80–91
CSV format, 255–256 direction vectors, formula, 136
cumulative parameter, 38 discoverable items, Rebecca cipher, 81
Customized Word Search project, 281 dispersion plots, 48–49
cv module, 100 applying, 49
cwd() class method, 101 documentation
cylindrical projections, 158 holoviews module, 260
NumPy (Numerical Python) package, 24
OpenCV, 190
D pathlib module, 101
data formats, 255 playsound module, 208
comma-separated values (CSV), pyttsx3 module, 208
255–256 tkinter module, 156
Excel, 255 turtle module, 131
HTML (Hypertext Markup dot() method, 273
Language), 255 dot notation, using to call generate()
elements used in, 55 method, 68
graphics and, 249 downscaling images, 212
parsing, 53 Doyle, Sir Arthur Conan, 28
tags used in, 55 draw_map() method, defining, 14
using requests library with, 54 drawMatches() method, 102
Structured Query Language, 255 draw_menu() function, 18
dataframes, 248–249, 255 drawing boards, 133
datetime module, 209 dt (delta time), 134
Davoudi, Zohreh, 278
decoding, defined, 35
decrypt() function, 89 E
decryption, and encryption of data, edge artifacts, 111
78–79 elements, 55
defaultdict() function, 83, 87 elevation profiles, 172
delta time (dt), 134 encoding, defined, 35
derived constants, 157–158 encrypt() function, 88
descriptors, 104 end_fill() method, 272
Index 317
Enthought Canopy, xxiv, 7 forward modeling, 194
enumerate() function, 38, 88, 102 forward slash (/), 101
errata, xxii free return trajectory, plotting, 124
etaoin mnemonic, for remembering FreqDist() method, 38
the most common letters in frequency, 36, 50
English, 92 fromarray() method, 168
Excel, 255 f-string format, 211
exoplanets, 177–194 fx arguments, 220
extensions, 255 fy arguments, 220
Extracting an Elevation Profile project,
172, 298–299
extraction, 52
G
Game Night project, 72–73
Geany, xxii
F generate() method, dot notation, 68
face recognition, 203–205 Generating a Dynamic Light Curve
algorithm, 205, 227 project, 202
face classifiers, 204–205 generator expressions, 46
flowchart, 226–228 defined, 46
Haar feature approach, 221 generator objects, 46
LBP cascade classifier, 221 defined, 46
photographs, 204–205 genism, 61–64
sliding window approach, 205 installing, 61
video feeds, 219–221 running, 62
false positives, 111, 213, 215 summarize() function, 62
feature vectors, ORB, 104 get() method, obtaining, 44
Federal Information Processing Series getscreen() method, 144
(FIPS) code, 254 turtle module, 141
files get_word_freq() function, 57–58
.html extension, 261 gif files, 140
audio recordings, 209 global constants, 142
loading, 87–88 Go Tell It on the Mountain project,
preparing, 115–116 281–282
structure, 232 graphics
find_all() method, 53 bar charts, 92
find_best_matches() function, 102 charts, 92, 190–192, 229
Finding a Safe Space project, 279–280 choropleth maps, 246–247, 252,
Finding the Best Strategy with MCS 260–265
project, 25 dataframes, 249
find_transient() function diff_image variable, and pixels, 114
calling, 117 dispersion plots, 48–49
parameters, 114 histograms, 229
Fixing Frequency project, 50 HTML (Hypertext Markup
floor division returns (//), 257–258 Language) and, 249
folders images, 66, 140, 163, 189–194
images, 239 analyzing, 188
preparing, 115–116 blurring, 222
fonts, 118 calculating absolute differences
for loops, 101–102, 116 between, 116–117
turtle module, 128 difference, 116
Format Specification Mini-Language, 86 downscaling, 189, 212
318 Index
gif files, 140 I
importing, 66
I Have a Dream . . . to Summarize
looping through, 116–117
Speeches! project, 52–61
positive, 116
IDLE (integrated development and
registering, 108
learning environment) text
saving, 103, 117–119
editor, xxii
and tuple data, 114
Idle, Eric, xxii
pie charts, 190–192
if statements, 240
pixelated images, 190
Image and ImageTK modules, 156
pixels, 192–193, 273–274
images, 66, 140, 163, 189–194
blurring, 222
analyzing, 188
and the diff_image variable, 114
calculating absolute differences
turtle module, 129
between, 116–117
see also Python Imaging Library (PIL)
choropleth maps, 246–247, 252,
gravity propulsion, 145
260–265
simulation, 145
dataframes, 249
Greene, Preston, 278
difference, 116
GZK cutoff, 278
diff_image variable, and pixels, 114
downscaling, 189, 212
H drawing on, 163–164
Haar features, 203–205 gif files, 140
example templates, 204 importing, 66
face recognition, 221 loading, 210
Windows OS, 234 looping through, 116–117
Hamming distance, string length, 105 negative, 111, 116
heatmaps, 49–50 orthogonal patterns, 273–275
defined, 49 pixelated images, 190
Here Comes the Sun project, 280 pixels, 192–193, 273–274
hideturtle() method, 272 blurring, 222
histograms, 229 and the diff_image variable, 114
holoviews module, 248–250, 252–254 positive, 111, 116
documentation, 260 registering
extension, 255 applying homography, 108
installing, 250 and defining a function, 108
Windows OS, 250 rescaling, 212
homography, 108 saving, 103, 117–119
findHomography() method, 109 scale pyramids, 212
Hound of the Baskervilles, The, 32 stored by OpenCV, 158, 162
The Hound, The War, and The Lost storing, 237
World project, 28–47 tuple data, 114
HTML (Hypertext Markup Language) see also Python Imaging Library (PIL)
elements used in, 55 Imaging Exoplanets project, 188–194
file extensions, 261 imread() method, 102, 156
graphics and, 249 imshow() method, 14, 107
parsing, 53 imwrite() method, 235
tags used in, 55 Incorporating Limb Darkening project,
using requests library with, 54 198–200, 306–307
Hunting the Hound with Dispersion indexes, 81, 248–249
project, 48–49, 283–284 info() method, 258
Hutchinson, Georgia, 280
Index 319
__init__() method, 11–12 landing ellipses, 152
and attributes, 135 latin-1, 35
inliers, 109 LBP cascade classifier, 221
in-message duplication of keys, lead angles, defined, 125
Rebecca cipher, 81–82 lemmatization, 47
installing Python, xxii–xxiv libraries, 6–8
instantiation, 10, 135 installing, 6–7
intensity, 182 Life, the Universe, and Yertle’s Pond
International Data Corporation, 51 project, 270–278
interpolation, 190 limb darkening, 198–199
intersection() function, 47 line() method, 14, 102, 166
intersection over union, 45–47 Linux, xxii
isalpha() method, 36 list comprehension, 36
isdigit() function, 56 using to compare parts of speech, 42
iterables, 81 list item markers, 42
itertools module, 9, 17 load_file() function, 87
It’s Not Just What You Say, It’s How You local binary pattern histograms
Say It! project, 75 (LBPHs), 221, 226–230
loc() method, 258
Look-alikes and Twins project, 243
J loops
Jaccard similarity delta time (dt), 134
analyzed by natural language images, 116–117
processing (NLP), 29 for loops, 101–102, 116, 210–211
calculating, 45–47 in the main() function, 101–103
Jaccard similarity coefficient, 45–47 simulation loops, 144
jaccard_test() function, defining, 46 using, 37
join() method, 53 while loops, 19–20, 219
turning elements into a string Lost World, The, 28
with, 55 Low, George, 123
Lowell, Percival, 95
lowercase and uppercase letters,
K
handling with natural
keypoints, 100, 103–107 language processing
finding matches, 103–104, 107 (NLP), 58
keys, 44, 79
checking for duplicates, 84
printing, 86 M
keyword metadata macOS, xxii
extracting, 65 Tcl/Tk 8.5 bugs, 160
in word clouds, 64 main() function, 19–22
keywords, 65 calling, 21–22, 90, 143
completing, 56–57
defining, 19, 32–34, 54–55, 83,
L
100–101, 139, 182
labels, 192 finishing, 21–22
indexes, 248–249 looping in, 101–103
integers, 236 running programs with, 169
numeric, 192 make_dict() function, 87
lambda functions, 164 make_word_dict() function, 33
using, 106 defining, 35
320 Index
Making It Three in a Row project, 175 re, 54
Mapping US Population Change statistics, 276
project, 266–267 supporting, 231–232
margin parameter, 68 sys, 9, 83
Mars Global Surveyor (MGS), 153 time, 211, 276
Mars Orbiter Laser Altimeter (MOLA) tkinter, 156
map, 151, 153 turtle, 127–131, 270–271
math module, 181 webbrowser, 261
matplotlib, 7, 67 modulo operator (%), 258
plotting light curves with, 180 Monte Carlo simulation (MCS), 25
matrices, ORB, 104 most_common() method, 44, 57
max() function, 185 using to compare vocabularies, 44
maximum values, 185 moveWindow() method, 14
Measuring the Length of an Musk, Elon, 269
Exoplanet’s Day project,
201–202, 311
menu choices, evaluating, 19–21
N
message-to-message duplication of keys, __name__ variable, 22
Rebecca cipher, 81–82 natural language processing (NLP),
metadata 28–29
extracting, 65 handling uppercase and lowercase
in word clouds, 64 letters, 58
methods, 11 using, 28, 51
defining, 14–15 Natural Language Toolkit (NLTK),
min() function, 37 29–31
using to compare vocabularies, 44 All Packages tab, 31
minMaxLoc() method, 115 Corpora tab, 31
Minovitch, Michael, 126 installing, 29–32
Mixing Maps project, 173–175, 300–303 downloading the stopwords
mkdir() method, 235 corpus, 31
modules downloading the tokenizer,
bokeh, 248–250 30, 33
collections, 54, 83 Punkt Tokenizer Models, 30
cv, 100 sent_tokenize() method, 63
datetime, 209 truncation option, 33–34
holoviews, 248–250, 252–254 word_tokenize() method, 35
Image and ImageTK, 156 n-dimensional arrays, 158
importing, 9–10, 32–34, 54–55, 62, Newton, Isaac, 126
66, 82–83, 100, 113–114, normalization, 59
156, 181–182, 238, 255 word counts, 59
itertools, 9 normalized vectors, formula, 136
math, 181 np.zeros() method, 108, 181
order for importing, 9–10 numeric labels, 192
os, 83 numerical arrays, ORB, 104
pathlib, 100–101 NumPy (Numerical Python) package,
pillow, 66 7–8, 24, 67
playsound, 207–208, 217 documentation, 24
pyttsx3, 207–209 importing, 67
random, 9, 83 random.choice() method, 15
Index 321
O pandas (Python Data Analysis Library),
248–249
object-oriented programming (OOP),
installing, 250
10, 131
paragraph elements (p_elems),
classes, 10
selecting, 74
using, 131
paragraphs, formatting with <p> and
objects, 10
</p> tags, 55
one-time pads, 77–79
parking orbits, 125
OpenCV, 6–8, 168
part-of-speech (POS), 40–41
absdiff() method, 113
parts of speech
addWeighted() method, 118
analyzed by natural language
blur() method, 222
processing (NLP), 29
documentation, 190
comparing, 40–42
drawMatches() method, 102
parts_of_speech_test() function,
findHomography() method, 109
defining, 41
images stored by, 158, 162
patched conic method, assumptions
imshow() method, 220
of, 126
installing with pip, 8–9
patches, ORB, 104
LBP cascade classifier, 221
pathlib module, 100–101
line() method, 14, 102, 166
documentation, 101
ORB_create() method, 104
patterns, orthogonal, 273–275
putText() method, 118, 183
peak-to-valley statistics
random sample consensus
overview, 154–155
(RANSAC), 109
calculating, 161
rectangle() method, 114, 215
sorting, 165
split() method, 191
pencolor() method, 141–142
use of Blue-Green-Red (BGR)
Penrose tiling, 130
format, 14
penup() method, 130
VideoCapture() method, 219, 234
using, 135
waitKey() method, 184, 220
PEP8 Python style guide, 9, 115
open() function, 35, 67
PerceptronTagger, comparing parts of
ORB_create() method, 104
speech using, 40–41
orb.detectAndCompute() method, 105
perf_counter, 276–277
ord() function, 86
phase angles, defined, 125
using, 86
photospheres, 198
Oriented FAST and Rotated BRIEF
pie charts, 191–192
(ORB), 104
pillow module, 66
orthogonal patterns, 273–274
pip (preferred installer program), 8
os.listdir() method, 100
installing, 8
os module, 83
pixelated images, 190
os.path.exists() method, 84
pixels, 273–274
os.path.split() method, 237
blurring, 222
outliers, 109
diff_image variable, 114
plotting, 192–193
P plaintext messages, encryption, 88
playsound module, 207–208, 217
<p> and </p> tags, formatting
audio recordings, 217
paragraphs with, 55
documentation, 208
page variable, Response objects, 54–55
plot() method, 38, 185
PageRank algorithm, 61
Plotting in 3D project, 173, 299
322 Index
Plotting the Orbital Path project, Incorporating Limb Darkening,
119–120, 289–290 198–200, 306–307
plt.ion() method, 38 Life, the Universe, and Yertle’s
plt.legend() method, 38 Pond, 270–278
plt.show() method, 38 Measuring the Length of an
Pokrovsky shells Exoplanet’s Day, 201–202, 311
overview, 195–196 Mixing Maps, 173–175, 300–303
rings around, 196 Plotting in 3D, 173, 299
Polygons() class, 260 Plotting the Orbital Path, 119–120,
polygon types, 143 289–290
position() method, 130 Programming a Robot Sentry Gun,
pos_tag() method, 42 205–220
PowerShell, xxv, 42 Punctuation Heatmap, 49–50,
predict() method, 240 284–285
preferred installer program (pip), 8 Replicating a Blink Comparator,
print() function, 19 96–112
priors, 3 Restricting Access to the Alien
probabilities, 2 Artifact, 231–241
calculating, 4–5 Search and Rescue, 5–24
probability of detection (PoD), 25–26 Selecting Martian Landing Sites,
Programming a Robot Sentry Gun 153–171
project, 205–220 Sending Secrets the WWII Way,
Project Gutenberg, 32 93–94, 286–289
projects Shut Me Down! 148, 296–298
Blurring Faces, 222–223, 312 Simulating a Search Pattern,
Charting the Characters, 92, 285–286 146–147, 292–293
Confirming That Drawings Become Simulating an Exoplanet Transit,
Part of an Image, 172, 298 179–188
Counting Stars, 120–121 Start Me Up! 147–148, 293–295
Detecting a Planet with a Moon, Summarizing Speeches with
201, 309–310 gensim, 61–64
Detecting Alien Megastructures, Summarizing Text with Word
195–196, 304–305 Clouds, 64–71
Detecting an Alien Armada, 200– To the Moon with Apollo 8! 127–146
201, 307–309 Visualizing Population Density with
Detecting Asteroid Transits, 197– a Choropleth Map, 246–265
198, 305–306 What’s the Difference? 120, 290–292
Detecting Astronomical Transients see also challenge projects
with Image Differencing, pseudorandom numbers, 84
112–119 random module, 84
Detecting Starspots, 200 Punctuation Heatmap project, 49–50,
The Digital Key to Rebecca, 80–91 284–285
Extracting an Elevation Profile, Punkt Tokenizer Models, 30
172, 298–299 putText() method, 14, 183, 241
The Hound, The War, and The PyCharm, xxii
Lost World, 28–47 Pylint, unused variables, 102
Hunting the Hound with pyperclip, copying and pasting text to
Dispersion, 48–49, 283–284 the clipboard with, 91
I Have a Dream . . . to Summarize PyScripter, xxii
Speeches! 52–61 Pythagorean triple, 275
Imaging Exoplanets, 188–194
Index 323
Python R
built-in functions
raise_for_status() method, 55
enumerate(), 38, 88, 102
random.choice() method, 15, 88
__init__(), 11–12, 135
random module, 9, 83
intersection(), 47
pseudorandom numbers, 84
isalpha(), 36
random sample consensus
isdigit(), 56
(RANSAC), 109
max(), 185
random_state parameter, 68
min(), 37, 44
Raspberry Pi, xx
open(), 35
readers, 255
ord(), 86
read() method, 235
read(), 34
The Real Apollo 8 project, 149
repr(), 86
Rebecca cipher, 80
sorted(), 101
record_transit() function, 183
type(), 67
rectangle() method, 114, 164
Format Specification
Red-Green-Blue (RGB) format, 14, 168
Mini-Language, 86
converting to Blue-Green-Red
IDE, xxii
(BGR) format, 168
installing, xxii–xxiv
regex syntax, 56
PEP8 style guide, 115
registration, of images, 97–98
platform, xxii
regular expressions, 54
running, xxiv–xxv
defined, 54
split() function, 36, 89, 191
relative_scaling parameter, 68
version, xxii
release() method, 220
visualization tools, 248
re module, 54
Python Data Analysis Library (pandas),
remove_stop_words() function, 57
248–249
replace() function, 90
Python Imaging Library (PIL), 66, 156
using, 90
importing, 67
Replicating a Blink Comparator
Image module, 156
project, 96–112
Image.open() method, 67
repr() function, 86
ImageTK module, 156
requests, importing, 53
word clouds, 65–66
requests.get() method, 54
Python Standard Library
requests library, 53–54
collections module, 93
Response object, referencing with page
functions, 17
variable, 54–55
pyttsx3 module
Restricting Access to the Alien Artifact
documentation, 208
project, 231–241
initializing objects, 209
re.sub() function, 56
installing, 207–209
return statements, ending fuctions
say() method, 211
with, 47
revise_target_probs() method, 18, 22
Q robot sentry, 206
Q key, 220 root-mean-square
quality control applying, 154–155
function, 102 formula, 154
steps, 101 rotate() method, 130
queryIdx.pt attribute, 108 runAndWait() method, 211
quit, 220 run_rect_stats() method, 162
324 Index
S Simulating a Search Pattern project,
146–147, 292–293
Savage, Martin J., 278
Simulating an Exoplanet Transit
save() method, 261
project, 179–188
say() method, 211
simulation hypothesis, 269
scale pyramids, 212
simulation loops, 144
scaled gravitational constants, 133
sliding window approach, 205
SciPy package, 7–8
slingshot maneuver, 145
score_sentences() function, 57–59
simulation, 145
scraping the web, 53, 62
Smarter Searches project, 24
screen, setting up, 139
Smith, David, 153
Screen subclass, 133
sorted() function, 101
screen updates, 144
sound
search
audio recordings, 209, 217
calculating effectiveness, 16–17
files, 209
conducting, 16–17
playsound module, 217
Search and Rescue project, 5–24
split() function, 36, 89, 191
search classes, 10–12
tokens, 36
defining, 10–12, 161
standard deviation
initializing, 161
applying, 154–155
search effectiveness probability (SEP), 4
formula, 154
search engine optimization (SEO), 65
sorting, 165
Seeing Through a Dog’s Eyes project, 281
starspots, 200
Selecting Martian Landing Sites
Start Me Up! project, 147–148, 293–295
project, 153–171
statistics.mean() function, 277
select() method, 53
statistics module, 276
limits, 55
stemming, 47
self.area_actual attribute, 16
step() method, defining, 137
self attributes, 12
stop words, 39–40, 57–58, 67
self parameter
analyzed by natural language
function, 134
processing (NLP), 28
using, 136
comparing, 39–40
Sending Secrets the WWII Way project,
examples, 54
93–94, 286–289
functional, 54
sentry guns
importing, 66
automated, 206
removing, 57–58
use of video feeds, 207
Stopwords Corpus, 30–31
sent_tokenize() method, 63
string.replace() method, 56, 89
series, 248–249
strings
setpos() method, 135, 272
f-string format, 211
setup() method, 271
Hamming distance, and string
Shape class, 133
length, 105
shape() function, 109
join() method, turning elements
shapes, 142–143
into a string with, 55
building, 142–143
length, 105
shift value, 84
ord() function, 220
overview, 84
string.replace() method, 56, 89
Shut Me Down! project, 148, 296–298
string.split() method, 89
SIDE constants, 272
text_to_string() function, 33, 34
sim_loop() method, 144
Structured Query Language, 255
Simplifying a Celebration Slideshow
project, 281
Index 325
stylometry, 27–28 tolist() method, 259
performing, 29 Tombaugh, Clyde, 95–96, 110–111
subarrays, 12 To the Moon with Apollo 8! project,
Sublime Text, xxii 127–146
subplots() method, 191 tracer() method, 141, 144
summarize() function, 62 trainIdx.pt attribute, 108
Summarizing a Novel project, 74–75 train() method, 238
Summarizing Speeches with gensim transects, 154
project, 61–64 transient astronomical events,
Summarizing Summaries project, 73 defined, 98
Summarizing Text with Word Clouds transients, detecting and circling, 114–115
project, 64–71 transits and transit photometry, 178–179
super() function, using, 135 experimenting with, 186–188
suptitle() method, 70 translunar injection velocity (V0), 125
sys module, 9, 83 triangular distribution, 15
sysconfig module, 209 Trojan asteroids, 197
True-Scale Simulation project, 149
truncation option, 33
T tuple data type
Tabby’s Star, 195–196 overview, 7
taggers, 40–41 Blue-Green-Red (BGR) color
tags, 55 format, 14, 102
values, 64 elements used in
Tcl/Tk 8.5 bugs, 160 channels, 15
test statistic, 43–44 columns, 15
text rows, 15, 17
adding, 168–169 images, 114
files, 66 list, 57
fonts, 118 most_common() method, 44
titles, formatting with <title> and shapes, 17
</title> tags, 55 vectors, 136, 137
TextRank algorithm, 61 turtle module, 127–131, 270–271
text_to_string() function, 33 addcomponent() method, 142–143
defining, 34 assigning constants, 132–133
three-body problem, 126 documentation, 131
3D plotting, 173 example suite, 131
thresholding getscreen() method, 141
defined, 174 graphics, 129
using, 174–175 hideturtle() method, 272
Time Machine project, 243 importing, 132–133, 270–271
time module, 211, 276 polygon type, 143
time steps, 131 pencolor() method, 141
titles, formatting with <title> and screen.register_shape() method, 140
</title> tags, 55 shapes provided with, 127
tkinter module, 156 shapesize() method, 144
creating canvas objects, 160 tracer() method, 141
documentation, 156 using, 127–131
placement of code, 160 bk() method, 130
Windows OS, 156 penup() method, 130
to_array() method, 68 position() method, 130
calling, 68 rotate() method, 130
tokens, 30, 35 stamp() method, 129
326 Index
type() function, 67 Visualizing Population Density with a
Tyson, Neil DeGrasse, 269 Choropleth Map project,
246–265
vocab_test() function, defining, 43
U vocabularies, analyzed by natural
ultra-high energy cosmic rays language processing (NLP),
(UHECRs), 278 43–45
unconstrained faces, 221 voices
underscore (_), 102 changing, 209
Unicode Transformational Format Windows OS
(UTF), 35, 86 American “David,” 233
UnicodeDecodeError, 35 default, 233
while loading text, 67 default voice, 209
unit vectors, formula, 136 female, 209
unknown, 44, 47 male, 209
unstructured data, 51 Vonnegut, Kurt, 48
unused variables, 102
Pylint, 102
uppercase and lowercase letters, W
handling with natural waitKey() method, 14, 107, 184, 235
language processing War of the Worlds, The, 32
(NLP), 58 warpPerspective() method, 109
UTF (Unicode Transformational webbrowser module, 255, 261
Format), 35 web scraping, 53, 62
Wells, H. G., 28
What a Tangled Web We Weave
V project, 281
values What’s the Difference? project, 120,
max() function, 185 290–292
maximum values, 185 while loops, 19–20, 56, 219
variables, 17, 33, 102, 115 Windows OS, xxii
assigning, local, 17, 162 character encoding, 35
built-in, 22 CP-1252, 35
chi-squared random variable (X 2), Haar features, 209, 234
43–44 holoviews module, 250
diff_image variable, 114 PyScripter, xxii
excessive, 142 pyttsx3 module, 208
global, 12 tkinter module, 156
__name__ variable, 22 using PowerShell, 42
naming, 68 voices on, 209, 233
page variable, 54–55 word clouds, 64–71
Pylint, 102 displaying, 70
unused, 102 fine-tuning, 70–71
Vec2D, 133, 136 generating, 67–68
vectors, ORB, 104 plotting, 69–70
video feeds word length
capturing, 232–236 analyzed by natural language
streams, 219–221 processing (NLP), 28
virtual environments, xxv comparing, 37–39
using, xxv word_tokenize() method, 30, 35, 57
Visual Studio Code, xxii Wrapping Rectangles project, 175–176
writers, 255
Index 327
Y Z
YAML (.yml) files zip() function, 259
overview, 227 Zuber, Maria, 153
loading, 239
yield statements, 47
suspending fuctions with, 47
328 Index
Real-World Python is set in New Baskerville, Futura, Dogma, and TheSansMono
Condensed.
RESOURCES
Visit https://nostarch.com/real-world-python/ for errata and more information.
isbn 978-1-59327-928-8
phone: email:
800.420.7240 or sales @ nostarch.com
415.863.9900 web:
www.nostarch.com
PROGRAM
PYTHON LIKE
A PROFESSIONAL
With its emphasis on project-based practice, Real • Scrape speeches from the internet and auto-
World Python will take you from playing with syntax summarize them
to writing complete programs in no time. You’ll conduct
• Use the Mars Orbiter Laser Altimeter (MOLA) map
experiments, explore statistical concepts, and solve
to select spacecraft landing sites
novel problems that have frustrated geniuses throughout
history, like detecting distant exoplanets, as you continue • Survive a zombie apocalypse with the aid of
to build your Python skills. data-plotting and visualization tools
Chapters begin with a clearly defined project goal and a The book’s programs are beginner-friendly, but as you
discussion of ways to attack the problem, followed by a progress you’ll learn more sophisticated techniques to
mission designed to make you think like a programmer. help you grow your coding capabilities. Once your
You’ll direct a Coast Guard search-and-rescue effort, plot missions are accomplished, you’ll be ready to solve
and execute a NASA flight to the moon, protect access to real-world problems with Python on your own.
a secure lab using facial recognition, and more. Along the
way you’ll learn how to: ABOUT THE AUTHOR
• Use libraries like matplotlib, NumPy, Bokeh, pandas, Lee Vaughan is a programmer, pop culture enthusiast,
Requests, Beautiful Soup, and turtle educator, and author of Impractical Python Projects
(No Starch Press). As a former executive-level scientist
• Work with Natural Language Processing and
at ExxonMobil, he spent decades constructing and
computer vision modules like NLTK and OpenCV
reviewing complex computer models, developed
• Write a program to detect and track objects moving and tested software, and trained geoscientists and
across a starfield engineers.
T H E F I N E S T I N G E E K E N T E RTA I N M E N T ™
www.nostarch.com
$34.95 ($45.95 CDN)