Reading Files
Chapter 7
Python for Informatics: Exploring Information
www.pythonlearn.com
Software What
It is time to go find some
Next? Data to mess with!
Input Central
and Output Processing Files R Us
Devices Unit
Secondary
if x < 3: print Memory
Memory Return-Path: <
[email protected]>
Date: Sat, 5 Jan 2008 09:12:18 -0500To:
[email protected]:
[email protected]: [sakai] svn commit: r39772 -
content/branches/Details: http://source.sakaiproject.org/viewsvn/?
view=rev&rev=39772
...
File Processing
• A text file can be thought of as a sequence of lines
Return-Path: <
[email protected]>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To:
[email protected] From:
[email protected] Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
http://www.py4inf.com/code/mbox-short.txt
Opening a File
• Before we can read the contents of the file, we must tell Python
which file we are going to work with and what we will be doing with
the file
• This is done with the open() function
• open() returns a “file handle” - a variable used to perform operations
on the file
• Similar to “File -> Open” in a Word Processor
Using open()
• handle = open(filename, mode) fhand = open('mbox.txt', 'r')
> returns a handle use to manipulate the file
> filename is a string
> mode is optional and should be 'r' if we are planning to read the
file and 'w' if we are going to write to the file
What is a Handle?
>>> fhand = open('mbox.txt')
>>> print fhand
<open file 'mbox.txt', mode 'r' at 0x1005088b0>
When Files are Missing
>>> fhand = open('stuff.txt')
Traceback (most recent call last): File
"<stdin>", line 1, in <module>IOError: [Errno 2]
No such file or directory: 'stuff.txt'
The newline Character
• We use a special character >>> stuff = 'Hello\nWorld!'
>>> stuff
called the “newline” to
'Hello\nWorld!'
indicate when a line ends >>> print stuff
Hello
• We represent it as \n in World!
strings >>> stuff = 'X\nY'
>>> print stuff
• Newline is still one character - X
not two Y
>>> len(stuff)
3
File Processing
• A text file can be thought of as a sequence of lines
Return-Path: <
[email protected]>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To:
[email protected] From:
[email protected] Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
File Processing
• A text file has newlines at the end of each line
Return-Path: <
[email protected]>\n
Date: Sat, 5 Jan 2008 09:12:18 -0500\n
To:
[email protected]\n
From:
[email protected]\n
Subject: [sakai] svn commit: r39772 - content/branches/\n
\n
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n
File Handle as a Sequence
• A file handle open for read can be
treated as a sequence of strings
where each line in the file is a string xfile = open('mbox.txt')
in the sequence for cheese in xfile:
print cheese
• We can use the for statement to
iterate through a sequence
• Remember - a sequence is an
ordered set
Counting Lines in a File
• Open a file read-only fhand = open('mbox.txt')
count = 0
• Use a for loop to read each line for line in fhand:
count = count + 1
• Count the lines and print out print 'Line Count:', count
the number of lines
$ python open.py
Line Count: 132045
Reading the *Whole* File
• We can read the whole file >>> fhand = open('mbox-short.txt')
>>> inp = fhand.read()
(newlines and all) into a >>> print len(inp)94626
single string >>> print inp[:20]
From stephen.marquar
Searching Through a File
• We can put an if statement in fhand = open('mbox-short.txt')
our for loop to only print lines for line in fhand:
that meet some criteria if line.startswith('From:') :
print line
OOPS!
From:
[email protected]What are all these blank
lines doing here? From:
[email protected] From: [email protected]
From: [email protected]
...
OOPS!
What are all these blank From: [email protected]\n
lines doing here? \n
From: [email protected]\n
\n
• Each line from the file has a From: [email protected]\n
newline at the end \n
From: [email protected]\n
• The print statement adds a \n
newline to each line ...
Searching Through a File (fixed)
fhand = open('mbox-short.txt')
• We can strip the whitespace for line in fhand:
line = line.rstrip()
from the right-hand side of the if line.startswith('From:') :
string using rstrip() from the print line
string library
From: [email protected]
• The newline is considered From: [email protected]
“white space” and is stripped From: [email protected]
From: [email protected]
....
Skipping with continue
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
• We can conveniently if not line.startswith('From:') :
skip a line by using the continue
continue statement print line
Using in to select lines
• We can look for a string
fhand = open('mbox-short.txt')
for line in fhand:
anywhere in a line as our line = line.rstrip()
if not '@uct.ac.za' in line :
selection criteria continue
print line
X-Authentication-Warning: set sender to
[email protected] using –f
From:
[email protected]Author:
[email protected]From
[email protected] Fri Jan 4 07:02:32 2008
X-Authentication-Warning: set sender to
[email protected] using -f...
fname = raw_input('Enter the file name: ')
Prompt for
File Name
fhand = open(fname)
count = 0
for line in fhand:
if line.startswith('Subject:') :
count = count + 1
print 'There were', count, 'subject lines in', fname
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
Enter the file name: mbox-short.txt
There were 27 subject lines in mbox-short.txt
fname = raw_input('Enter the file name: ')
try:
fhand = open(fname)
Bad File
except:
print 'File cannot be opened:', fname
exit()
Names count = 0
for line in fhand:
if line.startswith('Subject:') :
count = count + 1
print 'There were', count, 'subject lines in', fname
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
Enter the file name: na na boo boo
File cannot be opened: na na boo boo
Summary
• Searching for lines
• Secondary storage
• Reading file names
• Opening a file - file handle
• Dealing with bad files
• File structure - newline character
• Reading a file line by line with a
for loop
Acknowledgements / Contributions
These slidee are Copyright 2010- Charles R. Severance (
...
www.dr-chuck.com) of the University of Michigan School of
Information and open.umich.edu and made available under a
Creative Commons Attribution 4.0 License. Please maintain this
last slide in all copies of the document to comply with the
attribution requirements of the license. If you make a change,
feel free to add your name and organization to the list of
contributors on this page as you republish the materials.
Initial Development: Charles Severance, University of Michigan
School of Information
… Insert new Contributors and Translators here