Download Reading Files

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Reading Files
Chapter 7
Python for Informatics: Exploring Information
www.py4inf.com
Unless otherwise noted, the content of this course material is licensed under a Creative
Commons Attribution 3.0 License.
http://creativecommons.org/licenses/by/3.0/.
Copyright 2010, Charles Severance
What
Next?
Software
Input
and Output
Devices
Central
Processing
Unit
if x< 3: print
Main
Memory
It is time to go
find some Data to
mess with!
Secondary
Memory
From [email protected] Sat Jan 5 09:14:16 2008
Return-Path: <[email protected]>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: [email protected]
From: [email protected]
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
...
File Processing
•
A text file can be thought of as a sequence of lines
From [email protected] Sat Jan 5 09:14:16 2008
Return-Path: <[email protected]>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: [email protected]
From: [email protected]
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
Opening a File
•
Before we can read the contents of the file we must tell Python which
file we are going to work with and what we will be doing with the file
•
•
This is done with the open() function
•
Kind of like “File -> Open” in a Word Processor
open() returns a “file handle” - a variable used to perform operations
on the file
Using open()
•
handle = open(filename, mode)
•
•
•
fhand = open("mbox.txt", "r")
returns a handle use to manipulate the file
filename is a string
mode is optional and should be “r” if we are planning reading the file
and “w” if we are going to write to the file.
http://docs.python.org/lib/built-in-funcs.html
What is a Handle?
>>> fhand = open('mbox.txt')
>>> print fhand
<open file 'mbox.txt', mode 'r' at 0x1005088b0>
>>> fhand = open("stuff.txt")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'stuff.txt'
When Files are Missing
>>> fhand = open("stuff.txt")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory: 'stuff.txt'
The newline
Character
•
We use a special character to
indicate when a line ends
called the "newline"
•
•
We represent it as \n in strings
Newline is still one character not two
>>> stuff = 'Hello\nWorld!'
>>> stuff
'Hello\nWorld!'
>>> print stuff
Hello
World!
>>> stuff = 'X\nY'
>>> print stuff
X
Y
>>> len(stuff)
3
File Processing
•
A text file can be thought of as a sequence of lines
From [email protected] Sat Jan 5 09:14:16 2008
Return-Path: <[email protected]>
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: [email protected]
From: [email protected]
Subject: [sakai] svn commit: r39772 - content/branches/
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
File Processing
•
A text file has newlines at the end of each line
From [email protected] Sat Jan 5 09:14:16 2008\n
Return-Path: <[email protected]>\n
Date: Sat, 5 Jan 2008 09:12:18 -0500\n
To: [email protected]\n
From: [email protected]\n
Subject: [sakai] svn commit: r39772 - content/branches/\n
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n
File Handle as a Sequence
•
A file handle open for read can be
treated as a sequence of strings
where each line in the file is a string
in the sequence
•
We can use the for statement to
iterate through a sequence
•
Remember - a sequence is an
ordered set
xfile = open("mbox.txt", "r")
for cheese in xfile:
print cheese
Counting Lines in a File
•
•
•
Open a file read-only
Use a for loop to read each
line
Count the lines and print out
the number of lines
fhand = open("mbox.txt")
count = 0
for line in fhand:
count = count + 1
print "Line Count:", count
python open.py
Line Count: 132045
Reading the *Whole* File
•
We can read the whole file
(newlines and all) into a
single string.
>>> fhand = open("mbox-short.txt")
>>> inp = fhand.read()
>>> print len(inp)
94626
>>> print inp[:20]
From stephen.marquar
Searching Through a File
•
We can put an if statement in
our for loop to only print
lines that meet some criteria
fhand = open("mbox-short.txt")
for line in fhand:
if line.startswith('From:') :
print line
OOPS!
What are all these blank
lines doing here?
From: [email protected]
From: [email protected]
From: [email protected]
From: [email protected]
...
OOPS!
What are all these blank
lines doing here?
The print statement adds a
newline to each line.
Each line from the file also
has a newline at the end.
From: [email protected]\n
\n
From: [email protected]\n
\n
From: [email protected]\n
\n
From: [email protected]\n
...
Searching Through a File (fixed)
•
We can strip the whitespace
from the right hand side of
the string using rstrip() from
the string library
•
The newline is considered
"white space" and is stripped
fhand = open("mbox-short.txt")
for line in fhand:
line = line.rstrip()
if line.startswith('From:') :
print line
From: [email protected]
From: [email protected]
From: [email protected]
From: [email protected]
....
Skipping with continue
•
We can convienently
skip a line by using the
continue statement
fhand = open("mbox-short.txt")
for line in fhand:
line = line.rstrip()
# Skip 'uninteresting lines'
if not line.startswith('From:') :
continue
# Process our 'interesting' line
print line
Using in to select lines
•
We can look for a string
anywhere in a line as our
selection criteria
fhand = open("mbox-short.txt")
for line in fhand:
line = line.rstrip()
if (not '@uct.ac.za' in line) :
continue
print line
From [email protected] Sat Jan 5 09:14:16 2008
X-Authentication-Warning: set sender to [email protected] using -f
From: [email protected]
Author: [email protected]
From [email protected] Fri Jan 4 07:02:32 2008
X-Authentication-Warning: set sender to [email protected] using -f
...
Review: Splitting Lines
•
The split() function breaks
lines based on whitespace
>>> line = 'Have a nice day'
>>> words = line.split()
>>> print words
['Have', 'a', 'nice', 'day']
>>> print len(words)
4
>>> print words[2]
nice
From [email protected] Sat Jan
fhand = open("mbox-short.txt")
for line in fhand:
line = line.rstrip()
if not line.startswith('From ') : continue
words = line.split()
print words[2]
5 09:14:16 2008
Sat
Fri
Fri
Fri
...
>>> line = "From [email protected] Sat Jan 5 09:14:16 2008"
>>> words = line.split()
>>> print words
['From', '[email protected]', 'Sat', 'Jan', '5', '09:14:16', '2008']
>>>
fname = raw_input("Enter the file name: ")
fhand = open(fname)
count = 0
for line in fhand:
if line.startswith('Subject:') :
count = count + 1
print 'There were', count, 'subject lines in', fname
Prompt for
File Name
python search6.py
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
python search6.py
Enter the file name: mbox-short.txt
There were 27 subject lines in mbox-short.txt
Bad File
Names
fname = raw_input("Enter the file name: ")
try:
fhand = open(fname)
except:
print 'File cannot be opened:', fname
exit()
count = 0
for line in fhand:
if line.startswith('Subject:') :
count = count + 1
print 'There were', count, 'subject lines in', fname
python search7.py
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
python search7.py
Enter the file name: na na boo boo
File cannot be opened: na na boo boo
Mystery Problem...
fhand = open("mbox-short.txt")
for line in fhand:
words = line.split()
if words[0] != 'From' : continue
print words[2]
python search8.py
Sat
Traceback (most recent call last):
File "search8.py", line 5, in <module>
if words[0] != 'From' : continue
IndexError: list index out of range
Summary
•
•
•
•
•
•
Secondary storage
Opening a file - file handle
File structure - newline character
Reading a file line-by-line with a for
loop
Reading the whole file as a string
Searching for lines
•
•
•
•
•
•
Stripping white space
Using continue
Using in as an operator
Reading a file and splitting lines
Reading file names
Dealing with bad files
Exercise 7.3
Write a program to read through a file and print the contents
of the file (line by line) all in upper case. Executing the program
will look as follows:
python shout.py
Enter a file name: mbox-short.txt
FROM [email protected] SAT JAN 5 09:14:16 2008
RETURN-PATH: <[email protected]>
RECEIVED: FROM MURDER (MAIL.UMICH.EDU [141.211.14.90])
BY FRANKENSTEIN.MAIL.UMICH.EDU (CYRUS V2.3.8) WITH LMTPA;
SAT, 05 JAN 2008 09:14:16 -0500
Write a program to loop through a mailbox-format file and look for lines of the
form:
X-DSPAM-Confidence: 0.8475
Use find and string slicing to extract the portion of the string after the colon
character and then use the float function to convert the extracted string into a
floating point number. Count these lines and the compute the total of the spam
confidence values from these lines. When you reach the end of the file, print out
the average spam confidence.
Enter the file name: mbox.txt
Average spam confidence: 0.894128046745
Enter the file name: mbox-short.txt
Average spam confidence: 0.750718518519
Exercise 7.4