07-Files
File Processing
• A text file can be thought of as a sequence of lines
Opening a File
• Before we can read the contents of the file, we must tell Python which file we are going to work with and what we will be doing with the file
• This is done with the open() function
• open() returns a “file handle” - a variable used to perform operations on the file
• Similar to “File -> Open” in a Word Processor
Using open()
• handle = open(filename, mode)
> returns a handle use to manipulate the file
> filename is a string
> mode is optional and should be 'r' if we are planning to read the
file and 'w' if we are going to write to the file
What is a Handle
>>> fhand = open('mbox.txt')
>>> print fhand
<open file 'mbox.txt', mode 'r' at 0x1005088b0>
When Files are Missing
>>> fhand = open('stuff.txt')
Traceback (most recent call last): File
"<stdin>", line 1, in <module>IOError: [Errno 2]
No such file or directory: 'stuff.txt'
The newline Character
• We use a special character called the “newline” to indicate when a line ends
• We represent it as \n in strings
• Newline is still one character - not two
>>> stuff = 'Hello\nWorld!'
>>> stuff
'Hello\nWorld!'
>>> print stuff
Hello
World!
>>> stuff = 'X\nY'
>>> print stuff
XY
>>> len(stuff)
3
File Processing
• A text file can be thought of as a sequence of lines
• A text file has newlines at the end of each line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n
Return-Path: <postmaster@collab.sakaiproject.org>\n
Date: Sat, 5 Jan 2008 09:12:18 -0500\n
To: source@collab.sakaiproject.org\n
From: stephen.marquard@uct.ac.za\n
Subject: [sakai] svn commit: r39772 - content/branches/\n
\n
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n
File Handle as a Sequence
• A file handle open for read can be treated as a sequence of strings where each line in the file is a string in the sequence
• We can use the for statement to iterate through a sequence
• Remember - a sequence is an ordered set
xfile = open('mbox.txt')
for cheese in xfile:
print cheese
Counting Lines in a File
Open a file read-only
• Use a for loop to read each line
• Count the lines and print out the number of lines
fhand = open('mbox.txt')
count = 0
for line in fhand:
count = count + 1
print 'Line Count:', count
$ python open.py
Line Count: 132045
Reading the *Whole* file
• We can read the whole file (newlines and all) into a single string
>>> fhand = open('mbox-short.txt')
>>> inp = fhand.read()
>>> print len(inp)
94626
>>> print inp[:20]
From stephen.marquar
Searching Through a File
• We can put an if statement inour for loop to only print lines that meet some criteria
fhand = open('mbox-short.txt')
for line in fhand:
if line.startswith('From:') :
print line
OOPS!
What are all these blank lines doing here?
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
...
• Each line from the file has a newline at the end
• The print statement adds a newline to each line
From: stephen.marquard@uct.ac.za\n
\n
From: louis@media.berkeley.edu\n
\n
From: zqian@umich.edu\n
\n
From: rjlowe@iupui.edu\n
\n
...
• We can strip the whitespace from the right-hand side of the string using rstrip() from the string library
• The newline is considered “white space” and is stripped
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
if line.startswith('From:') :
print line
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
....
Skipping with continue
• We can conveniently skip a line by using the continue statement
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
if not line.startswith('From:') :
continue
print line
Using in to select lines
• We can look for a string anywhere in a line as our selection criteria
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
if not '@uct.ac.za' in line :
continue
print line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
X-Authentication-Warning: set sender to stephen.marquard@uct.ac.za using –f
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan 4 07:02:32 2008
X-Authentication-Warning: set sender to david.horwitz@uct.ac.za using -f.
Bad File Names
Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt
Enter the file name: na na boo boo
File cannot be opened: na na boo boo
fname = raw_input('Enter the file name: ')
try:
fhand = open(fname)
except:
print 'File cannot be opened:', fname
exit()
count = 0
for line in fhand:
if line.startswith('Subject:') :
count = count + 1
print 'There were', count, 'subject lines in', fname
CSV File
Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.
mpg : miles per gallon
class : car classification
cty : city mpg
cyl : # of cylinders
displ : engine displacement in liters
drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
hwy : highway mpg
manufacturer : automobile manufacturer
model : model of car
trans : type of transmission
year : model year
import csv
%precision 2
with open('mpg.csv') as csvfile:
mpg = list(csv.DictReader(csvfile))
mpg[:3] # The first three dictionaries in our list.
csv.Dictreader has read in each row of our csv file as a dictionary. len shows that our list is comprised of 234 dictionaries.
len(mpg)
234
keys gives us the column names of our csv.
mpg[0].keys()
odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])
This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.
sum(float(d['cty']) for d in mpg) / len(mpg)
16.86
Similarly this is how to find the average hwy fuel economy across all cars.
sum(float(d['hwy']) for d in mpg) / len(mpg)
23.44
Use set to return the unique values for the number of cylinders the cars in our dataset have.
cylinders = set(d['cyl'] for d in mpg)
cylinders
{'4', '5', '6', '8'}
Here's a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.
CtyMpgByCyl = [] for c in cylinders: # iterate over all the cylinder levels summpg = 0 cyltypecount = 0 for d in mpg: # iterate over all dictionaries if d['cyl'] == c: # if the cylinder level type matches, summpg += float(d['cty']) # add the cty mpg cyltypecount += 1 # increment the count CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg') CtyMpgByCyl.sort(key=lambda x: x[0]) CtyMpgByCyl
[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]
Use set to return the unique values for the class types in our dataset.
vehicleclass = set(d['class'] for d in mpg) # what are the class types
vehicleclass
{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}
And here's an example of how to find the average hwy mpg for each class of vehicle in our dataset.
HwyMpgByClass = [] for t in vehicleclass: # iterate over all the vehicle classes summpg = 0 vclasscount = 0 for d in mpg: # iterate over all dictionaries if d['class'] == t: # if the cylinder amount type matches, summpg += float(d['hwy']) # add the hwy mpg vclasscount += 1 # increment the count HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg') HwyMpgByClass.sort(key=lambda x: x[1]) HwyMpgByClass
[('pickup', 16.88), ('suv', 18.13), ('minivan', 22.36), ('2seater', 24.80), ('midsize', 27.29), ('subcompact', 28.14), ('compact', 28.30)]