07-Files

File Processing

• A text file can be thought of as a sequence of lines

Opening a File

• Before we can read the contents of the file, we must tell Python which file we are going to work with and what we will be doing with the file

• This is done with the open() function

• open() returns a “file handle” - a variable used to perform operations on the file

• Similar to “File -> Open” in a Word Processor

Using open()

• handle = open(filename, mode)

> returns a handle use to manipulate the file

> filename is a string

> mode is optional and should be 'r' if we are planning to read the

file and 'w' if we are going to write to the file

What is a Handle

>>> fhand = open('mbox.txt')

>>> print fhand

<open file 'mbox.txt', mode 'r' at 0x1005088b0>

When Files are Missing

>>> fhand = open('stuff.txt')

Traceback (most recent call last): File

"<stdin>", line 1, in <module>IOError: [Errno 2]

No such file or directory: 'stuff.txt'

The newline Character

• We use a special character called the “newline” to indicate when a line ends

• We represent it as \n in strings

• Newline is still one character - not two

>>> stuff = 'Hello\nWorld!'

>>> stuff

'Hello\nWorld!'

>>> print stuff

Hello

World!

>>> stuff = 'X\nY'

>>> print stuff

XY

>>> len(stuff)

3

File Processing

• A text file can be thought of as a sequence of lines

• A text file has newlines at the end of each line

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n

Return-Path: <postmaster@collab.sakaiproject.org>\n

Date: Sat, 5 Jan 2008 09:12:18 -0500\n

To: source@collab.sakaiproject.org\n

From: stephen.marquard@uct.ac.za\n

Subject: [sakai] svn commit: r39772 - content/branches/\n

\n

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n

File Handle as a Sequence

• A file handle open for read can be treated as a sequence of strings where each line in the file is a string in the sequence

• We can use the for statement to iterate through a sequence

• Remember - a sequence is an ordered set

xfile = open('mbox.txt')

for cheese in xfile:

print cheese

Counting Lines in a File

Open a file read-only

• Use a for loop to read each line

• Count the lines and print out the number of lines

fhand = open('mbox.txt')

count = 0

for line in fhand:

count = count + 1

print 'Line Count:', count

$ python open.py

Line Count: 132045

Reading the *Whole* file

• We can read the whole file (newlines and all) into a single string

>>> fhand = open('mbox-short.txt')

>>> inp = fhand.read()

>>> print len(inp)

94626

>>> print inp[:20]

From stephen.marquar

Searching Through a File

• We can put an if statement inour for loop to only print lines that meet some criteria

fhand = open('mbox-short.txt')

for line in fhand:

if line.startswith('From:') :

print line

OOPS!

What are all these blank lines doing here?

From: stephen.marquard@uct.ac.za

From: louis@media.berkeley.edu

From: zqian@umich.edu

From: rjlowe@iupui.edu

...

• Each line from the file has a newline at the end

• The print statement adds a newline to each line

From: stephen.marquard@uct.ac.za\n

\n

From: louis@media.berkeley.edu\n

\n

From: zqian@umich.edu\n

\n

From: rjlowe@iupui.edu\n

\n

...

• We can strip the whitespace from the right-hand side of the string using rstrip() from the string library

• The newline is considered “white space” and is stripped

fhand = open('mbox-short.txt')

for line in fhand:

line = line.rstrip()

if line.startswith('From:') :

print line

From: stephen.marquard@uct.ac.za

From: louis@media.berkeley.edu

From: zqian@umich.edu

From: rjlowe@iupui.edu

....

Skipping with continue

• We can conveniently skip a line by using the continue statement

fhand = open('mbox-short.txt')

for line in fhand:

line = line.rstrip()

if not line.startswith('From:') :

continue

print line

Using in to select lines

• We can look for a string anywhere in a line as our selection criteria

fhand = open('mbox-short.txt')

for line in fhand:

line = line.rstrip()

if not '@uct.ac.za' in line :

continue

print line

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

X-Authentication-Warning: set sender to stephen.marquard@uct.ac.za using –f

From: stephen.marquard@uct.ac.za

Author: stephen.marquard@uct.ac.za

From david.horwitz@uct.ac.za Fri Jan 4 07:02:32 2008

X-Authentication-Warning: set sender to david.horwitz@uct.ac.za using -f.

Bad File Names

Enter the file name: mbox.txt

There were 1797 subject lines in mbox.txt

Enter the file name: na na boo boo

File cannot be opened: na na boo boo

fname = raw_input('Enter the file name: ')

try:

fhand = open(fname)

except:

print 'File cannot be opened:', fname

exit()

count = 0

for line in fhand:

if line.startswith('Subject:') :

count = count + 1

print 'There were', count, 'subject lines in', fname

CSV File

Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.

    • mpg : miles per gallon

    • class : car classification

    • cty : city mpg

    • cyl : # of cylinders

    • displ : engine displacement in liters

    • drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd

    • fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)

    • hwy : highway mpg

    • manufacturer : automobile manufacturer

    • model : model of car

    • trans : type of transmission

    • year : model year

import csv

%precision 2

with open('mpg.csv') as csvfile:

mpg = list(csv.DictReader(csvfile))

mpg[:3] # The first three dictionaries in our list.

csv.Dictreader has read in each row of our csv file as a dictionary. len shows that our list is comprised of 234 dictionaries.

len(mpg)

234

keys gives us the column names of our csv.

mpg[0].keys()

odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.

sum(float(d['cty']) for d in mpg) / len(mpg)

16.86

Similarly this is how to find the average hwy fuel economy across all cars.

sum(float(d['hwy']) for d in mpg) / len(mpg)

23.44

Use set to return the unique values for the number of cylinders the cars in our dataset have.

cylinders = set(d['cyl'] for d in mpg)

cylinders

{'4', '5', '6', '8'}

Here's a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.

CtyMpgByCyl = [] for c in cylinders: # iterate over all the cylinder levels summpg = 0 cyltypecount = 0 for d in mpg: # iterate over all dictionaries if d['cyl'] == c: # if the cylinder level type matches, summpg += float(d['cty']) # add the cty mpg cyltypecount += 1 # increment the count CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg') CtyMpgByCyl.sort(key=lambda x: x[0]) CtyMpgByCyl

[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

Use set to return the unique values for the class types in our dataset.

vehicleclass = set(d['class'] for d in mpg) # what are the class types

vehicleclass

{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

And here's an example of how to find the average hwy mpg for each class of vehicle in our dataset.

HwyMpgByClass = [] for t in vehicleclass: # iterate over all the vehicle classes summpg = 0 vclasscount = 0 for d in mpg: # iterate over all dictionaries if d['class'] == t: # if the cylinder amount type matches, summpg += float(d['hwy']) # add the hwy mpg vclasscount += 1 # increment the count HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg') HwyMpgByClass.sort(key=lambda x: x[1]) HwyMpgByClass

[('pickup', 16.88), ('suv', 18.13), ('minivan', 22.36), ('2seater', 24.80), ('midsize', 27.29), ('subcompact', 28.14), ('compact', 28.30)]