Python File Operations

Janne Kemppainen |

Knowing how to handle files in your programming language of choice is an essential skill for any developer. After reading this post you should be comfortable doing file operations in Python.

If you’re totally new to working with files, don’t worry! I’ll try to explain everything in enough detail so that you should be able to pick it up and start applying the knowledge in practice.

Open and close a file

Many tutorials on the internet start by telling you to open a file using the built-in function open()and then they tell you to use the file.close() method to close the file handle and release system resources. Personally, I think that as a beginner you shouldn’t even need to know about the close method since the recommended and most pythonic way to open and close files is to use a context manager.

Simplified, a context manager is a thing that you can use to safely work with resources in Python. When your code moves out of the context manager’s scope, Python will automatically call the needed cleanup functions for you, even if exceptions were thrown. With the file open context this means that your files will always be properly closed.

The context is constructed using the with ... as keywords. The open() function acts as the context manager, and the file handle is named using the as keyword:

with open('myfile.txt') as f:
    # do work

The first and only required parameter is the file path. It can be relative to the current directory, or an absolute path. You can use os.path from Python’s standard library to build it.

In this example the file object is now available as the variable f, and you can perform operations on it. For simple cases it is ok to use f as a general variable name. Often your code will become cleaner if you give it a proper name, though. So stop to think if the variable should be called mp3_file or something similar instead.

Files can be opened in different modes. The default mode reads the file as text, but you can also write, append and create new files. Binary files, such as images, need to be opened in the binary mode since their contents cannot be meaningfully interpreted as text.

ModeDescription
"r"Read only, raise FileNotFoundError if the file doesn’t exist
"w"Write only, overwrite existing files or create a new file if it doesn’t exist
"a"Append at the end of an existing file or create a new file
"x"Create a file, fail if it already exists
"r+"Read and write, does not create a new file
"w+"Write and read, overwrite existing file or create a new file
"a+"Append and read, append to an existing file or create a new file
"b"Binary mode (images, video, executables, etc.), combine with one of the previous modes to work with binary files

The mode can be defined as the second argument to open(). It needs to be a string containing one of the opening modes, and you can include 'b' to switch to the binary mode, for example:

with open('video.mp4', 'rb') as video_file:
    # work with video.mp4 in binary read mode

In this case you would need to handle the data as bytes.

There are some additional arguments that you may want to provide to the open() function, most notably the text encoding in text mode. You can find good explanations for the different options from the official documentation.

Read text data

Reading text files is really easy. Often you need to process a file line by line which can be done with a simple for loop since file objects are iterable. This script would simply print the contents of a file called quotes.txt line by line:

with open('quotes.txt') as quotes:
    for quote in quotes:
        print(quote)

As you can see, you don’t even need to know the file object methods for these basic operations. The context manager takes care of closing the file, and the for-loop iterates each line.

When you need more precision you can use the read() method. By default it returns the whole file as a string, but you can also choose the number of characters to be read. This code would read the fist four characters of a file:

with open('data.txt') as data:
    header = data.read(4)

Sometimes you may need to load the whole file in memory as a list. In that case you can use the readlines() method:

with open('quotes.txt') as quote_file:
    quotes = quote_file.readlines()

By default, readlines() reads the file until it reaches the end of the file. You can read the file in smaller chunks by defining the amount of characters to read, just like the read() method. However, this method behaves quite differently. It reads the file line by line until it has read the amount of data that you defined, then it continues to the end of that line. Therefore, it can read more data than you expected.

The singular version readline() reads the file until it encounters a newline character. It also lets you limit the maximum amount characters read at a time. It’s really handy when you need to read a single line at a time and maybe need a bit more control.

Normally you would iterate the file object directly with a for loop, but there is actually an interesting case where you could use readline() together with the new assignment expression (:=), introduced in Python 3.8. If your lines are really long, you may want to read them in smaller chunks:

with open('large-data.txt') as data_file:
    while chunk := data_file.readline(512):
        # process partial data
        print(chunk)

The assignment expression := is also called the walrus operator because of the way it looks. It lets you initialize data and check the while loop condition all on the same line. Reading from the end of the file returns an empty string, which evaluates to false and exits the loop.

As you read the file with the read, readline and readlines methods it moves the file pointer ahead by the amount of data read. Each new read operation will therefore continue from the place where the previous one stopped.

Read binary files

Working with binary files doesn’t really differ in principle. Instead of normal strings you’re dealing with byte strings. In this case it might be useful to be familiar with byte operations so that you can access the data that you need.

You can think of binary files as series of ones and zeros, or bits. But when you read or write to a binary file you must do so in groups of eight, or bytes. So even if you want to store a single boolean value you need to use at least eight bits. With binary operations you can store up to eight such values in one byte, these are sometimes called flags.

Some data types require more space. You need four bytes to store a 32 bit integer that can hold a value between -2147483648 and 2147483647. Your program needs to know how to interpret the data it receives.

I will probably write a separate article about working with binary files in Python so I won’t go too deep in details here.

Let’s do something relatively simple. The PNG specification says the following of the PNG file signature:

The first eight bytes of a PNG file always contain the following values:

   (decimal)              137  80  78  71  13  10  26  10
   (hexadecimal)           89  50  4e  47  0d  0a  1a  0a
   (ASCII C notation)    \211   P   N   G  \r  \n \032 \n

A single byte can be described as a decimal number between 0 and 255, a hexadecimal number between 00 and ff, or as an ASCII character. In Python, we can represent this signature as a byte string in two equivalent ways:

hex_signature = b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a"
ascii_signature = b"\211PNG\r\n\032\n"

A byte string can be created by adding the b character before the quotation marks. Byte strings and normal strings cannot be directly compared or joined so you will need to do encoding or decoding before any such operations.

If you write these examples in an interactive Python interpreter and print them out you will get b'\x89PNG\r\n\x1a\n' as the result. If you look carefully, this is actually a combination of them both, so you can mix and match formats as you wish. Note that with hexadecimal numbers you need to add \x before each byte value.

Here is a small Python program, png.py, that checks if the provided file is a PNG image (or starts with the PNG file signature).

import sys

def main():
    try:
        filename = sys.argv[1]
    except IndexError:
        print("Please provide a filename")
        return

    with open(filename, "rb") as img:
        signature = img.read(8)
        if signature == b"\x89PNG\r\n\x1a\n":
            print("This is a PNG file")
        else:
            print("This is not a PNG file")

if __name__ == "__main__":
    main()

You can test it with different file types:

>> python3 png.py test.png
This is a PNG file
>> python3 png.py test.jpg
This is not a PNG file

The script interprets the first argument as the filename and opens the file in the read binary mode, "rb". Then it reads the first eight bytes, compares it against the known PNG signature, and prints the result to the console.

The same methods that you use to read text files can be used with binary files too. If you read a binary file line by line the result will be split on newline characters \n. Depending on the filetype that you’re dealing with this may or may not have a meaning.

Quite often with binary files you stick to the read() method and read a specific amount of bytes at a time. You may need to read a few bytes of header data first to determine the size of the actual data.

Jump between file locations

You don’t always process a file in order. Sometimes you need to jump from one file location to another to find the pieces of data that you’re interested in. In this case it might not make sense to read the whole file contents in memory.

As you read a file the file pointer will move forward. You can also move this pointer manually using the seek() method.

Seeking allows you to jump to a specific point in a file. This is an especially common requirement with binary files.

By default, the seek function uses absolute file positions. Therefore, you can seek to the beginning of the file really easily:

f.seek(0)

What if you’d like to seek to the end of the file? Then you need to specify a second argument to the seek function that defines where the seeking should start from, the available values are:

  • 0 absolute position from the start of the file (default),
  • 1 relative position from the current position, and
  • 2 relative position from the end of the file.

So in order to seek to the end of the file you’d need to move zero bytes backwards from the end:

f.seek(0, 2)

These examples work with both text and binary files. But if you want to do more advanced movements there are some distinctions that you need to be aware of.

Text files support relative seeking only from the beginning of the file. Any attempt to do a non-zero relative seek from the start or the current position will cause an exception:

with open("testfile.txt") as f:
    f.seek(10, 2) 
# Traceback (most recent call last):
#   File "<stdin>", line 2, in <module>
# io.UnsupportedOperation: can't do nonzero end-relative seeks

This exception is basically telling you that zero is the only allowed value if you want to seek relative to the file’s end.

You also need to be aware that the seek position is not a predictable number in text mode. Because of text encoding, the characters in your file may not match the bytes, so you cannot directly jump to an arbitrary position.

You can get the current file position with the tell() method. In text mode, the result from this function is the only number in addition to zero that you should use. Let me demonstrate this with another example. Let’s say that testfile.txt looks like this:

päämäärä
objective

Let’s read the first eight characters and use tell() to print the current position:

with open("testfile.txt") as f:
    print(f.read(8))
    print(f.tell())

If you run the program it’ll print 13, not 8 like you might’ve expected. This is because the non-ASCII characters require two bytes to encode!

So remember that with text files you can use tell() and seek() to go back to the file positions that you’ve already visited.

Binary mode, in contrast, gives you the full power to move to any location you want. You can still use tell() to get the current position, and it will always be the amount of bytes from the start of the file. With the seek mode 1 you can move relative to the current position. It’s up to you to make sure that you move the correct amount of bytes.

Read and write files

If you add the plus sign + to the mode string that will make the file open in read/write mode where both read and write operations are allowed. Take another look at the table at the beginning of this post to understand how the modes differ from each other.

For simplicity, I’m going to use the "r+" mode which opens an existing file with read and write access. It fails if the file doesn’t exist, and it doesn’t remove the file contents.

Assume that we have the following text file, python.txt:

Python is a programming language.
It was created by Guido van Rossum.
Programming with Python is fun!

Now let’s say that we want to make every occurrence of the word “Python” capitalized. How could we do that? Here’s one possible solution:

with open("python.txt", "r+") as f:
    lines = []
    for line in f:
        lines.append(line.replace("Python", "PYTHON"))
    f.seek(0)
    f.writelines(lines)

First we open the file in read/write mode. Because we’re using r+ the file pointer starts from the beginning of the file and we can start reading immediately.

We assume that the file is small enough to fit in memory, so we create a list that will hold the final result. Then we read the file line by line in a for loop, replace any occurrences of the word “Python”, and append the lines to the list.

Because the for loop has now moved the file pointer to the end of the file, we need to seek to the beginning. Then we can use the writelines() method to save the modified data.

Note that if we didn’t seek to the beginning the end result would contain the original contents, followed by the modified data!

This worked well because we replaced the file with an equal amount of data. But what would happen if you created a fun little program that changes all occurrences of “Python” to “C”? If you make the required change on the fourth line in the example the result would look like this:

C is a programming language.
It was created by Guido van Rossum.
Programming with C is fun!
n is fun!

What is happening here is that you’re overwriting the file contents, but since your new file is actually smaller than the original it still contains parts of the old data. You should use the truncate() method to get rid of the extra text. The final program would now look like this:

with open("python.txt", "r+") as f:
    lines = []
    for line in f:
        lines.append(line.replace("Python", "C"))
    f.seek(0)
    f.writelines(lines)
    f.truncate()

Truncate ends the file at the current file position and drops any remaining data.

This also demonstrates how you can replace data in-place, but you cannot actually insert data in the middle. If you need to insert something in the middle of a file, then you should also read the rest of the file contents and store them temporarily so that you can write them back after your additions.

You also need to be aware of the encoding since the characters can consume a different amount of bytes. Strings that have equal length in Python might have a different size on the file system. This can be a source of unexpected bugs!

Conclusion

I hope you learned something new about file operations in Python, and if not then you’re already a pro!

Stay tuned for more Python content, comment on Twitter, and subscribe to the newsletter if you’d like to receive occasional emails from me. See you next time!

Discuss on Twitter

Subscribe to my newsletter

What’s new with PäksTech? Subscribe to receive occasional emails where I will sum up stuff that has happened at the blog and what may be coming next.

powered by TinyLetter | Privacy Policy