Why sys.setdefaultencoding() will break code

I know wiser and more experienced Python coders have written to python-dev about this before but every time I’ve needed to reference one of those messages for someone else I have trouble finding one. This time when I did my google search the most relevant entry was a post from myself to the yum-devel mailing list in 2011. Since I know I’ll need to prove why setdefaultencoding() is to be avoided in the future I figured I should post the reasoning here so that I don’t have to search the web next time.

Some Background

15 years ago: Creating a Unicode Aware Python

In Python 2 it is possible to mix byte strings (str type) and text strings (unicode type) together to a limited extent. For instance:

>>> u'Toshio' == 'Toshio'
True
>>> print(u'Toshio' + ' Kuratomi')
Toshio Kuratomi

When you perform these operations Python sees that you have a unicode type on one side and a str type on the other. It takes the str value and decodes it to a unicode type and then performs the operation. The encoding it uses to interpret the bytes is what we’re going to call Python’s defaultencoding (named after sys.getdefaultencoding() which allows you to see what this value is set to.)

When the Python developers were first experimenting with a unicode-aware text type that was distinct from byte strings it was unclear what the value of defaultencoding should be. So they created a function to set the defaultencoding when Python started in order to experiment with different settings. The function they created was sys.setdefaultencoding() and the Python authors would modify their individual site.py files to gain experience with how different encodings would change the experience of coding in Python.

Eventually, in October of 2000 (fourteen and a half years prior to me writing this) that experimental version of Python became Python-2.0 and the Python authors had decided that the sensible setting for defaultencoding should be ascii.

I know it’s easy to second guess the ascii decision today but remember 14 years ago the encoding landscape was a lot more cluttered. New programming languages and new APIs were emerging that optimized for fixed-width 2-byte encodings of unicode. 1-byte, non-unicode encodings for specific natural languages were even more popular then than they are now. Many pieces of data (even more than today!) could include non-ascii text without specifying what encoding to interpret that data as. In that environment anyone venturing outside of the ascii realm needed to be warned that they were entering a world where encoding dragons roamed freely. The ascii encoding helps to warn people that they were entering a land where their code had to take special precautions by throwing an error in many of the cases where the boundary was crossed.

However, there was one oversight about the unicode functionality that went into Python-2.0 that the Python authors grew to realize was a bad idea. That oversight was not removing the setdefaultencoding() function. They had taken some steps to prevent it being used outside of initialization (in the site.py file) by deleting the reference to it from the sys module after Python initialized but it still existed for people to modify the defaultencoding in site.py.

The rise of the sys.setdefaultencoding() hack

As time went on, the utf-8 encoding emerged as the dominant encoding of both Unix-like systems and the Internet. Many people who only had to deal with utf-8 encoded text were tired of getting errors when they mixed byte strings and text strings together. Seeing that there was a function called setdefaultencoding(), people started trying to use it to get rid of the errors they were seeing.

At first, those with the ability to, tried modifying their Python installation’s global site.py to make sys.setdefaultencoding do its thing. This is what the Python documentation suggests is the proper way to use it and it seemed to work on the user’s own machines. Unfortunately, the users often turned out to be coders. And it turned out that what these coders were doing was writing programs that had to work on machines run by other people: the IT department, customers, and users all over the Internet. That meant that applying the change in their site.py often left them in a worse position than before: They would code something which would appear to work on their machines but which would fail for the people who were actually using their software.

Since the coders’ concern was confined to whether people would be able to run their software the coders figured if their software could set the defaultencoding as part of its initialization that would take care of things. They wouldn’t have to force other people to modify their Python install; their software could make that decision for them when the software was invoked. So they took another look at sys.setdefaultencoding(). Although the Python authors had done their best to make the function unavailable after python started up these coders hit upon a recipe to get at the functionality anyway:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Once this was run in the coders’ software, the default encoding for coercing byte strings to text strings was utf-8. This meant that when utf-8 encoded byte strings were mixed with unicode text strings, Python would successfully convert the str type data to unicode type and combine the two into one unicode string. This is what this new generation of coders were expecting from the majority of their data so the idea that solving their problem with just these few lines of (admittedly very hacky) code was very attractive to them. Unfortunately, there are non-obvious drawbacks to doing this….

Why sys.setdefaultencoding() will break your code

(1) Write once, change everything

The first problem with sys.setdefaultencoding() is not obviously a problem at first glance. When you call sys.setdefaultencoding() you are telling Python to change the defaultencoding for all of the code that it is going to run. Your software’s code, the code in the stdlib, and third-party library code all end up running with your setting for defaultencoding. That means that code which you weren’t responsible for that relied on the behaviour of having the defaultencoding be ascii would now stop throwing errors and potentially start creating garbage values. For instance, let’s say one of the libraries you rely on does this:

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.

Naturally, if this was your code, in your piece of software, you’d be able to fix it to deal with the defaultencoding being set to utf-8. But if it’s in a third party library that luxury may not exist for you.

(2) Let’s break dictionaries!

The most important problem with setting defaultencoding to the utf-8 encoding is that it will break certain assumptions about dictionaries. Let’s write a little code to show this:

def key_in_dict(key, dictionary):
    if key in dictionary:
        return True
    return False

def key_found_in_dict(key, dictionary):
    for dict_key in dictionary:
        if dict_key == key:
            return True
    return False

Would you assume that given the same inputs the output of both functions will be the same? In Python, if you don’t hack around with sys.setdefaultencoding(), your assumption would be correct:

>>> # Note: the following is the same as d = {'Café': 'test'} on
>>> #       systems with a utf-8 locale
>>> d = { u'Café'.encode('utf-8'): 'test' }
>>> key_in_dict('Café', d)
True
>>> key_found_in_dict('Café', d)
True
>>> key_in_dict(u'Café', d)
False
>>> key_found_in_dict(u'Café', d)
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

But what happens if you call sys.setdefaultencoding('utf-8')? Answer: the assumption breaks:

>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')
>>> d = { u'Café'.encode('utf-8'): 'test' }
>>> key_in_dict('Café', d)
True
>>> key_found_in_dict('Café', d)
True
>>> key_in_dict(u'Café', d)
False
>>> key_found_in_dict(u'Café', d)
True

This happens because the in operator hashes the keys and then compares the hashes to determine if they are equal. In the utf-8 encoding, only the characters represented by ascii hash to the same values whether in a byte string or a unicode type text string. For all other characters the hash for the byte string and the unicode text string will be different values. The comparison operator (==), on the other hand, converts the byte string to a unicode type and then compares the results. When you call setdefaultencoding('utf-8') you allow the byte string to be transformed into a unicode type. Then the two text strings will be compared and found to be equal. The ramifications of this are that containment tests with in now yield different values than equality testing individual entries via ==. This is a pretty big difference in behaviour to get used to and for most people would count as having broken a fundamental assumption of the language.

So how does Python 3 fix this?

You may have heard that in Python 3 the default encoding has been switched from ascii to utf-8. How does it get away with that without encountering the equality versus containment problem? The answer is that python3 does not perform implicit conversions between byte strings (python3 bytes type) and text strings (python3 str type). Since the two objects are now entirely separate comparing them via both equality and containment will always yield False:

$ python3
>>> a = {'A': 1}
>>> b'A' in a
False
>>> b'A' == list(a.keys())[0]
False

At first, coming from python2 where ascii values were the same this might look a little funny. But just remember that bytes are really a type of number and you wouldn’t expect this to work either:

>>> a = {'1': 'one'}
>>> 1 in a
False
>>> 1 == list(a.keys())[0]
False