Why sys.setdefaultencoding() will break code

I know wiser and more experienced Python coders have written to python-dev about this before but every time I’ve needed to reference one of those messages for someone else I have trouble finding one. This time when I did my google search the most relevant entry was a post from myself to the yum-devel mailing list in 2011. Since I know I’ll need to prove why setdefaultencoding() is to be avoided in the future I figured I should post the reasoning here so that I don’t have to search the web next time.

Some Background

15 years ago: Creating a Unicode Aware Python

In Python 2 it is possible to mix byte strings (str type) and text strings (unicode type) together to a limited extent. For instance:

>>> u'Toshio' == 'Toshio'
True
>>> print(u'Toshio' + ' Kuratomi')
Toshio Kuratomi

When you perform these operations Python sees that you have a unicode type on one side and a str type on the other. It takes the str value and decodes it to a unicode type and then performs the operation. The encoding it uses to interpret the bytes is what we’re going to call Python’s defaultencoding (named after sys.getdefaultencoding() which allows you to see what this value is set to.)

When the Python developers were first experimenting with a unicode-aware text type that was distinct from byte strings it was unclear what the value of defaultencoding should be. So they created a function to set the defaultencoding when Python started in order to experiment with different settings. The function they created was sys.setdefaultencoding() and the Python authors would modify their individual site.py files to gain experience with how different encodings would change the experience of coding in Python.

Eventually, in October of 2000 (fourteen and a half years prior to me writing this) that experimental version of Python became Python-2.0 and the Python authors had decided that the sensible setting for defaultencoding should be ascii.

I know it’s easy to second guess the ascii decision today but remember 14 years ago the encoding landscape was a lot more cluttered. New programming languages and new APIs were emerging that optimized for fixed-width 2-byte encodings of unicode. 1-byte, non-unicode encodings for specific natural languages were even more popular then than they are now. Many pieces of data (even more than today!) could include non-ascii text without specifying what encoding to interpret that data as. In that environment anyone venturing outside of the ascii realm needed to be warned that they were entering a world where encoding dragons roamed freely. The ascii encoding helps to warn people that they were entering a land where their code had to take special precautions by throwing an error in many of the cases where the boundary was crossed.

However, there was one oversight about the unicode functionality that went into Python-2.0 that the Python authors grew to realize was a bad idea. That oversight was not removing the setdefaultencoding() function. They had taken some steps to prevent it being used outside of initialization (in the site.py file) by deleting the reference to it from the sys module after Python initialized but it still existed for people to modify the defaultencoding in site.py.

The rise of the sys.setdefaultencoding() hack

As time went on, the utf-8 encoding emerged as the dominant encoding of both Unix-like systems and the Internet. Many people who only had to deal with utf-8 encoded text were tired of getting errors when they mixed byte strings and text strings together. Seeing that there was a function called setdefaultencoding(), people started trying to use it to get rid of the errors they were seeing.

At first, those with the ability to, tried modifying their Python installation’s global site.py to make sys.setdefaultencoding do its thing. This is what the Python documentation suggests is the proper way to use it and it seemed to work on the user’s own machines. Unfortunately, the users often turned out to be coders. And it turned out that what these coders were doing was writing programs that had to work on machines run by other people: the IT department, customers, and users all over the Internet. That meant that applying the change in their site.py often left them in a worse position than before: They would code something which would appear to work on their machines but which would fail for the people who were actually using their software.

Since the coders’ concern was confined to whether people would be able to run their software the coders figured if their software could set the defaultencoding as part of its initialization that would take care of things. They wouldn’t have to force other people to modify their Python install; their software could make that decision for them when the software was invoked. So they took another look at sys.setdefaultencoding(). Although the Python authors had done their best to make the function unavailable after python started up these coders hit upon a recipe to get at the functionality anyway:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Once this was run in the coders’ software, the default encoding for coercing byte strings to text strings was utf-8. This meant that when utf-8 encoded byte strings were mixed with unicode text strings, Python would successfully convert the str type data to unicode type and combine the two into one unicode string. This is what this new generation of coders were expecting from the majority of their data so the idea that solving their problem with just these few lines of (admittedly very hacky) code was very attractive to them. Unfortunately, there are non-obvious drawbacks to doing this….

Why sys.setdefaultencoding() will break your code

(1) Write once, change everything

The first problem with sys.setdefaultencoding() is not obviously a problem at first glance. When you call sys.setdefaultencoding() you are telling Python to change the defaultencoding for all of the code that it is going to run. Your software’s code, the code in the stdlib, and third-party library code all end up running with your setting for defaultencoding. That means that code which you weren’t responsible for that relied on the behaviour of having the defaultencoding be ascii would now stop throwing errors and potentially start creating garbage values. For instance, let’s say one of the libraries you rely on does this:

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.

Naturally, if this was your code, in your piece of software, you’d be able to fix it to deal with the defaultencoding being set to utf-8. But if it’s in a third party library that luxury may not exist for you.

(2) Let’s break dictionaries!

The most important problem with setting defaultencoding to the utf-8 encoding is that it will break certain assumptions about dictionaries. Let’s write a little code to show this:

def key_in_dict(key, dictionary):
    if key in dictionary:
        return True
    return False

def key_found_in_dict(key, dictionary):
    for dict_key in dictionary:
        if dict_key == key:
            return True
    return False

Would you assume that given the same inputs the output of both functions will be the same? In Python, if you don’t hack around with sys.setdefaultencoding(), your assumption would be correct:

>>> # Note: the following is the same as d = {'Café': 'test'} on
>>> #       systems with a utf-8 locale
>>> d = { u'Café'.encode('utf-8'): 'test' }
>>> key_in_dict('Café', d)
True
>>> key_found_in_dict('Café', d)
True
>>> key_in_dict(u'Café', d)
False
>>> key_found_in_dict(u'Café', d)
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False

But what happens if you call sys.setdefaultencoding('utf-8')? Answer: the assumption breaks:

>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')
>>> d = { u'Café'.encode('utf-8'): 'test' }
>>> key_in_dict('Café', d)
True
>>> key_found_in_dict('Café', d)
True
>>> key_in_dict(u'Café', d)
False
>>> key_found_in_dict(u'Café', d)
True

This happens because the in operator hashes the keys and then compares the hashes to determine if they are equal. In the utf-8 encoding, only the characters represented by ascii hash to the same values whether in a byte string or a unicode type text string. For all other characters the hash for the byte string and the unicode text string will be different values. The comparison operator (==), on the other hand, converts the byte string to a unicode type and then compares the results. When you call setdefaultencoding('utf-8') you allow the byte string to be transformed into a unicode type. Then the two text strings will be compared and found to be equal. The ramifications of this are that containment tests with in now yield different values than equality testing individual entries via ==. This is a pretty big difference in behaviour to get used to and for most people would count as having broken a fundamental assumption of the language.

So how does Python 3 fix this?

You may have heard that in Python 3 the default encoding has been switched from ascii to utf-8. How does it get away with that without encountering the equality versus containment problem? The answer is that python3 does not perform implicit conversions between byte strings (python3 bytes type) and text strings (python3 str type). Since the two objects are now entirely separate comparing them via both equality and containment will always yield False:

$ python3
>>> a = {'A': 1}
>>> b'A' in a
False
>>> b'A' == list(a.keys())[0]
False

At first, coming from python2 where ascii values were the same this might look a little funny. But just remember that bytes are really a type of number and you wouldn’t expect this to work either:

>>> a = {'1': 'one'}
>>> 1 in a
False
>>> 1 == list(a.keys())[0]
False
Advertisements

6 thoughts on “Why sys.setdefaultencoding() will break code

  1. In sad old reality, though, dealing with the problems you mention may still be a better option than dealing with the pain of the ASCII default.

    I dunno how well known it is, but anaconda uses the setdefaultencoding hack…and with wider impact, pygtk2 did it as well!

    https://git.gnome.org/browse/pygtk/tree/pangomodule.c#n69

    so anything that used pygtk2 (with pango) had the ‘hack’ applied to it.

    When anaconda moved to gtk3, the default encoding for live image installs went back to being ascii (because the way the hack used to be implemented in anaconda didn’t work for lives, so we were in fact relying on the pygtk2 hack for lives). And we promptly started getting UnicodeEncode/DecodeErrors all over the goddamn place. In a sufficiently complex codebase there’s just so many goddamn different ways you can wind up trying to combine a unicode and a str that no-one much fancied the idea of trying to fix them all ‘properly’ – so we wound up just modifying the hack to apply to lives properly, for Fedora 22.

    • Yeah, I remember when the pygtk2 hack was discovered by someone in python upstream. I hadn’t yet wrapped my head around the ramifications of it so a lot of that conversation went over my head. But years later, I understand what the upstream developers were getting at. What’s happening is that you’re trading an unambiguous message about something being wrong for a message-free experience where bad data can silently creep in and start corrupting your data.

      Anaconda strikes me as a particularly bad place to rely on this for the long term (instead of say, a stopgap until someone can devote the time to fixing all the things) because the hack is global in nature as I mentioned and anaconda has to deal with a lot of encoding-less data (all of the rpm metadata). On the other hand, as you point out, time is a huge issue with a large existing codebase. There are some tools to help you out here: Enabling python-unicodenazi during development will greatly aid in finding places where the code is mixing str and unicode type. python-kitchen’s i18n module gives you a class that avoids tracebacks when using gettext to internalionalize your messages and its text.converters module gives you functions that avoid tracebacks when converting from byte str to unicode and unicode to byte str. And of course python3 itself won’t automatically convert between bytes and text anymore so you are alerted to all of those places when they occur rather than when an odd piece of data creeps in.

      But even with all of these tools you’ve only saved time spent finding and fixing all the places in the code that mix the types together. It doesn’t eliminate the time you have to spend altogether. So there is a manpower cost that has to be paid eventually and Fedora 22 probably wasn’t the time to pay it. You just want to avoid the mistake of thinking you never have to pay it otherwise you amass enough unfixable bugs (for instance, making utf-8 the default still has the property of causing tracebacks if the data is using the high bit in a different encoding.) and other people’s software addresses enough of your cornercases that people start migrating.

  2. The dictionary breakage is rather generic: it happens whenever your hashing function does not respect your equality operator’s equivalences. Whenever you add a coercion to your equality operator, some kind of special case must be added to the hashing function to make the key-in behavior the same as the key-equality one. I can only assume that Python 2 in fact has such a special case, but unlike equality it does not adjust for the defaultencoding, instead applying only for high-bit-clear bytes and ascii characters. This would be a bug if it weren’t abundantly clear that the defaultencoding is not meant to be changed in the first place. I wonder if, back before the defaultencoding was fixed, the hashing function respected it, or if it had just such a bug.

    As for the welcome_message example, I’m not sure why one would try to coerce automatically, and only upon that causing an error try to decode properly, instead of decoding properly right away. On the other hand, I don’t doubt such code is out there either way.

    Thank you for such a clear explanation.

    • Yeah, the latter is definitely what I’ve run into. If it’s a problem with code I control it’s no problem to fix it and the fix is obvious. Unfortunately there’s a lot of code out there where someone was just trying to hack around a reported UnicodeError traceback and they came up with something pretty strange to fix it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s