Porting Kitchen to Python3: Part 1 — Detecting string types

I’ve spent a good part of the last week working on the python3 port of kitchen. It’s now to the point where I’ve reviewed all of the code and got the unittests passing. I still need to add some deprecation warnings and a gettext object that mirrors the python3 API instead of the python2 API. Then it’ll be ready for an alpha release. Still a lot of work to do before a final release. Most of the documentation will need to be updated to change from unicode + str to str + bytes and the best practices sections will need a major overhaul since a lot of the problems with python2 and unicode have either been fixed, mitigated, or moved to a different level.

It was both an easy and hard undertaking. The easy part was that kitchen is largely a collection of dependent but unrelated functions. So it’s reasonably easy to pick a set of functions, figure out that they don’t depend on anything else in kitchen, and then port them one by one.

The hard part is that a lot of those functions deal with things that are explicitly unicode and things that are explicitly byte strings; an area that has both changed dramatically in python3 and that 2to3 doesn’t handle very well. Here’s a couple of things I ended up doing to help out:

Detecting String Types

Kitchen has several places that need to know whether an object it’s been given is a byte string, unicode string, or a generic string. The python2 idioms for this are:

if isinstance(obj, basestring):
    pass # object is any of the string types
    if isinstance(obj, str):
        pass # object is a byte string
    elif isinstance(obj, unicode):
        pass # object is a unicode string
else:
    pass # object was not a string type

In python3, a couple things have changed.

  • There’s no longer a basestring type as byte strings and unicode strings are no longer meant to be related types.
  • Byte strings now have an immutable (bytes) and mutable (bytearray) type.

With these changes, the python3 idioms equivalent to the python2 ones look something like this:

if isinstance(obj, str) or isinstance(obj, bytes) or isinstance(obj, bytearray):
    pass # any string type
    if isinstance(obj, bytes) or isinstance(obj, bytearray):
        pass # byte string
    elif isinstance(obj, str):
        pass # unicode string

There’s two issues with these changes:

  • code that needs to do this needs to be manually ported when moving from python2 to python3. 2to3 can correctly change all occurrences of isinstance(obj, unicode) to isinstance(obj, str) but occurrences of isinstance(obj, basestring) and isinstance(obj, str) will also be rendered as isinstance(obj, str) in the 2to3 output. This is correct for a lot of normal python2 code that is trying to separate strings from ints, floats, or other types but it is incorrect for code that’s trying to explicitly separate bytes from unicode. So you’ll need to hand-audit and fix your code wherever these idioms are being used.
  • This is more prolix and tedious to write than the python2 version and if your code has to do this sort of differentiation in many places you’ll soon get bored of it.

For kitchen, I added a few helper functions into kitchen.text.misc that encapsulate the python2 and python3 idioms. For instance:

def isbasestring(obj):
    if isinstance(obj, str) or isinstance(obj, bytes) or isinstance(obj, bytearray):
        return True
    return False

and similar for isunicodestring() and isbytestring(). [In case you're curious, I broke with PEP8 style for these function names because of the long history of is* functions and methods in python and other programming languages.] By pushing these into functions, I can use if isbasetring(obj): on both python2 and python3. I only have to change the implementation of the is*string() functions in a single place when porting from python2 to python3.

Now let’s mention some of the caveats to using this:

  • In python, calling a function (isbasestring()) is somewhat expensive. So if you use this in any hot inner loops, you may want to benchmark with the function and with the expanded version to see whether you take a noticable loss of speed.
  • Not every piece of code is going to want to define “string” in the same way. For instance, bytearrays are mutable so maybe your code shouldn’t include those with the “normal” string types.
  • Maybe your code can be changed to only deal with unicode strings (str). In python3 byte strings are not as ubiquitous as they were in python2 so maybe your code can be changed to stop checking for the type of the object altogether or reduced to a single isinstance(obj, str). The language has evolved so when possible, evolve your code to adapt as well.

Next time: Literals

About these ads

One thought on “Porting Kitchen to Python3: Part 1 — Detecting string types

  1. You can instead write:

    isinstance(var, (str, bytes, bytearray))

    But most of the time tou are right, what you really need to do is refactor the code.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s