In the last post, I promised to talk about implementing recursive validation of Python Thrift structures. However, for a number of totally good reasons, I’d like to take a little detour through the fjords of serialization and deserialization first.

Unlike Protocol Buffers, serialization in Thrift is not as simple as calling msg.SerializeToString(). This is because of a wise-but-cumbersome design decision made by the Thrift implementers to decouple Thrift objects from the particulars of their serialization and transport mechanisms. This design makes a ton of sense because in some contexts you want to serialize objects to ultra-compact binary streams while in other situations JSON or XML are much more appropriate (e.g. if you want to read the output or have external clients). It also allows you (the developer) to implement new transport protocols and serialization approaches rather simply. However, an annoying consequence of this design is that actually serializing a Thrift structure to a string of bytes is 4 lines of code instead of 1:

transportOut = TTransport.TMemoryBuffer()
protocolOut = TBinaryProtocol.TBinaryProtocol(transportOut)
serialized = transportOut.getvalue()

As it took me more than a few seconds of Googling to figure out how to do this, I made some convenience methods - SerializeThriftMsg and DeserializeThriftMsg - and put them on GitHub.

Unlike the validate() methods, serialization does do some type checking but it does the checks in a haphazard and frankly slightly dangerous fashion. I’ll use the following Thrift structures to demonstrate what’s going on.

namespace py avi.thrift.validation.example

struct Point {
	1: required double x;
	2: required double y;

struct Review {
	1: required i32 rating;
	2: optional string text;

struct Place {
	1: required string name;
	2: required Point location;
	3: optional Review review;

For reference, these are the same definitions I used last time except that I made the Review into a structure of its own for reasons that will become clear shortly. By playing with the Point structure in ipython, we see that serialization checks that required fields are set and checks their types to boot.

In [1]: from avi.thrift.validation.example.ttypes import Point, Place, Review

In [2]: from util.serialization import SerializeThriftMsg, DeserializeThriftMsg

In [3]: point = Point()

In [4]: SerializeThriftMsg(point)

TProtocolException: Required field x is unset!

In [5]: point = Point(x=1.2)

In [6]: SerializeThriftMsg(point)

TProtocolException: Required field y is unset!

In [7]: point = Point(x=1.2, y="asdf")

In [8]: SerializeThriftMsg(point)

error: required argument is not a float

In [9]: point = Point(x=1.2, y=5.0)

In [10]: SerializeThriftMsg(point)
Out[10]: '\x04\x00\x01?\xf3333333\x04\x00\x02@\x14\x00\x00\x00\x00\x00\x00\x00'

HOWEVER. When we start to play with structures like Place which contain embedded structures, we see some surprising behavior:

In [11]: place = Place(name="avi's place", location=Point(x=1.2, y=5.0))

In [12]: SerializeThriftMsg(place)
Out[12]: "\x0b\x00\x01\x00\x00\x00\x0bavi's place\x0c\x00\x02\x04\x00\x01?\xf3333333\x04\x00\x02@\x14\x00\x00\x00\x00\x00\x00\x00\x00"

In [13]: = Point(x=3.5, y=2.2)

In [14]: SerializeThriftMsg(place)
Out[14]: "\x0b\x00\x01\x00\x00\x00\x0bavi's place\x0c\x00\x02\x04\x00\x01?\xf3333333\x04\x00\x02@\x14\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x04\x00\x01@\x0c\x00\x00\x00\x00\x00\x00\x04\x00\x02@\x01\x99\x99\x99\x99\x99\x9a\x00\x00"

In [15]: place.location = Review(rating=4, text="this place is great")

In [16]: SerializeThriftMsg(place)
Out[16]: "\x0b\x00\x01\x00\x00\x00\x0bavi's place\x0c\x00\x02\x08\x00\x01\x00\x00\x00\x04\x0b\x00\x02\x00\x00\x00\x13this place is great\x00\x0c\x00\x03\x04\x00\x01@\x0c\x00\x00\x00\x00\x00\x00\x04\x00\x02@\x01\x99\x99\x99\x99\x99\x9a\x00\x00"

We can set the required location field, which should be a Point, to a Review object and serialization does not complain. Likewise, we can set the review field to a Point object and we can serialize without any type errors: from Thrift’s perspective, all Thrift structures have the same type (at least during serialization in Python). If you got the impression that serialization was typesafe from the type checks triggered while serializing the Point object, you were fooled (as was I).

Now you might think, as I did, that deserialization should catch all these problems and alert you to the fact that someone is sending around malformed Thrift structures. If you thought that, you are half-right: deserialization does seem to detect these errors, but it swallows them whole and does not report them to anyone.

In [17]: place = Place(name="avi's place", location=Review(rating=4, text="great place"))

In [18]: serialized = SerializeThriftMsg(place)

In [19]: deserialized = DeserializeThriftMsg(Place(), serialized)

In [20]: print deserialized
Place(review=None, name="avi's place", location=Point(y=None, x=None))

Here I incorrectly set the location field to a Review and then serialized and deserialized the Place object. When I print deserialized Place, you can see that the location field is correctly initialized to a Point object, but that point is empty (and, therefore, invalid) and no error was raised during deserialization. So I guess I’ll end by repeating the title: especially if both ends of your Thrift client/server application are in scripting languages, take care when serializing and deserializing Thrift structures. You never know what you’re gonna get (unless you test it).

1 year ago