Brendel Consulting

Jun 26, 2009

Pitfalls of lazy evaluation in Django's ORM

It's well documented that query sets in Django's ORM are lazily evaluated. The database is not accessed until you actually reference query set results. For example:

    my_set = MyModel.objects.all() # No database access
    print my_set[0]                # DB accessed and model instance created

The thing one needs to be aware of here is that any access to the result array will always cause a new database access and object creation for model instances:

    my_set = MyModel.objects.all() # No database access
    print my_set[0]                # DB access and model instantiation...
    print my_set[0]                # ... and the same again!

It is not uncommon to retrieve a set of results via a query set and then perform some operations on them, for example in a loop. It is tempting to view query sets just as if they were arrays - lists of in-memory objects - and use them in a similar manner. In-memory arrays tend to be quite efficient, query sets not so much: If you have to run over such a loop several times, you will end up retrieving the same object from the database multiple times as well.

If you need to iterate over the result set more than once - or just access the same element a few times - it is therefore more efficient to make copies of your individual results. That way, the query and the creation of model instance objects is done only once:

    my_set = [ m for m in MyModel.objects.all() ] # DB accessed for all
    print my_set[0]                               # No DB access!

In effect, the list you created in the last code snippet becomes a cache for your query results. You could also have just assigned a single result-set element to a variable for the same effect.

Fortunately, when passing a query set to a template it appears as if this caching is taken care of automatically for us. Making copies of query set results therefore is important when more complex operations are necessary on result sets within the view function.

You should follow me on twitter here.

Labels: database django orm python caching lazy evaluation queryset

Now on Twitter as well

So, I thought I try my hand in this thing that everyone has been talking about and got myself a Twitter account. You can now follow me there, if you are so inclined.

I'm still undecided on whether this is going to be enlightening or merely another efficient way of wasting time, but we will see.

Jun 14, 2009

Read-optimize your source code

When you design a software system, database or data structures, you take into consideration its most common use case. For example, you organize your data differently when it is read frequently, but only rarely written to. You read optimize your data.

I am convinced that the same applies to source code. In almost all cases, source code will be read much more often than it is written. Who writes the source code? You. A function is developed over some limited time, and once it is done it changes rarely, unless being refactored or modified to accommodate some changed requirements.

But who has to read your source code?

Well, for starters, your colleagues who have to integrate with or use your code.
New hires who join your team and try to find their way around.
The maintenance (or continued engineering) programmers.
Those who come after you, or inherit your code as part of their responsibility
You.

Your code is read around the time it is written and integrated and possibly for many, many years after you have long forgotten about your code.

We can see that over the life-time of some piece of code, it is very likely to be read much more often then written or modified. Consequently, source code needs to be read optimized just like we might read optimize a data structure or database if called for.

What does it take to read optimize source code? Here are the key points:

Meaningful variable and function names.

The compiler doesn't care whether the function name is 3 characters or 30 characters long. However, a colleague reading the code will be very grateful if the function is called get_daily_rainfall_average(), rather than dra(). Who cares that it takes two seconds more to type it? Many modern IDEs are going to do that for you anyway.

Thoughtful source code documentation.

Explaining in one sentence what some code does and in possibly many more sentences why it does it and why it's needed. The why is often much more important then the how: Explain the rational behind your design or implementation decisions. Everyone can see that a variable is increased in a for-loop, that doesn't have to be documented. But why the for-loop is needed in the first place is much more interesting and illuminating to someone who is new to the code.
Comments like this should be there for each module, class, function and the more complex code blocks within a function. Keep the comments close to the code they refer to. If they are all in the function or module header, they are often forgotten when the internals of code are changed, 15 pages further down.

A legible and consistent coding style.

We can endlessly argue about which style of parenthesis is the right one, but what's more important is that you remain consistent throughout your project.
If you are new to a team, use the coding style they use, even if it's not your favourite one.
Unless absolutely needed for performance reasons, don't try to optimize your doubly-nested loop into a fancy single line statement, exploiting even the most esoteric features of the language. Instead, break it up into a more easily understandable set of spelled out loops.
In general, don't optimize your code until you know you have to! Not only does it waste time if it turns out that the code really isn't performance critical, it is often also much more difficult to read and maintain.
If you can do what you have to do with simple and often used language features then use those.
Use white-space to make code less 'dense' and increase legibility. Many developers have pretty large screens these days, white space doesn't cost money.
Align code. For example, if you have several variable assignments, align the '=' operators directly underneath each other. It's astonishing how much more legible that block of code becomes. If you have to line-break a long function call, indent the arguments of the second line to align with the start of the arguments in the first line. That's just a little example, but there are many cases like this in most programs, where a bit of thoughtful alignment can make a difference.

It is often astonishing what amount of resistance some developers put up against even those few, simple rules. I have heard arguments, such as:

"I don't want to write comments, because then I will have to scroll more to see the code."
"Writing comments in the code takes time."
Formatting the code nicely takes time, especially when I need to change a few things, in which case I then have to re-format the code.
"Code changes, and before you know it the comments are obsolete."

I have absolutely no patience for the first three arguments. There might be some extremely rare situations where an emergency fix needs to be rushed out and short-cuts need to be taken. But for the most part, those arguments are bogus. The only one even remotely credible is the last one about comments getting obsolete, but this can be addressed in a straight forward manner: Keep the comments close to the code, and do code reviews where readability and correctness of the comments are stressed as well.

In fact, I claim that if you don't take those rules to heart in your own source code then you are either unprofessional, lazy, not a team-player, or all of the above. If you as a software developer take pride in your professionalism and quality of your work then you have to consider that it is not only the achieved functionality for which you are being paid: The code you produce in almost all cases becomes property of your employer. Therefore, the code itself also becomes a product you deliver, and actually your most important product.

How useful and usable your code is for the team who has to work with it is what really determines its value in the long run.

You should follow me on twitter here.

Labels: coding style, documentation, programmers, read optimization, source code

Jun 12, 2009

A Django ORM bug when using PostgreSQL

Normally, an ORM such as the one used in Django is expected to 'do the right thing' with the database. I should be able to set up my models, and use the ORM without having to give any further thought to what it does behind the scenes. Switching between different databases should be as simple as changing two or three lines in my settings.py file: Use PostgreSQL instead of sqlite, for example.

At least in theory. It can be quite jarring to come across situations where this is not the case. And that is exactly what happened to me today.

Imagine a simple Django model:

    >>> class Foo(models.Model):
    >>>     date = models.DateField()

Now assign a somewhat unusual date to a model instance:

    >>> f = Foo()
    >>> f.date = datetime.date(1,2,3)
    >>> f.date.year
    1
    >>> f.date.month
    2
    >>> f.date.day
    3

You can see that we are talking about a date a long time ago. Nevertheless, a valid date as we can confirm by querying the individual elements of the date. Now save the instance:

    >>> f.save()

If you are using sqlite and do a select * from foo; in your database shell, you will see the following as the output:

       1-02-03

If you load a Django object from that table, you will find everything in order and as expected.

However, now try the same with PostgreSQL as database. After doing the very same thing in Python, take a look at what's written in the database table:

    2003-01-02

Oops!

Now that shouldn't happen!

The interesting thing is that as soon as you specify a year that has at least 3 digits, PostgreSQL stores the correct date. Do f.date = datetime.date(101,2,3) and after you save the model in the database it will have stored what we expect to see:

    0101-02-03

Also interesting, if you specify a year with two digits and a value more than 12, you get an exception. For example f.date = datetime.date(99,2,3):

    Traceback (most recent call last):
       .....
    DataError: date/time field value out of range: "  99-02-03"
    HINT:  Perhaps you need a different "datestyle" setting.

I checked the DATESTYLE setting for my database, and indeed, it was ISO, which defines a date as YYYY-MM-DD. In theory, therefore, that should work. But obviously, something is not quite lining up.

To summarize, while it all works fine with sqlite, in PostgreSQL I get:

A wrong date stored for years between 1 and 12
An exception for years between 13 and 99
Correct behaviour for years 100 or up

So, in case you need to work with really small dates in your Django app, you might want to keep this in mind.

You should follow me on twitter here.

Labels: python django orm postgres postgresql sqlite bug

Jun 1, 2009

How to do "count" with "group by" in Django

Faced with the 'age old' problem of having to do a group by query in Django, I finally came across a solution.In short, you use the lower-level query interface directly:

    q = MyModel.objects.all()
    q.query.group_by = ['field_name']

When you then iterate over 'q', you get the results grouped by the value of whatever was in 'field_name'. Great! So far, so good.

Well, a word of caution at this point: Using the low-level query API is not really something Django wants you to do, apparently. In fact, try this with sqlite and it works. Try it with PostgreSQL and it does not (all sorts of error messages). So, your mileage may vary, depending on which database you are using. Ok, let's assume that your database is fine with this...

The only challenge for me now was that I had to do a count for this.
If you form your query using the usual methods of Django's ORM, you will be disappointed. For example, this here will not work:

    q = MyModel.objects.all()
    q.query.group_by = ['field_name']
    q.count()

It appears as if this returns only the count of the first group, not a list of counts, as you would expect.

The solution is to wade a bit deeper into the low-level query API. We can instruct the query to add a count-column. In fact, this results in merely a single column being returned, just like COUNT(*). It goes like this:

    q = MyModel.objects.all()
    q.query.group_by = ['field_name']
    q.query.add_count_column()

Since this returns counts, rather than complete objects, we now need to get the individual group counts as a list of values. We add one more line:

    q.values_list('id')

The values_list() function gives you not instantiated objects but instead the tuples of values for each object. The tuples contain only the fields you specify by name in the call to values_list(). Except, we have manually added the count column with the add_count_column() function. That count column is always returned first. So, what you get as result is something like this:

    [ (3,1), (19,8), ... ]

The first value of each tuple is the count, the second value is the value of the 'id' field of the last occurrence of the grouped model in the table. If you specify something other than 'id', you get the same thing: First the count and then the value of that other field.

But that's not what we want, right? We want a list of counts. We could manually extract the first element in each tuple, but Django offers us a shortcut:

    q.values_list('id', flat=True)

Setting flat=True tells the values_list() function to just return the first element of each tuple in a plain list (not a list of tuples). And since the first element now is the count column, we finally get what we want: A list of the counts for each group.

Note that we could have specified any other fields in the call to values_list(), not just 'id'. Because we specified flat those fields are ignored. It seems, though, that at least one field needs to be specified here, even though it won't be considered as part of the output.

You should follow me on twitter here.

Labels: count, django, group by, python