Towards a django.contrib.site_search (Part 1)

June 3, 2009

Somewhat ambitious, I know, but I’ve become convinced that site search is the next frontier for Django. Let’s face it: users like search. Everyone knows how to use it. It’s usually the first thing people with the goal of information-finding turn to. Before site maps, before content hierarchies, before tags...people turn to search. That’s a good thing, if you have a search for them to use and itt works well.

Over the next two weeks, I hope to outline an approach to building a site search application that will work well, scale well, be as easy as possible for application builders to implement, and accord with the design goals of Django. And if I’m lucky, perhaps some form of this will make it into django.contrib. There’s certainly no surfeit of requests for such a feature. 

Before we begin, a warning: although I have the architectural approach pretty well laid out in my head, I’m still writing the code. These posts will not lack for code snippets (well, this one will), but I probably won’t be ready to provide the full version of the app until the final post. So, caveat lector—these posts demonstrate a work in progress. That said, if I later reverse course on something I outline in a post, I promise to note that in the next one. 

For a long time, I’ve felt like Django needed a way to provide a full-text search of models. In fact, this site’s own search is written in Django. I’ll be the first to tell you, it’s not that great. (In my defense, I wrote it in an evening and have never really gotten back to it.) So how do you do search for a Django application? The easiest way is to let Google or Yahoo do it for you with their customized site search widgets. But for many sites, this isn’t a good option: you have no control over the results and Google or Yahoo branding and styling are slapped everywhere. Alternately, there are open-source projects like Lucene or Xapian: search engines which run externally to your Django project but which can be called out to from Python. 

Doubtless, Lucene is a great piece of software. If you need enterprise-quality search and have the monetary and hardware resources to develop for and run it, Lucene and ancillary products like Solr are a good way to go. It even comes ready to run with Django. The big problem for me is that it requires Java. What’s more it requires that you be able to run Tomcat ,Jetty or some other Java server. And those + Lucene can be resource intensive. Xapian has similar issues: it has to be compiled for the box you’re running and maintains its own indices. I’m not knocking either of these products, but Java won’t ever become part of the requirements to use Django. Nor will compiling C++ source code. One of the implicit design decisions in Django is that it’s mostly self reliant. It could also be argued that such a decision would invalidate the explicit Django principles of loose coupling and quick development. In any case, a django.contrib module that requires Java is a non-starter.

One could try writing a search engine in Python (in fact there is one, Whoosh) but writing a good search engine is hard, and there’s good reasons to suspect performance issues mean Python may not be the best language to do it in. That really only leaves us a single viable option: the RDBMS itself. It might sound odd, but most modern RDMS support full text indexing, searching, and result ranking or scoring. This includes MySQL, Postgres, Oracle and MS SQL. They all do it differently, of course—we’re well beyond the limits of ANSI SQL here—but Django already deals with the different ways RDBMS do things; transactions, for example.

Given that it’s the only option left us, how do we build a full-text search module for Django that uses the DB? Well, if you’re using sqlite, you don’t. We may as well be up front about the fact that if you’re using sqlite as a production database, you’re not going to be able to do some things. Chances are, you already knew that though. But for everyone else site search is possible.

We’ll be looking into RDMBS full text search capabilities more later, but suffice it to say that the tsvector package for Postgres, MySQL’s MATCH...AGAINST syntax, Oracle’s Oracle Text and MS SQL’s full text search service all provide us with the features and speed we need.

I’ll finish up by putting some high-level requirements out there:

  1. Enabling search should be a simple as possible. We’re operating under the “explicit is better than implicit,” dictum, so some configuration might be needed, but you shouldn't have to significantly rewire you Django apps to get search working. 
  2. Similarly, search should be loosely coupled. For example, if we’re going to need some sort of index DB field (hint, hint: we will) it probably shouldn’t require you to change the table schema of your models by adding new fields. This also means that the interface between searching and the templating of a search result should be loosely coupled. The search should give you something like a query set to work with in the template and search result items in this set should be uniform, even if they come from different models. 
  3. ...which brings us to this: you should be able to search across models and apps. The models in any installed app should be fair game for indexing and searching.
  4. The search index must update in real time. If you publish a blog post, it should be indexed immediately. So we won’t be using a cron job which periodically scrapes your database to build an index and we won’t be using a crawler.
  5. Similarly, the responsibility for keeping our index updated lies with the search app, not the DB. This seemed somewhat heretical to me at first: why not use triggers to keep your DB index up-to-date? But I decided against it because it seems to violate a couple of Django’s principles:  the loose coupling that an ORM gives us is lost. Also, index updating becomes implicit rather than explicit. Performance testing might force me to be more pragmatic, but for now, triggers and the like are out. 

So that’s it for the high-level requirements. Tomorrow, I’m going to go into the steps I see happening in the search app when a user actually does a search. 

Tagged with: , .

Leave a comment:

Comments are closed for this entry.