Towards a django.contrib.site_search (Part 2)

June 17, 2009

Search engine indexing is the foundation of search. Although some RDBMS—notably Postgres—allow you to build ad-hoc full text search indices from an arbitrary number of table columns at query time, the best approach both for universality across DB back-ends and for performance is to build an index of relevant content that exists alongside and is updated with that content; in other words: a cache-based search engine. Given that we’re using the capabilities of the supported Django RDBMS to query these indices, it makes sense to store these them in a database table. But where, exactly?

How Content Will Be Indexed

The naive approach would be to store the index of a model in the model itself as an additional field. Though the conceptual simplicity of such an approach is attractive it has three major failings. Firstly, it’s invasive, changing the table schema of applications and violating Django’s principles of loose coupling. Secondly, it’s inefficient if we wish to search across models. At minimum, we’d have to perform at least one SQL query for each model. Finally, it leaves us with a thorny dilemma: what should we then do with multiple query sets of search results —one for each model—once we have them? A search result may include different content types (and thus, models), but most users are probably more interested in the overall relevance of a result rather than it’s relevance within a category of content. Put it anther way: imagine you search a news site for “mortgage crisis” and back comes four distinct columns of data: matching blog entries, news articles, editorials, and video content. There is no way to determine which item is the most relevant of all results because we’ve indexed and searched them relative to only like items. It may be the the video at the top of the videos column is the most relevant video, but is far less relevant overall than the most relevant news article. However, because we’ve indexed them separately, we have no way to merge our separate search results into a single column by best overall relevancy.

The answer to this dilemma is to have a single index of all searchable content that can be queried once and which ranks the relevancy of all content from all indexed models as a whole. Additionally, we need to have some way connecting a record in the index back to the model and the record within that model from which it came. We will want to do this to get more information about each search result—information, perhaps that we haven’t indexed—and to allow users to filter content by type if they should so desire.

At this point we begin to get an idea of what our index DB table and thus our Django model will look like. We’ll want a column to hold the index, first of all (though the type of column will vary from DB to DB. More on that later.) Then we’ll need a way to tie the index back to the model and record from which it was derived. If you’ve spent any time with Django, the contenttypes framework in django.contrib will come immediately to mind as a good way to do this:

Django includes a contenttypes application that can track all of the models installed in your Django-powered project, providing a high-level, generic interface for working with your models.

Imagine we have a django app called site_search. This is what might appear in our model of a search index that uses contentypes:


class Search(models.Model):
    index = DB_Backend_Dependent_IndexField()
    content_type = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()    
    content_object = generic.GenericForeignKey()

Each record in our Search model will be a row in one of the models we’re indexing. Using this, we can search against the index, and then lookup the model and record in that model for each relevant result.

What we need to know about model to index it

There’s another question we have to answer: how does our search application know what models to index, and how which fields and records within that model to index? My felling is that the models themselves should tell the search app if and how they want to be indexed. This gets a bit tricky, because we want to avoid tightly coupling model metadata our search app needs to the models themselves. My solution is to add that data to our models via a mixin. So how do we do this? This is not the first time I found myself wishing for the robust component registration system of Zope 3: it’s perfect for these aspect-oriented types of development where you’re attempting to do something which cuts across a range of concerns without being disruptive. Unlike, say a blog application which we could think of a vertical, a search application runs horizontally across a range of other apps, enhancing their functionality. The easiest way of doing this is to literally add the mixin to the classes a model inherits from. Multiple inheritance is usually something you want to avoid in Python (or any language) but we’re on safer ground here because the methods and properties which we mixin to the model of an app for the benefit of out search should not already exist on the model.


class ArticleSearch:
	is_searchable = True

class Article(models.Model, ArticleSearch):
    slug = models.SlugField()
    title = models.CharField(max_length=200)
    pub_date = models.DateTimeField()
    body = models.TextField()
    status = models.CharField(max_length=1, choices=(
        ('p', "Published"),
        ('i', "Unpublished")
    )

This solution is OK, but to my mind it’s still too invasive. Furthermore, there is no programatic way to do this. Let’s look at how Django already deals with a similar issue: indicating that a model should be editable via the admin interface.


class ArticleAdmin(admin.ModelAdmin):
    list_display = ('title', 'pub_date')
    list_display_links = ('title',)
    prepopulated_fields = {"slug": ("title",)}
    
admin.site.register(Article, ArticleAdmin)

It would be nice if search could work the same way:

class ArticleSearch:
	is_searchable = True

site_search.register(Article, ArticleAdmin)

Well, it can actually:

_ADAPTED_MODEL_CACHE

def register(model, mixin=None):
    for key, value in mixin.__dict__.items():
        if not key.startswith("__"):
            model.add_to_class(key, value)        
    _ADAPTED_MODEL_CACHE.append(model)

This is admittedly primitive, but it gets the job done, copying the attributes of our mixin class onto our model class. You’ll notice that we exclude those properties that start with “__” because there is a potential to overwrite attributes like '__doc__' and '__module__' which we don’t want to overwrite on our model. You’ll also notice we’re using a method from django.db.models.ModelBase called add_to_class which is useful but not well known.

Finally, we will store the references to the models which have been registered to be indexed in a cache we can access down the road called _ADAPTED_MODEL_CACHE. More on why we would want to do this in a later post. This method isn’t multiple inheritance, but the result is the same.


>>> class ArticleSearch:
...    is_searchable = True

>>> class Article(models.Model, ArticleSearch):
...    slug = models.SlugField()
...    title = models.CharField(max_length=200)
...    pub_date = models.DateTimeField()
...    body = models.TextField()
...    status = models.CharField(max_length=1, choices=(
...        ('p', "Published"),
...        ('i', "Unpublished")
...    )



>>> site_search.register(Article, ArticleAdmin)
>>> Article.is_searchable
True

Unfortunately, this doesn’t really cover what we need to know about a model to index it, so let’s get back to that. Firstly, we likely don’t want to index every field in our model. Things like “status” in the previous Article model are not relevant as it’s just a single character field. So, we should describe what fields should be indexed:

class ArticleSearch:
	is_searchable = True
	fields_to_index  = ('title', 'pub_date', 'body')

And, just as we will wish to exclude certain columns in our table from being indexed, we’ll wish to exclude certain rows from being indexes, such as rows which correspond to unpublished content:

class ArticleSearch:
	is_searchable = True
	fields_to_index  = ('title', 'pub_date', 'body')

	def is_searchable(self):
        return self.status == ‘p’

At this point, the ArticleSearch class tells us enough about the Article class that we can successfully index it. Next, I’d like to cover the actual mechanics of how we’ll index our models and some additional metadata we’ll be storing in our index.

Tagged with: , .

Leave a comment:

Comments are closed for this entry.