Towards a django.contrib.site_search (Part 3)

October 28, 2009

OK, so my site_search module is finally done. Or at least it's reached a state where I feel comfortable releasing it. It works, but it's still very alpha in quality and I welcome feedback on it. In my final installment of this post series I'll focus on the mechanics of wiring all the separate pieces together in Django. In fact, part of the reason for the delay between this post and the last one was due to some major re-factoring that I finally got around to, the purpose of which was to bring my search app more in line with packages in django.contrib, like auth or admin. I really would like search to be a drop-in component in the way those other components are, where all that is required is some initial configuration and templating by the user.

A quick summary of what I've covered is probably in order. Firstly, I decided that search would be best accomplished via the full-text searching functionality of RDBM systems thatDjango supports (sqlite3 excepted). Then I outlined how Django's ContentTypes framework could be used to index content from an unlimited number of models and store it all in the same search index. Next I described what information would need to be known about a record/ DB table row in order to index it:

  • Whether the parent table should be indexed
  • Whether that particular row should be indexed (ignoring unpublished content, for example)
  • What fields in the row should be indexed

Finally, I showed how we could provide that functionality by modifying the models we wish to index: first via a mixin to the model class and then, programatically, by modifying the model at run time and copying fields from our mixin to our model.

My search app actually stores a little more information about each record than I outlined above. Here's a sample class which will be registered to an imaginary blog entry model:


class EntrySearchMixin(SearchableMixin):
    """A mix-in for a blog entry"""
    
    fields_to_index  = ('headline', 'summary', 'body')
    
    def is_searchable(self):
        return self.status in PUBLISHED_ENTRY_STATES
    
    def get_search_result_title(self):
        return "%s: %s" % (self.blog, self.headline)
    
   def get_search_result_description(self):
        return self.summary

    def get_search_result_date(self):
        return self.pub_date

site_search.register(Entry, EntrySearchMixin)

The fields_to_index property shows what fields in the model should be indexed. The is_searchable method can be used to determine if a particular blog entry should be indexed (in this case, a published entry). The next fields are used to help control how the search result will display on the page. Every search result should have a title which identifies it. It can also have an optional description. Finally, we can also associate (optionally) a date with each item in the index. This provides users with a way to sort search results by date as well as relevance (which is calculated by the RDMBS).

The concluding line of the sample shows how we can register this configuration class to our blog entry model. If it looks familiar, it should; it's similar to how you configure models for use in the Django admin interface:

admin.site.register(Entry, EntryAdmin)

I'm not going into great detail about how site_search.register works—I outlined the basic mechanism in the last post—suffice to say, that setting up search is as simple as adding a file called site_search.py to every app with models you want to index, creating classes that inherit from a modelSearchMixin to guide the indexing mechanism and registering them. Again, the process is very similar to how the admin interface is enabled.

Maintaining The Index

Once we've registered a model, how do we make sure that our index can be maintained alongside it? Django's signals framework is he perfect way to do this. (Remember, I specifically decided not to use RDBMS-level features like triggers because they remove control from the Django app, where it belongs, and violate the principle of loose coupling.) Every model emits a post_save signal when a row is created or updated and a post_delete signal when a row is deleted. We can connect listeners to those signals for the models we are indexing, listeners which will update the associated row in the index or remove is, respectively. The site_search application also includes functionality to re-index an entire django project (or, if you're adding search to an application that already has data, index that data for the first time).

Querying The Index

I'll wrap this up by showing how the index is queried. Here is a function which queries our index for postgresql databases:


def query(search_model, query, sort="relevance", app_label=None, model=None):
    assert sort in SORT_MAPPINGS.keys()
    query = connection.ops.quote_name(query)
    where=["index @@ plainto_tsquery(%s)"]
    params=[query,]
    if app_label and model:
        id = ContentType.objects.filter(app_label=app_label, model=model)[0].id
        where.insert(0, "content_type_id = %s")
        params.insert(0, id)            
    result = search_model.objects.extra(
        select_params=(query,),
        select = {'relevance': "ts_rank(index, to_tsquery(%s))"},
        where=where,
        params=params
    ).order_by(SORT_MAPPINGS[sort])
    return result

There's a lot going on here. You can see some of the postgres SQL methods (e.g. plainto_tsquery) which can do a full-text search of an index. You might also notice that, despite a fairly high level of complexity in the query, there is no RAW sql being written here. We can use the powerful "extra" method of Django's ORM to construct our query instead. By way of contrast, here's the same function for the mysql-based backend:


def query(search_model, query, sort="relevance", app_label=None, model=None):
    assert sort in SORT_MAPPINGS.keys()
    query = connection.ops.quote_name(query)
    where=["MATCH index AGAINST %s"]
    params=[query,]
    if app_label and model:
        id = ContentType.objects.filter(app_label=app_label, model=model)[0].id
        where.insert(0, "content_type_id = %s")
        params.insert(0, id)            
    result = search_model.objects.extra(
        select_params=(query,),
        select = {'relevance': "MATCH index AGAINST %s"},        
        where=where,
        params=params
    ).order_by(SORT_MAPPINGS[sort])
    return result

At this point, I hope I've given a good high-level view of how search works. There is a lot I've left out, but those details are best appreciated by taking a look at the source code.

So without further ado:

Tagged with: .

Leave a comment:

Comments are closed for this entry.