Register   Reset password

Developer resources

Developer documents

Developer articles

Explained: Indexing and searching content

Cuyahoga uses DotLucene to index the content that has to be searched. DotLucene is a .NET port of the famous Jakarta Lucene engine. It is not a complete search engine but more like a very powerful API that can be integrated in almost every application that needs full text search capabilities.
All content in DotLucene is a Document. To index content, you have to feed Document instances to DotLucene and after a query, it will (indirectly, via Hits) return Document.

Modules in Cuyahoga that want to have their content indexed need to implement the ISearchable interface. This interface consists of one method (GetAllSearchableContent) that retrieves all documents for that module and three events (ContentCreated, ContentUpdated, ContentDeleted), that have to be fired by the module after changing content. These events contain the document  that has to be indexed or removed from the index.
Implementing was pretty easy because all module adminstration pages inherit from the same base class. This base class handles the events and calls the appropriate methods of the index builder.
The advantage of the event-based indexing is that all content is indexed immediately. There is no need for a scheduler that crawles the site to update the search index. If anything goes wrong you can always manually build a new search index.

When rebuilding the entire search index from scratch, Cuyahoga traverses the sites, nodes and sections and when it finds a module connected to a section that implements ISearchable, it calls GetAllSearchableContent() to retrieve the documents that have to be indexed.

2/19/2005 5:31:00 PM Published by Martijn Boland Category Developers Comments 4

  • I was tooling around with DotLucene, trying to find a method to index aspx pages. I think it's a very powerful API, as you call it, but there's sparse documentation. I can't wait to see how you've done it!

    by Andrew Hallock - 2/24/2005 9:24:44 AM
  • There's a lot of documentation of Lucene, the Java version. Since DotLucene is a direct port, all Java documentation also applies to the .NET version.

    For instance, I was very much inspired by an article on theserverside.com (Java community) where is explained how indexing en searching works over there.

    by Martijn Boland - 2/25/2005 3:42:25 PM
  • Do you know of any examples (.net) that show you how to use Lucence to index aspx/html pages.

    Basically want to be able to use it to index a site by traversing urls.



    by John Mandia - 3/17/2005 5:23:12 AM
  • As far as I know there are no crawlers yet for DotLucene. Two options:

    - write one yourself: this shouldn't be too hard. HttpWebRequest and some smart regexes to find all the links can give you the desired results. You can easily convert the html from the WebRequest to DotLucene documents.
    - if you can use Java on your server, you can get one of the Java crawlers for Lucene. The index that is being created is fully compatible with DotLucene so you can then use DotLucene to do the querying.

    by Martijn Boland - 3/21/2005 3:58:58 PM