I’ve been hacking around with a lot of search related technology recently. We have a project at my gig where we need to implement a site wide search for a reasonable sized website. We investigated a bunch of different products and technologies during the process.
Before this most recent search bonanza I was doing a decent amount of work with DotLucene. For those of you unfamiliar with DotLucene I recommend you check it out. It’s an open source port of the Apache Jakarta Lucene project. DotLuene is a “powerful open source search engine for .NET applications”. It’s a very powerful tool which I’d highly recommend to those looking to implement search on your site, or within your application. By combining DotLucene with a .NET wrapper around Indexing Services IFilter components you can get indexing of not only HTML content but also Microsoft Office, PDF, and any other file type that has a IFilter available.
The other project which I’ve come across is Nutch, which is an open source web crawler that creates indexes of websites. It uses Lucene behind the scenes and I believe has some seriously large indexes being created with it (can’t remember where I saw mention of this). Anyway all of this has sparked my interest in the technology behind search. I’m planning on spending some more time in the coming weeks and months playing with DotLucene, overviewing the Nutch code base, and thinking about what might come of it all.