Search feedback

Brett Dikeman brett at cloud9.net
Thu Jul 13 17:46:35 EDT 2006


I'm experimenting with a return to using htdig for searching the site/ 
list archives.

http://www.audifans.com/htdig/search.html

Please give it a whirl and let me know what you think.  Note that  
results are biased towards newer documents, and more heavily biased   
towards the subject line of the message in the archives, so consider  
this good encouragement to use meaningful subject lines :-)  The  
index was last updated 2-3 days ago.

It's somewhat slow, so please be patient.  I timed searches at 10-15  
seconds; htdig apparently doesn't scale well up to half a million  
documents (we have about 425,000 pages.)  Don't expect Google-like  
snappyness.

On the plus side, compare these searches on "alternator belt":
Google w/"site:audifans.com", 1200 hits
Google "branded" search (on audifans homepage): 117 hits
htdig: 743 hits

The htdig top hits often had "alternator belt" in the title; the hits  
around page 10 had "alternator belt" somewhere in the message body.

I'm also "all ears" to alternative search engines, for any computer  
people out there that know of them.  Rough criteria:

-scales past half a million documents
-free or open-source
-intelligent (fuzzy) searching.  Ie it won't annoy users with  
"alternator" not matching "alternators", etc.  Some control over how  
documents are ranked.
-no annoying requirements (Lucine requires we install Java and  
Tomcat. I don't mind compiling things. I do mind having to install a  
whole new architecture and run a second webserver just for ONE site  
feature.)
-documentation.  This rules out "omega", which was almost impossible  
to find to even -download-.
-no "write your own frontend" BS; ie complete packages, not  
"libraries".  I want a set of CLI utilities, and a functional html  
search interface.
-built-in crawler, but must be able to index most but NOT ALL local  
documents directly...Htdig is smart enough to let me tell it / 
archives = /opt/whatever/archives, etc...but will happily switch over  
to hitting the website for other URLs, like the Wiki or marketplace,  
which are database-backed.
-incremental updates to the index/databases (see above re half a  
million documents)

Brett


More information about the quattro mailing list