Search feedback
Brett Dikeman
brett at cloud9.net
Thu Jul 13 17:46:35 EDT 2006
I'm experimenting with a return to using htdig for searching the site/
list archives.
http://www.audifans.com/htdig/search.html
Please give it a whirl and let me know what you think. Note that
results are biased towards newer documents, and more heavily biased
towards the subject line of the message in the archives, so consider
this good encouragement to use meaningful subject lines :-) The
index was last updated 2-3 days ago.
It's somewhat slow, so please be patient. I timed searches at 10-15
seconds; htdig apparently doesn't scale well up to half a million
documents (we have about 425,000 pages.) Don't expect Google-like
snappyness.
On the plus side, compare these searches on "alternator belt":
Google w/"site:audifans.com", 1200 hits
Google "branded" search (on audifans homepage): 117 hits
htdig: 743 hits
The htdig top hits often had "alternator belt" in the title; the hits
around page 10 had "alternator belt" somewhere in the message body.
I'm also "all ears" to alternative search engines, for any computer
people out there that know of them. Rough criteria:
-scales past half a million documents
-free or open-source
-intelligent (fuzzy) searching. Ie it won't annoy users with
"alternator" not matching "alternators", etc. Some control over how
documents are ranked.
-no annoying requirements (Lucine requires we install Java and
Tomcat. I don't mind compiling things. I do mind having to install a
whole new architecture and run a second webserver just for ONE site
feature.)
-documentation. This rules out "omega", which was almost impossible
to find to even -download-.
-no "write your own frontend" BS; ie complete packages, not
"libraries". I want a set of CLI utilities, and a functional html
search interface.
-built-in crawler, but must be able to index most but NOT ALL local
documents directly...Htdig is smart enough to let me tell it /
archives = /opt/whatever/archives, etc...but will happily switch over
to hitting the website for other URLs, like the Wiki or marketplace,
which are database-backed.
-incremental updates to the index/databases (see above re half a
million documents)
Brett
More information about the quattro
mailing list