Perl coder needed (was Cataloging Audi Info)
Brett Dikeman
brett at cloud9.net
Tue Mar 19 00:54:33 EST 2002
At 11:47 PM -0500 3/18/02, TM wrote:
>Question for all of you:
>How do you catalog all of the info that you've collected over time?
>
>I have a ton of collected posts and emails that I need to organize in
>some fashion and was wondering how you did it. I'm using Outlook and
>am thinking of just creating a whole new personal folder just for Audi
>stuff and trying to organize everything by subject matter, periodically
>archiving the whole mess to CDR.
I hadn't wanted to discuss it on-list(I wanted it to be a surprise)
but a few friends and I had been working on a much better archive
format. We started work before Christmas, and work fizzled out big
time. So no surprise(we originally thought we'd have something to
show by January 1st and wanted to present it as a new year's surprise
to the list.) At the moment, we're hung on scripts to do three
things:
a)import a single message, received through stdin(called from
/etc/aliases) Ie, keeps the archive up to date
b)import an entire MBOX of messages. Both majordomo's archive
program and Mailman's archive program store messages in this
format(in addition to the HTML.) This is not -nearly- as simple as
it sounds. All sorts of different header formats(and different
headers), forwarded emails, attachments, wrong dates, etc all make
the problem pretty messy.
Once we get the data in, -basic- searching isn't a problem, and we've
already got some basic frontend stuff set up...that stuff is pretty
easy.
Basically, I -desperately- need someone who has experience with perl
coding and some SQL(specifically PostgreSQL, but pgsql is entirely
SQL compliant and the most complete SQL implementation) to look over
what's already done(the import-one-message script is partially done),
get up to speed on our DB layout etc and help us finish both scripts.
There's another problem, and I'll mention it in hopes a lightbulb
goes off in someone's head...we need a full text search engine that
can index content in an SQL database. You wouldn't believe how few
of these things there are, and how much people want for them.
Searching HTML and plaintext files? Free or pennies, even ones that
STORE their indexes in an SQL db. Actually indexing text that is IN
an SQL database? $50,000+ There's -some- stuff partially integrated
into pgsql already, but it's featureset doesn't match our needs very
well. While intelligence about word forms isn't really necessary,
ability to handle odd pseudo-words like "5kstq" is ESSENTIAL for
obvious reasons, and the search system needs to recognize such words
entirely on its own.
Once we get over the import hurdle(even just importing new messages),
I can promise you all you'll really like what you see. It will be
very much a work in progress, but I have a lot of exciting and useful
features in mind...but again...the big problem right now is getting
the stuff into our DB.
Please contact me off-list if you're able to help, it will be -much-
appreciated.
Brett
--
----
"They that give up essential liberty to obtain temporary
safety deserve neither liberty nor safety." - Ben Franklin
http://www.users.cloud9.net/~brett/
More information about the quattro
mailing list