Searching the Archives

Brett Dikeman brett at cloud9.net
Wed Feb 26 00:54:28 EST 2003


At 7:21 PM -0800 2/25/03, Ti Kan wrote:
>Michael Pshenishny writes:
>>  I'm pretty sure I'm not imagining things.  I've done a bunch of searches
>  >   and nothing from 2002 or 2003 from the MAIN quattro list comes up.  I

>Gee, you're right.  The results I saw were mostly from the sub-lists,
>and nothing recent from the main list.  Maybe somehow Google had purged
>the main list entries from its index?  Any ideas Brett or Dan?

Google is flaky sometimes- its a phenomenon lovingly referred to as
"The Google Dance", and it's particularly bad during the end of the
month when their servers are redoing indexes and such.

Google is a bad solution anyway- its good at web stuff, piss poor at
mail archives; it a)doesn't understand them and b)its prioritizing
algorithm(ie, based off the # of links) doesn't work at ALL on
mailing list archives(because they are tightly linked internally and
nobody really links to specific archive posts.)

The solution, which I had started on a year ago, was dumping
everything into an SQL database.  I have a perl script which SORT of
works, but it gets faked out very easily, because pipermail(the
Mailman archiver) does not write proper unix MBOX format files.

For example, this line will cause my script(which relies on perl MBOX
modules) to skip the rest of this email:

 From Germany.

...because it starts with the word "From".  No.  I'm not kidding.
The A8 archive MBOX file, which I was using for testing, has two or
three such messages in it.

I expressed my frustration to the maintainers of the 2-3 perl modules
that are available for MBOX parsing, and the answer was either
silence or some snotty reply about how I must be doing something
wrong, their module couldn't possibly be that stupid.  One guy's code
was so #$@!$ difficult to understand, I gave up trying to do so.

Fortunately, pipermail is consistent about one thing- the "From" line
always has the list name in it, it's always preceded by a blank line,
and the following line is a)never blank and b)a mail header.  If
someone is willing to write a perl script that parses out an mbox
file based on those rules and sanity checks, into a
one-file-per-message folder, or to try and fix one of the perl MBOX
modules, I'd be thrilled.

That solves importing the messages since we switched to Mailman.
There are other problems- mainly the majordomo archives.  Dan keeps
telling me he has plaintext(ie, mbox or something similar) versions
of the archives on CD 'somewhere'; if he doesn't, someone will have
to write a parser to not only parse the HTML, but traverse all the
various links.

There's one other problem- I never figured out how to do thread
re-assembly(ie, figuring out what message is in reply to what).  It
is not as simple as it sounds, because a LARGE number of email
clients do not provide ANY threading information, and the ones that
do tend to give it in all sorts of different ways.  Eudora and
Netscape both provide threading/quoting information, but they do it
differently.  Outlook, across the DOZENS of different versions(I am
not exaggerating.  The count might actually be a hundred or more),
either does or does not provide any information.  Scores of little
has-been email clients don't provide any quotation/thread info.

The solution I came up with was Jaccard(sp?) text similarity
analysis(available from a perl module)- something that wouldn't even
be faked out by subject line changes, as long as SOME text from the
original message was included in the reply...but it would be
incredibly computationally intense, since you'd have to run the
analysis on every message in the archives, say, going a week back,
with the message you're trying to import right now.  It doesn't
address, for example, issues where someone replies to a message a
month old or more, which happens occasionally.

Brett
--
----
"They that give up essential liberty to obtain temporary
safety deserve neither liberty nor safety." - Ben Franklin
http://www.users.cloud9.net/~brett/



More information about the quattro mailing list