Date: Sat, 24 Dec 1994 17:02:17 +0900
From: Stephen Turnbull <turnbull AT shako DOT sk DOT tsukuba DOT ac DOT jp>
To: djgpp AT sun DOT soe DOT clarkson DOT edu
Subject: DJGPP list archives

OK, the Yaseppochi-gumi DJGPP mailing list archives are now up-to-date
(well, they will be until this message goes out :-).  I noticed that
the November archive was up to 2.2MB before gzip'ing, and the December
archive to date is already 1.2MB.  I don't see any reason for the
level of traffic to decrease substantially over the next couple of
months, except that a revised FAQ seems likely to come out shortly,
which will help a little.  But most of the traffic seems to be V2 or
high-tech; the newbies are generally scared away by the volume, I
think.  This is unfortunate.  Access to the archives might be a
reasonable alternative to subscribing, and a way to get familiar with
the list.

gzip'ed, those archives are respectively 555KB and 365KB.  A bit much
for anyone operating over a telephone line, even at 14.4Kbps.

I can see a couple of possibilities for improving this situation.  But
unless they can be easily automated, they're too much like work for me
to be willing to do.  Here's what I know how to automate, and will
start doing shortly:
(1) filter out subscribe and unsubscribe messages (actually, this is
    already done, mostly)
(2) filter out nuisance headers, especially the duplicate set produced
    by RMail
(3) filter out certain well-known Warlord-style .sigs (I'm an
    occasional offender myself, but I don't think you need to see them
    a dozen (or dozen-score) times in an archive; I'll probably put
    them in a separate file so you can just download the bunch once :-)

I may be able to filter duplicates (certainly if they have the same
message ID, as happens when I save both the direct copy to me and the
listserv generated copy) also.  But there aren't too many of these.

I would appreciate any suggestions for
    (a) other easy-to-filter nuisances, either by line or by message
    (b) code (perl or gawk) for doing the job
    (c) other filtering tools (especially AI programs capable of
        filtering replies with 100 quoted lines and 2 lines of new
        content!)
    (d) dividing messages into files by content and *how to recognize
        content* (maybe threading a la newsreaders could sort of be
        done?)
    (e) dividing files into messages.  Currently I plan to use ASCII
        FF (^L), but if there's a good reason to use something else,
        let me know
    (f) tools to use to 'grep' the archives.  Currently I plan to use
        a batch file calling gawk but if there's a better tool that
        can easily be made message-oriented (I don't know how to do
        that with grep; perl is pretty big compared to gawk), I'd love
        to hear about it.  Needs to be fairly newbie-transparent; this
        isn't for me, it's for people who would otherwise not have
        regexp search available.

I would like to know if there is any interest in extending the
archives backward by date; so far no one has requested this, but if
there's a reason to do so and the process is pretty automatic, I'd do
it for historical interest if nothing else.

Let me know about anything else that might make the archives more
useful.  Don't hesitate to suggest things that look burdensome; if I
don't like it, I'll just ignore it ;-)
    --Steve