On Thu, Aug 24, 2006 at 11:46:09AM +1000, Marcus Barczak wrote:
[...]
> Now, I need to somehow fix this - so my question is, is there a
> program out there that already exists that can reconstruct each
> message into mbox format? (ie. work out and insert the "From <sender>
> <date>" header between each message ? Google so far has turned up
> nothing :(
>
> I have started work on a perl script to read the file and try and
> reconstruct the From seperators of messages by looking at the
> Received: headers etc. However it's turning into a world of hurt now
> dealing with MIME, parsing the seperators etc ..
>
> I need to somehow fix this so if anyone has ANY advice i'd really
> appreciate it. Is there an easier way to find where a message starts?
Unfortunately the catenation of a bunch of messages is
itself a valid message (anything that may appear in a
header is also valid in a body) so this is not a
well-posed problem, unless you're lucky enough that
something in your MDA inserts a Content-Length: or Lines:
header, in which case it's trivial. If not....
You need a heuristic to detect a block of message headers,
and you need to be sure that you don't mistake lines in
message bodies for headers (e.g. quoted headers in a
bounce) or the headers of a MIME part for the headers of a
message. Once you've got that inserting some sort of From_
line is trivial, and most sane programs don't much care
what's in the From_ line (see JWZ rants passim) so you
don't need to take too much care about reconstructing a
sane value for it. I also don't think you need to be able
to parse MIME messages to get this right (and I don't
think doing so would necessarily help much, because the
catenation of a valid MIME part and any number of complete
messages is itself a valid MIME part).
So, once you've got a candidate block of headers, how do
you figure out it's not (a) the middle of a message body;
or (b) the headers of a MIME part. (b) is easier --
headers of parts won't include From:, To: Received:, Date:
MIME-Version: etc., unless it's a part of type
message/rfc822 or similar, which is really like case (a).
Case (a) is really painful, though. The common cases are
bounces and forwards, and you might be able to
pattern-match on the preceding text (`forwarded message';
`this is a copy of the message, including all the headers'
and all the variants used by other MTAs) to distinguish
these. But maybe not.
I suspect you'll need to do quite a lot of testing to get
this to work reliably (and perhaps interactively classify
the difficult cases, assuming you can detect them).
--
But as it is!... my language fails!
Go out and govern New South Wales!
(`Lord Lundy', Hilaire Belloc)