Re: [exim] Counting the number of headers in an incoming ema…

Top Page
Delete this message
Reply to this message
Author: Mike Brudenell
Date:  
To: Exim
Subject: Re: [exim] Counting the number of headers in an incoming email
On 6 January 2016 at 18:00, Always Learning <exim@???> wrote:

> I had thought the third {} (now '{512}') was the first of the 'yes match
> found' 'no match found' Boolean matching result pair. It seems the third
> {} is actually an accumulator of successful matches.
>


The line had me confused for a while too until I copied it into a text file
and spaced things out more…

The third {} isn't really an "accumulator" (which to my mind implies a
counter that increases, counting a number) but a repetition count for part
of the regular expression. Maybe this makes it clearer…

${if match{$message_headers}{\N(\S.+\n(\s.+\n)*){512}\N}}


with a bit of whitespace separation becomes

${if match {$message_headers} {\N(\S.+\n(\s.+\n)*){512}\N} }


Now you can see more clearly that there is no "third {}" — instead the
"match" has its usual two {…} arguments: the first being the expanded
string to consider and the second being the pattern to test against.

Putting some whitespace into that pattern (which of course is just to help
work out what's happening, and reality would break the pattern match as it
changes the expression!) you can see the expression's individual parts:

\N (\S.+\n(\s.+\n)*){512} \N


I reckon this means:

- \N — any one character that is not a newline
- (\S.+\n(\s.+\n)*){512} — the pattern within the parenthesised
expression repeated exactly 512 times
- \N — any one character that is not a newline

where the parenthesised expression matches:

- \S — any one character that is not a whitespace
- .+ — one or more of any character
- (\s.+\n)* — zero or more instances of the pattern "a whitespace
character, followed by one or more of any character, followed by a newline"

So the overall expression:

- does *not* match runs of 511 or fewer of the overall pattern (ie,
header lines in this case);
- *does* match a run of exactly 512 instances of the overall pattern;
- *does* match the first 512 instances of the overall pattern if there's
a run of more than 512 instances.

As I say, it took me a few minutes to realise that a pattern matching 512
header lines also matches 512 or more: it's just matching the *first* 512
header lines of the entire set. Ingenious! :-)

My PCRE is rusty these days, so for my own information (I'm venturing back
into the world of Exim and PCRE!):

- Would there be any advantage to using "(?: … )" instead of "( … )" to
avoid the cost of storing matched substrings? Or is that not applicable to
a "match", but only an "sg"?
- I seem to recall that patterns like ".*" and ".+" can be expensive in
that they potentially do backtracking when trying to match; is there an
even cleverer pattern that uses the "do not backtrack" matcher. (I think
it's introduced with "($>" ?)

Cheers,
Mike B-)

--
Systems Administrator & Change Manager
IT Services, University of York, Heslington, York YO10 5DD, UK
Tel: +44-(0)1904-323811

Web: www.york.ac.uk/it-services
Disclaimer: www.york.ac.uk/docs/disclaimer/email.htm