------- You are receiving this mail because: -------
You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1315
--- Comment #6 from Philip Hazel <ph10@???> 2012-11-09 12:16:22 ---
Christoph,
Perl-style regular expressions are quite complicated and some of the
ways in which they work are not always intuitive. If you have not
already read it, I highly recommend Jeffrey Friedl's book "Mastering
Regular Expressions" (3rd edition), published by O'Reilly. It discusses
many variants, but in particular includes Perl and PCRE.
On Thu, 8 Nov 2012, Christoph Anton Mitterer wrote:
> > e.g: /a$[^x]b/m matches to a\nb, since $ itself just checks a condition, but
> > does not change the character position.
> Uhm... not yet sure whether I understand...
"a" matches "a". Then "$" is true, because we are now just before a
newline in multiline mode (the /m modifier says "multiline"). Then [^x]
matches \n because \n is a character that is not x.
> So ok... that means basically a pattern like \r[^\n] needs at least to
> characters to match.
Yes. A class [...] always matches exactly one character.
> Am I right then, that \r\n is _not_ matched here, because \n doesn’t appear
> at all as a character (because it’s considered not a character in that sense
> but the "mark up" for line separating)?
No, you are not right; \n is a character. The reason that \r\n does not
match \r[^\n] is because [^\n] matches any character that is not \n.
Using \r and \n is perhaps confusing; this is no different in the way it
works to the pattern "a[^b]", which matches "a", followed by any
character that is not "b".
> Ah.... ok so [...] always means there needs to be a character...
Yes!
> and if I put in ^$ it just says "at that position, there must be a
> character, but the end-of-line condition must NOT be met.
No! Because $ is _not_ a metacharacter when it is part of a character
class [...]. It is just an ordinary character there, so [^$] matches any
character that is not a dollar.
With respect, you really do need to read that book I recommended above.
> a) It's still not clear why a plain $ doesn't match... I would expect it to
> _always_ match... as I would for ^
>
> b) The case:
> $ hd $file
> 00000000 41 0d 0a |A..|
> 00000003
> $ pcregrep '\n' $file ; echo $?
> 1
> Is that, because in UNIX (or rather when the end-of-line is set to \n... \n
> will never match, because again, the \n is not considered a char but rather the
> condition "end-of-line".
The reason this does not match is because of the way pcregrep works.
This is the same as the way GNU grep works. Basically, it is a
line-based matching process, and in effect, terminal newline characters
are stripped from each line before matching happens. So searching for
newline character can never work. However, pcregrep does have a -M
(multiline) option, which then makes this work.
If you use pcretest instead of pcregrep, where there is better control,
this pattern also matches.
> This goes now rather towards support and less towards (invalid) bug
> reporting: Is there a way in PCRE to do what I wanted... e.g. matching
> a CR, that is not followed by an LF?
This pattern does that: \r(?<!\n)
It does precisely what you ask: first find \r, then look ahead and
assert that it is not followed by \n. But if you want to use this in
pcregrep, you'll need the -M option to make it search more than one line
at once.
> Or can I check for e.g. LFCRs, who are not actually just CRLFCR i.e. checking
> for LFCR which are not prefixed by another CR.
Yes, assertions should be able to do that.
> What |Removed |Added
> ----------------------------------------------------------------------------
> Status|RESOLVED |REOPENED
> Resolution|INVALID |
I see no point in reopening this, because it is not a bug.
> 1) May I suggest (which is why I reopened this "bug") that you add a small
> example at the places you've mentioned.
> Perhaps one similar to mine e.g.:
> This means that "^a[^$]" would not match the single line "a", but only e.g.
> "ab"
>
> 2) Further, it would perhaps make sense to tell what this particular condition
> is... I guess "the end of the current input line has been met" (in contrast to
> "I found and end-of-line character).
It very clearly says that [...] always matches a character. It is also
very clear, in the section "Characters and metacharacters" that $ is a
metacharacter only outside square brackets. I really don't think it is
worth saying any more.
> So basically people are looking for kinda binarygrep (just google around, I'm
> not the only one missing this).
Perhaps pcregrep with the -M option might do what you want.
Philip
--
Configure bugmail:
http://bugs.exim.org/userprefs.cgi?tab=email