[pcre-dev] [Bug 1315] \r, \n and $ matching seems to be ill…

Kezdőlap
Üzenet törlése
Szerző: Philip Hazel
Dátum:  
Címzett: pcre-dev
Tárgy: [pcre-dev] [Bug 1315] \r, \n and $ matching seems to be illogical or not fully documented.
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1315




--- Comment #6 from Philip Hazel <ph10@???> 2012-11-09 12:16:22 ---
Christoph,

Perl-style regular expressions are quite complicated and some of the
ways in which they work are not always intuitive. If you have not
already read it, I highly recommend Jeffrey Friedl's book "Mastering
Regular Expressions" (3rd edition), published by O'Reilly. It discusses
many variants, but in particular includes Perl and PCRE.

On Thu, 8 Nov 2012, Christoph Anton Mitterer wrote:

> > e.g: /a$[^x]b/m matches to a\nb, since $ itself just checks a condition, but
> > does not change the character position.
> Uhm... not yet sure whether I understand...


"a" matches "a". Then "$" is true, because we are now just before a
newline in multiline mode (the /m modifier says "multiline"). Then [^x]
matches \n because \n is a character that is not x.

> So ok... that means basically a pattern like \r[^\n] needs at least to
> characters to match.


Yes. A class [...] always matches exactly one character.

> Am I right then, that \r\n is _not_ matched here, because \n doesn’t appear
> at all as a character (because it’s considered not a character in that sense
> but the "mark up" for line separating)?


No, you are not right; \n is a character. The reason that \r\n does not
match \r[^\n] is because [^\n] matches any character that is not \n.

Using \r and \n is perhaps confusing; this is no different in the way it
works to the pattern "a[^b]", which matches "a", followed by any
character that is not "b".

> Ah.... ok so [...] always means there needs to be a character...


Yes!

> and if I put in ^$ it just says "at that position, there must be a
> character, but the end-of-line condition must NOT be met.


No! Because $ is _not_ a metacharacter when it is part of a character
class [...]. It is just an ordinary character there, so [^$] matches any
character that is not a dollar.

With respect, you really do need to read that book I recommended above.

> a) It's still not clear why a plain $ doesn't match... I would expect it to
> _always_ match... as I would for ^
>
> b) The case:
> $ hd $file
> 00000000  41 0d 0a                                          |A..|
> 00000003
> $ pcregrep '\n' $file ; echo $?
> 1
> Is that, because in UNIX (or rather when the end-of-line is set to \n... \n
> will never match, because again, the \n is not considered a char but rather the
> condition "end-of-line".


The reason this does not match is because of the way pcregrep works.
This is the same as the way GNU grep works. Basically, it is a
line-based matching process, and in effect, terminal newline characters
are stripped from each line before matching happens. So searching for
newline character can never work. However, pcregrep does have a -M
(multiline) option, which then makes this work.

If you use pcretest instead of pcregrep, where there is better control,
this pattern also matches.

> This goes now rather towards support and less towards (invalid) bug
> reporting: Is there a way in PCRE to do what I wanted... e.g. matching
> a CR, that is not followed by an LF?


This pattern does that: \r(?<!\n)

It does precisely what you ask: first find \r, then look ahead and
assert that it is not followed by \n. But if you want to use this in
pcregrep, you'll need the -M option to make it search more than one line
at once.

> Or can I check for e.g. LFCRs, who are not actually just CRLFCR i.e. checking
> for LFCR which are not prefixed by another CR.


Yes, assertions should be able to do that.

>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|RESOLVED                    |REOPENED
>          Resolution|INVALID                     |


I see no point in reopening this, because it is not a bug.

> 1) May I suggest (which is why I reopened this "bug") that you add a small
> example at the places you've mentioned.
> Perhaps one similar to mine e.g.:
> This means that "^a[^$]" would not match the single line "a", but only e.g.
> "ab"
>
> 2) Further, it would perhaps make sense to tell what this particular condition
> is... I guess "the end of the current input line has been met" (in contrast to
> "I found and end-of-line character).


It very clearly says that [...] always matches a character. It is also
very clear, in the section "Characters and metacharacters" that $ is a
metacharacter only outside square brackets. I really don't think it is
worth saying any more.

> So basically people are looking for kinda binarygrep (just google around, I'm
> not the only one missing this).


Perhaps pcregrep with the -M option might do what you want.

Philip


--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email