Author: Philip Hazel Date: To: pcre-dev Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
On Tue, 21 Aug 2007, I wrote:
> On Mon, 20 Aug 2007, Sheri wrote:
>
> > That sounds promising. Multiline ^ and $ need to continue to behave
> > properly (e.g., change 15 for version 7.1 which prevents "(?m)^$" from
> > matching between the the \r and \n in "A\r\nB" for ANY and ANYCRLF
> > hopefully doesn't depend in part on the skip routine.
>
> Indeed it doesn't.
The promise, alas, is unfulfilled. I was forgetting about alternatives.
For a simple pattern such as .+A you can tell it was dot that failed it,
but if there are alternatives, it all goes wrong. I have also thought
about weird patterns like (.|X)A and then it struck me that someone
might also write (.|\n)A and at that point I realised that this really
isn't solvable by this kind of fudge. (It also can't work with
pcre_dfa_exec(), I realised.)
I am now wondering if perhaps I should back off the infamous change 46.
However, that would mean that .+A would match \nA if the subject is
\r\nA and the newline setting is CRLF (but not if it's ANYCRLF). This is
unhappy, because .A is a standard way of matching "A not at the start of
a line". Sigh. Mumble. Gnash teeth. Tear hair.
Which is the lesser of the two evils? I don't know!
Providing a means for the user to vary the newline setting is perhaps
a sensible thing to do for now, I guess. It might help.
The other thing that I could easily do (I think, I've checked the code,
but not done it) is to determine whether \r or \n is explicitly used in
the pattern (either as an escape or literally), and in that case only,
back off change 46. I think that makes some kind of sense too.
Philip
--
Philip Hazel, University of Cambridge Computing Service.