Re: [pcre-dev] PCRE 7.3 release candidate for testing

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Sheri
CC: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing
On Thu, 16 Aug 2007, Sheri wrote:

> In giving this more thought, would say:
>
> Change 46 need not apply if ((dot matches everything) or (not multiline
> and pattern has no dots in it)). Otherwise, I think there would be the
> potential for empty matches between the \r and \n in a multiline pattern.


I suspect that dot is not the only thing that can cause trouble, but I
can't think of a good other example just at the moment.

In any case, there is a Problem: (?s) and (?m) can be used within
patterns, many times, meaning that some parts are (dot matches
everything) and some are not. Ditto for multiline. Back up at the
external level, how do you decide whether to bumpalong by one character
or two after a match fails?

> For patterns such as "(?m)s*" and data "\r\n", change 46 is desirable
> when CRLF is a valid line ending.


Precisely.

I take your point about internalizing the newline options. However, I
judge it to be too late for 7.3 now, as I do really want to get that
release out the door. Indeed, thinking about it, it's going to be quite
a bit of work - and note that there is also a compile-time implication
because newlines are relevant when skipping comments and when skipping
ignorable white space. So I'm not sure how an internal setting should or
indeed could interact with that.

It would all be a lot easier if the requirement were to provide the
equivalent of a /option -- that is, some way of setting the option at
the start, without the complication of dealing with it inside the regex,
which includes preserving and restoring when it's inside parens.

Of course, Perl and pcretest (and other applications like, I think, PHP)
do this by having a delimited regex with option settings outside. I
chose not to do this inside PCRE because I thought that it was something
the application should do if it wanted to, and many applications might
not be able easily to supply suitable delimiters.

I suppose one could invent some new syntax and recognize, *at the start
of the pattern only*, things like (+NEWLINE=ANY) and hope that Perl will
never invent something that makes that valid. However, I don't really
like this idea....

...though of course you could do something like that outside in your
application and set the options for PCRE appropriately. In other words,
write your own wrapper that picked apart a string into options and
regex.

A cleaner change to PCRE would be to add a new interface pcre_compile3()
that takes a delimited regex with options settings. But then you have to
use delimiters... and invent rules for merging these textual options
with those specified in the variable.

Philip

--
Philip Hazel, University of Cambridge Computing Service.