Re: [pcre-dev] PCRE 7.3 release candidate for testing

Author: Sheri
Date:
To: pcre-dev
Subject: Re: [pcre-dev] PCRE 7.3 release candidate for testing

Philip Hazel wrote:
> On Mon, 20 Aug 2007, Sheri wrote:
>
>
>> That sounds promising. Multiline ^ and $ need to continue to behave
>> properly (e.g., change 15 for version 7.1 which prevents "(?m)^$" from
>> matching between the the \r and \n in "A\r\nB" for ANY and ANYCRLF
>> hopefully doesn't depend in part on the skip routine.
>>
>
> Indeed it doesn't.
>
>
>> FWIW, for my own usage ANYCRLF has been just fine, wonderful even :D.
>>
>
> Oh good, a happy customer! :-)
>
>
>> That's the reason it became our default. Perhaps there will be fewer
>> issues for less informed users when there's a metacharacter for ANYCRLF.
>>
>
> I asked my contact in the Perl developers about that, and he's asking
> Larry. I don't really want to introduce one on my own that might be
> taken later for something else by Perl. (We coordinated on the
> introduction of \R.) Trouble is, there are very few \-letter escapes
> left. I know (?>\r\n|[\r\n]) is a lot to type, but it *is* currently
> available.
>
>
>> I haven't seen any apps other than ours extending the newline options to
>> user. I previously suggested to Nuno that newline options be made
>> available to PHP users (don't know the status). With your fix, the
>> option would already be there. Without it, users are stuck with
>> whichever newline option was made the default. Which newline would be
>> wanted has more to do with the user's data than anything else, and is
>> especially useful for processing multiline data with ^ and $ matching
>> where they should.
>>
>> About the issue of it needing to be at the start of the pattern, could
>> it possibly also be supported after a leading (?#comment) ?
>>
>
> Not (I currently think) without extra special code, which I'm loath to
> introduce just for this case. But I'll think about it; as I write I'm
> being niggled by an idea that might make it not so bad.
>
> Philip
>
>
Hi Philip,

I composed the following prior to your recent posting --

Re skipping over a leading comment, I am aware of a situation where the
user has embedded some script type code in a leading comment (to trigger
single vs global matching), but its not something for which there are no
alternative ways of handling. Still, it would be nice to preserve the
PCRE-neutral quality of a leading comment.

Just a thought -- whatever's done, might want to also consider making it
flexible enough to potentially support other external options in the
future, especially if it has a positional requirement. E.g., Notempty is
another option which could enhance end-user power. DupNames is another,
though that one may generally need backend support. Even the newline
options can be be enhanced with backend support. For example, with our
app's new (global) functions (where newline options are externally set),
when multiline, after an empty match at a \r, if \n is the next
character we skip it (for ANY, ANYCRLF and CRLF).

About the metacharacter, you will never find (?>\r\n|[\r\n]) in a novice
pattern. Availability of a metacharacter would help to standardize how
linebreaks are searched. Its regrettable that non-unicode code pages for
Windows have unrelated meaning uses for \x85 et al. In English \x85 is a
horizontal ellipsis (perhaps a rarity, but still a possibility in
non-UTF8 data), but in Asian languages, the conflict is more objectionable.

Currently I use \r?\n quite a bit for matching linebreaks (I don't have
any MAC style data with \r lineends). Sometimes I use alternation
(\r\n|\n|\r). Often I add \r\n to a negated character class to avoid
matching line breakes, e.g., [^a\r\n]. Obviously that would exempt \r
and \n whether they are together or not.

Would a metacharacter for (?>\r\n|[\r\n]) work in a negated character
class? Probably not, I see that \R is documented to mean "R" in a
character class.

Regards,
Sheri

This message is part of the following thread:
	the complete thread tree sorted by date
	Philip Hazel at
	Philip Hazel at