[pcre-dev] [Bug 1190] lookaround doesn't work with startoffs…

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 1190] lookaround doesn't work with startoffset
------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1190




--- Comment #3 from Philip Hazel <ph10@???> 2011-12-28 11:08:37 ---
On Tue, 27 Dec 2011, Alan Lehotsky wrote:

> Trying to match the pattern (from page 66 of Mastering Regular Expressions, 3rd
> Edition)
>
> (?<=\d)(?=(\d\d\d)+(?!\d)
>
> with the source string "1234567"
>
> This fails if my search loop uses the startoffset argument to pcre_exec() to
> advance thru the search string (leaving the subject and length unchanged as I
> find successive match points).
>
> But it does work if I advance the subject ptr and decrement the length, and use
> a zero for the startoffset on each call.


The pcretest program has facilities for trying both of these methods,
and for me it gives the same result both times:

PCRE version 8.12 2011-01-15

/(?<=\d)(?=(\d\d\d)+(?!\d))/g+
    1234567
 0: 
 0+ 234567
 1: 567
 0: 
 0+ 567
 1: 567


/(?<=\d)(?=(\d\d\d)+(?!\d))/G+
    1234567
 0: 
 0+ 234567
 1: 567
 0: 
 0+ 567
 1: 567


The /g option does the startoffset thing, whereas the /G option advances
the pointer after a match. The /+ option causes it to output the rest of
the string that follows a match - so you can see exactly where it
matches an empty string. Note also that Perl with /g also gives exactly
the same results.

If you are worried that the /G option is looking behind in order to give
the first match (as I momentarily was), you are mistaken. That match
happens when the string is passed as "1234567" - remember that when an
unanchored pattern is matched there is an internal advance within the
string. A match against "234567" finds only the second match:

/(?<=\d)(?=(\d\d\d)+(?!\d))/+
    234567
 0: 
 0+ 567
 1: 567



--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email