Re: [pcre-dev] Matching file contents as one string using P…

Top Page
Delete this message
Author: ph10
Date:  
To: pcunite
CC: pcre-dev
Subject: Re: [pcre-dev] Matching file contents as one string using PCRE_DOTALL
On Wed, 8 May 2013, pcunite@??? wrote:

> Thank you for helping.
>
> However, when a file only contains a single term: "hello", the regex
> (?=.*\p{Any}hello.*) does NOT match it.


It wouldn't. That pattern looks for "zero of more of any character
(except newline)", followed by "any one character", followed by "hello",
followed by "zero or more of any character (except newline)". So it has
to match at least 6 characters; "hello" is only 5 characters.

> Also, \p is slow. Since (?=.*hello.*) is working for Unicode (which I
> need) and seems fast my question ... I guess ... is related to why NOT
> use PCRE_DOTALL and why does PCRE not treat a subject string as a
> single line with no regard for line endings as stated in the
> documentation?


ALL that PCRE_DOTALL does is make "." match newline characters. The
default is not, for Perl compatibility. If that's what you want, then
use it. Note, however, the caveat about CRLF newlines in the spec:

  PCRE_DOTALL                                                            


If this bit is set, a dot metacharacter in the pattern matches a
character of any value, including one that indicates a newline.
However, it only ever matches one character, even if newlines are
coded as CRLF. Without this option, a dot does not match when the
current position is at a newline. This option is equivalent to Perl's
/s option, and it can be changed within a pattern by a (?s) option
setting. A negative class such as [^a] always matches newline
characters, independent of the setting of this option.

> Perhaps it's an issue with this bug: http://bugs.exim.org/show_bug.cgi?id=1351


No; that is a pcregrep bug.

Philip

--
Philip Hazel