[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

Autore: Graycode
Data:
To: pcre-dev
Oggetto: [pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1416

--- Comment #5 from Graycode <xgcode@???> 2013-11-21 19:20:36 ---
Oh, I see now with your description that it was doing a global search
and replace operation. That's an important factor.

So now it may also be a consideration for PCRE because I think Philip
has proposed for a future PCRE to have its own data-replacement method.
Otherwise I would not have looked into this more.

In a global operation: After something has been found, the next round
of search begins After the length that was found. But what to do on
the occasion of a zero-length match? Just incrementing the subject
position by zero is bad unless we enjoy infinite loops.

PHP and PCRETEST try to re-run the expression with additional flags
(NOTEMPTY | ANCHORED) in an attempt to push for a non-zero length
match. If that fails there's still the zero-length to deal with.

The choices may include:

1: Stop the process after a zero-length is matched.
Subsequent iterations that may lead to further matches are not
done. That has been my choice because it avoids complications
and I'm personally not very concerned about Perl compatibility.

2: Act as if the match length was greater than 0, skip the next
subject character, and then continue trying to match at that
++incremented position.

PHP is doing choice #2, but then when it needs to ++increment the
position PHP currently does it by a single byte vs. your 3-byte UTF8.

> If the bug was present I would expect to see four matches, not two.

It's actually the PCRETEST program that honors the PCRE_UTF8 flag and
increments over your (3 byte) UTF8 subject after a zero-length match.
PHP uses the same underlying PCRE engine but does not acknowledge the
UTF8. UTF conversion issues may be raised as PHP skips single bytes.
Your 2nd iteration for subject "m" would be "\xB6\xB8" which is invalid
UTF8. I wonder if that contributed to the "EF BF BD" seen in your
original "Actual output".

/-- Using Invalid UTF8 character coding --/

/(?<!\xE0\xB6\x9A)/8
\xB6\x9A
\x9A
\xB6\xB8
\xB8

/-- End of test --/

You might need a newer PCRETEST to see the errors, I don't know if the
old version 7 did any UTF validation or not. The current one yields:
Error -10 (bad UTF-8 string) offset=0 reason=20

In the PHP_PCRE.C that I linked to earlier, this seems to happen in
function php_pcre_replace_impl() starting at about line# 940. I am
somewhat surprised to see great similarity between that code and PCRE's
own PCRETEST, including many variables and mention of Perl. However it
seems PHP's code has not been updated to account for UTF8 character
incrementation after a zero-length match (at about line# 1117).

Yet we find PHP does acknowledge UTF8 for PHP's preg_split() which is
implemented in C function php_pcre_split_impl() starting at line# 1385.
Maybe they just forgot about the need for that in PHP's preg_replace().
I think it's interesting PHP calls on PCRE with expression "/./u" as
their method to count the number of bytes comprising a UTF8 character
for PHP's increment (line# 1502).

Note the PHP_PCRE.C that I linked to was just what was found when
searching for PHP version 5. There may be newer versions that have
resolved UTF8 issues but I didn't know where to look for them.

So my point is that glogal searching, as often done for search-and-
replace operations, can get complicated. As for PCRE, a possible
future search-and-replace method might need to be "Perl Compatible"
(the "PC" in PCRE). I'm uncertain if that will further mean that
UTF validation of the resultant data will be needed as well.

Note: If you're reading this and you happen maintain PHP, please
consider adding a PHP preg_version() function which in turn calls
upon pcre_version(). That way it could be easy for users to know
and tell us what underlying version of PCRE they happen to be using.
<?php
print(preg_version() . "\n");
?>

Regards,
Graycode

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

Questo messaggio è parte di questo thread:
	il thread completo ordinato per data
	Philip Hazel at
	CJ Dennis at

[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead …