[pcre-dev] [Bug 1074] New: Incorrect length check in match

Autor: Zoltan Herczeg
Data:
A: pcre-dev
Assumptes nous: [pcre-dev] [Bug 1074] Incorrect length check in match_ref(...)
Assumpte: [pcre-dev] [Bug 1074] New: Incorrect length check in match_ref(...)

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1074
           Summary: Incorrect length check in match_ref(...)
           Product: PCRE
           Version: 8.11
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
        AssignedTo: ph10@???
        ReportedBy: hzmester@???
                CC: pcre-dev@???

I had a discussion with Philip and it turned out that some unicode
uppercase-lowercase pairs have different length, like the following pair: 570 -
11365 (I have no idea about the glyph, but it doesn't matter).

Their utf8 representation (in C hecxa character form):
\xc8\xba = 570 \xe2\xb1\xa5 = 11365

The following regular expression incorrectly reports a match:

const char* pattern = "(\xc8\xba\xc8\xba\xc8\xba)?\\1"

on string:

const char* input = "\xc8\xba\xc8\xba\xc8\xba\xe2\xb1\xa5\xe2\xb1\xa5"

The input is basically the char 570 repeated 3 times, and char 11365 repeated
twice. The pattern also contans char 570 repeated 3 times.

Output: 0, 12, 0, 6

Actually match_ref do an early byte-length check, which is invalid in this case
since the length of three '570' is the same as two '11365'.

--
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

Aquest missatge és part del següent fil:
	l'arbre de fils complet ordenat per data

[pcre-dev] [Bug 1074] New: Incorrect length check in match_r…