[pcre-dev] [Bug 2642] New: Searching with PCRE2_MATCH_INVALI…

Top Page
Delete this message
Author: admin
Date:  
To: pcre-dev
Subject: [pcre-dev] [Bug 2642] New: Searching with PCRE2_MATCH_INVALID_UTF and PCRE2_CASELESS not working in binary files
https://bugs.exim.org/show_bug.cgi?id=2642

            Bug ID: 2642
           Summary: Searching with PCRE2_MATCH_INVALID_UTF and
                    PCRE2_CASELESS not working in binary files
           Product: PCRE
           Version: 10.35 (PCRE2)
          Hardware: x86-64
                OS: All
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
          Assignee: Philip.Hazel@???
          Reporter: tempelmann@???
                CC: pcre-dev@???


Created attachment 1334
--> https://bugs.exim.org/attachment.cgi?id=1334&action=edit
the binary file with the subject data

(See also my post on the developers mailing list titled "Getting crash when
searching binary data with case-insensitive option")

PCRE2 seems currently unable to find plain ASCII text with the case-insensitive
option in binary files.

I have attached a sample binary file for this. Searching for the string
"AWAVAUATSH" inside, or any other case variation, fails to find it, when I use
the PCRE2_CASELESS option. Without PCRE2_CASELESS, it works.

I see no logical reason why this shouldn't work. Adding the caseless option
means that the search tree is simply getting bigger, with more decision cases.
And since it works when searching in plain text files, it should as well work
in files that contain invalid Unicode codes inside (i.e. are considered
binary). The search pattern is still inside that file and should be found.

Here's the test code.


#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

#define PCRE2_CODE_UNIT_WIDTH 8
#import "pcre2.h"

int main(int argc, const char * argv[])
{
    {
        const char *find = "AWAVAUATSH";


        uint32_t regexOptions = PCRE2_MATCH_INVALID_UTF | PCRE2_UTF |
PCRE2_CASELESS;
        uint32_t matchOptions = PCRE2_NOTBOL | PCRE2_NOTEOL | PCRE2_NOTEMPTY;


        int errNum = 0; PCRE2_SIZE errOfs = 0;
        pcre2_code *regEx2 = pcre2_compile_8 ((PCRE2_SPTR)find,
PCRE2_ZERO_TERMINATED, regexOptions, &errNum, &errOfs, NULL);
        pcre2_match_data *regEx2Match = pcre2_match_data_create_from_pattern
(regEx2, NULL);


        size_t bufLen = 32 * 1024 * 1024; // 32 MB, in case we test larger
files
        void *bufPtr = malloc (bufLen);
        int fd = open ("pcre2_subject_sample", O_RDONLY);
        if (fd < 0) {
            printf("File not found! Please fix the path in the code.\n");
            return 1;
        }
        size_t actualLen = read (fd, bufPtr, bufLen);


        int ok = pcre2_match_8 (regEx2, (PCRE2_SPTR8)dataPtr, actualLen, 0,
matchOptions, regEx2Match, NULL);
        if (ok > 0) {
            printf("Pattern found\n");
        } else {
            printf("Pattern NOT found\n");
        }
    }


    return 0;
}


--
You are receiving this mail because:
You are on the CC list for the bug.