Re: [pcre-dev] Getting crash when searching binary data with…

Top Page
Delete this message
Author: Petr Pisar
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] Getting crash when searching binary data with case-insensitive option
On Mon, Sep 14, 2020 at 11:05:21PM +0200, Thomas Tempelmann via Pcre-dev wrote:
> I have uploaded a rather short binary file here:
> http://files.tempel.org/tmp/pcre2_subject_sample.
>
> If you use my sample code to search for "AWAVAUATSH", it won't find it with
> PCRE2_MATCH_INVALID_UTF but will find it with the PCRE2_UTF option (no
> crash in this small file, though).
> So, whenever I use PCRE2_MATCH_INVALID_UTF, text won't be found at all in
> binary files, it seems. That contradicts the docs and Philip's suggestion,
> though.
>
> What am I doing wrong?
>

You need PCRE2_MATCH_INVALID_UTF. That's a way of matching in a binary data,
that the Mach-O binary is, when you need handling Unicode.

pcre2grep invocation closest to your code looks like this:

$ pcre2grep -l --binary-files=binary -i -U AWAVAUATSH pcre2_subject_sample
Binary file pcre2_subject_sample matches

I looked at your code, compared it to pcre2grep, made your code similar to
pcre2grep, and I found these two differences:

(1) pcre2grep does not invoke pcre2_match_data_create_from_pattern_8().

(2) pcre2grep searches the file info more smaller lumps: 7901, 236, 3, 1917,
139. That's 156 bytes less than the size of the pcre2_subject_sample file.
It looks like a bug in pcre2grep. Unless it's a some kind of a smart
optimitization.

I also corrected your code to search the for the the file-lenght. Not the
whole heap-allocated 32 MB whose trailnig part could be uninitialized with
a binary garbage.

But I found what triggers the misbehaviour of your code. It's the
PCRE2_CASELESS option. Without it it work's like a charm.

For your information, this how I changed your code:

//
// main.c
// PCRE2_Binary_Search
//
// Created by Thomas Tempelmann on 14.09.20.
//

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>

int main(int argc, const char * argv[])
{
    const char *find = "AWAVAUATSH";


    uint32_t regexOptions = PCRE2_UTF | PCRE2_CASELESS | PCRE2_MATCH_INVALID_UTF;
    uint32_t matchOptions = 0;


    int errNum = 0; PCRE2_SIZE errOfs = 0;
    pcre2_code *regEx2 = pcre2_compile_8 ((PCRE2_SPTR)find, 10/*PCRE2_ZERO_TERMINATED*/, regexOptions, &errNum, &errOfs, NULL);
    if (!regEx2) {
        printf("pcre2_compile_8() failed.\n");
        return 1;
    }
    pcre2_match_data *regEx2Match = pcre2_match_data_create_from_pattern_8 (regEx2, NULL);
    if (!regEx2Match) {
        printf("pcre2_match_data_create_from_pattern() failed.\n");
        return 1;
    }


    size_t dataLen = 32 * 1024 * 1024; // 32 MB
    void *dataPtr = malloc (dataLen);
    if (!dataPtr) {
        printf("malloc() failed.\n");
        return 1;
    }
    int fd = open ("/tmp/pcre2_subject_sample", O_RDONLY);
    if (fd == -1) {
        printf("open() failed.\n");
        return 1;
    }
    ssize_t dataRead = read (fd, dataPtr, dataLen);
    if (dataRead != 10352) {
        printf("read() did not return 10352 bytes.\n");
        return 1;
    }


    errNum = pcre2_match_8 (regEx2, (PCRE2_SPTR8)dataPtr, (PCRE2_SIZE)dataRead, 0, matchOptions, regEx2Match, NULL);
    if (errNum >= 1) {
        printf("A match found.\n");
    } else if (errNum == 0) {
        printf("A matchblock is too small.\n");
    } else {
        printf("No match found.\n");
    }


    return 0;
}


-- Petr