On Mon, Sep 14, 2020 at 11:05:21PM +0200, Thomas Tempelmann via Pcre-dev wrote:
> I have uploaded a rather short binary file here:
> http://files.tempel.org/tmp/pcre2_subject_sample.
>
> If you use my sample code to search for "AWAVAUATSH", it won't find it with
> PCRE2_MATCH_INVALID_UTF but will find it with the PCRE2_UTF option (no
> crash in this small file, though).
> So, whenever I use PCRE2_MATCH_INVALID_UTF, text won't be found at all in
> binary files, it seems. That contradicts the docs and Philip's suggestion,
> though.
>
> What am I doing wrong?
>
You need PCRE2_MATCH_INVALID_UTF. That's a way of matching in a binary data,
that the Mach-O binary is, when you need handling Unicode.
pcre2grep invocation closest to your code looks like this:
$ pcre2grep -l --binary-files=binary -i -U AWAVAUATSH pcre2_subject_sample
Binary file pcre2_subject_sample matches
I looked at your code, compared it to pcre2grep, made your code similar to
pcre2grep, and I found these two differences:
(1) pcre2grep does not invoke pcre2_match_data_create_from_pattern_8().
(2) pcre2grep searches the file info more smaller lumps: 7901, 236, 3, 1917,
139. That's 156 bytes less than the size of the pcre2_subject_sample file.
It looks like a bug in pcre2grep. Unless it's a some kind of a smart
optimitization.
I also corrected your code to search the for the the file-lenght. Not the
whole heap-allocated 32 MB whose trailnig part could be uninitialized with
a binary garbage.
But I found what triggers the misbehaviour of your code. It's the
PCRE2_CASELESS option. Without it it work's like a charm.
For your information, this how I changed your code:
//
// main.c
// PCRE2_Binary_Search
//
// Created by Thomas Tempelmann on 14.09.20.
//
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
int main(int argc, const char * argv[])
{
const char *find = "AWAVAUATSH";
uint32_t regexOptions = PCRE2_UTF | PCRE2_CASELESS | PCRE2_MATCH_INVALID_UTF;
uint32_t matchOptions = 0;
int errNum = 0; PCRE2_SIZE errOfs = 0;
pcre2_code *regEx2 = pcre2_compile_8 ((PCRE2_SPTR)find, 10/*PCRE2_ZERO_TERMINATED*/, regexOptions, &errNum, &errOfs, NULL);
if (!regEx2) {
printf("pcre2_compile_8() failed.\n");
return 1;
}
pcre2_match_data *regEx2Match = pcre2_match_data_create_from_pattern_8 (regEx2, NULL);
if (!regEx2Match) {
printf("pcre2_match_data_create_from_pattern() failed.\n");
return 1;
}
size_t dataLen = 32 * 1024 * 1024; // 32 MB
void *dataPtr = malloc (dataLen);
if (!dataPtr) {
printf("malloc() failed.\n");
return 1;
}
int fd = open ("/tmp/pcre2_subject_sample", O_RDONLY);
if (fd == -1) {
printf("open() failed.\n");
return 1;
}
ssize_t dataRead = read (fd, dataPtr, dataLen);
if (dataRead != 10352) {
printf("read() did not return 10352 bytes.\n");
return 1;
}
errNum = pcre2_match_8 (regEx2, (PCRE2_SPTR8)dataPtr, (PCRE2_SIZE)dataRead, 0, matchOptions, regEx2Match, NULL);
if (errNum >= 1) {
printf("A match found.\n");
} else if (errNum == 0) {
printf("A matchblock is too small.\n");
} else {
printf("No match found.\n");
}
return 0;
}
-- Petr