Re: [pcre-dev] Getting crash when searching binary data with…

Top Page
Delete this message
Author: Thomas Tempelmann
Date:  
To: Philip Hazel
CC: pcre-dev
Subject: Re: [pcre-dev] Getting crash when searching binary data with case-insensitive option
Hi Philip,

> This is just a quick reply (sorry, out of time to reply in detail). You are indeed not doing this quite right. Search the PCRE2 documents for                                                                 
>   PCRE2_MATCH_INVALID_UTF                                         

>
> This option forces PCRE2_UTF and also enables support for matching by pcre2_match() in subject strings that contain invalid UTF sequences.


This doesn't seem to work for me, though:

When I add PCRE2_MATCH_INVALID_UTF to the options for pcre2_compile(), the crash goes away but I won't find the pattern in my sample binary file; it is found in a plain text file, though.

OTOH, if I remove both PCRE2_MATCH_INVALID_UTF and PCRE2_UTF from the options, then I get no crash and successfully find the pattern in my sample binary file (not the locate db).

The binary file is a macOS library file (CoreFoundation), and the pattern is a plain ASCII name (all letters) that appears as a symbol in it. It's part of the symbol table, where lots of plain ASCII names are separated by single 00 bytes. Nothing of that is invalid UTF, and even if PCRE2 would consider a 00-byte invalid, it should restart at the next byte, which is eventually the first byte of the pattern to find. So I don't see why this wouldn't work with the PCRE2_MATCH_INVALID_UTF option.

The explanation for PCRE2_MATCH_INVALID_UTF in the docs (pcre2unicode.html) makes sense to me in the way how it deals with invalid UTF sequences in the binary data. So, ideally, I'd like to follow your suggestion, but it would mean that the search for plain ASCII text in my sample binary file would fail, and that's not great.

Also, when I keep doing this without PCRE2_MATCH_INVALID_UTF and PCRE2_UTF, will I still run the risk of getting crashes and related issues? So far, I seem to get the crashes only if I use PCRE2_UTF. I understand that I won't be able to find non-ASCII UTF text, then.

When I run this cmd, it finds the pattern inside the binary file:

pcre2grep -al 'NSURLVolumeNameKey' CoreFoundation

So, if that works, what do I wrong, then? Or is pcre2grep not using the PCRE2_UTF option?

Thomas