Re: [pcre-dev] Serialization format versioning

Top Page
Delete this message
Author: Daniel Richard G.
Date:  
To: Zoltán Herczeg, pcre-dev
Subject: Re: [pcre-dev] Serialization format versioning
Hi Zoltán, it's been a while!

On Fri, 2018 Jun 22 05:55+0200, Zoltán Herczeg wrote:
> Hi,
>
> to tell the truth, when the serialization was created the use case we
> were discussing was different from the use case below.
>
> I consider serialized forms inherently unsecure. I would never
> recommend to accept any regexes in binary forms for any application.
> Instead, I would recommend to distribute patterns in text form, then
> the application pre-compiles them and store them in a secure way. The
> application can also store both the text and binary forms, and after
> any regex engine changes, pre-compile the patterns again.


I can understand the security implications of loading serialized
regexes, but beyond validation of the input, and recommendations on how
to use this feature, there's not much more we (PCRE) can do about that.
For my part, all I can say is... I'm a big boy, I can handle it :-)

I see the approach you are suggesting here; e.g. an application
compiles a regex on the first run, and caches the serialized form in
/var/cache/foo/ for later use. Anytime the format changes, it
re-compiles and re-caches same.

In my use case, however, the application has binary data files
[containing serialized regexes] under /usr/share/foo/, and no provision
is available to cache under /var/, nor any other writable disk location.
PCRE2 can be updated at any time due to security vulnerabilities, but
the application's data files are tied to release cycles that take the
better part of a year to complete.

> While this requires more disk space, it is usually less of an
> issue than the security implications of distributing regexes in
> binary forms.


Disk space is not the concern here, but the non-trivial amount of time
it can take to (re-)compile a large regex.

> One option could be versioning serialized regexes. In another project
> (JerryScript) we use versioning for snapshots (serialized form of
> JavaScript code), and the version number grows after any change that
> affects snapshots. It is not a high burden, but it is easy to forget
> in my experiences, especially for people newly joined to the project.
> We have never went beyond that, supporting two snapshot formats in one
> engine sounds like too much burden. Writing conversion tools also.


When you say "two snapshot formats," do you mean two formats that are
completely different, or two formats that are identical but for one or
two newly-added features? Straight versioning doesn't exactly
distinguish between these two scenarios, which is why I'd imagine you'd
want more of a modular PNG-like chunked format for this.

In any event, as I wrote to Philip, the format used for serialization
should be independent of the in-memory representation, so that it is
minimally affected by the vagaries of ongoing engine development. That
way, it is less likely to need to change over time, which eases the
maintenance burden and improves the prospects for future compatibility.


--Daniel


P.S.: Please Cc: me on any replies, as I am not subscribed to this list.


--
Daniel Richard G. || skunk@???
My ASCII-art .sig got a bad case of Times New Roman.