Re: [pcre-dev] Serialization format versioning

Author: Daniel Richard G.
Date:
To: pcre-dev
Subject: Re: [pcre-dev] Serialization format versioning

Hi Philip,

On Thu, 2018 Jun 21 09:27+0100, ph10@??? wrote:
> On Wed, 20 Jun 2018, Daniel Richard G. wrote:
>
> > Is it not feasible for the serialized form to be forward-compatible
> > with later versions of PCRE2?
>
> Zoltán may correct me on this, but basically we felt that this would
> be too much of a constraint on future development. Changes to what is
> compiled are not ruled out - in fact there's an item buried somewhere
> in my "potential work" list to redesign how classes work because over
> the years the code has got contorted and hard to understand. It may
> not happen, but we wanted to be able to change the internal form of
> compiled patterns without constraint.

I think that what you are saying is applicable to what lies behind a
pcre2_code pointer, but the serialized form of that serves a different
purpose, and so cannot be held to the same constraints.

Specifically, the in-memory representation (pcre2_code) needs to
correspond to the matcher implementation and allow for changes that may
improve performance, maintainability, etc. But the serialized form needs
to be more stable and (at least) forward-compatible. The immediate
implication of this is that the two representations are not the same,
possibly not even substantially the same. You need to be able to convert
between the two representations, of course. But questions of improving
matcher performance/etc. should only apply to one of the two, because of
the different purposes served.

My presumption of how this ought to work is that the serialized form is
a sort of binary tokenization / intermediate representation of the
original regex source that can quickly be converted into the optimized
pcre2_code form---much more quickly than re-compiling the original regex
in the first place. This representation would, at a minimum, be
equivalent to having the original regex text (comments stripped out but
position offsets unaffected) and compilation options. From there, it
could also contain intermediate structures needed by the compilation
process that are time-consuming to re-create. Perhaps later versions of
PCRE2 have some crazy new optimization that requires different
intermediate structures, but the de-serializer would be smart enough to
re-create whatever is needed that the serialized form does not provide.
(For this reason, it would be understandable if a bleeding-edge version
of PCRE2 takes a bit longer to load an old serialization, as it sort of
has to partially re-compile it. As long as it's still faster than re-
compiling the original regex, it's still a win.)

> Of course, in theory one could support old and new versions of
> something, but this would involve tests and alternate paths and the
> maintenance of old code, all of which I felt was too much of a burden
> for the maintainer, even if performance wasn't hit.

That work is part and parcel of a serialization feature, although if the
evolution of the format is carefully managed, the burden should not be
great. It would not be unreasonable to draw a boundary for the forward
compatibility at major version changes (so e.g. PCRE 11.0 would not be
able to read a regex serialized by PCRE 10.x), although I think even
that might not be too hard to make work.

Importantly, however, the serialized form is largely decoupled from what
the matching engine consumes. Otherwise, yes, it would become
unmanageable.

> I have no idea how widely the serializing feature is used.

It's certainly a feature that is needed in some quarters. To the point
that I've seen older applications half-arse it with PCRE1 by memory-
dumping pcre_code objects to disk.

> I suppose we could invent the idea of a "format version" for the
> compiled code. The serialization functions could check this instead of
> the PCRE2 version number. Zoltán: (are you reading this?) What do you
> think? That seems an easy solution.

That can be said to be an improvement over the current behavior, but as
long as a point release (e.g. security update) can potentially break the
format, we'll be right back where we started.

(Indeed, anyone who subscribes to the "fail-fast" philosophy---as I do---
would see this behavior as *worse* than what's implemented now.)

--Daniel

P.S.: Please Cc: me on any replies, as I am not subscribed to this list.

--
Daniel Richard G. || skunk@???
My ASCII-art .sig got a bad case of Times New Roman.

This message is part of the following thread:
	the complete thread tree sorted by date
	ph10 at
	Zoltán Herczeg at