Re: [pcre-dev] [PATCH] PCRE2 on Windows

Top Page
Delete this message
Author: Daniel Richard G.
Date:  
To: pcre-dev
Subject: Re: [pcre-dev] [PATCH] PCRE2 on Windows
Hi Philip,

On Wed, 2018 May 9 17:38+0100, ph10@??? wrote:
> Hi Daniel,
>
> Thanks for your messages and your efforts. I haven't yet looked
> seriously at your new patches. However, I should say that for many
> years I stuck religiously to C90 in all PCRE code, until PCRE2 came
> along. By that stage I felt it was probably time to allow some newer C
> things (e.g. snprintf()) to be used, so nowadays I work from C11. One
> can keep on supporting historic facilities for only so long... so I'm
> not sure exactly how many of your patches I will want to include when
> I do get round to looking at them.


I understand where you're coming from. My view is that if compatibility
with older systems can be maintained with small, unobtrusive changes,
then it is worthwhile; if lots of bending over backwards is needed, then
perhaps not. In particular, I want to steer clear of easily-avoidable
incompatibilities, such as unconditionally #including stdint.h.

> The C++ comment comment amused me because // comments were a
> feature of BCPL way back in the late 1960s and it annoyed me when
> they were not carried forward into C. However, I don't tend to use
> them very much.


They're a fine example of convenient syntax! I'm down on them only due
to fussy compilers.

> > * Possibly some way of defining the integer types needed for systems
> > that lack both inttypes.h and stdint.h, although I'm not sure how
> > to safely define e.g. uint32_t for the benefit of prototypes in
> > pcre2.h since the consuming application will necessarily see those
> > definitions too.
>
> Do you think this is a big issue? How many such systems are there
> still around that are likely to want to install PCRE2?


Windows systems have gone without these type definitions until
relatively recently. I think something is needed at least on that side.

The pstdint.h header is one option, but because only a handful of
definitions are needed, I believe a more self-contained solution may
be feasible. See below for a proposal.

> > * Currently, the generic pcre2.h header (under src/) in the source
> > tree is used instead of the generated pcre2.h in the build tree.
> > This is due to the use of #include"pcre2.h" instead of
> > #include<pcre2.h>. Would it be all right to change all instances
> > of the former to the latter in the PCRE2 source?
>
> ... but would this not have the effect of including an *installed*
> version of pcre2.h from /usr/include instead of from the build or
> source trees?


That won't be a problem, because an -I$(builddir)/src would be part of
the #includes. (Note that #include<> typically searches the list of -I
directories before looking in the system directories.)

> #include "file"
>
>     This variant is used for header files of your own program. It
>     searches for a file named file first in the directory containing
>     the current file, then in the quote directories and then the same
>     directories used for <file>. You can prepend directories to the
>     list of quote directories with the -iquote option.


Side note: "-iquote" is, of course, GCC-specific.

> It seems to me that the "" form is right for "header files of your own
> program". Perhaps the way out of this is to ensure that there isn't a
> pcre2.h file in the source tree? Hmm. I've just looked at a distribution
> tarball, and there is no pcre2.h therein, only pcre2.h.generic and
> pcre2.h.in. How did one get into your source tree? Perhaps you did a
> previous build within that tree?


I've been building within the SVN tree, which has the file:

    $ svn info src/pcre2.h
    Path: src/pcre2.h
    Name: pcre2.h
    Working Copy Root Path: /path/to/pcre2
    URL: svn://vcs.exim.org/pcre2/code/trunk/src/pcre2.h
    Relative URL: ^/code/trunk/src/pcre2.h
    Repository Root: svn://vcs.exim.org/pcre2
    Repository UUID: 6239d852-aaf2-0410-a92c-79f79f948069
    Revision: 934
    Node Kind: file
    Schedule: normal
    Last Changed Author: ph10
    Last Changed Rev: 916
    Last Changed Date: 2018-02-19 11:55:47 -0500 (Mon, 19 Feb 2018)
    Text Last Updated: 2018-04-04 15:38:14 -0400 (Wed, 04 Apr 2018)
    Checksum: 1b465330d9b453505e58b04ff052dec1b5bfabf2


Removing it may solve the immediate issue, but the setup will remain
fragile to any errant file that is left under $(srcdir)/src/. I think
using <pcre.h> consistently, in addition to removing the file from SVN,
is the best solution.

> The French test: you can try posting me the output, but there do seem
> to be variations in locale handling - which of course should be being
> phased out in favour of Unicode. :-)


Oh, for sure! It is attached, in gzipped form. I'll be happy to re-run
this test as needed.

> > It appears to be test 8 which is failing in this way.
>
> Test 8 contains a test for deeply nested parentheses, preceded by this
> comment:
>
> # Check the absolute limit on nesting (?| etc. This varies with code unit
> # width because the workspace is a different number of bytes. It will fail
> # with link size 2 in 8-bit and 16-bit but not in 32-bit.
>
> The test itself sets parens_nest_limit=1000 and does have around 1000
> nested parens. Please can you try cutting out this test from the input
> and see if all is well? Cutting it down to a smaller number might not
> be straightforward. I wonder why nobody has complained about this
> issue before?


The test with the bazillion parentheses actually passes fine. It is the
following three patterns, with "expand", that are leading to stack
overflows.

I think the MSVC environment just uses a small stack by default.
One thing I noticed is that the stack is smaller when you test in
debug mode, so some of the failures were not reproducible in an
optimized build.

> > If changing the input read to binary mode is deemed unacceptable, then
> > this should address the problem as well.
>
> I decided that going back to binary mode was unacceptable on the grounds
> that, if Ctrl/Z is normally treated as EOF in Windows text files, it
> should be treated that way by pcre2test.


That's a reasonable point.

> > One thing I missed earlier is that the #include<stdint.h> is in the
> > pcre2.h header file, which means that stdint.h is needed not just to
> > compile PCRE2, but also any consuming application. This is a problem on
> > platforms that don't already provide stdint.h.
>
> How many don't? It's been around in the standard for over a decade.
> I must say I'm a bit reticent about pandering too much to non-standard
> environments.


Microsoft Windows again leads the way :]

After thinking about this, I have a solution to propose:

On the Autoconf side, there are various useful macros of the form

    AC_TYPE_UINT8_T
    AC_TYPE_INT32_T
    et al.


that will check to see if the namesake type is available in stdint.h or
inttypes.h, and if not, they define the type. The definitions would be
hidden away in config.h, so this is transparent to the code.

On the CMake side, I don't believe a similar facility is available, but
something along the following lines can be added to config-cmake.h.in:

    #if defined(_MSC_VER) && !defined(HAVE_STDINT_H)
    typedef int int32_t;
    /* ... */
    #endif


Either way, these definitions will only serve for the benefit of
building PCRE2, allowing the build to be self-contained. Consuming
applications will need to provide these types themselves (which, in all
likelihood, they already are).

Does this sound like a reasonable approach?

> > One more point of interest: A comment in the source states that
> > PCRE2 cannot be built with MSVC6, but in fact I am finding that it
> > builds and tests out with that ancient toolchain no worse than newer
> > versions of Visual Studio. It is certainly helpful to me that this
> > compatibility remains, and I'll be happy to formally verify it once
> > the other work is done.
>
> A quick google suggests me me that MSVC6 is now 20 years old. I'm
> amazed you still have a need for it when compiling new software, but
> hey, what do I know? And if it works, it works...


I think it's a good thing to keep, so long as the cost is low.
Maintaining this kind of portability tends to have unexpected
benefits, such as facilitating ports to embedded environments,
hobbyist systems, etc.


--Daniel


--
Daniel Richard G. || skunk@???
My ASCII-art .sig got a bad case of Times New Roman.