Multiple-reply time again!
On Wed, 2007 Feb 21 23:52:56 -0800, Craig Silverstein wrote:
> I'll take a look at this tomorrow, but wanted to respond to some email
> questions/comments:
>
> } * Forgot to mention previously: I've changed the date format from
> } 18-Dec-2006 to 2006-Dec-18---any objections? (I believe Philip said
> } he was OK with it, as long as there was no date ambiguity.)
>
> I object. 2006-12-18 would be ok, but 2006-Dec-18 is just weird. I
> quite like 18-Dec-2006 myself, but if you really want to go ISO 8601,
> you should just do that, and have numbers everywhere.
Would something like "2006 December 18" be any better?
> } What do y'all say we ditch the ac_pcre_* variables, and use
> } enable_foo/with_foo in their stead? We could then eliminate the third
> } argument from most of the AC_ARG_ENABLE/WITH calls, as it would be
> } redundant.
>
> This seems like a nice simplification. Any cost to making this
> change?
Just the effort in renaming them.
> But again, we have something that works now, so any objection to going
> with what we have now, and then making what you suggest part of a
> future change? It wouldn't be user-visible, would it?
No, this is all internal to configure.ac. (Unlike the newline-option
change.)
On Thu, 2007 Feb 22 09:48:26 -0500, Bob Rossi wrote:
>
> At this point, I get,
> $ patch -p1 < ~/pcre-7.0-drg2.patch
> patching file autogen.sh
> ...
> patching file RunGrepTest.in
> Hunk #5 FAILED at 201.
> 1 out of 7 hunks FAILED -- saving rejects to file RunGrepTest.in.rej
> patching file RunTest.in
> ...
>
> RunGrepTest.in.rej is attached.
As Philip noted, this is due to embedded CR characters in the patch getting
mangled. I'll uuencode the next patch to get around this.
> > What do y'all say we ditch the ac_pcre_* variables, and use
> > enable_foo/with_foo in their stead? We could then eliminate the third
> > argument from most of the AC_ARG_ENABLE/WITH calls, as it would be
> > redundant.
>
> I like the way it is. That's because it's easy to provide default
> values. Also, it's nice and easy to read. Using the built in variables
> is slightly more confusing to someone that doesn't know autoconf that
> well.
The default variable assignments would stay as they are; this is just about
using variables that Automake already defines for us, instead of creating a
parallel set.
> > ==== Makefile.am ====
> >
> > * perltest --> perltest.pl
>
> Why is this necessary at this point in time? Does it help solve a
> build system problem?
You could say it fixed the "Hey, why didn't 'make clean' remove perltest?"
problem :-) The idea was to eliminate confusion due to the name closely
paralleling that of binaries like "pcretest".
> > diff -ruN pcre-7.0/autogen.sh pcre-7.1-RC0/autogen.sh
> > --- pcre-7.0/autogen.sh 1969-12-31 19:00:00.000000000 -0500
> > +++ pcre-7.1-RC0/autogen.sh 2007-02-20 22:44:21.000000000 -0500
> > @@ -0,0 +1,21 @@
> > +#!/bin/sh
>
> Do you think the rewrite of the autoconf script is necessary at this
> point? I'm personally looking for minimal change at this point, only
> until the patch is acceptted upstream. Then we can reak havok again!
Well, one of the tweaks I made there was to increase the warnings reported
by the autotools programs, which helped eliminate a few minor issues from
the in-flux configure.ac/Makefile.am.
Besides, we're addressing the build system, and autogen.sh bootstraps that,
so it's reasonably within the scope of the current work. Where I *have*
held back is on significant changes to the actual PCRE source code.
> > diff -ruN pcre-7.0/CMakeLists.txt pcre-7.1-RC0/CMakeLists.txt
> > --- pcre-7.0/CMakeLists.txt 1969-12-31 19:00:00.000000000 -0500
> > +++ pcre-7.1-RC0/CMakeLists.txt 2007-02-21 01:55:06.000000000 -0500
> > @@ -0,0 +1 @@
> > +# This file is a placeholder.
>
> What's this?
That's what will become the CMake listfile, for building with that system.
I've added the file so that we can already have the appropriate references
in Makefile.am. I'll have the real CMakeLists.txt ready before long.
> > -dnl A safety precaution
> > +m4_define(pcre_major, [7])
> > +m4_define(pcre_minor, [1])
> > +m4_define(pcre_prerelease, [-RC0])
> > +m4_define(pcre_date, [2006-Dec-18])
>
> I like the way you quoted the rhs, I almost did that. What do you think
> about quoting the rhs too?
You meant to say, "quoting the lhs too?"
Quoting the lhs is overcautious, IMHO, and the extra syntax doesn't look
good. Heck, the rhs quoting is there more for cosmetic purposes than any
real need.
> > +# Libtool shared library interface versions (current:revision:age)
> > +m4_define(libpcre_version, [0:1:0])
> > +m4_define(libpcreposix_version, [0:0:0])
> > +m4_define(libpcrecpp_version, [0:0:0])
>
> What's the motivation of changing this from a shell variable to an m4
> define? Did it solve a problem?
Yes, somewhat. Here was my order of operations:
1. Lets move the library versions up near the package versions, 'cause
those two bits are closely related.
2a. Hmm, would be nice if these were m4 assignments, so to be consistent
with the package-version bits.
2b. Hey, why are shell-variable assignments before AC_INIT() being ignored?
> > +AC_INIT(PCRE, pcre_major.pcre_minor[]pcre_prerelease, , pcre)
>
> We previously had
> AC_INIT(pcre, pcre_major.pcre_minor[pcre_prerelease])
A "[]" is more clearly a zero-width macro separator (which is ultimately
the effect we want here), and it's better to have pcre_prerelease at the
same level of quoting as pcre_minor et al.
> Does everyone want to switch from "pcre" to "PCRE"?
Ah ah, this is something more subtle. From the Autoconf manual:
AC_INIT (package, version, [bug-report], [tarname])
Process any command-line arguments and perform various
initializations and verifications.
Set the name of the package and its version. These are typically
used in `--version' support, including that of configure. The
optional argument bug-report should be the email to which users
should send bug reports. The package tarname differs from package:
the latter designates the full package name (e.g., `GNU Autoconf'),
while the former is meant for distribution tar ball names (e.g.,
`autoconf'). It defaults to package with `GNU ' stripped,
lower-cased, and all characters other than alphanumerics and
underscores are changed to `-'.
The package's proper name is PCRE (uppercase), being an acronym and all,
but we want the tarball to be pcre-<version>.tar.gz. Depending on which
form is desired, PACKAGE_NAME or PACKAGE_TARNAME should be used.
> > -dnl Provide versioning information for libtool shared libraries that
> > -dnl are built by default on Unix systems.
> > +# Define all variables that represents the AC_ARG_ENABLE options.
> > +ac_pcre_cpp=yes
> > +ac_pcre_utf8=maybe # default is "no"
>
> The comment above does not match the value.
What's happening there is that we need to distinguish (later on) whether
the user gave --enable-utf8, --disable-utf8, or neither; if we assign "no"
there, then we can't tell apart --disable-utf8 from "neither". So we assign
some neutral value, but then we also note that if the value doesn't change,
it'll go to "no". Because the default, after all, is "no"---we just can't
express it in the assignment at that point, so we do so in a comment.
> Also, reimplementing the same feature in a different way is fine with me
> if it has a benefit. However, it seems like a waste of my time to have
> written it originally if it is being reimplented for personal
> preferences. Please don't read this as an attack on your work, I actually
> like the way it's done. I just dislike wasting my time.
You're referring to the ac_pcre_unicode_properties_set mechanism? I thought
we'd discussed that it wouldn't catch the --enable-unicode-properties
--disable-utf8 error case, and that you'd agreed that this could be
addressed later on....
> > +ac_pcre_match_limit_recursion=MATCH_LIMIT
>
> The above line stands out as odd to me. This is the first time that
> MATCH_LIMIT is used in the file. Is this what you intended?
Yes! If the user doesn't give a value for match-limit-recursion, then it
would be much nicer for config.h to have
#define MATCH_LIMIT 10000000
#define MATCH_LIMIT_RECURSION MATCH_LIMIT
instead of
#define MATCH_LIMIT 10000000
#define MATCH_LIMIT_RECURSION 10000000
:)
> > +# Handle --disable-cpp
> > +AC_ARG_ENABLE(cpp,
> > + AS_HELP_STRING([--disable-cpp],
> > + [disable C++ support]),
> > + ac_pcre_cpp="$enableval")
>
> Now, I previously had,
> AC_ARG_ENABLE(cpp, AC_HELP_STRING([--disable-cpp],[disable C++ support]),
> ac_pcre_cpp="no")
> Now, the major difference here, is that you set ac_pcre_cpp to
> "$enableval" and I was setting it to "no". I'm not sure what's best
> here. I prefer the "no" because I know exactly what the value is going
> to be set to. Is there any way the user can manipulate the value of
> $enableval to be something other than "no"?
>
> This question goes for the other values as well.
Yes indeedy... if the user specifies any of the following:
--enable-cpp => enableval="yes"
--enable-cpp=blah => enableval="blah"
--disable-cpp=xyzzy => enableval="xyzzy"
There's nothing about an AC_ARG_ENABLE option that makes it inherently a
positive (--enable-*) or negative (--disable-*) option. (Ditto for
AC_ARG_WITH.) There is the help string that you show the user, but that has
no effect on the logic.
Even though we don't advertise --enable-cpp, and it's not needed (being the
default anyway), we don't want it to have an effect opposite from what a
user would expect.
> > + AC_MSG_ERROR([support for Unicode properties requires UTF-8 support])
> > + fi
> > + ac_pcre_utf8=yes
> > +fi
> > +if test "x$ac_pcre_utf8" = "xmaybe"
> > +then
> > + ac_pcre_utf8=no
> > +fi
>
> The above looks great. Good work!
Thanks :) This was what I had in mind for the utf8-vs.-ucp checking.
> > +# This facilitates -ansi builds under Linux
> > +dnl AC_DEFINE([_GNU_SOURCE], [], [Enable GNU extensions in glibc])
>
> Whoa, why would we want to add this? I don't think this is a good idea
> until after the build system is integrated. What effects could this have
> on the translation units that I'm diffing?
I've commented it out for now, but the motivation is to allow strict-ANSI
builds under Linux. Right now, if you try to build with -ansi, some
stat()-related macros fail in pcregrep.c.
(Being able to build in strict-ANSI mode is useful for ensuring
compatibility with older Unix environments, so it's something we'll want to
have in the new PCRE.)
> > diff -ruN pcre-7.0/pcre_compile.c pcre-7.1-RC0/pcre_compile.c
> > --- pcre-7.0/pcre_compile.c 2006-12-19 04:31:35.000000000 -0500
> > +++ pcre-7.1-RC0/pcre_compile.c 2007-02-21 22:21:29.000000000 -0500
> > @@ -312,7 +312,7 @@
> > 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* - 71 40 */
> > 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 72- | */
> > 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* & - 87 50 */
> > - 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 88- ? */
> > + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 88- */
>
> Do you really want to change the above?
Argh! Looks like my editor mangled a Latin-1 character. That'll definitely
have to go... we could convert it to UTF-8, but the safest thing would be
to change it to a \x## or \### character reference, so that the source file
is 7-bit clean.
Philip, what do you think? We surely don't want Latin-1 chars in the source
files (and preferably no UTF-8 ones either) as that makes for encoding
headaches all round when people work on them---case in point....
> Honestly, I would really prefer to leave the EBCDIC stuff the way it was
> until we get the patch committed. However, if it needs to be done now, I
> can update my expected translation units. I'm just afraid that the more
> code we change, the more will be broken.
The "#if !EBCDIC" --> "#ifndef EBCDIC" changes are there since the cpp
semantic is directly tied to what's in the configuration header. We're
generally refraining from source changes right now, but this is one of the
exceptions, due to the close connection.
> > diff -ruN pcre-7.0/pcre_ucp_searchfuncs.c pcre-7.1-RC0/pcre_ucp_searchfuncs.c
> > --- pcre-7.0/pcre_ucp_searchfuncs.c 2006-12-19 04:31:35.000000000 -0500
> > +++ pcre-7.1-RC0/pcre_ucp_searchfuncs.c 2007-02-20 22:37:47.000000000 -0500
> > @@ -45,7 +45,7 @@
> >
> > #include "ucp.h" /* Category definitions */
> > #include "ucpinternal.h" /* Internal table details */
> > -#include "ucptable.c" /* The table itself */
> > +#include "ucptable.h" /* The table itself */
>
> This is another change that could wait I believe.
A similar issue closely tied to the build system. If the file is named
ucptable.c, then Automake will want to compile it separately, which fails.
> > /* Table to translate from particular type value to the general value. */
> > diff -ruN pcre-7.0/pcrecpp.cc pcre-7.1-RC0/pcrecpp.cc
> > --- pcre-7.0/pcrecpp.cc 2006-12-19 04:31:35.000000000 -0500
> > +++ pcre-7.1-RC0/pcrecpp.cc 2007-02-20 22:57:43.000000000 -0500
> > @@ -29,6 +29,10 @@
> > //
> > // Author: Sanjay Ghemawat
> >
> > +#ifdef HAVE_CONFIG_H
> > +# include <config.h>
> > +#endif
>
> Why do you need the ifdef? If config.h isn't #included, I think we'll
> get a compile error. I don't see any problem with the way this file was
> before.
The HAVE_CONFIG_H bit is a standard idiom. The idea is, either you or
Autotools #defines PACKAGE_VERSION et al. in config.h, and passes
-DHAVE_CONFIG_H to the compiler---or it gives -DPACKAGE_VERSION=7.1 et al.
and no -DHAVE_CONFIG_H.
(Autotools will #define all the config.h symbols on the compiler command
line if you don't do the AC_CONFIG_HEADERS bit)
> > diff -ruN pcre-7.0/pcredemo.c pcre-7.1-RC0/pcredemo.c
> > --- pcre-7.0/pcredemo.c 2006-12-19 04:31:35.000000000 -0500
> > +++ pcre-7.1-RC0/pcredemo.c 2007-02-21 01:06:35.000000000 -0500
> > @@ -16,6 +16,10 @@
> > */
> >
> >
> > +#ifdef HAVE_CONFIG_H
> > +# include <config.h>
> > +#endif
>
> Same comment here. In fact, why are we adding the #include here?
IIRC, I got a compile error without it. Eventually, all the *.c files will
need to have this bit added---that's the convention for the config header.
> > -#define VERSION "4.4 29-Nov-2006"
>
> Yeah, good catch here!
Wasn't me :-)
> > diff -ruN pcre-7.0/perltest.pl pcre-7.1-RC0/perltest.pl
> > --- pcre-7.0/perltest.pl 2006-12-19 04:31:35.000000000 -0500
> > +++ pcre-7.1-RC0/perltest.pl 2007-02-21 01:31:20.000000000 -0500
> > @@ -1,4 +1,4 @@
> > -#! /usr/bin/perl
> > +#! /usr/bin/env perl
>
> I also don't understand this. It seems non standard.
It's a common convention to avoid hard-coding the path of the interpreter.
Otherwise, what happens on a system where Perl is installed as
/usr/local/bin/perl?
> > diff -ruN pcre-7.0/testdata/grepoutput pcre-7.1-RC0/testdata/grepoutput
> > --- pcre-7.0/testdata/grepoutput 2006-12-19 04:31:35.000000000 -0500
> > +++ pcre-7.1-RC0/testdata/grepoutput 2007-02-21 00:39:11.000000000 -0500
> > @@ -342,8 +342,8 @@
> > ./testdata/grepinputx
> > RC=0
> > ---------------------------- Test 36 -----------------------------
> > -./testdata/grepinputx
> > ./testdata/grepinput8
> > +./testdata/grepinputx
>
> Why is this necessary?
This output is from a pcregrep -L -r command, which gives a list of files
as output. The list, alas, is not sorted, and the test was failing for me
as I was getting a different file order than the reference. So I re-sorted
the list to proper lexical order ('8' < 'x') and added a sort(1) command to
RunGrepTest.in so that the output is always sorted.
> Thanks! Your changes look great.
Likewise for your critiques :-) Thanks for going through my patch so
meticulously!
On Thu, 2007 Feb 22 11:28:46 +0000, Philip Hazel wrote:
>
> * Dates: I can see Craig's objection to 2006-Dec-22. I suppose people
> will not be confused by 2007-02-04 but, for myself, I would have to
> spend just an ounce of brainpower doublechecking that I hadn't
> misunderstood it. Whereas with 2007-Feb-04 or my original 04-Feb-2007
> I would not. Or even Feb-04-2007. But standards are standards, so I
> won't object to 2007-02-04.
If e.g. "2007 February 04" isn't any better, straight YYYY-MM-DD will
certainly garner no complaints. For my part, I was just hoping to find an
intermediate form that retained the month name.
> * Daniel wrote this:
>
> * Changed --disable-stack-for-recursion to
> --enable-stack-for-recursion, as this is "no" by default.
>
> NO! The default must be to enable the use of the stack for recursion.
> Otherwise the performace degrades by 30% or so. Please reverse this
> change.
Ah, I must have goofed when I removed the double negative :] (And then I
was under the impression that using the stack was the slower approach.)
I'll change that back.
> * I don't think we should change the --enable-newline settings, for
> compatibility reasons. It may be tidier, but it seems gratuitous.
In the long run, I think one option would be preferable. We already have
the precedent of a multi-value option for link-size (i.e.
--with-link-size=N such that 2<=N=<4, instead of --with-link-size-is-2,
--with-link-size-is-3, etc.), and this avoids the semantic issues of
combinations of the explicit newline options ("I configured with
--enable-newline-is-cr and --enable-newline-is-lf, so my PCRE treats both
CR and LF characters as newlines!").
For now, however, I'll add the options back in.
On Thu, 2007 Feb 22 14:42:23 +0000, Philip Hazel wrote:
>
> * --enable-stack-for-recursion MUST be "on" by default. Out of interest,
> I ran "time make check" for a default configuration with this option
> both on and off. The real/user/sys times for "off" were 8.01/6.36/1.73
> whereas for "on" they were 5.62/3.97/1.70. The "off" setting is for
> environments where there is very limited stack - not typically a
> problem for Unix.
Understood.
> * I want to keep the old --enable-newline-is-xxx settings for backwards
> compatibility.
Roger that.
> Daniel, do you want to fix those issues and make another patch, or shall
> I just take it from here and learn how to change them myself?
I'll make another patch. There's a few other bits that need addressing as
well.
Thank you Philip, Bob, and Craig for your input; it's much appreciated!
--Daniel
--
NAME = Daniel Richard G. ## Remember, skunks _\|/_ meef?
EMAIL1 = skunk@??? ## don't smell bad--- (/o|o\) /
EMAIL2 = skunk@??? ## it's the people who < (^),>
WWW = http://www.******.org/ ## annoy them that do! / \
--
(****** = site not yet online)