Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Craig Silverstein, Daniel Richard G., Ralf Junker
CC: pcre-dev, ho1459, stan
Subject: Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with
On Wed, 14 Mar 2007, Craig Silverstein wrote:

> Yes. I should point out that even though it printed out 'failed', the
> test still passed, and 'make check' returned success (0). So I'm not
> sure there's anything really broken here.


OK, I'll bear that in mind. I'm doing something else for a few days
while waiting for all the comments to come in - maybe I'll get back to
PCRE next week.

> I'd still like to do the Great Reorganizing one day of moving all the
> source files into a src subdir, but I agree it doesn't make sense to
> do it in this release.


I agree. When I first wrote PCRE, it had only 2 .c files and 2 .h files
and I thought "Why bother with a src directory for such a small thing?"
The distribution directory had no subdirectories - everything was at top
level, including the man pages and the test files.

Big Mistake! Even by the time of pcre-1.00 (which is the earliest source
I still have, but it was about 3 months after the first 0.9x release)
there were 6 .c files and 3 .h files. I should have spotted the warning
signs at that stage and re-arranged things then, instead of storing up
trouble for later. Oh well...

> I'm prefectly fine with leaving all current names as they are.


I think I'm going to do that.

On Thu, 15 Mar 2007, Daniel Richard G. wrote:

> In which case... how about this: We have pcre_chartables.c cover no more
> and no less than ASCII. (So we don't have to make any decisions about
> locale at build time.) We then say that if a user wants to do anything
> special with characters beyond 0x7F, s/he has to generate locale-specific
> tables at run time.


That is fine except that it makes life hard for some existing European
users who probably currently build PCRE in an environment that uses
8859-1. Ralf's arguments (below) are relevant here. Such users don't
really want to get all their applications changed (and maybe can't).

> Also, we can distribute pcre_chartables.ascii.c and
> pcre_chartables.ebcdic.c, and have the configure script symlink one of
> them to pcre_chartables.c for the build.


Another thought: We could distribute an ASCII pcre_chartables.c and
invent a new configure option, say --rebuild-chartables. This would be
disabled by default so that cross-compiling would work. If enabled, it
would cause the current practice to be observed, that is, it would build
the tables at compile time. This would preserve backwards compatibility.

I suppose we might even make it build the chartables at ./configure
time.

> > In non-UTF-8 mode, they work for the full 256 characters.
>
> Okay, but as defined by the locale in which dftables is run, then...
> (correct me if I'm wrong).


No, that's right.

> > Thanks for looking at that. What I think I'll do before the next release
> > is to remove everything except those that you have mentioned.
>
> Sounds good, though what about just throwing those items into a tar file so
> they can be archived without causing clutter? It would be nice if the
> material were still accessible in some way, just in case.


Indeed. I wasn't going to throw it away.

> Well, there's a difference between scripts in a PATH directory, versus a
> source tarball. Aside from serving the purpose of language identification,
> it's useful so that text editors can load the appropriate
> syntax-highlighting rules and such.
>
> > I must say that if I see a file called xxx.pl I assume that it has to be
> > run by "perl xxx.pl" rather than just "xxx.pl", but maybe I'm out of
> > line here.
>
> That's what the +x bit, and colorized ls(1) is for :-)


You have to remember that I'm a dinosaur. Far too old to use fancy
things like syntax-highlighting and colourized output. :-)

> I see. It would probably be better to have this file under doc/, as a sort
> of documentation-related "source file," and named without an .html
> extension (so that it's not recognized as documentation in itself).
> Something like index.html.src, or similar....


OK, perhaps yes, that might be better. I'll put a comment in it too.

> I'd suggest bringing the file into the top-level directory, and renaming it
> to HACKING (or TECHNOTES)....


OK, that's not unreasonable.

>    http://en.wikipedia.org/wiki/French_spacing

>
> I guess it goes to show, the English still hate the French ^_^


It used to be traditional in English English to call anything odd or
nonstandard "French". Long-term memory of 1066, I suppose, though of
course the Normans weren't really French.

On Thu, 15 Mar 2007, Ralf Junker wrote:

> Unfortunately, ISO-8859-2 does not help when PCRE is used in UTF-8
> mode. Only ISO-8859-1 is an exact mapping of the first 255 Unicode
> code points.


That is true. That does make it a special case.

> I think the chartables.c question is a political one: It depends on if
> you put the PCRE focus on UTF-8 mode or not: Choose ASCII to point out
> PCRE's non-UTF-8 mode, but go for ISO-8859-1 to stress PCRE's Unicode
> capabilities and if you think that UTF-8 will become the de facto
> Unicode Encoding of the future.


I wouldn't presume to guess what will be the de facto encoding of the
future! In any case, I am bad at predicting the future (see above re the
src directory). Unicode itself does seem to be becoming fairly de facto
(and that is an argument for making chartables 8859-1), but whether we
will end up with UTF-8, UTF-16, or something else, or a mixture, who
knows?

Proposal:

1. Distribute pcre_chartables.c with ISO-8859-1 values. This means
that:

. Anyone whose source contains only ASCII characters will be happy.

   . Anyone operating in non-UTF-8 mode, but with Unicode or 8859-1 
     character values, will be happy.


   . Anybody operating in UTF-8 mode (assuming Unicode character values) 
     will be happy.


2. Add --rebuild-chartables to ./configure, and if possible, make it
compile and run dftables at configure time in the current locale.
This means that:

   . Anybody who relied on the previous behaviour in a non-8859-1 locale 
     will be able to re-create it.


   . The few people (if any) who build PCRE in an EBCDIC environment can 
     use this (we might even default it if EBCDIC is selected), so we 
     don't have to distribute EBCDIC tables.  


If we can make table re-building operate at ./configure time, it may
solve the cross-compiling problem, because part of the ./configure thing
is to run little test programs while configuring, isn't it? That means
it must have access to a non-cross-compiler. I need to investigate this
further.

Philip

--
Philip Hazel, University of Cambridge Computing Service.