Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with

Góra strony
Delete this message
Autor: Philip Hazel
Data:  
Dla: Craig Silverstein, Ralf Junker, Daniel Richard G.
CC: pcre-dev, ho1459, stan
Temat: Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with
On Mon, 12 Mar 2007, Craig Silverstein wrote:

> The only 'issue' I got with the test was this:
> < Failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0


Did that happen on a straight ./configure, make, make check run? I don't
think it happened to me. We should be able to automate such things out.

> You're absolutely right there's a problem: it compiled dftables for
> the powerpc, and was then unable to run it.


Thanks for confirming my suspicion.

On Tue, 13 Mar 2007, Ralf Junker wrote:

> Works fine for me, especially the new *.generic files.


Aha! I'm glad we managed to get that right.

> My I suggest to default for the ISO-8859-1 character set instead?


I don't like that idea because it is just one very specific case. What
about the people who would prefer ISO-8859-2, etc?

It has occurred to me that we could distribute several tables in
pcre_chartables.c and make them configurable, so you could say

./configure --default-tables=ISO-8859-1

for example. The default default would remain ASCII.

> For non-European languages, it would also be nice if a future version
> PCRE would have a callback function to convert case for codepoints
> greater than 255.


Isn't that what Unicode is all about? With Unicode property support,
PCRE does this already in UTF-8 mode.

> I remember I once asked for a pcre_chartables.c for exactly this
> reason. On the other hand, it would not hurt to still have dftables.c
> available for all who like to generate the character tables at compile
> time instead of run time.


Indeed, it could still be distributed.

On Tue, 13 Mar 2007, Craig Silverstein wrote:

> Anyway, I think the "C" locale are the ISO-8859-1 characters. The
> only difference is how they sort and stuff.


No, the "C" locale doesn't go beyond 127, I don't think.

On Tue, 13 Mar 2007, Ralf Junker wrote:

> The "C" locale differs among compilers (I know because I once reported
> a false alarm to Philip because of this).


... which is another good reason for distributing the default tables
rather than creating them at build time.

On Wed, 14 Mar 2007, Daniel Richard G. wrote:

> Hi everyone. Sorry for dropping out there; was sick these past couple days.


Sorry to hear that. Don't overdo things while you are recovering.

> Haven't finished the CMake bits yet, but now that the rest of the package
> is nearing completion, the time is right for it. I'll work on this.


Excellent!

> Why not have a fixed set of character tables, defined in Unicode, and have
> PCRE be sensitive to the locale at runtime---converting from Latin-1 or
> UTF-8 or whatever as necessary? (This is basically what Ralf was
> suggesting, I believe, though he didn't go as far.) I mean, that's the much
> more common way of working; this whole issue of locale-sensitivity at
> *compile time* is really quite unusual.


It's a performance issue. You don't really want to be doing character
conversions every time you call pcre_compile() or pcre_exec(). (And I
know there are plenty of PCRE users who do not use the UTF-8/Unicode
features.)

Remember, an application can, at runtime, generate a set of tables of
its own and pass them to PCRE. Apart from the initial generation,
there's no performance penalty in doing that.

> Currently, the tables deal only with ASCII characters (0x7F and below),
> yes? In which case, shouldn't a single pcre_chartables.c work for pretty
> much everyone except EBCDIC users?


In non-UTF-8 mode, they work for the full 256 characters. My idea
(above) about --default-tables would allow for an EBCDIC setting (which
could be the default if EBCDIC is selected).

> > 3. The Contrib directory on the PCRE ftp site


> Pretty much all the build-project stuff will be subsumed by the CMake
> support. Remaining interesting stuff: Symbian support (we'd need someone to
> do testing), the Delphi wrapper (ditto)... pcre_subst and worddefine look
> like nice extras.


Thanks for looking at that. What I think I'll do before the next release
is to remove everything except those that you have mentioned.

> * I believe INSTALL need not be listed in EXTRA_DIST, as it is included
> implicitly by Automake.


Will check.

> * I noticed the new scripts: PrepareRelease, CleanTxt, etc. (1) Could
> you tack on a .pl extension, to differentiate these from plain shell
> scripts? (2) When we move the source files into src/, it would
> probably be good to move these scripts into their own directory as well.


Sigh. That now involves editing several documentation files and the
scripts themselves (since they call each other). And the relevant
messing about in svn. Not exactly an instant's work.

I suppose this could be done (he said, grudgingly), but I don't really
see the need - what do other people think? In the Unix world we don't
normally differentiate different types of executable by their names.
Shell scripts, Perl scripts, Python scripts, binaries, whatever, are
plentiful on my box, without special extensions.

I must say that if I see a file called xxx.pl I assume that it has to be
run by "perl xxx.pl" rather than just "xxx.pl", but maybe I'm out of
line here.

<notserious>
And what if I make PrepareRelease into a shell script that just calls
Perl? <grin/>
</notserious>

Not sure about (2). In fact, I wasn't originally going to distribute
them at all, having put them in the "maint" directory, but I was
persuaded by Craig's argument about distribution packagers making tweaks.

> * The files !compile.txt and !linklib.txt... could we rename those to
> something that doesn't use shell-unfriendly bang-marks?


I don't know. They were contributed with those names. I didn't want to
change them in case that was somehow significant in the environment for
which they are intended (not Unix).

> * What's this Index.html file in the toplevel? (what of
> doc/html/index.html?)


Index.html is the base, hand-made HTML file. When PrepareRelease is run,
it removes the entire doc/html directory. It then copies Index.html into
doc/index.html and regenerates the remaining files in doc/html from the
man pages.

> * Is Tech.Notes no longer to be distributed?


Since it is only of interest to people who want to develop the code, I
thought it should remain in the "maint" directory which such developers
would get from svn. The vast majority of people who use the distribution
would surely have no interest in it.

> * I noticed some minor inconsistencies w.r.t. inter-sentence spacing in
> the config.h comments in configure.ac (one space or two?). I don't
> know how much you care about this, but it appeared to be using two
> spaces [more] consistently before.


I *never* use two spaces. I find reading text with extra spaces after
full stops to be "bumpy". It disrupts the even flow of words. Using
additional space has never been the norm in traditional book printing.
Certainly not in Europe, and I don't think in North America either.
Somehow it got introduced into typescripts - I don't know the history of
this - and it is sometimes taught as "standard" in typing classes. I'll
clean up the comments.

> > Another choice would be to rewrite dftables, in say, perl. :-) Not
> > that I'm recommending that course of action...
>
> It neatly avoids the cross-compilation issue, but some folks would balk at
> [gratuitously] taking on Perl as a build dependency :]


Balk.

Philip

--
Philip Hazel, University of Cambridge Computing Service.