Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with

Góra strony
Delete this message
Autor: Daniel Richard G.
Data:  
Dla: pcre-dev
Temat: Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with
On Thu, 2007 Mar 15 10:11:45 +0000, Philip Hazel wrote:
>
> On Thu, 15 Mar 2007, Daniel Richard G. wrote:
>
> > In which case... how about this: We have pcre_chartables.c cover no more
> > and no less than ASCII. (So we don't have to make any decisions about
> > locale at build time.) We then say that if a user wants to do anything
> > special with characters beyond 0x7F, s/he has to generate locale-specific
> > tables at run time.
>
> That is fine except that it makes life hard for some existing European
> users who probably currently build PCRE in an environment that uses
> 8859-1. Ralf's arguments (below) are relevant here. Such users don't
> really want to get all their applications changed (and maybe can't).


Okay, so we allow the user to generate a locale-specific default table, as
a special case. But the one we distribute covers only ASCII, and we warn
users that a locale-agnostic PCRE must use an ASCII-only default table.
(That is, major distributions would package PCRE compiled in this manner;
locale-specific tables should appear only in one-off builds, etc.)

> > Also, we can distribute pcre_chartables.ascii.c and
> > pcre_chartables.ebcdic.c, and have the configure script symlink one of
> > them to pcre_chartables.c for the build.
>
> Another thought: We could distribute an ASCII pcre_chartables.c and
> invent a new configure option, say --rebuild-chartables. This would be
> disabled by default so that cross-compiling would work. If enabled, it
> would cause the current practice to be observed, that is, it would build
> the tables at compile time. This would preserve backwards compatibility.


This sounds reasonable. The option would have to be expressed as
--enable-something, however... --enable-rebuild-chartables, or perhaps
--enable-locale-specific-chartables. (The long name should discourage its
use }:)

> I suppose we might even make it build the chartables at ./configure
> time.


That's beyond the scope of what the configure script is supposed to do.

> > I see. It would probably be better to have this file under doc/, as a sort
> > of documentation-related "source file," and named without an .html
> > extension (so that it's not recognized as documentation in itself).
> > Something like index.html.src, or similar....
>
> OK, perhaps yes, that might be better. I'll put a comment in it too.


One other possibility, by the way---move the file to its final resting
place (doc/html/index.html), and instead of deleting the html/ directory
when regenerating the HTML, just delete html/pcre*.html.

> On Thu, 15 Mar 2007, Ralf Junker wrote:
>
> > Unfortunately, ISO-8859-2 does not help when PCRE is used in UTF-8
> > mode. Only ISO-8859-1 is an exact mapping of the first 255 Unicode
> > code points.
>
> That is true. That does make it a special case.


Yes, but not special in a useful way. If you're working in ISO-8859-2, then
0xC0 is "CAPITAL LETTER R WITH ACUTE ACCENT", not "CAPITAL LETTER A WITH
GRAVE". The Latin-1 table may map to Unicode exactly, but without character
conversion, it's still only applicable to Latin-1 locales.

> > I think the chartables.c question is a political one: It depends on if
> > you put the PCRE focus on UTF-8 mode or not: Choose ASCII to point out
> > PCRE's non-UTF-8 mode, but go for ISO-8859-1 to stress PCRE's Unicode
> > capabilities and if you think that UTF-8 will become the de facto
> > Unicode Encoding of the future.
>
> I wouldn't presume to guess what will be the de facto encoding of the
> future! In any case, I am bad at predicting the future (see above re the
> src directory). Unicode itself does seem to be becoming fairly de facto
> (and that is an argument for making chartables 8859-1), but whether we
> will end up with UTF-8, UTF-16, or something else, or a mixture, who
> knows?


The issue isn't about what encoding(s) will be mainstream in the future.
It's about the mapping of code points to meaningful characters: what should
chr(0xC0) give you? With Unicode (whether UTF-8, UCS-2, whatever), you have
a definite answer. With legacy systems supporting a mishmash of
incompatible 8-bit encodings, all bets are off beyond 0x7F.

> Proposal:
>
> 1. Distribute pcre_chartables.c with ISO-8859-1 values. This means
>    that:

>
>    . Anyone whose source contains only ASCII characters will be happy.

>
>    . Anyone operating in non-UTF-8 mode, but with Unicode or 8859-1 
>      character values, will be happy.

>
>    . Anybody operating in UTF-8 mode (assuming Unicode character values) 
>      will be happy.


Aaaaand... anyone operating in non-UTF-8, non-Latin-1 mode will get
incorrect results :(

Again I say, have the distributed table cover ASCII only, and rely on
proposal #2 below to support those who want something locale-specific.

> 2. Add --rebuild-chartables to ./configure, and if possible, make it 
>    compile and run dftables at configure time in the current locale. 
>    This means that:

>
>    . Anybody who relied on the previous behaviour in a non-8859-1 locale 
>      will be able to re-create it.

>
>    . The few people (if any) who build PCRE in an EBCDIC environment can 
>      use this (we might even default it if EBCDIC is selected), so we 
>      don't have to distribute EBCDIC tables.  


This sounds good, especially the EBCDIC part. I'd just generalize the first
point to apply to Latin-1 locales as well.

> If we can make table re-building operate at ./configure time, it may
> solve the cross-compiling problem, because part of the ./configure thing
> is to run little test programs while configuring, isn't it? That means
> it must have access to a non-cross-compiler. I need to investigate this
> further.


Compiler invocations in the configuration script are strictly for running
tests, however. The rules and dependencies relating to dftables and
pcre_chartables.c belong in a/the makefile.

If anything, we can have the configure script export a CC_FOR_BUILD-ish
variable with a host compiler. I'm not sure what the relevant conventions
are here w.r.t. Autotools, but we're certainly not the first ones in this
position.


--Daniel


-- 
NAME   = Daniel Richard G.       ##  Remember, skunks       _\|/_  meef?
EMAIL1 = skunk@???        ##  don't smell bad---    (/o|o\) /
EMAIL2 = skunk@???      ##  it's the people who   < (^),>
WWW    = http://www.******.org/  ##  annoy them that do!    /   \
--
(****** = site not yet online)