On Thu, 2007 Mar 15 10:11:45 +0000, Philip Hazel wrote:
>
> On Thu, 15 Mar 2007, Daniel Richard G. wrote:
>
> > In which case... how about this: We have pcre_chartables.c cover no more
> > and no less than ASCII. (So we don't have to make any decisions about
> > locale at build time.) We then say that if a user wants to do anything
> > special with characters beyond 0x7F, s/he has to generate locale-specific
> > tables at run time.
>
> That is fine except that it makes life hard for some existing European
> users who probably currently build PCRE in an environment that uses
> 8859-1. Ralf's arguments (below) are relevant here. Such users don't
> really want to get all their applications changed (and maybe can't).
Okay, so we allow the user to generate a locale-specific default table, as
a special case. But the one we distribute covers only ASCII, and we warn
users that a locale-agnostic PCRE must use an ASCII-only default table.
(That is, major distributions would package PCRE compiled in this manner;
locale-specific tables should appear only in one-off builds, etc.)
> > Also, we can distribute pcre_chartables.ascii.c and
> > pcre_chartables.ebcdic.c, and have the configure script symlink one of
> > them to pcre_chartables.c for the build.
>
> Another thought: We could distribute an ASCII pcre_chartables.c and
> invent a new configure option, say --rebuild-chartables. This would be
> disabled by default so that cross-compiling would work. If enabled, it
> would cause the current practice to be observed, that is, it would build
> the tables at compile time. This would preserve backwards compatibility.
This sounds reasonable. The option would have to be expressed as
--enable-something, however... --enable-rebuild-chartables, or perhaps
--enable-locale-specific-chartables. (The long name should discourage its
use }:)
> I suppose we might even make it build the chartables at ./configure
> time.
That's beyond the scope of what the configure script is supposed to do.
> > I see. It would probably be better to have this file under doc/, as a sort
> > of documentation-related "source file," and named without an .html
> > extension (so that it's not recognized as documentation in itself).
> > Something like index.html.src, or similar....
>
> OK, perhaps yes, that might be better. I'll put a comment in it too.
One other possibility, by the way---move the file to its final resting
place (doc/html/index.html), and instead of deleting the html/ directory
when regenerating the HTML, just delete html/pcre*.html.
> On Thu, 15 Mar 2007, Ralf Junker wrote:
>
> > Unfortunately, ISO-8859-2 does not help when PCRE is used in UTF-8
> > mode. Only ISO-8859-1 is an exact mapping of the first 255 Unicode
> > code points.
>
> That is true. That does make it a special case.
Yes, but not special in a useful way. If you're working in ISO-8859-2, then
0xC0 is "CAPITAL LETTER R WITH ACUTE ACCENT", not "CAPITAL LETTER A WITH
GRAVE". The Latin-1 table may map to Unicode exactly, but without character
conversion, it's still only applicable to Latin-1 locales.
> > I think the chartables.c question is a political one: It depends on if
> > you put the PCRE focus on UTF-8 mode or not: Choose ASCII to point out
> > PCRE's non-UTF-8 mode, but go for ISO-8859-1 to stress PCRE's Unicode
> > capabilities and if you think that UTF-8 will become the de facto
> > Unicode Encoding of the future.
>
> I wouldn't presume to guess what will be the de facto encoding of the
> future! In any case, I am bad at predicting the future (see above re the
> src directory). Unicode itself does seem to be becoming fairly de facto
> (and that is an argument for making chartables 8859-1), but whether we
> will end up with UTF-8, UTF-16, or something else, or a mixture, who
> knows?
The issue isn't about what encoding(s) will be mainstream in the future.
It's about the mapping of code points to meaningful characters: what should
chr(0xC0) give you? With Unicode (whether UTF-8, UCS-2, whatever), you have
a definite answer. With legacy systems supporting a mishmash of
incompatible 8-bit encodings, all bets are off beyond 0x7F.
> Proposal:
>
> 1. Distribute pcre_chartables.c with ISO-8859-1 values. This means
> that:
>
> . Anyone whose source contains only ASCII characters will be happy.
>
> . Anyone operating in non-UTF-8 mode, but with Unicode or 8859-1
> character values, will be happy.
>
> . Anybody operating in UTF-8 mode (assuming Unicode character values)
> will be happy.
Aaaaand... anyone operating in non-UTF-8, non-Latin-1 mode will get
incorrect results :(
Again I say, have the distributed table cover ASCII only, and rely on
proposal #2 below to support those who want something locale-specific.
> 2. Add --rebuild-chartables to ./configure, and if possible, make it
> compile and run dftables at configure time in the current locale.
> This means that:
>
> . Anybody who relied on the previous behaviour in a non-8859-1 locale
> will be able to re-create it.
>
> . The few people (if any) who build PCRE in an EBCDIC environment can
> use this (we might even default it if EBCDIC is selected), so we
> don't have to distribute EBCDIC tables.
This sounds good, especially the EBCDIC part. I'd just generalize the first
point to apply to Latin-1 locales as well.
> If we can make table re-building operate at ./configure time, it may
> solve the cross-compiling problem, because part of the ./configure thing
> is to run little test programs while configuring, isn't it? That means
> it must have access to a non-cross-compiler. I need to investigate this
> further.
Compiler invocations in the configuration script are strictly for running
tests, however. The rules and dependencies relating to dftables and
pcre_chartables.c belong in a/the makefile.
If anything, we can have the configure script export a CC_FOR_BUILD-ish
variable with a host compiler. I'm not sure what the relevant conventions
are here w.r.t. Autotools, but we're certainly not the first ones in this
position.
--Daniel
--
NAME = Daniel Richard G. ## Remember, skunks _\|/_ meef?
EMAIL1 = skunk@??? ## don't smell bad--- (/o|o\) /
EMAIL2 = skunk@??? ## it's the people who < (^),>
WWW = http://www.******.org/ ## annoy them that do! / \
--
(****** = site not yet online)