Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with

Kezdőlap
Üzenet törlése
Szerző: Daniel Richard G.
Dátum:  
Címzett: pcre-dev
Tárgy: Re: [pcre-dev] Here is pcre-7.1-RC1 for you to play with
On Wed, 2007 Mar 14 10:13:44 +0000, Philip Hazel wrote:
>
> > Anyway, I think the "C" locale are the ISO-8859-1 characters. The
> > only difference is how they sort and stuff.
>
> No, the "C" locale doesn't go beyond 127, I don't think.


Aye, or at least we can't make any assumptions about 0x80-0xFF.

> > Hi everyone. Sorry for dropping out there; was sick these past couple days.
>
> Sorry to hear that. Don't overdo things while you are recovering.


It wasn't that bad, but thanks :)

> > Why not have a fixed set of character tables, defined in Unicode, and have
> > PCRE be sensitive to the locale at runtime---converting from Latin-1 or
> > UTF-8 or whatever as necessary? (This is basically what Ralf was
>
> It's a performance issue. You don't really want to be doing character
> conversions every time you call pcre_compile() or pcre_exec(). (And I
> know there are plenty of PCRE users who do not use the UTF-8/Unicode
> features.)
>
> Remember, an application can, at runtime, generate a set of tables of
> its own and pass them to PCRE. Apart from the initial generation,
> there's no performance penalty in doing that.


Ah, so you're alluding to pcre_maketables()? I see what you're getting at.

In which case... how about this: We have pcre_chartables.c cover no more
and no less than ASCII. (So we don't have to make any decisions about
locale at build time.) We then say that if a user wants to do anything
special with characters beyond 0x7F, s/he has to generate locale-specific
tables at run time.

Also, we can distribute pcre_chartables.ascii.c and
pcre_chartables.ebcdic.c, and have the configure script symlink one of them
to pcre_chartables.c for the build.

> > Currently, the tables deal only with ASCII characters (0x7F and below),
> > yes? In which case, shouldn't a single pcre_chartables.c work for pretty
> > much everyone except EBCDIC users?
>
> In non-UTF-8 mode, they work for the full 256 characters.


Okay, but as defined by the locale in which dftables is run, then...
(correct me if I'm wrong).

> > > 3. The Contrib directory on the PCRE ftp site
>
> > Pretty much all the build-project stuff will be subsumed by the CMake
> > support. Remaining interesting stuff: Symbian support (we'd need someone to
> > do testing), the Delphi wrapper (ditto)... pcre_subst and worddefine look
> > like nice extras.
>
> Thanks for looking at that. What I think I'll do before the next release
> is to remove everything except those that you have mentioned.


Sounds good, though what about just throwing those items into a tar file so
they can be archived without causing clutter? It would be nice if the
material were still accessible in some way, just in case.

> > * I noticed the new scripts: PrepareRelease, CleanTxt, etc. (1) Could
> > you tack on a .pl extension, to differentiate these from plain shell
> > scripts? (2) When we move the source files into src/, it would
> > probably be good to move these scripts into their own directory as well.
>
> Sigh. That now involves editing several documentation files and the
> scripts themselves (since they call each other). And the relevant
> messing about in svn. Not exactly an instant's work.


It's not that big a deal. This is more in the way of unspoken convention
for Autotoolized packages---some things you expect to see, and some things
you don't. (I've seen many packages, and deviations from the norm, so I
have a well-developed "spider-sense" about this :-)

> I suppose this could be done (he said, grudgingly), but I don't really
> see the need - what do other people think? In the Unix world we don't
> normally differentiate different types of executable by their names.
> Shell scripts, Perl scripts, Python scripts, binaries, whatever, are
> plentiful on my box, without special extensions.


Well, there's a difference between scripts in a PATH directory, versus a
source tarball. Aside from serving the purpose of language identification,
it's useful so that text editors can load the appropriate
syntax-highlighting rules and such.

> I must say that if I see a file called xxx.pl I assume that it has to be
> run by "perl xxx.pl" rather than just "xxx.pl", but maybe I'm out of
> line here.


That's what the +x bit, and colorized ls(1) is for :-)

> <notserious>
> And what if I make PrepareRelease into a shell script that just calls
> Perl? <grin/>
> </notserious>


You should have brought up the example of scripts that happen to be both
valid Perl and Bourne shell ^_^

> Not sure about (2). In fact, I wasn't originally going to distribute
> them at all, having put them in the "maint" directory, but I was
> persuaded by Craig's argument about distribution packagers making tweaks.


You needn't worry about this; there's good precedent for having these
things in the tarball. (GCC being one example.) But having them in a
separate directory would help keep down clutter.

> > * The files !compile.txt and !linklib.txt... could we rename those to
> > something that doesn't use shell-unfriendly bang-marks?
>
> I don't know. They were contributed with those names. I didn't want to
> change them in case that was somehow significant in the environment for
> which they are intended (not Unix).


Hmm... I don't suppose Mr. Tokarev is reading this list?

> > * What's this Index.html file in the toplevel? (what of
> > doc/html/index.html?)
>
> Index.html is the base, hand-made HTML file. When PrepareRelease is run,
> it removes the entire doc/html directory. It then copies Index.html into
> doc/index.html and regenerates the remaining files in doc/html from the
> man pages.


I see. It would probably be better to have this file under doc/, as a sort
of documentation-related "source file," and named without an .html
extension (so that it's not recognized as documentation in itself).
Something like index.html.src, or similar....

> > * Is Tech.Notes no longer to be distributed?
>
> Since it is only of interest to people who want to develop the code, I
> thought it should remain in the "maint" directory which such developers
> would get from svn. The vast majority of people who use the distribution
> would surely have no interest in it.


True, but some folks would do development from the tarball, too (and might
not have access to SVN). Many source tarballs include a HACKING file.

I'd suggest bringing the file into the top-level directory, and renaming it
to HACKING (or TECHNOTES)....

> > * I noticed some minor inconsistencies w.r.t. inter-sentence spacing in
> > the config.h comments in configure.ac (one space or two?). I don't
> > know how much you care about this, but it appeared to be using two
> > spaces [more] consistently before.
>
> I *never* use two spaces. I find reading text with extra spaces after
> full stops to be "bumpy". It disrupts the even flow of words. Using
> additional space has never been the norm in traditional book printing.
> Certainly not in Europe, and I don't think in North America either.
> Somehow it got introduced into typescripts - I don't know the history of
> this - and it is sometimes taught as "standard" in typing classes. I'll
> clean up the comments.


    http://en.wikipedia.org/wiki/French_spacing


I guess it goes to show, the English still hate the French ^_^


--Daniel


-- 
NAME   = Daniel Richard G.       ##  Remember, skunks       _\|/_  meef?
EMAIL1 = skunk@???        ##  don't smell bad---    (/o|o\) /
EMAIL2 = skunk@???      ##  it's the people who   < (^),>
WWW    = http://www.******.org/  ##  annoy them that do!    /   \
--
(****** = site not yet online)