Hello,
This is my first mail to this list. I am doing some hacking on GLib (part
of GNOME), specifically their GRegex module which is built on top of this
(incredible) library.
I was digging into pcre_compile.c
(here<
http://vcs.pcre.org/viewvc/code/trunk/pcre_compile.c?revision=1233&view=markup>)
to understand ownership of 'errorptr'. I know now that ownership is not
transferred to the caller.
I found this line in pcre_compile2 ():
>> *errorptr = find_error_text(errorcode);
Following the trail, I found this comment:
/* The texts of compile-time error messages. These are "char *" because they
are passed to the outside world. Do not ever re-use any error number,
because
they are documented. Always add a new error instead. Messages marked DEAD
below
are no longer used. This used to be a table of strings, but in order to
reduce
the number of relocations needed when a shared library is loaded
dynamically,
it is now one long string. We cannot use a table of offsets, because the
lengths of inserts such as XSTRING(MAX_NAME_SIZE) are not known. Instead, we
simply count through to the one we want - this isn't a performance issue
because these strings are used only when there is a compilation error.
Each substring ends with \0 to insert a null character. This includes the
final
substring, so that the whole string ends with \0\0, which can be detected
when
counting through. */
I have no doubt that this design exists for very good reasons, as explained
in the comment. (I had never considered relocation impact from a huge list
of char pointers.) Regarding the note about "performance issue", I have an
idea to improve.
What if we create a (static) list of offsets that is only initialised at
first compile error?
Something like:
>> static int error_text_offsets[78] = { -1 };
I am willing to write the code and submit a patch. However, since I am new
to this project, I don't know enough about its style (example: cathedral
vs. bazaar?). Also, I wanted to have my idea considered before writing
code.
If you think this idea is worthy, there are two areas of concern I can
think of:
Sizing error_text_offsets
=================
I am unsure if error_text_offsets should be sized (1) static/precisely, (2)
static/liberally, or (3) dynamically ((3a) precisely or (3b) liberally).
(1) static/precisely: we need to keep error_text_offsets precisely sized to
error_texts. Today, it looks to be 78.
This uses the least amount of memory, but incurs higher
risk/maintenance costs.
I'm all about low-maintenance ("future proofing") where
possible/reasonable.
This would probably require a test to be written to ensure the precise
size is correct.
Imagine a scenario where error_texts has an addition, but
error_text_offsets does not grow.
I have not yet looked if PCRE includes testing as part of its release
procedure. I assume yes.
(2) static/liberally: something big, but not crazy, like 255. ex: static
int error_text_offsets[255]
If error_texts grows over time, we have space. Low maintenance, but
takes a bit more memory.
(3) dynamically: static int *error_text_offsets = NULL
allocate on first use via malloc() or PRCE's built-in/override
method
Again: do we allocate precisely or liberally?
(a) Precisely can be done via static constant or at runtime
(counting the number of embedded strings).
(b) Liberally: Allocate something big enough, like 255, to
prevent double scan/two-pass over error_texts.
First pass: Calculate size of offsets array.
Second pass: Calculate and store offsets.
Multithreading
==========
Finally, what about multithreading (MT)? Imagine the scenario where two
parallel threads call pcre_compile2 () independently. Both their compiles
fail. The init routine for error_text_offsets needs to be MT-aware.
Generally, (in C) I don't know how to properly handle that issue when
initialising this list the first time. (I know how to do it in Java/C#,
but there you get a huge framework library with every install. C is bit
trickier due to cross platform issues -- pthreads, etc.) Plus, I know
nothing about how PRCE handles/supports multithreading.
Thanks,
Arpe