Author: Philip Hazel Date: To: David Dennerline CC: pcre-dev Subject: Re: [pcre-dev] Add #ifdef SUPPORT_UCP to pcre_ucd.c
On Thu, 30 Apr 2009, David Dennerline wrote:
> Would there be any problem with adding a #ifdef SUPPORT_UCP to prevent
> including the Unicode character table for pcre_ucd.c? I tried doing this
> because I don't need UTF-8 support and it decreases binary size by 50KB. I
> saw the GET_UCD() is not called in pcre_dfa_exec() only if SUPPORT_UCP is
> called.
I don't see any problem, but then again, I don't see the need. Surely if
there are no references to the module, it won't get included in the
binary? I thought that was how libraries worked?
> The program compiles and links correctly, but I wanted to double-check to
> see if there would be any potential instability.
You did not say which operating system you are using. Is it Windows? I
know nothing about Windows, never having used it. I have just done an
experiment on Linux, and when I compile with UCP support disabled,
adding #ifdef SUPPORT_UCP makes no difference at all to the size of the
binaries for pcretest and pcregrep (though it does reduce the size of
the pcre_ucd.o compiled module). The binaries are, however, noticeably
smaller than when UCP support is enabled (by more than 50K because a lot
of other code is cut out as well as the tables).
> Second, has there ever been any discussion or any plans on trying to
> implement a hybrid NFA/DFA engine that would improve performance for
> applications that do not require back-references (i.e., substitution) or
> other non-DFA friendly constructs. Something like Henry Spencer's Tcl
> regular expression parser.
There has been no discussion or planning that I am aware of. A while
before I retired (18 months ago) I did start thinking about the
possibility of turning the compiled regex into a proper state table for
a traditional finite state machine that would probably execute faster
than pcre_dfa_exec(). However, I did not get very far (it was very
tricky, as I recall) and I have not picked this up again since.