Re: [pcre-dev] How to iterate over the compiled regular expr…

Góra strony
Delete this message
Autor: Philip Hazel
Data:  
Dla: Jacob Rief
CC: pcre-dev
Temat: Re: [pcre-dev] How to iterate over the compiled regular expression tree?
On Thu, 14 Oct 2010, Jacob Rief wrote:

> pcre* re = pcre_compile("abc.+xyz", 0, &error, &erroffset, NULL);
>
> but the result is somehow disappointing:
>
> (gdb) p *re
> $1 = {magic_number = 1346589253, size = 69, options = 0, flags = 6,
> dummy1 = 0, top_bracket = 0, top_backref = 0, first_byte = 97,
> req_byte = 634, name_table_offset = 48, name_entry_size = 0, name_count = 0,
> ref_count = 0, tables = 0x0, nullpad = 0x0}
>
> Note that gdb automatically uses 'struct real_pcre' to display the
> content of 're'.
>
> But there is no tree and no anchor to anything looking like a tree.
> How can 'struct real_pcre' store anything like a compiled version of
> my regex? Is there a way to access that content?


The data is not compiled into a tree. It is compiled into a byte code,
in a block that is tacked onto the end of the real_pcre struct (after
the name table, if there is one). The file called HACKING in the PCRE
distribution gives some description of this code. To do what you want,
you could scan along the code, looking for items that match literal
characters (and if there were several in succession, join them into
strings).

There is a function called pcre_printint() in the source file
pcre_printint.src (because is it #included in more than one place). This
function does such a scan, printing out the encoded items. You could
adapt this to ignore everything except literal character matches.

pcre_printint() is used by the pcretest program to show the encoded
regex:

$ ./pcretest
PCRE version 8.11-RC1 2010-10-09

re> /abc.+xyz/B

------------------------------------------------------------------
  0  17 Bra
  3     abc
  9     Any+
 11     xyz
 17  17 Ket
 20     End
------------------------------------------------------------------

data>


Although there is no guarantee that the actual operator values in the
byte code will remain the same from release to release, in practice they
do not change very often.

I hope this is helpful.

Philip

--
Philip Hazel