[pcre-dev] JIT and callouts

Top Page
Delete this message
Author: Zoltán Herczeg
Date:  
To: pcre-dev
Subject: [pcre-dev] JIT and callouts
Hi,

I am thinking again about the JIT support of callouts because it seems there are people who are interested. Theoretically calling a user function is easy. The problem is that the internal representation of the JIT compiled pattern is different from the interpreter.

Here are the members of the pcre_callout_block structure:

typedef struct pcre_callout_block {
  int          version;           /* Identifies version of block */
    /* OK - Only a constant. */
  /* ------------------------ Version 0 ------------------------------- */
  int          callout_number;    /* Number compiled into pattern */
    /* OK - Only a constant. */
  int         *offset_vector;     /* The offset vector */
    /* See later. */
  PCRE_SPTR    subject;           /* The subject being matched */
   /* OK - easy to retrieve. */
  int          subject_length;    /* The length of the subject */
   /* OK - easy to retrieve. */
  int          start_match;       /* Offset to start of this match attempt */
   /* OK - easy to retrieve. */
  int          current_position;  /* Where we currently are in the subject */
   /* OK - easy to retrieve. */
  int          capture_top;       /* Max current capture */
    /* See later. */
  int          capture_last;      /* Most recently closed capture */
    /* See later. */
  void        *callout_data;      /* Data passed in with the call */
   /* OK - easy to retrieve. */
  /* ------------------- Added for Version 1 -------------------------- */
  int          pattern_position;  /* Offset to next item in the pattern */
    /* OK - Only a constant. */
  int          next_item_length;  /* Length of next item in the pattern */
    /* OK - Only a constant. */
  /* ------------------- Added for Version 2 -------------------------- */
  const unsigned char *mark;      /* Pointer to current mark or NULL    */
   /* OK - easy to retrieve. */
  /* ------------------------------------------------------------------ */
} pcre_callout_block;


Return value: equal to 0: continue match, greater than 0: backtrack, less than 0: abandon the match and return with this value. Easy to support these options.

So most of the members are easy to support, except:

offset_vector - the current offsets are not stored in the offset vector, they are stored in the stack, and they are character pointers (-1 offset is represented by subject_start - 1). Converting them back takes a lot of time. Furthermore in the "optimized" case, the start offset is updated when we enter into a capturing block, so the value pair may be inconsistent. This optimization is disabled, if a particular offset pair is referenced by a backreference or a conditional block. In this "unoptimized" case we use an extra temporary value to store the offset, after we enter into a capturing block. Callouts may disable this optimization entirely, so we can sacrifice some performance to make callouts more consistent.

capture_top, capture_last - these are not stored by the JIT compiler. The capture_top is calculated when a successful match is found, by searching the last non -1 offset starting from the last offset. Once again: JIT stores the offsets on the stack, and it always has enough space to store all offsets (unlike the interpreter when a limited ovector is passed). After the match is finished, the necessary values are copied back to the ovector, and converted to offsets. The capture_last is not maintained at all.

The question is what to do. Is it worth to implement a restricted callout mechanism (some members are set to an invalid value)? What should we do with the ovector? And a theoretical question: is JIT worth when we call expensive C functions?

Any feedback is welcome.

Thanks,
Zoltan