[Pcre-svn] [920] code/trunk: Add support to pcre2grep for bi…

Top Page
Delete this message
Author: Subversion repository
Date:  
To: pcre-svn
Subject: [Pcre-svn] [920] code/trunk: Add support to pcre2grep for binary zeros in -f files.
Revision: 920
          http://www.exim.org/viewvc/pcre2?view=rev&revision=920
Author:   ph10
Date:     2018-02-24 17:09:19 +0000 (Sat, 24 Feb 2018)
Log Message:
-----------
Add support to pcre2grep for binary zeros in -f files. 


Modified Paths:
--------------
    code/trunk/ChangeLog
    code/trunk/RunGrepTest
    code/trunk/doc/pcre2grep.1
    code/trunk/src/pcre2grep.c
    code/trunk/testdata/grepoutput


Modified: code/trunk/ChangeLog
===================================================================
--- code/trunk/ChangeLog    2018-02-20 15:37:49 UTC (rev 919)
+++ code/trunk/ChangeLog    2018-02-24 17:09:19 UTC (rev 920)
@@ -25,7 +25,10 @@
 issue was fixed for other kinds of repeat in release 10.20 by change 19, but
 repeating character classes were overlooked.


+5. pcre2grep now supports the inclusion of binary zeros in patterns that are
+read from files via the -f option.

+
Version 10.31 12-February-2018
------------------------------


Modified: code/trunk/RunGrepTest
===================================================================
--- code/trunk/RunGrepTest    2018-02-20 15:37:49 UTC (rev 919)
+++ code/trunk/RunGrepTest    2018-02-24 17:09:19 UTC (rev 920)
@@ -641,7 +641,13 @@
 $valgrind $vjs $pcre2grep --colour=always '(?=[ac]\K)' testNinputgrep >>testtrygrep
 echo "RC=$?" >>testtrygrep


+echo "---------------------------- Test 126 -----------------------------" >>testtrygrep
+printf "Next line pattern has binary zero\nABC\x00XYZ\n" >testtemp1grep
+printf "ABC\x00XYZ\nABCDEF\nDEFABC\n" >testtemp2grep
+$valgrind $vjs $pcre2grep -a -f testtemp1grep testtemp2grep >>testtrygrep
+echo "RC=$?" >>testtrygrep

+
# Now compare the results.

$cf $srcdir/testdata/grepoutput testtrygrep

Modified: code/trunk/doc/pcre2grep.1
===================================================================
--- code/trunk/doc/pcre2grep.1    2018-02-20 15:37:49 UTC (rev 919)
+++ code/trunk/doc/pcre2grep.1    2018-02-24 17:09:19 UTC (rev 920)
@@ -1,4 +1,4 @@
-.TH PCRE2GREP 1 "13 November 2017" "PCRE2 10.31"
+.TH PCRE2GREP 1 "24 February 2018" "PCRE2 10.32"
 .SH NAME
 pcre2grep - a grep with Perl-compatible regular expressions.
 .SH SYNOPSIS
@@ -121,6 +121,14 @@
 of changing the way binary files are handled.
 .
 .
+.SH "BINARY ZEROS IN PATTERNS"
+.rs
+.sp
+Patterns passed from the command line are strings that are terminated by a
+binary zero, so cannot contain internal zeros. However, patterns that are read 
+from a file via the \fB-f\fP option may contain binary zeros.
+.
+.
 .SH OPTIONS
 .rs
 .sp
@@ -304,12 +312,15 @@
 .TP
 \fB-f\fP \fIfilename\fP, \fB--file=\fP\fIfilename\fP
 Read patterns from the file, one per line, and match them against each line of
-input. What constitutes a newline when reading the file is the operating
-system's default. The \fB--newline\fP option has no effect on this option.
-Trailing white space is removed from each line, and blank lines are ignored. An
-empty file contains no patterns and therefore matches nothing. See also the
-comments about multiple patterns versus a single pattern with alternatives in
-the description of \fB-e\fP above.
+input. As is the case with patterns on the command line, no delimiters should
+be used. What constitutes a newline when reading the file is the operating
+system's default interpretation of \en. The \fB--newline\fP option has no
+effect on this option. Trailing white space is removed from each line, and
+blank lines are ignored. An empty file contains no patterns and therefore
+matches nothing. Patterns read from a file in this way may contain binary
+zeros, which are treated as ordinary data characters. See also the comments
+about multiple patterns versus a single pattern with alternatives in the
+description of \fB-e\fP above.
 .sp
 If this option is given more than once, all the specified files are read. A
 data line is output if any of the patterns match it. A file name can be given
@@ -320,14 +331,15 @@
 .TP
 \fB--file-list\fP=\fIfilename\fP
 Read a list of files and/or directories that are to be scanned from the given
-file, one per line. Trailing white space is removed from each line, and blank
-lines are ignored. These paths are processed before any that are listed on the
-command line. The file name can be given as "-" to refer to the standard input.
-If \fB--file\fP and \fB--file-list\fP are both specified as "-", patterns are
-read first. This is useful only when the standard input is a terminal, from
-which further lines (the list of files) can be read after an end-of-file
-indication. If this option is given more than once, all the specified files are
-read.
+file, one per line. What constitutes a newline when reading the file is the
+operating system's default. Trailing white space is removed from each line, and
+blank lines are ignored. These paths are processed before any that are listed
+on the command line. The file name can be given as "-" to refer to the standard
+input. If \fB--file\fP and \fB--file-list\fP are both specified as "-",
+patterns are read first. This is useful only when the standard input is a
+terminal, from which further lines (the list of files) can be read after an
+end-of-file indication. If this option is given more than once, all the
+specified files are read.
 .TP
 \fB--file-offsets\fP
 Instead of showing lines or parts of lines that match, show each match as an
@@ -679,12 +691,13 @@
 different newline conventions from the default. Any parts of the input files
 that are written to the standard output are copied identically, with whatever
 newline sequences they have in the input. However, the setting of this option
-does not affect the interpretation of files specified by the \fB-f\fP,
-\fB--exclude-from\fP, or \fB--include-from\fP options, which are assumed to use
-the operating system's standard newline sequence, nor does it affect the way in
-which \fBpcre2grep\fP writes informational messages to the standard error and
-output streams. For these it uses the string "\en" to indicate newlines,
-relying on the C I/O library to convert this to an appropriate sequence.
+affects only the way scanned files are processed. It does not affect the
+interpretation of files specified by the \fB-f\fP, \fB--file-list\fP,
+\fB--exclude-from\fP, or \fB--include-from\fP options, nor does it affect the
+way in which \fBpcre2grep\fP writes informational messages to the standard
+error and output streams. For these it uses the string "\en" to indicate
+newlines, relying on the C I/O library to convert this to an appropriate
+sequence.
 .
 .
 .SH "OPTIONS COMPATIBILITY"
@@ -862,6 +875,6 @@
 .rs
 .sp
 .nf
-Last updated: 13 November 2017
-Copyright (c) 1997-2017 University of Cambridge.
+Last updated: 24 February 2018
+Copyright (c) 1997-2018 University of Cambridge.
 .fi


Modified: code/trunk/src/pcre2grep.c
===================================================================
--- code/trunk/src/pcre2grep.c    2018-02-20 15:37:49 UTC (rev 919)
+++ code/trunk/src/pcre2grep.c    2018-02-24 17:09:19 UTC (rev 920)
@@ -13,7 +13,7 @@
 The header can be found in the special z/OS distribution, which is available
 from www.zaconsultants.net or from www.cbttape.org.


-           Copyright (c) 1997-2017 University of Cambridge
+           Copyright (c) 1997-2018 University of Cambridge


-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -303,6 +303,7 @@
typedef struct patstr {
struct patstr *next;
char *string;
+ PCRE2_SIZE length;
pcre2_code *compiled;
} patstr;

@@ -557,6 +558,7 @@

 Arguments:
   s          pattern string to add
+  patlen     length of pattern 
   after      if not NULL points to item to insert after


 Returns:     new pattern block or NULL on error
@@ -563,7 +565,7 @@
 */


 static patstr *
-add_pattern(char *s, patstr *after)
+add_pattern(char *s, PCRE2_SIZE patlen, patstr *after)
 {
 patstr *p = (patstr *)malloc(sizeof(patstr));
 if (p == NULL)
@@ -571,7 +573,7 @@
   fprintf(stderr, "pcre2grep: malloc failed\n");
   pcre2grep_exit(2);
   }
-if (strlen(s) > MAXPATLEN)
+if (patlen > MAXPATLEN)
   {
   fprintf(stderr, "pcre2grep: pattern is too long (limit is %d bytes)\n",
     MAXPATLEN);
@@ -580,6 +582,7 @@
   }
 p->next = NULL;
 p->string = s;
+p->length = patlen;
 p->compiled = NULL;


 if (after != NULL)
@@ -1276,12 +1279,14 @@
 *            Read one line of input              *
 *************************************************/


-/* Normally, input is read using fread() (or gzread, or BZ2_read) into a large
-buffer, so many lines may be read at once. However, doing this for tty input
-means that no output appears until a lot of input has been typed. Instead, tty
-input is handled line by line. We cannot use fgets() for this, because it does
-not stop at a binary zero, and therefore there is no way of telling how many
-characters it has read, because there may be binary zeros embedded in the data.
+/* Normally, input that is to be scanned is read using fread() (or gzread, or
+BZ2_read) into a large buffer, so many lines may be read at once. However,
+doing this for tty input means that no output appears until a lot of input has
+been typed. Instead, tty input is handled line by line. We cannot use fgets()
+for this, because it does not stop at a binary zero, and therefore there is no
+way of telling how many characters it has read, because there may be binary
+zeros embedded in the data. This function is also used for reading patterns
+from files (the -f option).

 Arguments:
   buffer     the buffer to read into
@@ -1291,7 +1296,7 @@
 Returns:     the number of characters read, zero at end of file
 */


-static unsigned int
+static PCRE2_SIZE
read_one_line(char *buffer, int length, FILE *f)
{
int c;
@@ -1651,11 +1656,11 @@
*/

static BOOL
-match_patterns(char *matchptr, size_t length, unsigned int options,
- size_t startoffset, int *mrc)
+match_patterns(char *matchptr, PCRE2_SIZE length, unsigned int options,
+ PCRE2_SIZE startoffset, int *mrc)
{
int i;
-size_t slen = length;
+PCRE2_SIZE slen = length;
patstr *p = patterns;
const char *msg = "this text:\n\n";

@@ -2317,7 +2322,7 @@
char *lastmatchrestart = NULL;
char *ptr = main_buffer;
char *endptr;
-size_t bufflength;
+PCRE2_SIZE bufflength;
BOOL binary = FALSE;
BOOL endhyphenpending = FALSE;
BOOL input_line_buffered = line_buffered;
@@ -2339,7 +2344,7 @@
input_line_buffered);

#ifdef SUPPORT_LIBBZ2
-if (frtype == FR_LIBBZ2 && (int)bufflength < 0) return 2; /* Gotcha: bufflength is size_t; */
+if (frtype == FR_LIBBZ2 && (int)bufflength < 0) return 2; /* Gotcha: bufflength is PCRE2_SIZE; */
#endif

endptr = main_buffer + bufflength;
@@ -2368,8 +2373,8 @@
unsigned int options = 0;
BOOL match;
char *t = ptr;
- size_t length, linelength;
- size_t startoffset = 0;
+ PCRE2_SIZE length, linelength;
+ PCRE2_SIZE startoffset = 0;

/* At this point, ptr is at the start of a line. We need to find the length
of the subject string to pass to pcre2_match(). In multiline mode, it is the
@@ -2381,7 +2386,7 @@

t = end_of_line(t, endptr, &endlinelength);
linelength = t - ptr - endlinelength;
- length = multiline? (size_t)(endptr - ptr) : linelength;
+ length = multiline? (PCRE2_SIZE)(endptr - ptr) : linelength;

   /* Check to see if the line we are looking at extends right to the very end
   of the buffer without a line terminator. This means the line is too long to
@@ -2560,7 +2565,7 @@
       {
       if (!invert)
         {
-        size_t oldstartoffset;
+        PCRE2_SIZE oldstartoffset;


         if (printname != NULL) fprintf(stdout, "%s:", printname);
         if (number) fprintf(stdout, "%lu:", linenumber);
@@ -2647,7 +2652,7 @@
           startoffset -= (int)(linelength + endlinelength);
           t = end_of_line(ptr, endptr, &endlinelength);
           linelength = t - ptr - endlinelength;
-          length = (size_t)(endptr - ptr);
+          length = (PCRE2_SIZE)(endptr - ptr);
           }


         goto ONLY_MATCHING_RESTART;
@@ -2812,7 +2817,7 @@
             endprevious -= (int)(linelength + endlinelength);
             t = end_of_line(ptr, endptr, &endlinelength);
             linelength = t - ptr - endlinelength;
-            length = (size_t)(endptr - ptr);
+            length = (PCRE2_SIZE)(endptr - ptr);
             }


           /* If startoffset is at the exact end of the line it means this
@@ -2895,7 +2900,7 @@
   /* If input is line buffered, and the buffer is not yet full, read another
   line and add it into the buffer. */


-  if (input_line_buffered && bufflength < (size_t)bufsize)
+  if (input_line_buffered && bufflength < (PCRE2_SIZE)bufsize)
     {
     int add = read_one_line(ptr, bufsize - (int)(ptr - main_buffer), in);
     bufflength += add;
@@ -2907,7 +2912,7 @@
   1/3 and refill it. Before we do this, if some unprinted "after" lines are
   about to be lost, print them. */


-  if (bufflength >= (size_t)bufsize && ptr > main_buffer + 2*bufthird)
+  if (bufflength >= (PCRE2_SIZE)bufsize && ptr > main_buffer + 2*bufthird)
     {
     if (after_context > 0 &&
         lastmatchnumber > 0 &&
@@ -3395,9 +3400,8 @@
 PCRE2_UCHAR errmessbuffer[ERRBUFSIZ];


if (p->compiled != NULL) return TRUE;
-
ps = p->string;
-patlen = strlen(ps);
+patlen = p->length;

if ((options & PCRE2_LITERAL) != 0)
{
@@ -3407,8 +3411,8 @@

   if (ellength != 0)
     {
-    if (add_pattern(pe, p) == NULL) return FALSE;
-    patlen = (int)(pe - ps - ellength);
+    patlen = pe - ps - ellength;
+    if (add_pattern(pe, p->length-patlen-ellength, p) == NULL) return FALSE;
     }
   }


@@ -3470,6 +3474,7 @@
read_pattern_file(char *name, patstr **patptr, patstr **patlastptr)
{
int linenumber = 0;
+PCRE2_SIZE patlen;
FILE *f;
const char *filename;
char buffer[MAXPATLEN+20];
@@ -3490,13 +3495,11 @@
filename = name;
}

-while (fgets(buffer, sizeof(buffer), f) != NULL)
+while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0)
{
- char *s = buffer + (int)strlen(buffer);
- while (s > buffer && isspace((unsigned char)(s[-1]))) s--;
- *s = 0;
+ while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
linenumber++;
- if (buffer[0] == 0) continue; /* Skip blank lines */
+ if (patlen == 0) continue; /* Skip blank lines */

/* Note: this call to add_pattern() puts a pointer to the local variable
"buffer" into the pattern chain. However, that pointer is used only when
@@ -3503,7 +3506,7 @@
compiling the pattern, which happens immediately below, so we flatten it
afterwards, as a precaution against any later code trying to use it. */

-  *patlastptr = add_pattern(buffer, *patlastptr);
+  *patlastptr = add_pattern(buffer, patlen, *patlastptr);
   if (*patlastptr == NULL)
     {
     if (f != stdin) fclose(f);
@@ -3513,8 +3516,9 @@


/* This loop is needed because compiling a "pattern" when -F is set may add
on additional literal patterns if the original contains a newline. In the
- common case, it never will, because fgets() stops at a newline. However,
- the -N option can be used to give pcre2grep a different newline setting. */
+ common case, it never will, because read_one_line() stops at a newline.
+ However, the -N option can be used to give pcre2grep a different newline
+ setting. */

   for(;;)
     {
@@ -3833,7 +3837,8 @@
   else if (op->type == OP_PATLIST)
     {
     patdatastr *pd = (patdatastr *)op->dataptr;
-    *(pd->lastptr) = add_pattern(option_data, *(pd->lastptr));
+    *(pd->lastptr) = add_pattern(option_data, (PCRE2_SIZE)strlen(option_data), 
+      *(pd->lastptr));
     if (*(pd->lastptr) == NULL) goto EXIT2;
     if (*(pd->anchor) == NULL) *(pd->anchor) = *(pd->lastptr);
     }
@@ -4095,7 +4100,9 @@
 if (patterns == NULL && pattern_files == NULL)
   {
   if (i >= argc) return usage(2);
-  patterns = patterns_last = add_pattern(argv[i++], NULL);
+  patterns = patterns_last = add_pattern(argv[i], (PCRE2_SIZE)strlen(argv[i]),
+    NULL);
+  i++;   
   if (patterns == NULL) goto EXIT2;
   }



Modified: code/trunk/testdata/grepoutput
===================================================================
(Binary files differ)