My goal is to remove all instances of the word "the" (and any surrounding
whitespace) from a string. The code I started with is:
pcrecpp::RE_Options options;
options.set_utf8(true).set_caseless(true);
pcrecpp::RE regex("(^|\\s+)The($|\\s+)",options);
regex.GlobalReplace("",&some_string);
This works for most strings, but not for "The the". It took me a bit to
figure out why, but I think I at least understand that much. The string
initially matches "^The " and then we're left with "the". Unfortunately
this no longer matches the regex because "the" doesn't begin the original
string, nor does it start with whitespace.
I could take the "($|s\\+)" from the regex, but that makes other things fail
(e.g. "The foo the" becomes " foo" instead of "foo"). Other mods I've come
up cause other failures too.
My tests pass if I call GlobalReplace in a loop, like this:
do {
num_replacements = regex.GlobalReplace("",&std_normalized);
} while (num_replacements > 0);
but I'm curious if this is a normal/good/optimal thing to do or if there's a
smarter regex to use that does everything in one call (maybe GlobalReplace
or I suppose another function).
Thanks much for your help.
-DB