Treacherous Regular Expressions in PHP
In PHP, there are three libraries available for regular expressions: PCRE, Oniguruma, and POSIX Regex. The second one may not always be available, and the third is deprecated, so you should exclusively use the more adept and faster PCRE library. Unfortunately, its implementation suffers from quite unpleasant flaws across all PHP versions.
The operation of the preg_*
functions can be divided into
two steps:
- compilation of the regular expression
- execution (searching, replacing, filtering, …)
It is advantageous that PHP maintains a cached version of compiled regular expressions, meaning they are only compiled once. Therefore, it is appropriate to use static regular expressions, i.e., not to generate them parametrically.
Now for the unpleasant issues. If an error is discovered during compilation,
PHP will issue an E_WARNING
error, but the return value of the
function is inconsistent:
preg_filter
,preg_replace_callback
,preg_replace
returnnull
preg_grep
,preg_match_all
,preg_match
,preg_split
returnfalse
It is good to know that functions returning an array $matches
by
reference (i.e., preg_match_all
and preg_match
) do not
nullify the argument upon a compilation error, thus validating the test of the
return value.
Since version 5.2.0, PHP has the function preg_last_error returning the code of
the last error. However, beware, this only applies to errors that occur during
execution! If an error occurs during compilation, the value of
preg_last_error
is not reset and returns the previous value. If the
return value of a preg_*
function is not null
or
false
(see above), definitely do not rely on what
preg_last_error
returns.
What kind of errors can occur during execution? The most common case is
exceeding pcre.backtrack_limit
or invalid UTF-8 input when using
the u
modifier. (Note: invalid UTF-8 in the regular expression
itself is detected during compilation.) However, the way PHP handles such an
error is utterly inadequate:
- it generates no message (silent error)
- the return value of the function may indicate that everything is fine
- the error can only be detected by calling
preg_last_error
later
Let's talk about the return value, which is probably the biggest betrayal.
The process is executed until an error occurs, then it returns a partially
processed result. And this is done completely silently. However, even this is
not always the case, for example, the trio of functions
preg_filter
, preg_replace_callback
,
preg_replace
can return null
even during execution
errors.
Whether an error occurred during execution can only be determined by calling
preg_last_error
. But as you know, this function returns a
nonsensical result if, on the contrary, a compilation error occurred, so we must
distinguish both situations by considering the return value of the function,
whether it is null
or false
. And since functions that
return null
during a compilation error can also return
null
during an execution error, it can be stated only that PHP is
undoubtedly a messed-up language.
What would safe use of PCRE functions look like? For example, like this:
function safeReplaceCallback($pattern, $callback, $subject)
{
// we must verify the callback ourselves
if (!is_callable($callback)) {
throw new Exception('Invalid callback.');
}
// test the expression on an empty string
if (preg_match($pattern, '') === false) { // compilation error?
$error = error_get_last();
throw new Exception($error['message']);
}
// call PCRE
$result = preg_replace_callback($pattern, $callback, $subject);
// execution error?
if ($result === null && preg_last_error()) {
throw new Exception('Error processing regular expression.', preg_last_error());
}
return $result;
}
The provided code transforms errors into exceptions but does not attempt to suppress warning outputs.
Safe processing of regular expressions is implemented in the class Nette\Utils\Strings.