Na navigaci | Klávesové zkratky

Treacherous Regular Expressions in PHP

In PHP, there are three libraries available for regular expressions: PCRE, Oniguruma, and POSIX Regex. The second one may not always be available, and the third is deprecated, so you should exclusively use the more adept and faster PCRE library. Unfortunately, its implementation suffers from quite unpleasant flaws across all PHP versions.

The operation of the preg_* functions can be divided into two steps:

  1. compilation of the regular expression
  2. execution (searching, replacing, filtering, …)

It is advantageous that PHP maintains a cached version of compiled regular expressions, meaning they are only compiled once. Therefore, it is appropriate to use static regular expressions, i.e., not to generate them parametrically.

Now for the unpleasant issues. If an error is discovered during compilation, PHP will issue an E_WARNING error, but the return value of the function is inconsistent:

  • preg_filter, preg_replace_callback, preg_replace return null
  • preg_grep, preg_match_all, preg_match, preg_split return false

It is good to know that functions returning an array $matches by reference (i.e., preg_match_all and preg_match) do not nullify the argument upon a compilation error, thus validating the test of the return value.

Since version 5.2.0, PHP has the function preg_last_error returning the code of the last error. However, beware, this only applies to errors that occur during execution! If an error occurs during compilation, the value of preg_last_error is not reset and returns the previous value. If the return value of a preg_* function is not null or false (see above), definitely do not rely on what preg_last_error returns.

What kind of errors can occur during execution? The most common case is exceeding pcre.backtrack_limit or invalid UTF-8 input when using the u modifier. (Note: invalid UTF-8 in the regular expression itself is detected during compilation.) However, the way PHP handles such an error is utterly inadequate:

  • it generates no message (silent error)
  • the return value of the function may indicate that everything is fine
  • the error can only be detected by calling preg_last_error later

Let's talk about the return value, which is probably the biggest betrayal. The process is executed until an error occurs, then it returns a partially processed result. And this is done completely silently. However, even this is not always the case, for example, the trio of functions preg_filter, preg_replace_callback, preg_replace can return null even during execution errors.

Whether an error occurred during execution can only be determined by calling preg_last_error. But as you know, this function returns a nonsensical result if, on the contrary, a compilation error occurred, so we must distinguish both situations by considering the return value of the function, whether it is null or false. And since functions that return null during a compilation error can also return null during an execution error, it can be stated only that PHP is undoubtedly a messed-up language.

What would safe use of PCRE functions look like? For example, like this:

function safeReplaceCallback($pattern, $callback, $subject)
{
	// we must verify the callback ourselves
	if (!is_callable($callback)) {
		throw new Exception('Invalid callback.');
	}

	// test the expression on an empty string
	if (preg_match($pattern, '') === false) { // compilation error?
		$error = error_get_last();
		throw new Exception($error['message']);
	}

	// call PCRE
	$result = preg_replace_callback($pattern, $callback, $subject);

	// execution error?
	if ($result === null && preg_last_error()) {
		throw new Exception('Error processing regular expression.', preg_last_error());
	}

	return $result;
}

The provided code transforms errors into exceptions but does not attempt to suppress warning outputs.

Safe processing of regular expressions is implemented in the class Nette\Utils\Strings.


phpFashion © 2004, 2024 David Grudl | o blogu

Ukázky zdrojových kódů smíte používat s uvedením autora a URL tohoto webu bez dalších omezení.