Rubrika PHP » Strana 3

Rubrika PHP

Treacherous Regular Expressions in PHP

In PHP, there are three libraries available for regular expressions: PCRE, Oniguruma, and POSIX Regex. The second one may not always be available, and the third is deprecated, so you should exclusively use the more adept and faster PCRE library. Unfortunately, its implementation suffers from quite unpleasant flaws across all PHP versions.

The operation of the preg_* functions can be divided into two steps:

compilation of the regular expression
execution (searching, replacing, filtering, …)

It is advantageous that PHP maintains a cached version of compiled regular expressions, meaning they are only compiled once. Therefore, it is appropriate to use static regular expressions, i.e., not to generate them parametrically.

Now for the unpleasant issues. If an error is discovered during compilation, PHP will issue an E_WARNING error, but the return value of the function is inconsistent:

preg_filter, preg_replace_callback, preg_replace return null
preg_grep, preg_match_all, preg_match, preg_split return false

It is good to know that functions returning an array $matches by reference (i.e., preg_match_all and preg_match) do not nullify the argument upon a compilation error, thus validating the test of the return value.

Since version 5.2.0, PHP has the function preg_last_error returning the code of the last error. However, beware, this only applies to errors that occur during execution! If an error occurs during compilation, the value of preg_last_error is not reset and returns the previous value. If the return value of a preg_* function is not null or false (see above), definitely do not rely on what preg_last_error returns.

What kind of errors can occur during execution? The most common case is exceeding pcre.backtrack_limit or invalid UTF-8 input when using the u modifier. (Note: invalid UTF-8 in the regular expression itself is detected during compilation.) However, the way PHP handles such an error is utterly inadequate:

it generates no message (silent error)
the return value of the function may indicate that everything is fine
the error can only be detected by calling preg_last_error later

Let's talk about the return value, which is probably the biggest betrayal. The process is executed until an error occurs, then it returns a partially processed result. And this is done completely silently. However, even this is not always the case, for example, the trio of functions preg_filter, preg_replace_callback, preg_replace can return null even during execution errors.

Whether an error occurred during execution can only be determined by calling preg_last_error. But as you know, this function returns a nonsensical result if, on the contrary, a compilation error occurred, so we must distinguish both situations by considering the return value of the function, whether it is null or false. And since functions that return null during a compilation error can also return null during an execution error, it can be stated only that PHP is undoubtedly a messed-up language.

What would safe use of PCRE functions look like? For example, like this:

function safeReplaceCallback($pattern, $callback, $subject)
{
	// we must verify the callback ourselves
	if (!is_callable($callback)) {
		throw new Exception('Invalid callback.');
	}

	// test the expression on an empty string
	if (preg_match($pattern, '') === false) { // compilation error?
		$error = error_get_last();
		throw new Exception($error['message']);
	}

	// call PCRE
	$result = preg_replace_callback($pattern, $callback, $subject);

	// execution error?
	if ($result === null && preg_last_error()) {
		throw new Exception('Error processing regular expression.', preg_last_error());
	}

	return $result;
}

The provided code transforms errors into exceptions but does not attempt to suppress warning outputs.

Safe processing of regular expressions is implemented in the class Nette\Utils\Strings.

Escaping – The Definitive Guide

One of the evergreen topics in programming is the confusion and misunderstandings around escaping. Ignorance causes the simplest methods of compromising websites, such as Cross Site Scripting (XSS) or SQL injection, to remain unfortunately widespread.

Escaping is the substitution of characters that have a special meaning in a given context with other corresponding sequences.

Example: To write quotes within a string enclosed by quotes, you need to replace them because quotes have a special meaning in the context of a string, and writing them plainly would be interpreted as ending the string. The specific substitution rules are determined by the context.

Prerequisites

Each escaping function assumes that the input is always a “raw string” (unmodified) in a certain encoding (character set).

Storing strings already escaped for HTML output in the database and similar is entirely counterproductive.

What contexts do we encounter?

As mentioned, escaping converts characters that have a special meaning in a certain context. Different escaping functions are used for each context. This table is only indicative, and it is necessary to read the notes below.

Context	Escaping Function	Reverse Function
HTML	htmlspecialchars	html_entity_decode
XML	htmlspecialchars	—
regular expression	preg_quote	—
PHP strings	var_export	—
MySQL database	mysql_real_escape_string	—
MySQL improved	mysqli_real_escape_string	—
SQLite database	sqlite_escape_string	—
PostgreSQL database	pg_escape_string	—
PostgreSQL, bytea type	pg_escape_bytea	pg_unescape_bytea
JavaScript, JSON	json_encode	json_decode
CSS	addcslashes	—
URL	rawurlencode	urldecode

Explanation of the following notes:

many contexts have their subcontexts where escaping differs. Unless otherwise stated, the specified escaping function is applicable universally without further differentiation of subcontexts.
the term usual character set refers to a character set with 1-byte or UTF-8 encoding.

HTML

In HTML contexts, the characters < & " ' collectively have a special meaning, and the corresponding sequences are < & " '. However, the exception is an HTML comment, where only the pair -- has special meaning.

For escaping, use:

$s = htmlspecialchars($s, ENT_QUOTES);

It works with any usual character set. However, it does not consider the subcontext of HTML comments (i.e., it cannot replace the pair -- with something else).

Reverse function:

$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');

XML / XHTML

XML 1.0 differs from HTML in that it prohibits the use of C0 control characters (including writing in the form of an entity) except for the tabulator, line feed, and space. XML 1.1 allows these banned characters, except NUL, in the form of entities, and further mandates C1 control characters, except NEL, also to be written as entities. Additionally, in XML, the sequence ]]> has a special meaning, so one of these characters must also be escaped.

For XML 1.0 and any usual character set, use:

$s = preg_replace('#[\x00-\x08\x0B\x0C\x0E-\x1F]+#', '', $s);
$s = htmlspecialchars($s, ENT_QUOTES);

Regular Expression

In Perl regular expressions, characters . \ + * ? [ ^ ] $ ( ) { } = ! < > | : - and the so-called delimiter, which is a character delimiting the regular expression (e.g., for the expression '#[a-z]+#i' it is #), collectively have special meaning. They are escaped with the character \.

$s = preg_quote($s, $delimiter);

In the string replacing the searched expression (e.g., the 2nd parameter of the preg_replace function), the backslash and dollar sign have special meaning:

$s = addcslashes($replacement, '$\\');

The encoding must be either 1-byte or UTF-8, depending on the modifier in the regular expression.

PHP Strings

PHP distinguishes these types of strings:

in single quotes, where special meaning can have characters \ '
in double quotes, where special meaning can have characters \ " $
NOWDOC, where no character has special meaning
HEREDOC, where special meaning can have characters \ $

Escape is done with the character \. This is usually done by the programmer when writing code, for PHP code generators, you can use the var_export function.

Note: because the mentioned regular expressions are usually written within PHP strings, both types of escaping need to be combined. E.g., the character \ for a regular expression is written as \\ and in a quoted string it needs to be written as \\\\.

SQL and Databases

Each database has its own escaping function, see the table above. Almost always, however, only a function for escaping strings is available, and it cannot be used for anything else, especially there are no functions for escaping wildcard characters used in LIKE constructions (in MySQL these are % _) or identifiers, such as table or column names. Databases do not require removing escaping on output! (Except, for example, for bytea type.)

For character sets with unusual multi-byte encoding, it is necessary to set the function mysql_set_charset or mysqli_set_charset in MySQL.

I recommend using a database layer (e.g., dibi, Nette Database, PDO) or parameterized queries, which take care of escaping for you.

JavaScript, JSON

As a programming language, JavaScript has a number of very different subcontexts. For escaping strings, you can use the side effect of the function

$s = json_encode((string) $s);

which also encloses the string in quotes. Strictly requires UTF-8.

JavaScript written inside HTML attributes (e.g., onclick) must still be escaped according to HTML rules, but this does not apply to JavaScript inside <script> tags, where only the potential occurrence of the ending tag </script> inside the string needs to be treated. However, json_encode ensures this, as JSON escapes the slash /. However, it does not handle the end of an HTML comment --> (which does not matter in HTML) or an XML CDATA block ]]>, which the script is wrapped in. For XML/XHTML, the solution is

$s = json_encode((string) $s);
$s = str_replace(']]>', ']]\x3E', $s);

Since JSON uses a subset of JavaScript syntax, the reverse function json_decode is fully usable only for JSON, limitedly for JavaScript.

CSS

In CSS contexts, the range of valid characters is precisely defined, for escaping identifiers, for example, you can use this function:

$s = addcslashes($s, "\x00..\x2C./:;<=>?@[\\]^`{|}~");

For CSS within HTML code, the same applies as stated about JavaScript and its escaping within HTML attributes and tags (here it is about the style attributes and <style> tags).

URL

In the context of a URL, everything except the letters of the English alphabet, digits, and characters - _ . is escaped by replacing them with % + the hexadecimally expressed byte.

$s = rawurlencode($s);

According to RFC 2718 (from 1999) or RFC 3986 (from 2005), writing characters in UTF-8 encoding is preferred.

The reverse function in this case is urldecode, which also recognizes the + character as meaning space.

If you find the whole topic too complicated, don't despair. Soon you will realize that it is actually about simple transformations, and the whole trick is in realizing which context I am in and which function I need to choose for it. Or even better, try using an intelligent templating system that can recognize contexts itself and apply proper escaping: Latte.

Is Singleton Evil?

Singleton is one of the most popular design patterns. Its purpose is to ensure the existence of only one instance of a certain class while also providing global access to it. Here is a brief example for completeness:

class Database
{
    private static $instance;

    private function __construct()
    {}

    public static function getInstance()
    {
        if (self::$instance === null) {
            self::$instance = new self;
        }
        return self::$instance;
    }

    ...
}

// singleton is globally accessible
$result = Database::getInstance()->query('...');

Typical features include:

A private constructor, preventing the creation of an instance outside the class
A static property $instance where the unique instance is stored
A static method getInstance(), which provides access to the instance and creates it on the first call (lazy loading)

Simple and easy to understand code that solves two problems of object-oriented programming. Yet, in dibi or Nette Framework, you won’t find any singletons. Why?

Apparent Uniqueness

Let's look closely at the code – does it really ensure only one instance exists? I’m afraid not:

$dolly = clone Database::getInstance();

// or
$dolly = unserialize(serialize(Database::getInstance()));

// or
class Dolly extends Database {}
$dolly = Dolly::getInstance();

There is a defense against this:

    final public static function getInstance()
    {
        // final getInstance
    }

    final public function __clone()
    {
        throw new Exception('Clone is not allowed');
    }

    final public function __wakeup()
    {
        throw new Exception('Unserialization is not allowed');
    }

The simplicity of implementing a singleton is gone. Worse – with every additional singleton, we repeat the same piece of code. Moreover, the class suddenly fulfills two completely different tasks: besides its original purpose, it takes care of being quite single. Both are warning signals that something is not right and the code deserves refactoring. Bear with me, I’ll get back to this soon.

Global = Ugly?

Singletons provide a global access point to objects. There is no need to constantly pass the reference around. However, critics argue that such a technique is no different from using global variables, and those are pure evil.

(If a method works with an object that was explicitly passed to it, either as a parameter or as an object variable, I call it “wired connection”. If it works with an object obtained through a global point (e.g., through a singleton), I call it “wireless connection”. Quite a nice analogy, right?)

Critics are wrong in one respect – there is nothing inherently bad about “global”. It’s important to realize that the name of each class and method is nothing more than a global identifier. There is no fundamental difference between the trouble-free construction $obj = new MyClass and the criticized $obj = MyClass::getInstance(). This is even less significant in dynamic languages like PHP, where you can “write in PHP 5.3” $obj = $class::getInstance().

However, what can cause headaches are:

Hidden dependencies on global variables
Unexpected use of “wireless connections”, which are not apparent from the API of classes (see Singletons are Pathological Liars)

The first issue can be eliminated if singletons do not act like global variables, but rather as global functions or services. Consider google.com – a nice example of a singleton as a global service. There is one instance (a physical server farm somewhere in the USA) globally accessible through the identifier www.google.com. (Even clone www.google.com does not work, as Microsoft discovered, they have it figured out.) Importantly, this service does not have hidden dependencies typical for global variables – it returns responses without unexpected connections to what someone else searched for moments ago. On the other hand, the seemingly inconspicuous function strtok suffers from a serious dependency on a global variable, and its use can lead to very hard-to-detect errors. In other words – the problem is not “globality”, but design.

The second point is purely a matter of code design. It is not wrong to use a “wireless connection” and access a global service, the mistake is doing it unexpectedly. A programmer should know exactly which object uses which class. A relatively clean solution is to have a variable in the object referring to the service object, which initializes to the global service unless the programmer decides otherwise (the convention over configuration technique).

Uniqueness May Be Harmful

Singletons come with a problem that we encounter no later than when testing code. And that is the need to substitute a different, test object. Let's return to Google as an exemplary singleton. We want to test an application that uses it, but after a few hundred tests, Google starts protesting We're sorry… and where are we? We are somewhere. The solution is to substitute a fictitious (mock) service under the identifier www.google.com. We need to modify the hosts file – but (back from the analogy to the world of OOP) how to achieve this with singletons?

One option is to implement a static method setInstance($mockObj). But oops! What exactly do you want to pass to that method when no other instance, other than that one and only, exists?

Any attempt to answer this question inevitably leads to the breakdown of everything that makes a singleton a singleton.

If we remove the restrictions on the existence of only one instance, the singleton stops being single and we are only addressing the need for a global repository. Then the question arises, why repeat the same method getInstance() in the code and not move it to an extra class, into some global registry?

Or we maintain the restrictions, only replacing the class identifier with an interface (Database → IDatabase), which raises the problem of the impossibility to implement IDatabase::getInstance() and the solution again is a global registry.

A few paragraphs above, I promised to return to the issue of repetitive code in all singletons and possible refactoring. As you can see, the problem has resolved itself. The singleton has died.

Twitter for PHP

Twitter for PHP is a very small and easy-to-use library for sending messages to Twitter and receiving status updates with OAuth support.

Download Twitter for PHP 3.5

It requires PHP (version 5 or newer) with CURL extension and is licensed under the New BSD License. You can obtain the latest version from our GitHub repository or install it via Composer:

php composer.phar require dg/twitter-php

Twitter requires SSL/TLS as of January 14th, 2014. Update to the last version.

Getting started

Sign in to the http://twitter.com and register an application from the http://dev.twitter.com/apps page. Remember
to never reveal your consumer secrets. Click on My Access Token link from the sidebar and retrieve your own access
token. Now you have consumer key, consumer secret, access token and access token secret.

Create object using application and request/access keys:

$twitter = new Twitter($consumerKey, $consumerSecret,
	$accessToken, $accessTokenSecret);

Posting

The send() method posts your status. The message must be encoded in UTF-8:

$twitter->send('I am fine today.');

You can append picture:

$twitter->send('This is my photo', $imageFile);

Displaying

The load() method returns the 20 most recent status updates posted in the last 24 hours by you:

$statuses = $twitter->load(Twitter::ME);

or posted by you and your friends:

$statuses = $twitter->load(Twitter::ME_AND_FRIENDS);

or most recent mentions for you:

$statuses = $twitter->load(Twitter::REPLIES);

Extracting the information from the channel is easy:

<ul>
<?php foreach ($statuses as $status): ?>
	<li><a href="http://twitter.com/<?= $status->user->screen_name ?>">
		<?= htmlspecialchars($status->user->name) ?></a>:

		<?= Twitter::clickable($status) ?>

		<small>at <?= date("j.n.Y H:m", strtotime($status->created_at)) ?></small>
	</li>
<?php endforeach ?>
</ul>

The static method Twitter::clickable() makes links in status clickable. In addition to regular links, it links @username to the user’s Twitter profile page and links hashtags to a Twitter search on that hashtag.

Searching

The search() method provides searching in twitter statuses:

$results = $twitter->search('#nette');

The returned result is a again array of statuses.

Error handling

All methods throw a TwitterException on error:

try {
	$statuses = $twitter->load(Twitter::ME);
} catch (TwitterException $e) {
	echo "Error: ", $e->getMessage();
}

Additional features

The authenticate() method tests if user credentials are valid:

if (!$twitter->authenticate()) {
	die('Invalid name or password');
}

Other commands

You can use all commands defined by Twitter API 1.1. For example GET statuses/retweets_of_me returns the array of most recent tweets authored by the authenticating user:

$statuses = $twitter->request('statuses/retweets_of_me', 'GET', array('count' => 20));

DOMDocument::registerNodeClass is great

PHP 5.2.0 comes with a new DOM function named registerNodeClass(). What is it good for? The documentation says nothing as well as uncle Google. But this function is really great!

…pokračování

PHP: The Dark Magic of Optimization

I recently managed to speed up a PHP script to a hundredth of its original execution time by changing just a few characters in the source code. How is this possible? The drastic acceleration is due to the appropriate use of references and assignments. I'll let you in on how it works. Don't believe the sensational headline; it's not any kind of black magic. I repeat, you just need to understand how PHP works internally. But don’t worry, it's nothing too complicated.

In-depth Reference Counting

The PHP core stores variable names separately from their values in memory. An anonymous value is described by the structure zval. Besides raw data, it includes information about the type (boolean, string, etc.) and two additional items: refcount and is_ref. Yes, refcount is exactly the counter for the aforementioned reference counting.

$abc = 'La Trine';

What does this code actually do? It creates a new zval value in memory, whose data section holds the 8 characters La Trine and indicates the type as a string. At the same time, a new entry abc is added to the variable table, referring to this zval.

Additionally, in the zval structure, we initialize the refcount counter to one, because there is exactly one variable ($abc) pointing to it.

…pokračování

HTTP Redirection

For some, this is obvious, for me, it's mainly a cheat sheet. I just can't remember extremely long numbers, meaning those that have more than one digit.

In PHP, redirection is implemented with the following code:

$code = 301; // code in the range 300..307
$url = 'http://example.com';
header('Location: ' . $url, true, $code);
die('Please <a href="' . htmlSpecialChars($url) . '">click here</a> to continue.');

Note that after calling the header() command, it is necessary to explicitly terminate the script. It doesn't hurt to offer a text message and a link for agents that do not automatically redirect.

Types of Redirection

The meanings of individual codes are described in detail in the standard RFC 2616: HTTP/1.1 Redirection. Here they are:

300 Multiple Choices

There are several URLs to which redirection is possible (pages may differ, for example, in language). Offer users a list of these. The preferred destination can be indicated in the Location header; not every browser automatically redirects. Rarely used.

301 Moved Permanently

Use this when a resource that used to exist at the requested URL is now (permanently) located at a new address. Specify this in the Location header. However, if it has been discontinued, announce this with the code 410 Gone.

302 Found

A problematic code. It indicates that the resource has been temporarily moved elsewhere and the browser should access the new URL using the same method (GET, POST, HEAD, …) as used on the original. Additionally, with methods other than GET and HEAD, user confirmation should be required for the redirection. Most browsers, however, do not respect this and change the method to GET without requiring confirmation.

The code is often mistakenly used instead of 303.

303 See Other – for PRG

The Post/Redirect/Get technique prevents double submission of forms upon page reload or back button click. After submitting a form using the POST method, a redirection is made using the GET method to another page. This is exactly what the 303 code is for, converting POST to GET.

304 Not Modified

For caching purposes. It responds to the If-Modified-Since header that the resource has not changed since the previous visit. The response must not contain a body, only headers.

307 Temporary Redirect

As mentioned, code 302 has become problematic due to non-compliance with the standard by both web designers and browser creators. Code 307 is its reincarnation, which mostly works correctly. It can be used, for example, to perform a redirection using the POST method with transferred data.

Property setters and getters – final solution

PHP currently supports property overloading provided by magic methods __get() and __set(). Those methods are called when accessed property is not declared in the class. Actually, this overloading is a bit frustrating.

This is my solution how to simulate properties with accessor methods in a Delphi way.

So…

// define new "like-keyword"
// value is short and unusual string
define('property',   "\0\0");

Maybe _property is a better choice. It doesn't matter now.

…pokračování

PHP 5 → 4 converter

I have wrote tool for automatic convert PHP5 scripts to PHP4. It converts new PHP 5 object model features and language constructions into PHP 4 equivalents. This tool may be useful, if you need transfer PHP5 scripts to run under previous version. It is not designated to provides missing functionality (try PHP_Compat), just missing syntax.

Supported are these constructions:

abstract classes and methods (demo)
interface, implements (demo)
new constructor syntax __construct (demo)
final classes and method
visibility modifiers public, protected, private (demo)
static attributes and methods (demo)
class constants (demo)
object clonning (demo)
operator instanceof
self::
automatic passing objects as references (see How to emulate PHP5's object model in PHP4)
type hinting
support for Doc Comments
and for exceptions try, throw, catch (it is not full-value replacement, only partial emulation)

Try it here

How to emulate PHP5 object model in PHP4

I spent a lot of time thinking, how to emulate some PHP5's object model features in older PHP4. How get rid tons of ampersands in my source codes. How force objects to not copy itself every time.

So, there is solution.

…pokračování

novější články