Escaping – The Definitive Guide

One of the evergreen topics in programming is the confusion and misunderstandings around escaping. Ignorance causes the simplest methods of compromising websites, such as Cross Site Scripting (XSS) or SQL injection, to remain unfortunately widespread.

Escaping is the substitution of characters that have a special meaning in a given context with other corresponding sequences.

Example: To write quotes within a string enclosed by quotes, you need to replace them because quotes have a special meaning in the context of a string, and writing them plainly would be interpreted as ending the string. The specific substitution rules are determined by the context.

Prerequisites

Each escaping function assumes that the input is always a “raw string” (unmodified) in a certain encoding (character set).

Storing strings already escaped for HTML output in the database and similar is entirely counterproductive.

What contexts do we encounter?

As mentioned, escaping converts characters that have a special meaning in a certain context. Different escaping functions are used for each context. This table is only indicative, and it is necessary to read the notes below.

Context	Escaping Function	Reverse Function
HTML	htmlspecialchars	html_entity_decode
XML	htmlspecialchars	—
regular expression	preg_quote	—
PHP strings	var_export	—
MySQL database	mysql_real_escape_string	—
MySQL improved	mysqli_real_escape_string	—
SQLite database	sqlite_escape_string	—
PostgreSQL database	pg_escape_string	—
PostgreSQL, bytea type	pg_escape_bytea	pg_unescape_bytea
JavaScript, JSON	json_encode	json_decode
CSS	addcslashes	—
URL	rawurlencode	urldecode

Explanation of the following notes:

many contexts have their subcontexts where escaping differs. Unless otherwise stated, the specified escaping function is applicable universally without further differentiation of subcontexts.
the term usual character set refers to a character set with 1-byte or UTF-8 encoding.

HTML

In HTML contexts, the characters < & " ' collectively have a special meaning, and the corresponding sequences are < & " '. However, the exception is an HTML comment, where only the pair -- has special meaning.

For escaping, use:

$s = htmlspecialchars($s, ENT_QUOTES);

It works with any usual character set. However, it does not consider the subcontext of HTML comments (i.e., it cannot replace the pair -- with something else).

Reverse function:

$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');

XML / XHTML

XML 1.0 differs from HTML in that it prohibits the use of C0 control characters (including writing in the form of an entity) except for the tabulator, line feed, and space. XML 1.1 allows these banned characters, except NUL, in the form of entities, and further mandates C1 control characters, except NEL, also to be written as entities. Additionally, in XML, the sequence ]]> has a special meaning, so one of these characters must also be escaped.

For XML 1.0 and any usual character set, use:

$s = preg_replace('#[\x00-\x08\x0B\x0C\x0E-\x1F]+#', '', $s);
$s = htmlspecialchars($s, ENT_QUOTES);

Regular Expression

In Perl regular expressions, characters . \ + * ? [ ^ ] $ ( ) { } = ! < > | : - and the so-called delimiter, which is a character delimiting the regular expression (e.g., for the expression '#[a-z]+#i' it is #), collectively have special meaning. They are escaped with the character \.

$s = preg_quote($s, $delimiter);

In the string replacing the searched expression (e.g., the 2nd parameter of the preg_replace function), the backslash and dollar sign have special meaning:

$s = addcslashes($replacement, '$\\');

The encoding must be either 1-byte or UTF-8, depending on the modifier in the regular expression.

PHP Strings

PHP distinguishes these types of strings:

in single quotes, where special meaning can have characters \ '
in double quotes, where special meaning can have characters \ " $
NOWDOC, where no character has special meaning
HEREDOC, where special meaning can have characters \ $

Escape is done with the character \. This is usually done by the programmer when writing code, for PHP code generators, you can use the var_export function.

Note: because the mentioned regular expressions are usually written within PHP strings, both types of escaping need to be combined. E.g., the character \ for a regular expression is written as \\ and in a quoted string it needs to be written as \\\\.

SQL and Databases

Each database has its own escaping function, see the table above. Almost always, however, only a function for escaping strings is available, and it cannot be used for anything else, especially there are no functions for escaping wildcard characters used in LIKE constructions (in MySQL these are % _) or identifiers, such as table or column names. Databases do not require removing escaping on output! (Except, for example, for bytea type.)

For character sets with unusual multi-byte encoding, it is necessary to set the function mysql_set_charset or mysqli_set_charset in MySQL.

I recommend using a database layer (e.g., dibi, Nette Database, PDO) or parameterized queries, which take care of escaping for you.

JavaScript, JSON

As a programming language, JavaScript has a number of very different subcontexts. For escaping strings, you can use the side effect of the function

$s = json_encode((string) $s);

which also encloses the string in quotes. Strictly requires UTF-8.

JavaScript written inside HTML attributes (e.g., onclick) must still be escaped according to HTML rules, but this does not apply to JavaScript inside <script> tags, where only the potential occurrence of the ending tag </script> inside the string needs to be treated. However, json_encode ensures this, as JSON escapes the slash /. However, it does not handle the end of an HTML comment --> (which does not matter in HTML) or an XML CDATA block ]]>, which the script is wrapped in. For XML/XHTML, the solution is

$s = json_encode((string) $s);
$s = str_replace(']]>', ']]\x3E', $s);

Since JSON uses a subset of JavaScript syntax, the reverse function json_decode is fully usable only for JSON, limitedly for JavaScript.

CSS

In CSS contexts, the range of valid characters is precisely defined, for escaping identifiers, for example, you can use this function:

$s = addcslashes($s, "\x00..\x2C./:;<=>?@[\\]^`{|}~");

For CSS within HTML code, the same applies as stated about JavaScript and its escaping within HTML attributes and tags (here it is about the style attributes and <style> tags).

URL

In the context of a URL, everything except the letters of the English alphabet, digits, and characters - _ . is escaped by replacing them with % + the hexadecimally expressed byte.

$s = rawurlencode($s);

According to RFC 2718 (from 1999) or RFC 3986 (from 2005), writing characters in UTF-8 encoding is preferred.

The reverse function in this case is urldecode, which also recognizes the + character as meaning space.

If you find the whole topic too complicated, don't despair. Soon you will realize that it is actually about simple transformations, and the whole trick is in realizing which context I am in and which function I need to choose for it. Or even better, try using an intelligent templating system that can recognize contexts itself and apply proper escaping: Latte.