Escaping – The Definitive Guide
One of the evergreen topics in programming is the confusion and misunderstandings around escaping. Ignorance causes the simplest methods of compromising websites, such as Cross Site Scripting (XSS) or SQL injection, to remain unfortunately widespread.
Escaping is the substitution of characters that have a special meaning in a given context with other corresponding sequences.
Example: To write quotes within a string enclosed by quotes, you need to replace them because quotes have a special meaning in the context of a string, and writing them plainly would be interpreted as ending the string. The specific substitution rules are determined by the context.
Prerequisites
Each escaping function assumes that the input is always a “raw string” (unmodified) in a certain encoding (character set).
Storing strings already escaped for HTML output in the database and similar is entirely counterproductive.
What contexts do we encounter?
As mentioned, escaping converts characters that have a special meaning in a certain context. Different escaping functions are used for each context. This table is only indicative, and it is necessary to read the notes below.
Context | Escaping Function | Reverse Function |
---|---|---|
HTML | htmlspecialchars | html_entity_decode |
XML | htmlspecialchars | — |
regular expression | preg_quote | — |
PHP strings | var_export | — |
MySQL database | mysql_real_escape_string | — |
MySQL improved | mysqli_real_escape_string | — |
SQLite database | sqlite_escape_string | — |
PostgreSQL database | pg_escape_string | — |
PostgreSQL, bytea type | pg_escape_bytea | pg_unescape_bytea |
JavaScript, JSON | json_encode | json_decode |
CSS | addcslashes | — |
URL | rawurlencode | urldecode |
Explanation of the following notes:
- many contexts have their subcontexts where escaping differs. Unless otherwise stated, the specified escaping function is applicable universally without further differentiation of subcontexts.
- the term usual character set refers to a character set with 1-byte or UTF-8 encoding.
HTML
In HTML contexts, the characters < & " '
collectively
have a special meaning, and the corresponding sequences are
< & " '
. However, the exception is
an HTML comment, where only the pair --
has special meaning.
For escaping, use:
$s = htmlspecialchars($s, ENT_QUOTES);
It works with any usual character set. However, it does not consider the
subcontext of HTML comments (i.e., it cannot replace the pair --
with something else).
Reverse function:
$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');
XML / XHTML
XML 1.0 differs from HTML in that it prohibits the use of C0 control
characters (including writing in the form of an entity) except for the
tabulator, line feed, and space. XML 1.1 allows these banned characters, except
NUL
, in the form of entities, and further mandates C1 control
characters, except NEL
, also to be written as entities.
Additionally, in XML, the sequence ]]>
has a special meaning, so
one of these characters must also be escaped.
For XML 1.0 and any usual character set, use:
$s = preg_replace('#[\x00-\x08\x0B\x0C\x0E-\x1F]+#', '', $s);
$s = htmlspecialchars($s, ENT_QUOTES);
Regular Expression
In Perl regular
expressions, characters
. \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
and the so-called
delimiter, which is a character delimiting the regular expression (e.g., for the
expression '#[a-z]+#i'
it is #
), collectively have
special meaning. They are escaped with the character \
.
$s = preg_quote($s, $delimiter);
In the string replacing the searched expression (e.g., the 2nd parameter of
the preg_replace
function), the backslash and dollar sign have
special meaning:
$s = addcslashes($replacement, '$\\');
The encoding must be either 1-byte or UTF-8, depending on the modifier in the regular expression.
PHP Strings
PHP distinguishes these types of strings:
- in single quotes, where special meaning can have
characters
\ '
- in double quotes, where special meaning can have
characters
\ " $
- NOWDOC, where no character has special meaning
- HEREDOC, where special meaning can have characters
\ $
Escape is done with the character \
. This is usually done by the
programmer when writing code, for PHP code generators, you can use the var_export function.
Note: because the mentioned regular expressions are usually written within
PHP strings, both types of escaping need to be combined. E.g., the character
\
for a regular expression is written as \\
and in a
quoted string it needs to be written as \\\\
.
SQL and Databases
Each database has its own escaping function, see the table above. Almost
always, however, only a function for escaping strings is available, and it
cannot be used for anything else, especially there are no functions for escaping
wildcard characters used in LIKE
constructions (in MySQL these are
% _
) or identifiers, such as table or column names. Databases
do not require removing escaping on output! (Except, for example, for
bytea type.)
For character sets with unusual multi-byte encoding, it is necessary to set the function mysql_set_charset or mysqli_set_charset in MySQL.
I recommend using a database layer (e.g., dibi, Nette Database, PDO) or parameterized queries, which take care of escaping for you.
JavaScript, JSON
As a programming language, JavaScript has a number of very different subcontexts. For escaping strings, you can use the side effect of the function
$s = json_encode((string) $s);
which also encloses the string in quotes. Strictly requires UTF-8.
JavaScript written inside HTML attributes (e.g., onclick
) must
still be escaped according to HTML rules, but this does
not apply to JavaScript inside <script>
tags, where only the
potential occurrence of the ending tag </script>
inside the
string needs to be treated. However, json_encode ensures this, as JSON escapes
the slash /
. However, it does not handle the end of an HTML comment
-->
(which does not matter in HTML) or an XML CDATA block
]]>
, which the script is wrapped in. For XML/XHTML, the
solution is
$s = json_encode((string) $s);
$s = str_replace(']]>', ']]\x3E', $s);
Since JSON uses a subset of JavaScript syntax, the reverse function json_decode is fully usable only for JSON, limitedly for JavaScript.
CSS
In CSS contexts, the range of valid characters is precisely defined, for escaping identifiers, for example, you can use this function:
$s = addcslashes($s, "\x00..\x2C./:;<=>?@[\\]^`{|}~");
For CSS within HTML code, the same applies as stated about JavaScript and its
escaping within HTML attributes and tags (here it is about the
style
attributes and <style>
tags).
URL
In the context of a URL, everything except the letters of the English
alphabet, digits, and characters - _ .
is escaped by replacing them
with %
+ the hexadecimally expressed byte.
$s = rawurlencode($s);
According to RFC 2718 (from 1999) or RFC 3986 (from 2005), writing characters in UTF-8 encoding is preferred.
The reverse function in this case is urldecode, which also recognizes the
+
character as meaning space.
If you find the whole topic too complicated, don't despair. Soon you will realize that it is actually about simple transformations, and the whole trick is in realizing which context I am in and which function I need to choose for it. Or even better, try using an intelligent templating system that can recognize contexts itself and apply proper escaping: Latte.