phpFashion

Do you know what the $ symbol means in a regular expression?

First, a question: match or no match?

$str = "123\n";

echo preg_match('~^\d+$~', $str);

If you think the function returns false because the regular expression operates in single-line mode and does not allow any characters other than digits in the string, you are mistaken.

I'll digress slightly. Regular expressions in Ruby have a flaw (inconsistency with the de facto PERL standard): the ^ and $ characters do not denote the start and end of the string, but only the start and end of a line within it. Not knowing this fact can cause security vulnerabilities, as noted in the Rails documentation. PHP behaves as standard, but few know what exactly that standard behavior means. The documentation for the meta-character $ is imprecise. (now corrected)

Correctly, the $ character means the end of the string or a terminating newline; in multiline mode (modifier m), it means the end of a line.

The actual end of the string is captured with the sequence \z. Alternatively, you can use the dollar sign together with the modifier D.

$str = "123\n";
echo preg_match('~^[0-9]+$~', $str); // true
echo preg_match('~^[0-9]+$~D', $str); // false
echo preg_match('~^[0-9]+\z~', $str); // false

FUCK. THAT. SHIT.

Large frameworks aren't always and universally suitable for everyone and everything!

I borrowed the title from the Manifesto of Miniature PHP, which I would happily sign electronically, if I had a digital signature. Although the argument about counting lines is unfair and debatable, I understand what the author was trying to say. On Zdroják, I wrote a comment that I eventually decided to immortalize here on the blog:

I often make simple websites, which I write entirely in “notepad”, and I want the code to have no more lines than is absolutely necessary. Uploading a several-megabyte framework for a 20kB website, including styles, to a hosting service is out of the question.

Yet, even in these simple websites, I want to use solutions that are available in Nette, and I don't want to give up the comfort I'm used to. I am a lazy programmer. For this reason, the Nette Framework can be used as a micro-framework.

An example would be appropriate. Just yesterday, I redesigned https://davidgrudl.com and made the source code available (check the top left corner), purely for inspiration to others on how I handle such a microsite. The entire PHP code of the website is contained in a single file, index.php, which is, I believe, understandable, although perhaps less so for the uninitiated. The rest are templates. And the framework is uploaded in the minified form of a single file, which, along with the fact that it's about twice the size of jQuery, overcomes the psychological block of “not wanting to upload a whole framework.”

Or take the example of a blog found directly in the distribution. Its source code is also just index.php, with even fewer lines than the previous example. Everything else is templates, see https://github.com/…ta/templates.

Perhaps I should explain why I actually use a framework on tiny websites. Mainly, today I cannot imagine programming without Tracy, which then logs errors on the production server (although they are unlikely with a static website). But I primarily use the Latte templating system because, starting from just 2 pages, I want to separate layout and content, I like Latte’s concise syntax, and I rely on its automatic escaping. I also use routing, because simply wanting URLs without .php extensions can only be set up correctly by God himself.

In the first mentioned website, caching is also used for Twitter feeds, and on the blog, a database layer is utilized. And there’s also a Nette SEO trick, which automatically prevents the known error of moving forwards and backwards through pagination and landing on the same page, only to have it haunted in the URL by page=1.

Nette also ensures that if there is an error, no PHP programming error messages are displayed, but rather a user-understandable page. And also autoloading – I've come to take it for granted so much that I would have completely forgotten to mention it.

Of course, I sometimes add a contact form and have it send emails. Now I realize that I actually use 90% of the framework.

That's how I create quick'n'dirty websites and that's how I enjoy it 😉

See also: How to write micro-websites

Fat and Sausages in String Replacement

How not to get burned when replacing occurrences of one string with another. Search & Replace tricks.

The basic function for replacing strings in PHP is str_replace:

$s = "Lorem ipsum";
echo str_replace('ore', 'ide', $s); // returns "Lidem ipsum"

Thanks to cleverly designed UTF-8 encoding, it can be reliably used even for strings encoded this way. Additionally, the first two arguments can be arrays, and the function will then perform multiple replacements. Here we encounter the first trick to be aware of. Each replacement goes through the string again, so if we wanted to swap dá <⇒ pá in the phrase pánské dárky to get dánské párky (a Swedish delicacy!), no order of arguments will achieve this:

// returns "dánské dárky"
echo str_replace(array('dá', 'pá'), array('pá', 'dá'), "pánské dárky");

// returns "pánské párky"
echo str_replace(array('pá', 'dá'), array('dá', 'pá'), "pánské dárky");

The sought-after function that goes through the string just once and prevents collisions is strtr:

// returns "dánské párky", hooray
echo strtr("pánské dárky", array('pá' => 'dá', 'dá' => 'pá'));

If we need to find occurrences according to more complex rules, we use regular expressions and the function preg_replace. It also allows for multiple replacements and behaves similarly to str_replace. Now, however, I am heading elsewhere. I need to replace all numbers in the string with the word hafo, which is easy:

$s = "Radek says he has an IQ of 151. Quite the collector's item!";
echo preg_replace('#\d+#', 'hafo', $s);

Let's generalize the code so it can replace numbers with anything we pass in the variable $replacement. Many programmers will use:

return preg_replace('#\d+#', $replacement, $s); // wrong!

Unfortunately, that’s not right. It's important to realize that certain characters have special meanings in the replaced string (specifically the slash and dollar), so we must “escape” them: escaping definitive guide. The correct general solution is:

return preg_replace('#\d+#', addcslashes($replacement, '$\\'), $s); // ok

Do any other replacement tricks come to mind?

Zen of Dependency Injection

Dependency Injection is a technique that solves certain problems but also introduces new challenges. These challenges are then addressed by a DI (Dependency Injection) container, which requires you to adopt a new perspective on object-oriented design.

If the problems that DI solves do not bother you, then you might perceive its implementation as an unnecessary hassle, particularly because it necessitates learning a new approach to object-oriented design.

However, it seems that if you are not bothered by the issues DI addresses, you have a serious problem. Which you will realize once you discover it.

“I Have the Worst Client, They Keep Changing the Requirements”

Do you know the complaints developers have about their clients not having a clear vision and constantly changing the project requirements? That's them crying over their own inability. Whenever I hear this, I wish the poor client had a better provider.

The client doesn't have a clear brief because they are not experts in web design. I wonder how many web designers understand their client's business well enough that they could create a precise brief if the roles were reversed.

If the client continuously changes the requirements, it means they are interested and engaged in the project, constantly thinking about it. There's a higher chance that something truly useful will emerge. And most importantly: they will keep asking for more and more work.

If the developer realizes this, they will understand that it is they who must adapt their working style. Perhaps simplify the addition of a ZIP code column on the website, even though it wasn't in the original brief.

Best Practices for Namespaces in PHP

Here are some well-intentioned tips on how to design the structure of namespaces and class names.

Namespaces are probably the best-known new feature of PHP version 5.3. Their main purpose is to prevent name conflicts and to allow shortening (aliasing) of class names for use within a single file. In practice, it has been shown that conflicts can also be avoided by using a 1–2 letter prefix, just as I have never used class names like Zend_Service_DeveloperGarden_Response_ConferenceCall_AddConferenceTemplateParticipantResponseType (97 characters, I wonder how they adhere to their maximum line length rule of 80 characters 🙂 ). However, PHP follows in the footsteps of Java, and so we have namespaces. How should we handle them?

Benefits of Namespaces

Perhaps the most complex question you need to answer is: what is the benefit of renaming a class:

sfForm → Symfony\Component\Form\Form

This question is a proven starter for endless flame wars. From the programmer's comfort, intuitiveness, and memorability perspective, the original concise and descriptive sfForm is more appropriate. It corresponds to how programmers colloquially refer to it, i.e., “form in Symfony”. The new and longer name is correct from other aspects, which I am not sure if the average user will appreciate.

How to Layout Namespaces?

The syntactic aspect of using namespaces is described in the documentation, but finding the right patterns requires practice, which there hasn’t been enough time for yet. Spaces in PHP have their specifics due to a number of factors, so it is not ideal to copy conventions used in Java or .NET exactly. However, they can be a good starting point.

More will be discussed in the individual naming rules.

1) A class should have a descriptive name even without mentioning the NS

The name of each class, even without the namespace, must capture its essence. It would be inappropriate to rename the class ArrayIterator → Spl\Iterators\Array, as one would not expect an iterator under the name Array (ignoring the fact that a class cannot be named a keyword). And beware, even from the name Spl\Iterators\Array, it is not clear that it is an iterator, because you cannot assume that the namespace Spl\Iterators only contains iterators. Here are a few examples:

unsuitable: Nette\Application\Responses\Download – it is not obvious that Download is a response
unsuitable: Zend\Validator\Date – you would expect Date to be a date, not a validator
unsuitable: Zend\Controller\Request\Http – you would expect Http to be a request

Therefore, in addition to specializing classes, it is appropriate to keep a level of generality in the name:

better: Nette\Application\Responses\DownloadResponse
better: Zend\Validator\DateValidator
better: Zend\Controller\Request\HttpRequest

The ideal is if there is a one-word yet descriptive name. This can be particularly conceived for classes that represent something from the real world:

best: Nette\Forms\Controls\Button – two-word ButtonControl not necessary (however, HiddenControl cannot be shortened to Hidden)

2) The namespace should have a descriptive name

Naturally, the name of the namespace itself must be descriptive, and it is advantageous to have a shorter name without redundancies. Such a redundancy to me seems like Component in Symfony\Component\Routing, because the name would not suffer without it.

In some situations, you need to decide between singular and plural (e.g., Zend\Validator vs Zend\Validators), which is a similarly undecided issue as when choosing singular and plural numbers for database tables.

3) Distinguish between namespaces and classes

Naming a class the same as a namespace (i.e., having classes Nette\Application and Nette\Application\Request) is technically possible, but it might confuse programmers and it is better to avoid it. Also, consider how well the resulting code will read or how you would explain the API to someone.

4) Limit unnecessary duplications (+ partial namespace)

Ideally, the name of the class and the name of the space should not contain the same information redundantly.

instead of Nette\Http\HttpRequest prefer Nette\Http\Request
instead of Symfony\Component\Security\Authentication\AuthenticationTrustResolver prefer the class TrustResolver

The class Nette\Http\Request does not violate rule No. 1 about the descriptive name of the class even without mentioning the namespace, on the contrary, it allows us to elegantly use the partial namespace:

use Nette\Http; // alias for namespace

// all classes via Http are available:
$request = new Http\Request;
$response = new Http\Response;
// and additionally, Http\Response is more understandable than just Response

If we understand namespaces as packages, which is common, it leads to unfortunate duplication of the last word:

Zend\Form\Form
Symfony\Component\Finder\Finder
Nette\Application\Application

Namespaces also literally encourage grouping classes (e.g., various implementations of the same interface, etc.) into their own spaces, which again creates duplications:

Nette\Caching\Storages\FileStorage – i.e., all storages in a separate space Storages
Zend\Form\Exception\BadMethodCallException – all exceptions in Exception
Symfony\Component\Validator\Exception\BadMethodCallException – again all exceptions in Exception

Grouping namespaces lengthen the name and create duplication in it because it is often impossible to remove the generality from the class name (rule 1). Their advantage may be better orientation in the generated API documentation (although this could be achieved differently) and easier access when using full-fledged IDEs with prompting. However, I recommend using them cautiously. For example, for exceptions, it is not very suitable.

5) Unmistakable classes from multiple spaces

According to point 1), a class should have a descriptive name, but that does not mean it has to be unique within the entire application. Usually, it is enough that it is unique within the namespace. However, if two classes from different spaces are often used next to each other in the code, or if they have some other significant connection, they should not have the same name. In other words, it should not be necessary to use AS in the USE clause.

6) One-way dependencies

Consider what dependencies should exist between classes from different namespaces. I try to maintain:

if a class from the namespace A\B has a dependency on a class from the namespace A\C, no class from A\C should have a dependency on A\B
classes from the namespace A\B should not have dependencies on a class from the space A\B\C (take this with a grain of salt)

p.s.: Please do not take this article as dogma, it is just a capture of current thoughts

Are these URLs the same?

A question that many webmasters ask: do search engines perceive these URLs as the same? How should they be treated?

http://example.com/article
http://example.com/article/
http://example.com/Article
https://example.com/article
http://www.example.com/article
http://example.com/article?a=1&b=2
http://example.com/article?b=2&a=1

The short answer would be: “URLs are different.” However, a more detailed analysis is needed.

From a user's perspective, these addresses differ only in minor details which they generally disregard. Thus, they perceive them as the same, although technically, they are different addresses. Let's call them similar addresses. For the sake of “user experience”, two principles should be adhered to:

Do not allow different content on similar addresses. As I will show soon, this would not only confuse users but also search engines.
Allow users access through similar addresses.

If the addresses differ in protocol http / https or with www domain or without, search engines consider them different. Not so for users. It would be a fatal mistake to place different content on such similar addresses. However, it would also be a mistake to prevent access through a similar address. The address with www and without www must both function, with SEO recommending sticking to one variant and redirecting the others to it using a 301 HTTP code. This can be managed for the www subdomain with a .htaccess file:

# redirection to the non-www variant
RewriteCond %{HTTP_HOST} ^www\.
RewriteRule ^.*$   http://example.com/$0  [R=301,NE,L]

# redirection to the www variant
RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^.*$   http://www.example.com/$0  [R=301,NE,L]

Immediately test whether your servers redirect, including the full address and correct parameter passing. Don't forget variants like www.subdomain.example.cz. Because some browsers can bypass missing redirections, try a low-level service like Web-Sniffer.

URLs are case-sensitive except for the scheme and domain. However, users do not differentiate and therefore, it is unfortunate to offer different content on addresses differing only by letter case. A poor example can be seen in Wikipedia:

http://en.wikipedia.org/wiki/Acid about acids
http://en.wikipedia.org/wiki/ACID about database transactions

Bing amusingly suffers from an error, returning the same URL whether you search for acid or a database (although the textual description is correct). Google and Yahoo do not have this issue.

Bing does not differentiate between acid and database

Some services (webmails, ICQ) convert uppercase letters in URLs to lowercase, which are all reasons to avoid distinguishing letter size, even in parameters. Better adhere to the convention that all letters in URLs should be lowercase.

Distinguishing some similar addresses is also a challenge for search engines. I conducted an experiment by placing different content on URLs differing in details like the presence of a trailing slash or parameter order. Only Google was able to index them as different. Other search engines could always handle only one of the variants.

Only Google can index these pages as different

As for trailing slashes, the web server usually redirects to the canonical form for you; if you access a directory without a trailing slash, it adds one and redirects. Of course, this does not apply when you manage URIs on your own (Cool URIs, etc.)

Finally: does the order of parameters really matter? There should be no difference between article?a=1&b=2 and article?b=2&a=1. However, there are situations where this is not the case, especially when passing complex structures such as arrays. For instance, ?sort[]=name&sort[]=city might be different from ?sort[]=city&sort[]=name. Nevertheless, redirecting if parameters are not in the specified order would be considered unnecessary overcorrection.

p.s. Nette Framework automatically handles redirection to canonical URLs on its own.

Treacherous Regular Expressions in PHP

In PHP, there are three libraries available for regular expressions: PCRE, Oniguruma, and POSIX Regex. The second one may not always be available, and the third is deprecated, so you should exclusively use the more adept and faster PCRE library. Unfortunately, its implementation suffers from quite unpleasant flaws across all PHP versions.

The operation of the preg_* functions can be divided into two steps:

compilation of the regular expression
execution (searching, replacing, filtering, …)

It is advantageous that PHP maintains a cached version of compiled regular expressions, meaning they are only compiled once. Therefore, it is appropriate to use static regular expressions, i.e., not to generate them parametrically.

Now for the unpleasant issues. If an error is discovered during compilation, PHP will issue an E_WARNING error, but the return value of the function is inconsistent:

preg_filter, preg_replace_callback, preg_replace return null
preg_grep, preg_match_all, preg_match, preg_split return false

It is good to know that functions returning an array $matches by reference (i.e., preg_match_all and preg_match) do not nullify the argument upon a compilation error, thus validating the test of the return value.

Since version 5.2.0, PHP has the function preg_last_error returning the code of the last error. However, beware, this only applies to errors that occur during execution! If an error occurs during compilation, the value of preg_last_error is not reset and returns the previous value. If the return value of a preg_* function is not null or false (see above), definitely do not rely on what preg_last_error returns.

What kind of errors can occur during execution? The most common case is exceeding pcre.backtrack_limit or invalid UTF-8 input when using the u modifier. (Note: invalid UTF-8 in the regular expression itself is detected during compilation.) However, the way PHP handles such an error is utterly inadequate:

it generates no message (silent error)
the return value of the function may indicate that everything is fine
the error can only be detected by calling preg_last_error later

Let's talk about the return value, which is probably the biggest betrayal. The process is executed until an error occurs, then it returns a partially processed result. And this is done completely silently. However, even this is not always the case, for example, the trio of functions preg_filter, preg_replace_callback, preg_replace can return null even during execution errors.

Whether an error occurred during execution can only be determined by calling preg_last_error. But as you know, this function returns a nonsensical result if, on the contrary, a compilation error occurred, so we must distinguish both situations by considering the return value of the function, whether it is null or false. And since functions that return null during a compilation error can also return null during an execution error, it can be stated only that PHP is undoubtedly a messed-up language.

What would safe use of PCRE functions look like? For example, like this:

function safeReplaceCallback($pattern, $callback, $subject)
{
	// we must verify the callback ourselves
	if (!is_callable($callback)) {
		throw new Exception('Invalid callback.');
	}

	// test the expression on an empty string
	if (preg_match($pattern, '') === false) { // compilation error?
		$error = error_get_last();
		throw new Exception($error['message']);
	}

	// call PCRE
	$result = preg_replace_callback($pattern, $callback, $subject);

	// execution error?
	if ($result === null && preg_last_error()) {
		throw new Exception('Error processing regular expression.', preg_last_error());
	}

	return $result;
}

The provided code transforms errors into exceptions but does not attempt to suppress warning outputs.

Safe processing of regular expressions is implemented in the class Nette\Utils\Strings.

Hackers Will Attack Your Website

Every now and then, a security vulnerability is reported on another significant website (Alza, Mapy.cz, BontonLand) or is exploited. Try searching for XSS vulnerability to understand why Cross Site Scripting (XSS) is currently one of the most widespread and dangerous vulnerabilities.

This is a distressing issue for website operators and perhaps even more so for suppliers. It can damage reputations, lead to fines, lawsuits, or simply spoil relationships with clients. How to defend against XSS? By so-called string escaping. Unfortunately, most experts are not well-versed in this area. (I don’t mean to be tactless or offend anyone, but of the “Czechoslovak IT celebrities,” I only know one person who deeply understands this issue.) Thus, even articles on this topic on well-known websites are, let’s say, inaccurate.

Moreover, this escaping is usually done in the template, falling on the coder’s shoulders. Thus, the most critical area requiring high expertise is handled by someone unqualified. How can this end? We know all too well – see the first paragraph.

Nette Framework Will Save You

I would like to introduce you to a killer feature of the Latte templating system in the Nette Framework. It's such a fundamental feature that it alone is a reason to choose this framework. Or at least to use its templates.

the bigger your company, the more crucial this feature is
no competing framework has it to date ¹⁾

The Nette Framework automatically escapes in templates. Its Context-aware escaping feature recognizes which part of the document you are in and chooses the appropriate escaping method accordingly.

Let's dive into more technical details. You can see how it works best with an example. Consider a variable $var and this template:

<p onclick="alert({$var})">{$var}</p>

<script>
document.title = {$var};
</script>

The notation {$var} means printing the variable. However, each print must be explicitly secured, even differently at each location. A coder must (for example, in Smarty) add the appropriate modifiers, must not make a mistake, and especially not omit anything.

In the Nette Framework, nothing needs to be manually secured. Everything is done automatically, correctly, and consistently!

If we assign $var = 'Width 1/2"' to the variable, the framework generates the HTML code:

<p onclick="alert(&quot;Width 1\/2\&quot;&quot;)">Width 1/2&quot;</p>

<script>
document.title = "Width 1\/2\"";
</script>

Of course, situations where you need to print a variable without escaping it are also considered, for example, because it contains article text including HTML tags. In such cases, you use the notation {$var|noescape}.

End of the technical digression. Thanks to Latte, it suddenly means that

the template remains simple
you don’t have to worry that a coder will overlook something
and at the same time, you don’t need to have a top expert on escaping ;)
the work is much easier

You can find more information about Latte’s smart templates in the documentation.

¹⁾ About half a year after Nette, Google introduced a similar feature for its library in C++, and as far as I know, no framework in PHP, Ruby, or Python has anything similar yet.

Escaping – The Definitive Guide

One of the evergreen topics in programming is the confusion and misunderstandings around escaping. Ignorance causes the simplest methods of compromising websites, such as Cross Site Scripting (XSS) or SQL injection, to remain unfortunately widespread.

Escaping is the substitution of characters that have a special meaning in a given context with other corresponding sequences.

Example: To write quotes within a string enclosed by quotes, you need to replace them because quotes have a special meaning in the context of a string, and writing them plainly would be interpreted as ending the string. The specific substitution rules are determined by the context.

Prerequisites

Each escaping function assumes that the input is always a “raw string” (unmodified) in a certain encoding (character set).

Storing strings already escaped for HTML output in the database and similar is entirely counterproductive.

What contexts do we encounter?

As mentioned, escaping converts characters that have a special meaning in a certain context. Different escaping functions are used for each context. This table is only indicative, and it is necessary to read the notes below.

Context	Escaping Function	Reverse Function
HTML	htmlspecialchars	html_entity_decode
XML	htmlspecialchars	—
regular expression	preg_quote	—
PHP strings	var_export	—
MySQL database	mysql_real_escape_string	—
MySQL improved	mysqli_real_escape_string	—
SQLite database	sqlite_escape_string	—
PostgreSQL database	pg_escape_string	—
PostgreSQL, bytea type	pg_escape_bytea	pg_unescape_bytea
JavaScript, JSON	json_encode	json_decode
CSS	addcslashes	—
URL	rawurlencode	urldecode

Explanation of the following notes:

many contexts have their subcontexts where escaping differs. Unless otherwise stated, the specified escaping function is applicable universally without further differentiation of subcontexts.
the term usual character set refers to a character set with 1-byte or UTF-8 encoding.

HTML

In HTML contexts, the characters < & " ' collectively have a special meaning, and the corresponding sequences are < & " '. However, the exception is an HTML comment, where only the pair -- has special meaning.

For escaping, use:

$s = htmlspecialchars($s, ENT_QUOTES);

It works with any usual character set. However, it does not consider the subcontext of HTML comments (i.e., it cannot replace the pair -- with something else).

Reverse function:

$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');

XML / XHTML

XML 1.0 differs from HTML in that it prohibits the use of C0 control characters (including writing in the form of an entity) except for the tabulator, line feed, and space. XML 1.1 allows these banned characters, except NUL, in the form of entities, and further mandates C1 control characters, except NEL, also to be written as entities. Additionally, in XML, the sequence ]]> has a special meaning, so one of these characters must also be escaped.

For XML 1.0 and any usual character set, use:

$s = preg_replace('#[\x00-\x08\x0B\x0C\x0E-\x1F]+#', '', $s);
$s = htmlspecialchars($s, ENT_QUOTES);

Regular Expression

In Perl regular expressions, characters . \ + * ? [ ^ ] $ ( ) { } = ! < > | : - and the so-called delimiter, which is a character delimiting the regular expression (e.g., for the expression '#[a-z]+#i' it is #), collectively have special meaning. They are escaped with the character \.

$s = preg_quote($s, $delimiter);

In the string replacing the searched expression (e.g., the 2nd parameter of the preg_replace function), the backslash and dollar sign have special meaning:

$s = addcslashes($replacement, '$\\');

The encoding must be either 1-byte or UTF-8, depending on the modifier in the regular expression.

PHP Strings

PHP distinguishes these types of strings:

in single quotes, where special meaning can have characters \ '
in double quotes, where special meaning can have characters \ " $
NOWDOC, where no character has special meaning
HEREDOC, where special meaning can have characters \ $

Escape is done with the character \. This is usually done by the programmer when writing code, for PHP code generators, you can use the var_export function.

Note: because the mentioned regular expressions are usually written within PHP strings, both types of escaping need to be combined. E.g., the character \ for a regular expression is written as \\ and in a quoted string it needs to be written as \\\\.

SQL and Databases

Each database has its own escaping function, see the table above. Almost always, however, only a function for escaping strings is available, and it cannot be used for anything else, especially there are no functions for escaping wildcard characters used in LIKE constructions (in MySQL these are % _) or identifiers, such as table or column names. Databases do not require removing escaping on output! (Except, for example, for bytea type.)

For character sets with unusual multi-byte encoding, it is necessary to set the function mysql_set_charset or mysqli_set_charset in MySQL.

I recommend using a database layer (e.g., dibi, Nette Database, PDO) or parameterized queries, which take care of escaping for you.

JavaScript, JSON

As a programming language, JavaScript has a number of very different subcontexts. For escaping strings, you can use the side effect of the function

$s = json_encode((string) $s);

which also encloses the string in quotes. Strictly requires UTF-8.

JavaScript written inside HTML attributes (e.g., onclick) must still be escaped according to HTML rules, but this does not apply to JavaScript inside <script> tags, where only the potential occurrence of the ending tag </script> inside the string needs to be treated. However, json_encode ensures this, as JSON escapes the slash /. However, it does not handle the end of an HTML comment --> (which does not matter in HTML) or an XML CDATA block ]]>, which the script is wrapped in. For XML/XHTML, the solution is

$s = json_encode((string) $s);
$s = str_replace(']]>', ']]\x3E', $s);

Since JSON uses a subset of JavaScript syntax, the reverse function json_decode is fully usable only for JSON, limitedly for JavaScript.

CSS

In CSS contexts, the range of valid characters is precisely defined, for escaping identifiers, for example, you can use this function:

$s = addcslashes($s, "\x00..\x2C./:;<=>?@[\\]^`{|}~");

For CSS within HTML code, the same applies as stated about JavaScript and its escaping within HTML attributes and tags (here it is about the style attributes and <style> tags).

URL

In the context of a URL, everything except the letters of the English alphabet, digits, and characters - _ . is escaped by replacing them with % + the hexadecimally expressed byte.

$s = rawurlencode($s);

According to RFC 2718 (from 1999) or RFC 3986 (from 2005), writing characters in UTF-8 encoding is preferred.

The reverse function in this case is urldecode, which also recognizes the + character as meaning space.

If you find the whole topic too complicated, don't despair. Soon you will realize that it is actually about simple transformations, and the whole trick is in realizing which context I am in and which function I need to choose for it. Or even better, try using an intelligent templating system that can recognize contexts itself and apply proper escaping: Latte.

novější články starší články