First, a question: match or no match?
$str = "123\n";
echo preg_match('~^\d+$~', $str);
If you think the function returns false
because the regular
expression operates in single-line mode and does not allow any characters other
than digits in the string, you are mistaken.
I'll digress slightly. Regular expressions in Ruby have a flaw (inconsistency
with the de facto PERL standard): the ^
and $
characters do not denote the start and end of the string, but only the start and
end of a line within it. Not knowing this fact can cause security
vulnerabilities, as noted in the Rails
documentation. PHP behaves as standard, but few know what exactly that
standard behavior means. The documentation for the meta-character $
is imprecise.
(now corrected)
Correctly, the $
character means the end of the string or a
terminating newline; in multiline mode (modifier m
), it means
the end of a line.
The actual end of the string is captured with the sequence \z
.
Alternatively, you can use the dollar sign together with the modifier
D
.
$str = "123\n";
echo preg_match('~^[0-9]+$~', $str); // true
echo preg_match('~^[0-9]+$~D', $str); // false
echo preg_match('~^[0-9]+\z~', $str); // false
Large frameworks aren't always and universally suitable for
everyone and everything!
I borrowed the title from the Manifesto of
Miniature PHP, which I would happily sign electronically, if I had a
digital signature. Although the argument about counting lines is unfair and
debatable, I understand what the author was trying to say. On Zdroják,
I wrote a comment that I eventually decided to immortalize here on
the blog:
I often make simple websites, which I write entirely in “notepad”, and
I want the code to have no more lines than is absolutely necessary. Uploading a
several-megabyte framework for a 20kB website, including styles, to a hosting
service is out of the question.
Yet, even in these simple websites, I want to use solutions that are
available in Nette, and I don't want to give up the comfort I'm used to. I am
a lazy programmer. For this reason, the Nette
Framework can be used as a micro-framework.
An example would be appropriate. Just yesterday, I redesigned https://davidgrudl.com and made the source
code available (check the top left corner), purely for inspiration to others on
how I handle such a microsite. The entire PHP code of the website is contained
in a single file, index.php, which is, I believe, understandable, although
perhaps less so for the uninitiated. The rest are templates. And the framework
is uploaded in the minified form of a single file, which, along with the fact
that it's about twice the size of jQuery, overcomes the psychological block of
“not wanting to upload a whole framework.”
Or take the example of a blog found directly in the distribution. Its source
code is also just index.php, with even fewer lines than the previous example.
Everything else is templates, see https://github.com/…ta/templates.
Perhaps I should explain why I actually use a framework on tiny websites.
Mainly, today I cannot imagine programming without Tracy, which then logs errors on the
production server (although they are unlikely with a static website). But
I primarily use the Latte templating
system because, starting from just 2 pages, I want to separate layout and
content, I like Latte’s concise syntax, and I rely on its automatic
escaping. I also use routing, because simply
wanting URLs without .php
extensions can only be set up correctly
by God himself.
In the first mentioned website, caching is also used for Twitter
feeds, and on the blog, a database
layer is utilized. And there’s also a Nette SEO trick, which
automatically prevents the known error of moving forwards and backwards through
pagination and landing on the same page, only to have it haunted in the URL by
page=1
.
Nette also ensures that if there is an error, no PHP programming error
messages are displayed, but rather a user-understandable page. And also
autoloading – I've come
to take it for granted so much that I would have completely forgotten to
mention it.
Of course, I sometimes add a contact form and have it send emails. Now I realize that
I actually use 90% of the framework.
That's how I create quick'n'dirty websites and that's how I enjoy
it 😉
See also: How to write
micro-websites
How not to get burned when replacing occurrences of one string
with another. Search & Replace tricks.
The basic function for replacing strings in PHP is str_replace:
$s = "Lorem ipsum";
echo str_replace('ore', 'ide', $s); // returns "Lidem ipsum"
Thanks to cleverly designed UTF-8 encoding, it can be reliably used even for
strings encoded this way. Additionally, the first two arguments can be arrays,
and the function will then perform multiple replacements. Here we encounter the
first trick to be aware of. Each replacement goes through the string
again, so if we wanted to swap dá
<⇒ pá
in
the phrase pánské dárky
to get dánské párky
(a
Swedish delicacy!), no order of arguments will achieve this:
// returns "dánské dárky"
echo str_replace(array('dá', 'pá'), array('pá', 'dá'), "pánské dárky");
// returns "pánské párky"
echo str_replace(array('pá', 'dá'), array('dá', 'pá'), "pánské dárky");
The sought-after function that goes through the string just once and prevents
collisions is strtr:
// returns "dánské párky", hooray
echo strtr("pánské dárky", array('pá' => 'dá', 'dá' => 'pá'));
If we need to find occurrences according to more complex rules, we use
regular expressions and the function preg_replace. It also allows for
multiple replacements and behaves similarly to str_replace
. Now,
however, I am heading elsewhere. I need to replace all numbers in the string
with the word hafo
, which is easy:
$s = "Radek says he has an IQ of 151. Quite the collector's item!";
echo preg_replace('#\d+#', 'hafo', $s);
Let's generalize the code so it can replace numbers with anything we pass in
the variable $replacement
. Many programmers will use:
return preg_replace('#\d+#', $replacement, $s); // wrong!
Unfortunately, that’s not right. It's important to realize that certain
characters have special meanings in the replaced string (specifically the slash
and dollar), so we must “escape” them: escaping
definitive guide. The correct general solution is:
return preg_replace('#\d+#', addcslashes($replacement, '$\\'), $s); // ok
Do any other replacement tricks come to mind?
Dependency
Injection is a technique that solves certain problems but also introduces
new challenges. These challenges are then addressed by a DI (Dependency
Injection) container, which requires you to adopt a new perspective on
object-oriented design.
If the problems that DI solves do not bother you, then you might perceive its
implementation as an unnecessary hassle, particularly because it necessitates
learning a new approach to object-oriented design.
However, it seems that if you are not bothered by the issues DI addresses,
you have a serious problem. Which you will realize once you discover it.
Do you know the complaints developers have about their clients
not having a clear vision and constantly changing the project requirements?
That's them crying over their own inability. Whenever I hear this, I wish the
poor client had a better provider.
The client doesn't have a clear brief because they are not experts in web
design. I wonder how many web designers understand their client's business
well enough that they could create a precise brief if the roles were
reversed.
If the client continuously changes the requirements, it means they are
interested and engaged in the project, constantly thinking about it. There's a
higher chance that something truly useful will emerge. And most importantly:
they will keep asking for more and more work.
If the developer realizes this, they will understand that it is they who must
adapt their working style. Perhaps simplify the addition of a ZIP code column on
the website, even though it wasn't in the original brief.
Here are some well-intentioned tips on how to design the
structure of namespaces and class names.
Namespaces are probably the best-known new feature of PHP version 5.3. Their
main purpose is to prevent name conflicts and to allow shortening (aliasing) of
class names for use within a single file. In practice, it has been shown that
conflicts can also be avoided by using a 1–2 letter prefix, just as I have
never used class names like
Zend_Service_DeveloperGarden_Response_ConferenceCall_AddConferenceTemplateParticipantResponseType
(97 characters, I wonder how they adhere to their maximum line length rule of
80 characters 🙂 ). However, PHP follows in the footsteps of Java, and so we
have namespaces. How should we handle them?
Benefits of Namespaces
Perhaps the most complex question you need to answer is: what is the benefit
of renaming a class:
sfForm
→ Symfony\Component\Form\Form
This question is a proven starter for endless flame wars. From the
programmer's comfort, intuitiveness, and memorability perspective, the original
concise and descriptive sfForm
is more appropriate. It corresponds
to how programmers colloquially refer to it, i.e., “form in Symfony”. The
new and longer name is correct from other aspects, which I am not sure if the
average user will appreciate.
How to Layout Namespaces?
The syntactic aspect of using namespaces is described
in the documentation, but finding the right patterns requires practice,
which there hasn’t been enough time for yet. Spaces in PHP have their
specifics due to a number of factors, so it is not ideal to copy conventions
used in Java or .NET exactly. However, they can be a good starting point.
More will be discussed in the individual naming rules.
1)
A class should have a descriptive name even without mentioning the NS
The name of each class, even without the namespace, must capture its essence.
It would be inappropriate to rename the class ArrayIterator
→
Spl\Iterators\Array
, as one would not expect an iterator under the
name Array
(ignoring the fact that a class cannot be named a
keyword). And beware, even from the name Spl\Iterators\Array
, it is
not clear that it is an iterator, because you cannot assume that the namespace
Spl\Iterators
only contains iterators. Here are a few examples:
- unsuitable:
Nette\Application\Responses\Download
– it is not
obvious that Download is a response
- unsuitable:
Zend\Validator\Date
– you would expect
Date
to be a date, not a validator
- unsuitable:
Zend\Controller\Request\Http
– you would expect
Http
to be a request
Therefore, in addition to specializing classes, it is appropriate to keep a
level of generality in the name:
- better:
Nette\Application\Responses\DownloadResponse
- better:
Zend\Validator\DateValidator
- better:
Zend\Controller\Request\HttpRequest
The ideal is if there is a one-word yet descriptive name. This can be
particularly conceived for classes that represent something from the
real world:
- best:
Nette\Forms\Controls\Button
– two-word
ButtonControl
not necessary (however, HiddenControl
cannot be shortened to Hidden
)
2) The namespace
should have a descriptive name
Naturally, the name of the namespace itself must be descriptive, and it is
advantageous to have a shorter name without redundancies. Such a redundancy to
me seems like Component
in Symfony\Component\Routing
,
because the name would not suffer without it.
In some situations, you need to decide between singular and plural (e.g.,
Zend\Validator
vs Zend\Validators
), which is a
similarly undecided issue as when choosing singular and plural numbers for
database tables.
3) Distinguish between
namespaces and classes
Naming a class the same as a namespace (i.e., having classes
Nette\Application
and Nette\Application\Request
) is
technically possible, but it might confuse programmers and it is better to avoid
it. Also, consider how well the resulting code will read or how you would
explain the API to someone.
4) Limit
unnecessary duplications (+ partial namespace)
Ideally, the name of the class and the name of the space should not contain
the same information redundantly.
- instead of
Nette\Http\HttpRequest
prefer
Nette\Http\Request
- instead of
Symfony\Component\Security\Authentication\AuthenticationTrustResolver
prefer the class TrustResolver
The class Nette\Http\Request
does not violate rule No. 1 about
the descriptive name of the class even without mentioning the namespace, on the
contrary, it allows us to elegantly use the partial namespace:
use Nette\Http; // alias for namespace
// all classes via Http are available:
$request = new Http\Request;
$response = new Http\Response;
// and additionally, Http\Response is more understandable than just Response
If we understand namespaces as packages, which is common, it leads to
unfortunate duplication of the last word:
Zend\Form\Form
Symfony\Component\Finder\Finder
Nette\Application\Application
Namespaces also literally encourage grouping classes (e.g., various
implementations of the same interface, etc.) into their own spaces, which again
creates duplications:
Nette\Caching\Storages\FileStorage
– i.e., all storages in a
separate space Storages
Zend\Form\Exception\BadMethodCallException
– all exceptions
in Exception
Symfony\Component\Validator\Exception\BadMethodCallException
–
again all exceptions in Exception
Grouping namespaces lengthen the name and create duplication in it because it
is often impossible to remove the generality from the class name (rule 1). Their
advantage may be better orientation in the generated API documentation (although
this could be achieved differently) and easier access when using full-fledged
IDEs with prompting. However, I recommend using them cautiously. For example,
for exceptions, it is not very suitable.
5) Unmistakable classes
from multiple spaces
According to point 1), a class should have a descriptive name, but that does
not mean it has to be unique within the entire application. Usually, it is
enough that it is unique within the namespace. However, if two classes from
different spaces are often used next to each other in the code, or if they have
some other significant connection, they should not have the same name. In other
words, it should not be necessary to use AS in the USE clause.
6) One-way dependencies
Consider what dependencies should exist between classes from different
namespaces. I try to maintain:
- if a class from the namespace A\B has a dependency on a class from the
namespace A\C, no class from A\C should have a dependency on A\B
- classes from the namespace A\B should not have dependencies on a class from
the space A\B\C (take this with a grain of salt)
p.s.: Please do not take this article as dogma, it is just a capture of
current thoughts
A question that many webmasters ask: do search engines perceive these URLs
as the same? How should they be treated?
http://example.com/article
http://example.com/article/
http://example.com/Article
https://example.com/article
http://www.example.com/article
http://example.com/article?a=1&b=2
http://example.com/article?b=2&a=1
The short answer would be: “URLs are different.” However, a more detailed
analysis is needed.
From a user's perspective, these addresses differ only in minor details
which they generally disregard. Thus, they perceive them as the same, although
technically, they are different addresses. Let's call them similar
addresses. For the sake of “user experience”, two principles should be
adhered to:
- Do not allow different content on similar addresses. As I will show
soon, this would not only confuse users but also search engines.
- Allow users access through similar addresses.
If the addresses differ in protocol http
/ https
or
with www
domain or without, search engines consider them different.
Not so for users. It would be a fatal mistake to place different content on such
similar addresses. However, it would also be a mistake to prevent access through
a similar address. The address with www
and without
www
must both function, with SEO recommending sticking to one
variant and redirecting the others to it using a 301 HTTP code. This can be
managed for the www
subdomain with a
.htaccess
file:
# redirection to the non-www variant
RewriteCond %{HTTP_HOST} ^www\.
RewriteRule ^.*$ http://example.com/$0 [R=301,NE,L]
# redirection to the www variant
RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^.*$ http://www.example.com/$0 [R=301,NE,L]
Immediately test whether your servers redirect, including the full address
and correct parameter passing.
Don't forget variants like www.subdomain.example.cz
. Because some
browsers can bypass missing redirections, try a low-level service like Web-Sniffer.
URLs are case-sensitive except for the scheme and domain. However, users do
not differentiate and therefore, it is unfortunate to offer different content on
addresses differing only by letter case. A poor example can be seen in
Wikipedia:
Bing amusingly suffers from an error, returning the same URL whether you
search for acid or a database (although the textual description is correct).
Google and Yahoo do not have this issue.
Some services (webmails, ICQ) convert uppercase letters in URLs to lowercase,
which are all reasons to avoid distinguishing letter size, even in parameters.
Better adhere to the convention that all letters in URLs should be
lowercase.
Distinguishing some similar addresses is also a challenge for search
engines. I conducted an experiment by placing different content on URLs
differing in details like the presence of a trailing slash or parameter order.
Only Google was able to index them as different. Other search engines could
always handle only one of the variants.
As for trailing slashes, the web server usually redirects to the canonical
form for you; if you access a directory without a trailing slash, it adds one
and redirects. Of course, this does not apply when you manage URIs on your own
(Cool URIs, etc.)
Finally: does the order of parameters really matter? There should be no
difference between article?a=1&b=2
and
article?b=2&a=1
. However, there are situations where this is
not the case, especially when passing complex structures such as arrays. For
instance, ?sort[]=name&sort[]=city
might be different from
?sort[]=city&sort[]=name
. Nevertheless, redirecting if
parameters are not in the specified order would be considered unnecessary
overcorrection.
p.s. Nette Framework automatically
handles redirection to canonical URLs on its own.
In PHP, there are three libraries available for regular expressions: PCRE, Oniguruma, and POSIX Regex. The second one
may not always be available, and the third is deprecated, so you should
exclusively use the more adept and faster PCRE library. Unfortunately, its
implementation suffers from quite unpleasant flaws across all PHP versions.
The operation of the preg_*
functions can be divided into
two steps:
- compilation of the regular expression
- execution (searching, replacing, filtering, …)
It is advantageous that PHP maintains a cached version of compiled regular
expressions, meaning they are only compiled once. Therefore, it is appropriate
to use static regular expressions, i.e., not to generate them
parametrically.
Now for the unpleasant issues. If an error is discovered during compilation,
PHP will issue an E_WARNING
error, but the return value of the
function is inconsistent:
preg_filter
, preg_replace_callback
,
preg_replace
return null
preg_grep
, preg_match_all
,
preg_match
, preg_split
return false
It is good to know that functions returning an array $matches
by
reference (i.e., preg_match_all
and preg_match
) do not
nullify the argument upon a compilation error, thus validating the test of the
return value.
Since version 5.2.0, PHP has the function preg_last_error returning the code of
the last error. However, beware, this only applies to errors that occur during
execution! If an error occurs during compilation, the value of
preg_last_error
is not reset and returns the previous value. If the
return value of a preg_*
function is not null
or
false
(see above), definitely do not rely on what
preg_last_error
returns.
What kind of errors can occur during execution? The most common case is
exceeding pcre.backtrack_limit
or invalid UTF-8 input when using
the u
modifier. (Note: invalid UTF-8 in the regular expression
itself is detected during compilation.) However, the way PHP handles such an
error is utterly inadequate:
- it generates no message (silent error)
- the return value of the function may indicate that everything is fine
- the error can only be detected by calling
preg_last_error
later
Let's talk about the return value, which is probably the biggest betrayal.
The process is executed until an error occurs, then it returns a partially
processed result. And this is done completely silently. However, even this is
not always the case, for example, the trio of functions
preg_filter
, preg_replace_callback
,
preg_replace
can return null
even during execution
errors.
Whether an error occurred during execution can only be determined by calling
preg_last_error
. But as you know, this function returns a
nonsensical result if, on the contrary, a compilation error occurred, so we must
distinguish both situations by considering the return value of the function,
whether it is null
or false
. And since functions that
return null
during a compilation error can also return
null
during an execution error, it can be stated only that PHP is
undoubtedly a messed-up language.
What would safe use of PCRE functions look like? For example, like this:
function safeReplaceCallback($pattern, $callback, $subject)
{
// we must verify the callback ourselves
if (!is_callable($callback)) {
throw new Exception('Invalid callback.');
}
// test the expression on an empty string
if (preg_match($pattern, '') === false) { // compilation error?
$error = error_get_last();
throw new Exception($error['message']);
}
// call PCRE
$result = preg_replace_callback($pattern, $callback, $subject);
// execution error?
if ($result === null && preg_last_error()) {
throw new Exception('Error processing regular expression.', preg_last_error());
}
return $result;
}
The provided code transforms errors into exceptions but does not attempt to
suppress warning outputs.
Safe processing of regular expressions is implemented in the class Nette\Utils\Strings.
Every now and then, a security vulnerability is reported on
another significant website (Alza, Mapy.cz, BontonLand) or is exploited.
Try searching for XSS
vulnerability to understand why Cross Site Scripting (XSS) is currently one
of the most widespread and dangerous vulnerabilities.
This is a distressing issue for website operators and perhaps even more so
for suppliers. It can damage reputations, lead to fines, lawsuits, or
simply spoil relationships with clients. How to defend against XSS? By so-called
string escaping. Unfortunately,
most experts are not well-versed in this area. (I don’t mean to be tactless
or offend anyone, but of the “Czechoslovak IT celebrities,” I only know one
person who deeply understands this issue.) Thus, even articles on this topic on
well-known websites are, let’s say, inaccurate.
Moreover, this escaping is usually done in the template, falling on the
coder’s shoulders. Thus, the most critical area requiring high expertise is
handled by someone unqualified. How can this end? We know all too well – see
the first paragraph.
Nette Framework Will Save You
I would like to introduce you to a killer feature of the Latte templating
system in the Nette Framework. It's such a
fundamental feature that it alone is a reason to choose this framework. Or at
least to use its templates.
- the bigger your company, the more crucial this feature is
- no competing framework has it to date 1)
The Nette Framework automatically escapes in templates. Its
Context-aware escaping feature recognizes which part of the document
you are in and chooses the appropriate escaping method accordingly.
Let's dive into more technical details. You can see how it works best with
an example. Consider a variable $var
and this template:
<p onclick="alert({$var})">{$var}</p>
<script>
document.title = {$var};
</script>
The notation {$var}
means printing the variable. However, each
print must be explicitly secured, even differently at each location. A coder
must (for example, in Smarty) add the appropriate modifiers, must not make a
mistake, and especially not omit anything.
In the Nette Framework, nothing needs to be manually secured. Everything
is done automatically, correctly, and consistently!
If we assign $var = 'Width 1/2"'
to the variable, the framework
generates the HTML code:
<p onclick="alert("Width 1\/2\"")">Width 1/2"</p>
<script>
document.title = "Width 1\/2\"";
</script>
Of course, situations where you need to print a variable without escaping it
are also considered, for example, because it contains article text including
HTML tags. In such cases, you use the notation {$var|noescape}
.
End of the technical digression. Thanks to Latte, it suddenly means that
- the template remains simple
- you don’t have to worry that a coder will overlook something
- and at the same time, you don’t need to have a top expert on
escaping ;)
- the work is much easier
You can find more information about Latte’s smart templates in the documentation.
1) About half a year after Nette, Google introduced a similar
feature for its library in C++, and as far as I know, no framework in PHP,
Ruby, or Python has anything similar yet.
One of the evergreen topics in programming is the confusion and
misunderstandings around escaping. Ignorance causes the simplest methods of
compromising websites, such as Cross Site Scripting (XSS) or SQL injection, to
remain unfortunately widespread.
Escaping is the substitution of characters that have a special meaning in
a given context with other corresponding sequences.
Example: To write quotes within a string enclosed by quotes, you need to
replace them because quotes have a special meaning in the context of a string,
and writing them plainly would be interpreted as ending the string. The specific
substitution rules are determined by the context.
Prerequisites
Each escaping function assumes that the input is always a “raw
string” (unmodified) in a certain encoding (character set).
Storing strings already escaped for HTML output in the database and similar
is entirely counterproductive.
What contexts do we encounter?
As mentioned, escaping converts characters that have a special meaning in a
certain context. Different escaping functions are used for each context. This
table is only indicative, and it is necessary to read the
notes below.
Explanation of the following notes:
- many contexts have their subcontexts where escaping differs. Unless
otherwise stated, the specified escaping function is applicable universally
without further differentiation of subcontexts.
- the term usual character set refers to a character set with 1-byte or UTF-8 encoding.
HTML
In HTML contexts, the characters < & " '
collectively
have a special meaning, and the corresponding sequences are
< & " '
. However, the exception is
an HTML comment, where only the pair --
has special meaning.
For escaping, use:
$s = htmlspecialchars($s, ENT_QUOTES);
It works with any usual character set. However, it does not consider the
subcontext of HTML comments (i.e., it cannot replace the pair --
with something else).
Reverse function:
$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');
XML / XHTML
XML 1.0 differs from HTML in that it prohibits the use of C0 control
characters (including writing in the form of an entity) except for the
tabulator, line feed, and space. XML 1.1 allows these banned characters, except
NUL
, in the form of entities, and further mandates C1 control
characters, except NEL
, also to be written as entities.
Additionally, in XML, the sequence ]]>
has a special meaning, so
one of these characters must also be escaped.
For XML 1.0 and any usual character set, use:
$s = preg_replace('#[\x00-\x08\x0B\x0C\x0E-\x1F]+#', '', $s);
$s = htmlspecialchars($s, ENT_QUOTES);
Regular Expression
In Perl regular
expressions, characters
. \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
and the so-called
delimiter, which is a character delimiting the regular expression (e.g., for the
expression '#[a-z]+#i'
it is #
), collectively have
special meaning. They are escaped with the character \
.
$s = preg_quote($s, $delimiter);
In the string replacing the searched expression (e.g., the 2nd parameter of
the preg_replace
function), the backslash and dollar sign have
special meaning:
$s = addcslashes($replacement, '$\\');
The encoding must be either 1-byte or UTF-8, depending on the modifier in the
regular expression.
PHP Strings
PHP distinguishes these types of strings:
- in single quotes, where special meaning can have
characters
\ '
- in double quotes, where special meaning can have
characters
\ " $
- NOWDOC, where no character has special meaning
- HEREDOC, where special meaning can have characters
\ $
Escape is done with the character \
. This is usually done by the
programmer when writing code, for PHP code generators, you can use the var_export function.
Note: because the mentioned regular expressions are usually written within
PHP strings, both types of escaping need to be combined. E.g., the character
\
for a regular expression is written as \\
and in a
quoted string it needs to be written as \\\\
.
SQL and Databases
Each database has its own escaping function, see the table above. Almost
always, however, only a function for escaping strings is available, and it
cannot be used for anything else, especially there are no functions for escaping
wildcard characters used in LIKE
constructions (in MySQL these are
% _
) or identifiers, such as table or column names. Databases
do not require removing escaping on output! (Except, for example, for
bytea type.)
For character sets with unusual multi-byte encoding, it is necessary to set
the function mysql_set_charset
or mysqli_set_charset
in MySQL.
I recommend using a database layer (e.g., dibi, Nette Database, PDO) or
parameterized queries, which take
care of escaping for you.
JavaScript, JSON
As a programming language, JavaScript has a number of very different
subcontexts. For escaping strings, you can use the side
effect of the function
$s = json_encode((string) $s);
which also encloses the string in quotes. Strictly requires UTF-8.
JavaScript written inside HTML attributes (e.g., onclick
) must
still be escaped according to HTML rules, but this does
not apply to JavaScript inside <script>
tags, where only the
potential occurrence of the ending tag </script>
inside the
string needs to be treated. However, json_encode ensures this, as JSON escapes
the slash /
. However, it does not handle the end of an HTML comment
-->
(which does not matter in HTML) or an XML CDATA block
]]>
, which the script is wrapped in. For XML/XHTML, the
solution is
$s = json_encode((string) $s);
$s = str_replace(']]>', ']]\x3E', $s);
Since JSON uses a subset of JavaScript syntax, the reverse function json_decode is fully usable only for
JSON, limitedly for JavaScript.
CSS
In CSS contexts, the range of valid characters is precisely
defined, for escaping identifiers, for example, you can use this
function:
$s = addcslashes($s, "\x00..\x2C./:;<=>?@[\\]^`{|}~");
For CSS within HTML code, the same applies as stated about JavaScript and its
escaping within HTML attributes and tags (here it is about the
style
attributes and <style>
tags).
URL
In the context of a URL, everything except the letters of the English
alphabet, digits, and characters - _ .
is escaped by replacing them
with %
+ the hexadecimally expressed byte.
$s = rawurlencode($s);
According to RFC 2718 (from 1999) or RFC 3986 (from 2005), writing characters
in UTF-8 encoding is preferred.
The reverse function in this case is urldecode, which also recognizes the
+
character as meaning space.
If you find the whole topic too complicated, don't despair. Soon you will
realize that it is actually about simple transformations, and the whole trick is
in realizing which context I am in and which function I need to choose for it.
Or even better, try using an intelligent templating system that can recognize
contexts itself and apply proper escaping:
Latte.