In PHP, there are three libraries available for regular expressions: PCRE, Oniguruma, and POSIX Regex. The second one
may not always be available, and the third is deprecated, so you should
exclusively use the more adept and faster PCRE library. Unfortunately, its
implementation suffers from quite unpleasant flaws across all PHP versions.
The operation of the preg_*
functions can be divided into
two steps:
- compilation of the regular expression
- execution (searching, replacing, filtering, …)
It is advantageous that PHP maintains a cached version of compiled regular
expressions, meaning they are only compiled once. Therefore, it is appropriate
to use static regular expressions, i.e., not to generate them
parametrically.
Now for the unpleasant issues. If an error is discovered during compilation,
PHP will issue an E_WARNING
error, but the return value of the
function is inconsistent:
preg_filter
, preg_replace_callback
,
preg_replace
return null
preg_grep
, preg_match_all
,
preg_match
, preg_split
return false
It is good to know that functions returning an array $matches
by
reference (i.e., preg_match_all
and preg_match
) do not
nullify the argument upon a compilation error, thus validating the test of the
return value.
Since version 5.2.0, PHP has the function preg_last_error returning the code of
the last error. However, beware, this only applies to errors that occur during
execution! If an error occurs during compilation, the value of
preg_last_error
is not reset and returns the previous value. If the
return value of a preg_*
function is not null
or
false
(see above), definitely do not rely on what
preg_last_error
returns.
What kind of errors can occur during execution? The most common case is
exceeding pcre.backtrack_limit
or invalid UTF-8 input when using
the u
modifier. (Note: invalid UTF-8 in the regular expression
itself is detected during compilation.) However, the way PHP handles such an
error is utterly inadequate:
- it generates no message (silent error)
- the return value of the function may indicate that everything is fine
- the error can only be detected by calling
preg_last_error
later
Let's talk about the return value, which is probably the biggest betrayal.
The process is executed until an error occurs, then it returns a partially
processed result. And this is done completely silently. However, even this is
not always the case, for example, the trio of functions
preg_filter
, preg_replace_callback
,
preg_replace
can return null
even during execution
errors.
Whether an error occurred during execution can only be determined by calling
preg_last_error
. But as you know, this function returns a
nonsensical result if, on the contrary, a compilation error occurred, so we must
distinguish both situations by considering the return value of the function,
whether it is null
or false
. And since functions that
return null
during a compilation error can also return
null
during an execution error, it can be stated only that PHP is
undoubtedly a messed-up language.
What would safe use of PCRE functions look like? For example, like this:
function safeReplaceCallback($pattern, $callback, $subject)
{
// we must verify the callback ourselves
if (!is_callable($callback)) {
throw new Exception('Invalid callback.');
}
// test the expression on an empty string
if (preg_match($pattern, '') === false) { // compilation error?
$error = error_get_last();
throw new Exception($error['message']);
}
// call PCRE
$result = preg_replace_callback($pattern, $callback, $subject);
// execution error?
if ($result === null && preg_last_error()) {
throw new Exception('Error processing regular expression.', preg_last_error());
}
return $result;
}
The provided code transforms errors into exceptions but does not attempt to
suppress warning outputs.
Safe processing of regular expressions is implemented in the class Nette\Utils\Strings.
One of the evergreen topics in programming is the confusion and
misunderstandings around escaping. Ignorance causes the simplest methods of
compromising websites, such as Cross Site Scripting (XSS) or SQL injection, to
remain unfortunately widespread.
Escaping is the substitution of characters that have a special meaning in
a given context with other corresponding sequences.
Example: To write quotes within a string enclosed by quotes, you need to
replace them because quotes have a special meaning in the context of a string,
and writing them plainly would be interpreted as ending the string. The specific
substitution rules are determined by the context.
Prerequisites
Each escaping function assumes that the input is always a “raw
string” (unmodified) in a certain encoding (character set).
Storing strings already escaped for HTML output in the database and similar
is entirely counterproductive.
What contexts do we encounter?
As mentioned, escaping converts characters that have a special meaning in a
certain context. Different escaping functions are used for each context. This
table is only indicative, and it is necessary to read the
notes below.
Explanation of the following notes:
- many contexts have their subcontexts where escaping differs. Unless
otherwise stated, the specified escaping function is applicable universally
without further differentiation of subcontexts.
- the term usual character set refers to a character set with 1-byte or UTF-8 encoding.
HTML
In HTML contexts, the characters < & " '
collectively
have a special meaning, and the corresponding sequences are
< & " '
. However, the exception is
an HTML comment, where only the pair --
has special meaning.
For escaping, use:
$s = htmlspecialchars($s, ENT_QUOTES);
It works with any usual character set. However, it does not consider the
subcontext of HTML comments (i.e., it cannot replace the pair --
with something else).
Reverse function:
$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');
XML / XHTML
XML 1.0 differs from HTML in that it prohibits the use of C0 control
characters (including writing in the form of an entity) except for the
tabulator, line feed, and space. XML 1.1 allows these banned characters, except
NUL
, in the form of entities, and further mandates C1 control
characters, except NEL
, also to be written as entities.
Additionally, in XML, the sequence ]]>
has a special meaning, so
one of these characters must also be escaped.
For XML 1.0 and any usual character set, use:
$s = preg_replace('#[\x00-\x08\x0B\x0C\x0E-\x1F]+#', '', $s);
$s = htmlspecialchars($s, ENT_QUOTES);
Regular Expression
In Perl regular
expressions, characters
. \ + * ? [ ^ ] $ ( ) { } = ! < > | : -
and the so-called
delimiter, which is a character delimiting the regular expression (e.g., for the
expression '#[a-z]+#i'
it is #
), collectively have
special meaning. They are escaped with the character \
.
$s = preg_quote($s, $delimiter);
In the string replacing the searched expression (e.g., the 2nd parameter of
the preg_replace
function), the backslash and dollar sign have
special meaning:
$s = addcslashes($replacement, '$\\');
The encoding must be either 1-byte or UTF-8, depending on the modifier in the
regular expression.
PHP Strings
PHP distinguishes these types of strings:
- in single quotes, where special meaning can have
characters
\ '
- in double quotes, where special meaning can have
characters
\ " $
- NOWDOC, where no character has special meaning
- HEREDOC, where special meaning can have characters
\ $
Escape is done with the character \
. This is usually done by the
programmer when writing code, for PHP code generators, you can use the var_export function.
Note: because the mentioned regular expressions are usually written within
PHP strings, both types of escaping need to be combined. E.g., the character
\
for a regular expression is written as \\
and in a
quoted string it needs to be written as \\\\
.
SQL and Databases
Each database has its own escaping function, see the table above. Almost
always, however, only a function for escaping strings is available, and it
cannot be used for anything else, especially there are no functions for escaping
wildcard characters used in LIKE
constructions (in MySQL these are
% _
) or identifiers, such as table or column names. Databases
do not require removing escaping on output! (Except, for example, for
bytea type.)
For character sets with unusual multi-byte encoding, it is necessary to set
the function mysql_set_charset
or mysqli_set_charset
in MySQL.
I recommend using a database layer (e.g., dibi, Nette Database, PDO) or
parameterized queries, which take
care of escaping for you.
JavaScript, JSON
As a programming language, JavaScript has a number of very different
subcontexts. For escaping strings, you can use the side
effect of the function
$s = json_encode((string) $s);
which also encloses the string in quotes. Strictly requires UTF-8.
JavaScript written inside HTML attributes (e.g., onclick
) must
still be escaped according to HTML rules, but this does
not apply to JavaScript inside <script>
tags, where only the
potential occurrence of the ending tag </script>
inside the
string needs to be treated. However, json_encode ensures this, as JSON escapes
the slash /
. However, it does not handle the end of an HTML comment
-->
(which does not matter in HTML) or an XML CDATA block
]]>
, which the script is wrapped in. For XML/XHTML, the
solution is
$s = json_encode((string) $s);
$s = str_replace(']]>', ']]\x3E', $s);
Since JSON uses a subset of JavaScript syntax, the reverse function json_decode is fully usable only for
JSON, limitedly for JavaScript.
CSS
In CSS contexts, the range of valid characters is precisely
defined, for escaping identifiers, for example, you can use this
function:
$s = addcslashes($s, "\x00..\x2C./:;<=>?@[\\]^`{|}~");
For CSS within HTML code, the same applies as stated about JavaScript and its
escaping within HTML attributes and tags (here it is about the
style
attributes and <style>
tags).
URL
In the context of a URL, everything except the letters of the English
alphabet, digits, and characters - _ .
is escaped by replacing them
with %
+ the hexadecimally expressed byte.
$s = rawurlencode($s);
According to RFC 2718 (from 1999) or RFC 3986 (from 2005), writing characters
in UTF-8 encoding is preferred.
The reverse function in this case is urldecode, which also recognizes the
+
character as meaning space.
If you find the whole topic too complicated, don't despair. Soon you will
realize that it is actually about simple transformations, and the whole trick is
in realizing which context I am in and which function I need to choose for it.
Or even better, try using an intelligent templating system that can recognize
contexts itself and apply proper escaping:
Latte.
Singleton is one of the most popular design patterns. Its purpose is to
ensure the existence of only one instance of a certain class while also
providing global access to it. Here is a brief example for completeness:
class Database
{
private static $instance;
private function __construct()
{}
public static function getInstance()
{
if (self::$instance === null) {
self::$instance = new self;
}
return self::$instance;
}
...
}
// singleton is globally accessible
$result = Database::getInstance()->query('...');
Typical features include:
- A private constructor, preventing the creation of an instance outside
the class
- A static property
$instance
where the unique instance
is stored
- A static method
getInstance()
, which provides access to the
instance and creates it on the first call (lazy loading)
Simple and easy to understand code that solves two problems of
object-oriented programming. Yet, in dibi or
Nette Framework, you won’t find any
singletons. Why?
Apparent Uniqueness
Let's look closely at the code – does it really ensure only one instance
exists? I’m afraid not:
$dolly = clone Database::getInstance();
// or
$dolly = unserialize(serialize(Database::getInstance()));
// or
class Dolly extends Database {}
$dolly = Dolly::getInstance();
There is a defense against this:
final public static function getInstance()
{
// final getInstance
}
final public function __clone()
{
throw new Exception('Clone is not allowed');
}
final public function __wakeup()
{
throw new Exception('Unserialization is not allowed');
}
The simplicity of implementing a singleton is gone. Worse – with every
additional singleton, we repeat the same piece of code. Moreover, the class
suddenly fulfills two completely different tasks: besides its original purpose,
it takes care of being quite single. Both are warning signals that something is
not right and the code deserves refactoring. Bear with me, I’ll get back to
this soon.
Global = Ugly?
Singletons provide a global access point to objects. There is no need to
constantly pass the reference around. However, critics argue that such a
technique is no different from using global variables, and those are
pure evil.
(If a method works with an object that was explicitly passed to it,
either as a parameter or as an object variable, I call it “wired
connection”. If it works with an object obtained through a global point (e.g.,
through a singleton), I call it “wireless connection”. Quite a nice
analogy, right?)
Critics are wrong in one respect – there is nothing inherently bad about
“global”. It’s important to realize that the name of each class and
method is nothing more than a global identifier. There is no fundamental
difference between the trouble-free construction $obj = new MyClass
and the criticized $obj = MyClass::getInstance()
. This is even less
significant in dynamic languages like PHP, where you can “write in PHP 5.3”
$obj = $class::getInstance()
.
However, what can cause headaches are:
- Hidden dependencies on global variables
- Unexpected use of “wireless connections”, which are not apparent from
the API of classes (see Singletons are
Pathological Liars)
The first issue can be eliminated if singletons do not act like global
variables, but rather as global functions or services. Consider google.com –
a nice example of a singleton as a global service. There is one instance (a
physical server farm somewhere in the USA) globally accessible through the
identifier www.google.com
. (Even clone www.google.com
does not work, as Microsoft discovered, they have it figured out.) Importantly,
this service does not have hidden dependencies typical for global variables –
it returns responses without unexpected connections to what someone else
searched for moments ago. On the other hand, the seemingly inconspicuous
function strtok suffers from a serious
dependency on a global variable, and its use can lead to very hard-to-detect
errors. In other words – the problem is not “globality”, but design.
The second point is purely a matter of code design. It is not wrong to use a
“wireless connection” and access a global service, the mistake is doing it
unexpectedly. A programmer should know exactly which object uses which class.
A relatively clean solution is to have a variable in the object referring to
the service object, which initializes to the global service unless the
programmer decides otherwise (the convention over configuration technique).
Uniqueness May Be Harmful
Singletons come with a problem that we encounter no later than when testing
code. And that is the need to substitute a different, test object. Let's return
to Google as an exemplary singleton. We want to test an application that uses
it, but after a few hundred tests, Google starts protesting We're
sorry… and where are we? We are somewhere. The solution is to substitute
a fictitious (mock) service under the identifier www.google.com
. We
need to modify the hosts
file – but (back from the analogy to
the world of OOP) how to achieve this with singletons?
One option is to implement a static method
setInstance($mockObj)
. But oops! What exactly do you want to pass
to that method when no other instance, other than that one and only,
exists?
Any attempt to answer this question inevitably leads to the breakdown of
everything that makes a singleton a singleton.
If we remove the restrictions on the existence of only one instance, the
singleton stops being single and we are only addressing the need for a global
repository. Then the question arises, why repeat the same method
getInstance()
in the code and not move it to an extra class, into
some global registry?
Or we maintain the restrictions, only replacing the class identifier with an
interface (Database
→ IDatabase
), which raises the
problem of the impossibility to implement IDatabase::getInstance()
and the solution again is a global registry.
A few paragraphs above, I promised to return to the issue of repetitive
code in all singletons and possible refactoring. As you can see, the problem has
resolved itself. The singleton has died.
Twitter for PHP is a very small and easy-to-use library for
sending messages to Twitter and receiving status updates with OAuth support.
Download Twitter for PHP 3.5
It requires PHP (version 5 or newer) with CURL extension and is licensed
under the New BSD License. You can obtain the latest version from our GitHub repository or install it via
Composer:
php composer.phar require dg/twitter-php
Twitter requires SSL/TLS as of January 14th, 2014. Update to
the last version.
Getting started
Sign in to the http://twitter.com and
register an application from the http://dev.twitter.com/apps page.
Remember
to never reveal your consumer secrets. Click on My Access Token link from the
sidebar and retrieve your own access
token. Now you have consumer key, consumer secret, access token and access token
secret.
Create object using application and request/access keys:
$twitter = new Twitter($consumerKey, $consumerSecret,
$accessToken, $accessTokenSecret);
Posting
The send() method posts your status. The message must be encoded
in UTF-8:
$twitter->send('I am fine today.');
You can append picture:
$twitter->send('This is my photo', $imageFile);
Displaying
The load() method returns the 20 most recent status updates posted in the
last 24 hours by you:
$statuses = $twitter->load(Twitter::ME);
or posted by you and your friends:
$statuses = $twitter->load(Twitter::ME_AND_FRIENDS);
or most recent mentions for you:
$statuses = $twitter->load(Twitter::REPLIES);
Extracting the information from the channel is easy:
<ul>
<?php foreach ($statuses as $status): ?>
<li><a href="http://twitter.com/<?= $status->user->screen_name ?>">
<?= htmlspecialchars($status->user->name) ?></a>:
<?= Twitter::clickable($status) ?>
<small>at <?= date("j.n.Y H:m", strtotime($status->created_at)) ?></small>
</li>
<?php endforeach ?>
</ul>
The static method Twitter::clickable()
makes links in status
clickable. In addition to regular links, it links @username
to the
user’s Twitter profile page and links hashtags to a Twitter search on that
hashtag.
Searching
The search()
method provides searching in twitter statuses:
$results = $twitter->search('#nette');
The returned result is a again array of statuses.
Error handling
All methods throw a TwitterException on error:
try {
$statuses = $twitter->load(Twitter::ME);
} catch (TwitterException $e) {
echo "Error: ", $e->getMessage();
}
Additional features
The authenticate()
method tests if user credentials
are valid:
if (!$twitter->authenticate()) {
die('Invalid name or password');
}
Other commands
You can use all commands defined by Twitter API 1.1. For example GET
statuses/retweets_of_me returns the array of most recent tweets authored by
the authenticating user:
$statuses = $twitter->request('statuses/retweets_of_me', 'GET', array('count' => 20));
PHP 5.2.0 comes with a new DOM function named
registerNodeClass()
. What is it good for? The documentation says
nothing as well as uncle Google. But this function is really great!
…pokračování
I recently managed to speed up a PHP script to a hundredth of
its original execution time by changing just a few characters in the source
code. How is this possible? The drastic acceleration is due to the appropriate
use of references and assignments. I'll let you in on how it works. Don't
believe the sensational headline; it's not any kind of black magic. I repeat,
you just need to understand how PHP works internally. But don’t worry,
it's nothing too complicated.
In-depth Reference Counting
The PHP core stores variable names separately from their values in memory. An
anonymous value is described by the structure zval.
Besides raw data, it includes information about the type (boolean, string, etc.)
and two additional items: refcount
and is_ref
. Yes,
refcount
is exactly the counter for the aforementioned reference
counting.
$abc = 'La Trine';
What does this code actually do? It creates a new zval
value in
memory, whose data section holds the 8 characters La Trine
and
indicates the type as a string. At the same time, a new entry abc
is added to the variable table, referring to this zval
.
Additionally, in the zval
structure, we initialize the
refcount
counter to one, because there is exactly one variable
($abc
) pointing to it.
…pokračování
For some, this is obvious, for me, it's mainly a cheat sheet.
I just can't remember extremely long numbers, meaning those that have more than
one digit.
In PHP, redirection is implemented with the following code:
$code = 301; // code in the range 300..307
$url = 'http://example.com';
header('Location: ' . $url, true, $code);
die('Please <a href="' . htmlSpecialChars($url) . '">click here</a> to continue.');
Note that after calling the header()
command, it is necessary to
explicitly terminate the script. It doesn't hurt to offer a text message and a
link for agents that do not automatically redirect.
Types of Redirection
The meanings of individual codes are described in detail in the standard RFC 2616:
HTTP/1.1 Redirection. Here they are:
300 Multiple Choices
There are several URLs to which redirection is possible (pages may differ,
for example, in language). Offer users a list of these. The preferred
destination can be indicated in the Location
header; not every
browser automatically redirects. Rarely used.
301 Moved Permanently
Use this when a resource that used to exist at the requested URL is now
(permanently) located at a new address. Specify this in the
Location
header. However, if it has been discontinued, announce
this with the code 410 Gone.
302 Found
A problematic code. It indicates that the resource has been temporarily
moved elsewhere and the browser should access the new URL using the same method
(GET, POST, HEAD, …) as used on the original. Additionally, with methods other
than GET and HEAD, user confirmation should be required for the redirection.
Most browsers, however, do not respect this and change the method to GET without
requiring confirmation.
The code is often mistakenly used instead of 303.
303 See Other – for PRG
The Post/Redirect/Get
technique prevents double submission of forms upon page reload or back button
click. After submitting a form using the POST method, a redirection is made
using the GET method to another page. This is exactly what the 303 code is for,
converting POST to GET.
304 Not Modified
For caching purposes. It responds to the If-Modified-Since
header that the resource has not changed since the previous visit. The response
must not contain a body, only headers.
307 Temporary Redirect
As mentioned, code 302 has become problematic due to non-compliance with the
standard by both web designers and browser creators. Code 307 is its
reincarnation, which mostly works correctly. It can be used, for example, to
perform a redirection using the POST method with transferred data.
PHP currently supports property overloading provided by magic methods __get()
and __set(). Those methods are called when accessed property is not declared in
the class. Actually, this overloading is a bit frustrating.
This is my solution how to simulate properties with accessor methods in a Delphi way.
So…
// define new "like-keyword"
// value is short and unusual string
define('property', "\0\0");
Maybe _property
is a better choice. It doesn't matter now.
…pokračování
I have wrote tool for automatic convert
PHP5 scripts to PHP4. It converts new PHP 5 object model features and
language constructions into PHP 4 equivalents. This tool may be useful, if you
need transfer PHP5 scripts to run under previous version. It is not designated
to provides missing functionality (try PHP_Compat), just missing
syntax.
Supported are these constructions:
Try it here
I spent a lot of time thinking, how to emulate some PHP5's object model
features in older PHP4. How get rid tons of ampersands in my source codes. How
force objects to not copy itself every time.
So, there is solution.
…pokračování