Here are some well-intentioned tips on how to design the
structure of namespaces and class names.
Namespaces are probably the best-known new feature of PHP version 5.3. Their
main purpose is to prevent name conflicts and to allow shortening (aliasing) of
class names for use within a single file. In practice, it has been shown that
conflicts can also be avoided by using a 1–2 letter prefix, just as I have
never used class names like
Zend_Service_DeveloperGarden_Response_ConferenceCall_AddConferenceTemplateParticipantResponseType
(97 characters, I wonder how they adhere to their maximum line length rule of
80 characters 🙂 ). However, PHP follows in the footsteps of Java, and so we
have namespaces. How should we handle them?
Benefits of Namespaces
Perhaps the most complex question you need to answer is: what is the benefit
of renaming a class:
sfForm → Symfony\Component\Form\Form
This question is a proven starter for endless flame wars. From the
programmer's comfort, intuitiveness, and memorability perspective, the original
concise and descriptive sfForm is more appropriate. It corresponds
to how programmers colloquially refer to it, i.e., “form in Symfony”. The
new and longer name is correct from other aspects, which I am not sure if the
average user will appreciate.
How to Layout Namespaces?
The syntactic aspect of using namespaces is described
in the documentation, but finding the right patterns requires practice,
which there hasn’t been enough time for yet. Spaces in PHP have their
specifics due to a number of factors, so it is not ideal to copy conventions
used in Java or .NET exactly. However, they can be a good starting point.
More will be discussed in the individual naming rules.
1)
A class should have a descriptive name even without mentioning the NS
The name of each class, even without the namespace, must capture its essence.
It would be inappropriate to rename the class ArrayIterator →
Spl\Iterators\Array, as one would not expect an iterator under the
name Array (ignoring the fact that a class cannot be named a
keyword). And beware, even from the name Spl\Iterators\Array, it is
not clear that it is an iterator, because you cannot assume that the namespace
Spl\Iterators only contains iterators. Here are a few examples:
unsuitable: Nette\Application\Responses\Download – it is not
obvious that Download is a response
unsuitable: Zend\Validator\Date – you would expect
Date to be a date, not a validator
unsuitable: Zend\Controller\Request\Http – you would expect
Http to be a request
Therefore, in addition to specializing classes, it is appropriate to keep a
level of generality in the name:
The ideal is if there is a one-word yet descriptive name. This can be
particularly conceived for classes that represent something from the
real world:
best: Nette\Forms\Controls\Button – two-word
ButtonControl not necessary (however, HiddenControl
cannot be shortened to Hidden)
2) The namespace
should have a descriptive name
Naturally, the name of the namespace itself must be descriptive, and it is
advantageous to have a shorter name without redundancies. Such a redundancy to
me seems like Component in Symfony\Component\Routing,
because the name would not suffer without it.
In some situations, you need to decide between singular and plural (e.g.,
Zend\Validator vs Zend\Validators), which is a
similarly undecided issue as when choosing singular and plural numbers for
database tables.
3) Distinguish between
namespaces and classes
Naming a class the same as a namespace (i.e., having classes
Nette\Application and Nette\Application\Request) is
technically possible, but it might confuse programmers and it is better to avoid
it. Also, consider how well the resulting code will read or how you would
explain the API to someone.
Ideally, the name of the class and the name of the space should not contain
the same information redundantly.
instead of Nette\Http\HttpRequest prefer
Nette\Http\Request
instead of
Symfony\Component\Security\Authentication\AuthenticationTrustResolver
prefer the class TrustResolver
The class Nette\Http\Request does not violate rule No. 1 about
the descriptive name of the class even without mentioning the namespace, on the
contrary, it allows us to elegantly use the partial namespace:
use Nette\Http; // alias for namespace
// all classes via Http are available:
$request = new Http\Request;
$response = new Http\Response;
// and additionally, Http\Response is more understandable than just Response
If we understand namespaces as packages, which is common, it leads to
unfortunate duplication of the last word:
Zend\Form\Form
Symfony\Component\Finder\Finder
Nette\Application\Application
Namespaces also literally encourage grouping classes (e.g., various
implementations of the same interface, etc.) into their own spaces, which again
creates duplications:
Nette\Caching\Storages\FileStorage – i.e., all storages in a
separate space Storages
Zend\Form\Exception\BadMethodCallException – all exceptions
in Exception
Symfony\Component\Validator\Exception\BadMethodCallException –
again all exceptions in Exception
Grouping namespaces lengthen the name and create duplication in it because it
is often impossible to remove the generality from the class name (rule 1). Their
advantage may be better orientation in the generated API documentation (although
this could be achieved differently) and easier access when using full-fledged
IDEs with prompting. However, I recommend using them cautiously. For example,
for exceptions, it is not very suitable.
5) Unmistakable classes
from multiple spaces
According to point 1), a class should have a descriptive name, but that does
not mean it has to be unique within the entire application. Usually, it is
enough that it is unique within the namespace. However, if two classes from
different spaces are often used next to each other in the code, or if they have
some other significant connection, they should not have the same name. In other
words, it should not be necessary to use AS in the USE clause.
6) One-way dependencies
Consider what dependencies should exist between classes from different
namespaces. I try to maintain:
if a class from the namespace A\B has a dependency on a class from the
namespace A\C, no class from A\C should have a dependency on A\B
classes from the namespace A\B should not have dependencies on a class from
the space A\B\C (take this with a grain of salt)
p.s.: Please do not take this article as dogma, it is just a capture of
current thoughts
A question that many webmasters ask: do search engines perceive these URLs
as the same? How should they be treated?
http://example.com/article
http://example.com/article/
http://example.com/Article
https://example.com/article
http://www.example.com/article
http://example.com/article?a=1&b=2
http://example.com/article?b=2&a=1
The short answer would be: “URLs are different.” However, a more detailed
analysis is needed.
From a user's perspective, these addresses differ only in minor details
which they generally disregard. Thus, they perceive them as the same, although
technically, they are different addresses. Let's call them similar
addresses. For the sake of “user experience”, two principles should be
adhered to:
Do not allow different content on similar addresses. As I will show
soon, this would not only confuse users but also search engines.
Allow users access through similar addresses.
If the addresses differ in protocol http / https or
with www domain or without, search engines consider them different.
Not so for users. It would be a fatal mistake to place different content on such
similar addresses. However, it would also be a mistake to prevent access through
a similar address. The address with www and without
www must both function, with SEO recommending sticking to one
variant and redirecting the others to it using a 301 HTTP code. This can be
managed for the www subdomain with a
.htaccess file:
# redirection to the non-www variant
RewriteCond %{HTTP_HOST} ^www\.
RewriteRule ^.*$ http://example.com/$0 [R=301,NE,L]
# redirection to the www variant
RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^.*$ http://www.example.com/$0 [R=301,NE,L]
Immediately test whether your servers redirect, including the full address
and correct parameter passing.
Don't forget variants like www.subdomain.example.cz. Because some
browsers can bypass missing redirections, try a low-level service like Web-Sniffer.
URLs are case-sensitive except for the scheme and domain. However, users do
not differentiate and therefore, it is unfortunate to offer different content on
addresses differing only by letter case. A poor example can be seen in
Wikipedia:
Bing amusingly suffers from an error, returning the same URL whether you
search for acid or a database (although the textual description is correct).
Google and Yahoo do not have this issue.
Bing does not differentiate between acid and database
Some services (webmails, ICQ) convert uppercase letters in URLs to lowercase,
which are all reasons to avoid distinguishing letter size, even in parameters.
Better adhere to the convention that all letters in URLs should be
lowercase.
Distinguishing some similar addresses is also a challenge for search
engines. I conducted an experiment by placing different content on URLs
differing in details like the presence of a trailing slash or parameter order.
Only Google was able to index them as different. Other search engines could
always handle only one of the variants.
Only Google can index these pages as different
As for trailing slashes, the web server usually redirects to the canonical
form for you; if you access a directory without a trailing slash, it adds one
and redirects. Of course, this does not apply when you manage URIs on your own
(Cool URIs, etc.)
Finally: does the order of parameters really matter? There should be no
difference between article?a=1&b=2 and
article?b=2&a=1. However, there are situations where this is
not the case, especially when passing complex structures such as arrays. For
instance, ?sort[]=name&sort[]=city might be different from
?sort[]=city&sort[]=name. Nevertheless, redirecting if
parameters are not in the specified order would be considered unnecessary
overcorrection.
p.s. Nette Framework automatically
handles redirection to canonical URLs on its own.
In PHP, there are three libraries available for regular expressions: PCRE, Oniguruma, and POSIX Regex. The second one
may not always be available, and the third is deprecated, so you should
exclusively use the more adept and faster PCRE library. Unfortunately, its
implementation suffers from quite unpleasant flaws across all PHP versions.
The operation of the preg_* functions can be divided into
two steps:
compilation of the regular expression
execution (searching, replacing, filtering, …)
It is advantageous that PHP maintains a cached version of compiled regular
expressions, meaning they are only compiled once. Therefore, it is appropriate
to use static regular expressions, i.e., not to generate them
parametrically.
Now for the unpleasant issues. If an error is discovered during compilation,
PHP will issue an E_WARNING error, but the return value of the
function is inconsistent:
It is good to know that functions returning an array $matches by
reference (i.e., preg_match_all and preg_match) do not
nullify the argument upon a compilation error, thus validating the test of the
return value.
Since version 5.2.0, PHP has the function preg_last_error returning the code of
the last error. However, beware, this only applies to errors that occur during
execution! If an error occurs during compilation, the value of
preg_last_error is not reset and returns the previous value. If the
return value of a preg_* function is not null or
false (see above), definitely do not rely on whatpreg_last_error returns.
What kind of errors can occur during execution? The most common case is
exceeding pcre.backtrack_limit or invalid UTF-8 input when using
the u modifier. (Note: invalid UTF-8 in the regular expression
itself is detected during compilation.) However, the way PHP handles such an
error is utterly inadequate:
it generates no message (silent error)
the return value of the function may indicate that everything is fine
the error can only be detected by calling
preg_last_error later
Let's talk about the return value, which is probably the biggest betrayal.
The process is executed until an error occurs, then it returns a partially
processed result. And this is done completely silently. However, even this is
not always the case, for example, the trio of functions
preg_filter, preg_replace_callback,
preg_replace can return null even during execution
errors.
Whether an error occurred during execution can only be determined by calling
preg_last_error. But as you know, this function returns a
nonsensical result if, on the contrary, a compilation error occurred, so we must
distinguish both situations by considering the return value of the function,
whether it is null or false. And since functions that
return null during a compilation error can also return
null during an execution error, it can be stated only that PHP is
undoubtedly a messed-up language.
What would safe use of PCRE functions look like? For example, like this:
function safeReplaceCallback($pattern, $callback, $subject)
{
// we must verify the callback ourselves
if (!is_callable($callback)) {
throw new Exception('Invalid callback.');
}
// test the expression on an empty string
if (preg_match($pattern, '') === false) { // compilation error?
$error = error_get_last();
throw new Exception($error['message']);
}
// call PCRE
$result = preg_replace_callback($pattern, $callback, $subject);
// execution error?
if ($result === null && preg_last_error()) {
throw new Exception('Error processing regular expression.', preg_last_error());
}
return $result;
}
The provided code transforms errors into exceptions but does not attempt to
suppress warning outputs.
Safe processing of regular expressions is implemented in the class Nette\Utils\Strings.
Every now and then, a security vulnerability is reported on
another significant website (Alza, Mapy.cz, BontonLand) or is exploited.
Try searching for XSS
vulnerability to understand why Cross Site Scripting (XSS) is currently one
of the most widespread and dangerous vulnerabilities.
This is a distressing issue for website operators and perhaps even more so
for suppliers. It can damage reputations, lead to fines, lawsuits, or
simply spoil relationships with clients. How to defend against XSS? By so-called
string escaping. Unfortunately,
most experts are not well-versed in this area. (I don’t mean to be tactless
or offend anyone, but of the “Czechoslovak IT celebrities,” I only know one
person who deeply understands this issue.) Thus, even articles on this topic on
well-known websites are, let’s say, inaccurate.
Moreover, this escaping is usually done in the template, falling on the
coder’s shoulders. Thus, the most critical area requiring high expertise is
handled by someone unqualified. How can this end? We know all too well – see
the first paragraph.
Nette Framework Will Save You
I would like to introduce you to a killer feature of the Latte templating
system in the Nette Framework. It's such a
fundamental feature that it alone is a reason to choose this framework. Or at
least to use its templates.
the bigger your company, the more crucial this feature is
no competing framework has it to date 1)
The Nette Framework automatically escapes in templates. Its
Context-aware escaping feature recognizes which part of the document
you are in and chooses the appropriate escaping method accordingly.
Let's dive into more technical details. You can see how it works best with
an example. Consider a variable $var and this template:
The notation {$var} means printing the variable. However, each
print must be explicitly secured, even differently at each location. A coder
must (for example, in Smarty) add the appropriate modifiers, must not make a
mistake, and especially not omit anything.
In the Nette Framework, nothing needs to be manually secured. Everything
is done automatically, correctly, and consistently!
If we assign $var = 'Width 1/2"' to the variable, the framework
generates the HTML code:
Of course, situations where you need to print a variable without escaping it
are also considered, for example, because it contains article text including
HTML tags. In such cases, you use the notation {$var|noescape}.
End of the technical digression. Thanks to Latte, it suddenly means that
the template remains simple
you don’t have to worry that a coder will overlook something
and at the same time, you don’t need to have a top expert on
escaping ;)
the work is much easier
You can find more information about Latte’s smart templates in the documentation.
1) About half a year after Nette, Google introduced a similar
feature for its library in C++, and as far as I know, no framework in PHP,
Ruby, or Python has anything similar yet.
One of the evergreen topics in programming is the confusion and
misunderstandings around escaping. Ignorance causes the simplest methods of
compromising websites, such as Cross Site Scripting (XSS) or SQL injection, to
remain unfortunately widespread.
Escaping is the substitution of characters that have a special meaning in
a given context with other corresponding sequences.
Example: To write quotes within a string enclosed by quotes, you need to
replace them because quotes have a special meaning in the context of a string,
and writing them plainly would be interpreted as ending the string. The specific
substitution rules are determined by the context.
Prerequisites
Each escaping function assumes that the input is always a “raw
string” (unmodified) in a certain encoding (character set).
Storing strings already escaped for HTML output in the database and similar
is entirely counterproductive.
What contexts do we encounter?
As mentioned, escaping converts characters that have a special meaning in a
certain context. Different escaping functions are used for each context. This
table is only indicative, and it is necessary to read the
notes below.
many contexts have their subcontexts where escaping differs. Unless
otherwise stated, the specified escaping function is applicable universally
without further differentiation of subcontexts.
the term usual character set refers to a character set with 1-byte or UTF-8 encoding.
HTML
In HTML contexts, the characters < & " ' collectively
have a special meaning, and the corresponding sequences are
< & " '. However, the exception is
an HTML comment, where only the pair -- has special meaning.
For escaping, use:
$s = htmlspecialchars($s, ENT_QUOTES);
It works with any usual character set. However, it does not consider the
subcontext of HTML comments (i.e., it cannot replace the pair --
with something else).
Reverse function:
$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');
XML / XHTML
XML 1.0 differs from HTML in that it prohibits the use of C0 control
characters (including writing in the form of an entity) except for the
tabulator, line feed, and space. XML 1.1 allows these banned characters, except
NUL, in the form of entities, and further mandates C1 control
characters, except NEL, also to be written as entities.
Additionally, in XML, the sequence ]]> has a special meaning, so
one of these characters must also be escaped.
In Perl regular
expressions, characters
. \ + * ? [ ^ ] $ ( ) { } = ! < > | : - and the so-called
delimiter, which is a character delimiting the regular expression (e.g., for the
expression '#[a-z]+#i' it is #), collectively have
special meaning. They are escaped with the character \.
$s = preg_quote($s, $delimiter);
In the string replacing the searched expression (e.g., the 2nd parameter of
the preg_replace function), the backslash and dollar sign have
special meaning:
$s = addcslashes($replacement, '$\\');
The encoding must be either 1-byte or UTF-8, depending on the modifier in the
regular expression.
PHP Strings
PHP distinguishes these types of strings:
in single quotes, where special meaning can have
characters \ '
in double quotes, where special meaning can have
characters \ " $
NOWDOC, where no character has special meaning
HEREDOC, where special meaning can have characters \ $
Escape is done with the character \. This is usually done by the
programmer when writing code, for PHP code generators, you can use the var_export function.
Note: because the mentioned regular expressions are usually written within
PHP strings, both types of escaping need to be combined. E.g., the character
\ for a regular expression is written as \\ and in a
quoted string it needs to be written as \\\\.
SQL and Databases
Each database has its own escaping function, see the table above. Almost
always, however, only a function for escaping strings is available, and it
cannot be used for anything else, especially there are no functions for escaping
wildcard characters used in LIKE constructions (in MySQL these are
% _) or identifiers, such as table or column names. Databases
do not require removing escaping on output! (Except, for example, for
bytea type.)
As a programming language, JavaScript has a number of very different
subcontexts. For escaping strings, you can use the side
effect of the function
$s = json_encode((string) $s);
which also encloses the string in quotes. Strictly requires UTF-8.
JavaScript written inside HTML attributes (e.g., onclick) must
still be escaped according to HTML rules, but this does
not apply to JavaScript inside <script> tags, where only the
potential occurrence of the ending tag </script> inside the
string needs to be treated. However, json_encode ensures this, as JSON escapes
the slash /. However, it does not handle the end of an HTML comment
--> (which does not matter in HTML) or an XML CDATA block
]]>, which the script is wrapped in. For XML/XHTML, the
solution is
For CSS within HTML code, the same applies as stated about JavaScript and its
escaping within HTML attributes and tags (here it is about the
style attributes and <style> tags).
URL
In the context of a URL, everything except the letters of the English
alphabet, digits, and characters - _ . is escaped by replacing them
with % + the hexadecimally expressed byte.
$s = rawurlencode($s);
According to RFC 2718 (from 1999) or RFC 3986 (from 2005), writing characters
in UTF-8 encoding is preferred.
The reverse function in this case is urldecode, which also recognizes the
+ character as meaning space.
If you find the whole topic too complicated, don't despair. Soon you will
realize that it is actually about simple transformations, and the whole trick is
in realizing which context I am in and which function I need to choose for it.
Or even better, try using an intelligent templating system that can recognize
contexts itself and apply proper escaping:
Latte.
Singleton is one of the most popular design patterns. Its purpose is to
ensure the existence of only one instance of a certain class while also
providing global access to it. Here is a brief example for completeness:
class Database
{
private static $instance;
private function __construct()
{}
public static function getInstance()
{
if (self::$instance === null) {
self::$instance = new self;
}
return self::$instance;
}
...
}
// singleton is globally accessible
$result = Database::getInstance()->query('...');
Typical features include:
A private constructor, preventing the creation of an instance outside
the class
A static property $instance where the unique instance
is stored
A static method getInstance(), which provides access to the
instance and creates it on the first call (lazy loading)
Simple and easy to understand code that solves two problems of
object-oriented programming. Yet, in dibi or
Nette Framework, you won’t find any
singletons. Why?
Apparent Uniqueness
Let's look closely at the code – does it really ensure only one instance
exists? I’m afraid not:
$dolly = clone Database::getInstance();
// or
$dolly = unserialize(serialize(Database::getInstance()));
// or
class Dolly extends Database {}
$dolly = Dolly::getInstance();
There is a defense against this:
final public static function getInstance()
{
// final getInstance
}
final public function __clone()
{
throw new Exception('Clone is not allowed');
}
final public function __wakeup()
{
throw new Exception('Unserialization is not allowed');
}
The simplicity of implementing a singleton is gone. Worse – with every
additional singleton, we repeat the same piece of code. Moreover, the class
suddenly fulfills two completely different tasks: besides its original purpose,
it takes care of being quite single. Both are warning signals that something is
not right and the code deserves refactoring. Bear with me, I’ll get back to
this soon.
Global = Ugly?
Singletons provide a global access point to objects. There is no need to
constantly pass the reference around. However, critics argue that such a
technique is no different from using global variables, and those are
pure evil.
(If a method works with an object that was explicitly passed to it,
either as a parameter or as an object variable, I call it “wired
connection”. If it works with an object obtained through a global point (e.g.,
through a singleton), I call it “wireless connection”. Quite a nice
analogy, right?)
Critics are wrong in one respect – there is nothing inherently bad about
“global”. It’s important to realize that the name of each class and
method is nothing more than a global identifier. There is no fundamental
difference between the trouble-free construction $obj = new MyClass
and the criticized $obj = MyClass::getInstance(). This is even less
significant in dynamic languages like PHP, where you can “write in PHP 5.3”
$obj = $class::getInstance().
The first issue can be eliminated if singletons do not act like global
variables, but rather as global functions or services. Consider google.com –
a nice example of a singleton as a global service. There is one instance (a
physical server farm somewhere in the USA) globally accessible through the
identifier www.google.com. (Even clone www.google.com
does not work, as Microsoft discovered, they have it figured out.) Importantly,
this service does not have hidden dependencies typical for global variables –
it returns responses without unexpected connections to what someone else
searched for moments ago. On the other hand, the seemingly inconspicuous
function strtok suffers from a serious
dependency on a global variable, and its use can lead to very hard-to-detect
errors. In other words – the problem is not “globality”, but design.
The second point is purely a matter of code design. It is not wrong to use a
“wireless connection” and access a global service, the mistake is doing it
unexpectedly. A programmer should know exactly which object uses which class.
A relatively clean solution is to have a variable in the object referring to
the service object, which initializes to the global service unless the
programmer decides otherwise (the convention over configuration technique).
Uniqueness May Be Harmful
Singletons come with a problem that we encounter no later than when testing
code. And that is the need to substitute a different, test object. Let's return
to Google as an exemplary singleton. We want to test an application that uses
it, but after a few hundred tests, Google starts protesting We're
sorry… and where are we? We are somewhere. The solution is to substitute
a fictitious (mock) service under the identifier www.google.com. We
need to modify the hosts file – but (back from the analogy to
the world of OOP) how to achieve this with singletons?
One option is to implement a static method
setInstance($mockObj). But oops! What exactly do you want to pass
to that method when no other instance, other than that one and only,
exists?
Any attempt to answer this question inevitably leads to the breakdown of
everything that makes a singleton a singleton.
If we remove the restrictions on the existence of only one instance, the
singleton stops being single and we are only addressing the need for a global
repository. Then the question arises, why repeat the same method
getInstance() in the code and not move it to an extra class, into
some global registry?
Or we maintain the restrictions, only replacing the class identifier with an
interface (Database → IDatabase), which raises the
problem of the impossibility to implement IDatabase::getInstance()
and the solution again is a global registry.
A few paragraphs above, I promised to return to the issue of repetitive
code in all singletons and possible refactoring. As you can see, the problem has
resolved itself. The singleton has died.
It requires PHP (version 5 or newer) with CURL extension and is licensed
under the New BSD License. You can obtain the latest version from our GitHub repository or install it via
Composer:
php composer.phar require dg/twitter-php
Twitter requires SSL/TLS as of January 14th, 2014. Update to
the last version.
Getting started
Sign in to the http://twitter.com and
register an application from the http://dev.twitter.com/apps page.
Remember
to never reveal your consumer secrets. Click on My Access Token link from the
sidebar and retrieve your own access
token. Now you have consumer key, consumer secret, access token and access token
secret.
Create object using application and request/access keys:
$twitter = new Twitter($consumerKey, $consumerSecret,
$accessToken, $accessTokenSecret);
Posting
The send() method posts your status. The message must be encoded
in UTF-8:
$twitter->send('I am fine today.');
You can append picture:
$twitter->send('This is my photo', $imageFile);
Displaying
The load() method returns the 20 most recent status updates posted in the
last 24 hours by you:
The static method Twitter::clickable() makes links in status
clickable. In addition to regular links, it links @username to the
user’s Twitter profile page and links hashtags to a Twitter search on that
hashtag.
Searching
The search() method provides searching in twitter statuses:
The authenticate() method tests if user credentials
are valid:
if (!$twitter->authenticate()) {
die('Invalid name or password');
}
Other commands
You can use all commands defined by Twitter API 1.1. For example GET
statuses/retweets_of_me returns the array of most recent tweets authored by
the authenticating user:
I recently participated in a discussion that reminded me
(again) of the deeply entrenched myths regarding the differences between HTML
and XHTML. The campaign for the formats with the letter “X” was accompanied
by great emotions, which usually do not go hand in hand with a clear head.
Although the enthusiasm has long since faded, a significant part of the
professional community and authors still believe a number of misconceptions.
In this article, I will attempt to bury the biggest of these myths in the
following way. This article will contain only facts. I will save my
opinions and your comments for a second article.
In the text below, by HTML I mean the version HTML 4.01, and by XHTML I mean the
version XHTML 1.0 Second Edition.
For completeness, I add that HTML is an application of the SGML
language, while XHTML is an application of the XML language.
Myth: HTML allows tag crossing
Not at all. Tag crossing is directly prohibited in SGML, and consequently in
HTML. This fact is mentioned, for example, in the W3C recommendation:
“…overlapping is illegal in SGML…”. All these markup languages
perceive the document as a tree structure, and therefore it is not possible to
cross tags.
I am also responding to a reformulation of the myth: “The advantage of
XHTML is the prohibition of crossing tags.” This is not the case; tags cannot
be crossed in any existing version of HTML or XHTML.
Myth:
XHTML banned presentation elements and introduced CSS
Not at all. XHTML contains the same sort of elements as HTML 4.01. This is
mentioned right in the first
paragraph of the XHTML specification: “The meaning of elements and
their attributes is defined in the W3C recommendation for HTML 4.” From
this perspective, there is no difference between XHTML and HTML.
Some elements and attributes were deprecated already in HTML
4.01. Presentation elements are forbidden in favor of CSS, which also
answers the second part of the myth: the arrival of cascading styles with XHTML
is unrelated, having occurred earlier.
Myth: HTML parser must
guess tag endings
Not at all. In HTML, for a defined group of
elements, the ending or starting tag can optionally be omitted. This is for
elements where omitting the tag cannot cause ambiguity. As an example,
take the ending tag for the p element. Since the standard states
that a paragraph cannot be inside another paragraph, it is clear by
writing…
<p>....
<p>....
…that by opening the second paragraph, the first must close. Therefore,
stating the ending tag is redundant. However, for example, the div
element can be nested within itself, so both the starting and ending tags are
required.
Myth: HTML attribute
notation is ambiguous
Not at all. XHTML always requires enclosing attribute values in quotes or
apostrophes. HTML also requires
this, except if the value consists of an alphanumeric string. For
completeness, I add that even in these cases, the specification recommends
using quotes.
Thus, in HTML it is permissible to write
<textarea cols=20 rows=30>, which is formally as unambiguous
as <textarea cols="20" rows="30">. If the value contained
multiple words, HTML insists on using quotes.
Myth: HTML document is
ambiguous
Not at all. The reasons given for ambiguity are either the possibility of
crossing tags, ambiguity in writing attributes without quotes, which are already
debunked myths, or also the possibility of omitting some tags. Here I repeat
that the group of elements where tags can be omitted is chosen so as to omit
only redundant information.
Thus, an HTML document is always unambiguously determined.
Myth: Only in XHTML
is the ‘&’ character written as ‘&’
Not at all – it must also be written that way in HTML. For both languages,
the characters < and & have a specific meaning.
The first opens a tag and the second an entity. To prevent them from being
understood in their meta-meaning, they must be written as an entity. Thus also
in HTML, as stated by the specification.
Myth: HTML
allows ‘messes’ that would not pass in XHTML
Not at all. This view is rooted in a series of myths that I have already
refuted above. I haven't yet mentioned that XHTML, unlike HTML, is case
sensitive for element and attribute names. However, this is a completely
legitimate feature of the language. In this way, Visual Basic differs from C#,
and it cannot objectively be said that one or the other approach is worse. HTML
code can be made confusing by inappropriately mixing upper and lower case
(<tAbLe>), XML code can also be confusing by using strings
like id, ID, Id for different
attributes.
The clarity of the notation in no way relates to the choice of one language
over the other.
Myth: Parsing XHTML is much
easier
Not at all. Comparing them would be subjective and therefore has no place in
this article, but objectively, there is no reason why one parser should have a
significantly easier time. Each has its own set of challenges.
Parsing HTML is conditioned by the fact that the parser must know the
document type definition. The first reason is the existence of optional tags.
Although their addition is unambiguous (see above) and algorithmically easy to
handle, the parser must know the respective definition. The second reason
concerns empty elements. That an element is empty is known to the parser only
from the definition.
Parsing XHTML is complicated by the fact that the document can (unlike HTML)
contain an internal subset DTD with the definition of its own entities (see example). I add that
an “entity” does not have to represent a single character, but any lengthy
segment of XHTML code (possibly containing further entities). Without processing
the DTD and verifying its correctness, we cannot talk about parsing XHTML.
Furthermore, syntactically, DTD is essentially the opposite of XML language.
In summary: both HTML and XHTML parsers must know the document type
definition. The XHTML parser additionally must be able to read it in DTD
language.
Myth: Parsing XHTML is much
faster
In terms of the syntactic similarity of both languages, the speed of parsing
is only determined by the skill of the programmers of the individual parsers.
The time required for machine processing of a typical web page (whether HTML or
XHTML) on a regular computer is imperceptible to human perception.
Myth: HTML parser must always
cope
Not at all. The HTML specification does
not dictate how an application should behave in case of processing an
erroneous document. Due to competitive pressures in the real world, browsers
have become completely tolerant of faulty HTML documents.
It is different in the case of XHTML. The specification, by referring to XML
dictates
that the parser must not continue processing the logical structure of the
document in case of an error. Again, due to competitive pressures in the real
world, RSS readers have become tolerant of faulty XML documents (RSS is an
application of XML, just like XHTML).
If we were to deduce something negative about HTML from the tolerance of web
browsers, then we must necessarily deduce something negative about XML from
the tolerance of RSS readers. Objectively, the draconian approach
of XML to errors in documents is utopian.
Conclusion?
If your mind is no longer burdened by any of the myths mentioned above, you
can better perceive the difference between HTML and XHTML. Or rather, you can
better perceive that there is no difference. The real difference occurs a level
higher: it is the departure from SGML and the transition to the new XML.
Unfortunately, it cannot be said that XML only solves the problems of SGML
and adds no new ones. I have encountered two in this article alone. One of them
is the draconian processing of errors in XML, which is not in line with
practice, and the other is the existence of a different DTD language inside XML,
which complicates parsing and the understandability of XML documents. Moreover,
the expressive capability of this language is so small that it cannot formally
cover even XHTML itself, so some features must be defined separately. For a
language not bound by historical shackles, this is a sad and striking finding.
However, criticism of XML is a topic for a separate article.
(If I encounter more myths, I will gradually update the article. If you
want to refer to them, you can take advantage of the fact that each headline has
its own ID)
As you might know, web forms have to by styled with care, since
their native look often is the best you can achieve.
That said, sometimes even default look has its bugs. A truly flagrant
mistake concerns buttons in Internet Explorer (including version 7) in Windows
XP. If the button's caption is too long, the browser produces such a
nasty thing:
Texy2 is a huge leap forward. More perfect, cleverer, highly
customizable. And above all – even more sexy! Web application developers can
chuckle in contentment.
Initially, Texy2 wasn’t even supposed to be released. But let's not get
ahead of ourselves…
How Software Is Designed
The best analysis of a program is done by programming it. Only then do you
realize what you really need from it. And only then can you write it
perfectly.
I was aware of this while writing Texy 1. I didn’t want to write API
documentation, I didn’t translate the website into other languages. I knew
that was just a rehearsal for the real Texy.
The first version was a labor because I had to crack a ton of nuts. Figure
out how to even do it. It’s not a joke. For instance, you might say:
“Texy will insert non-breaking spaces between a preposition and a word.” And
one might think a regular expression that finds v lese and replaces
it with v lese would suffice.
But, can it handle this too:
v <strong>lese</strong>? Yes, a non-breaking space
belongs there too. Why wouldn’t it? Should we filter strings in angle
brackets? Okay, but what about this input:
v <span title="3 > 2">lese</span>
You'd suggest more cunning HTML tag filtering? Wait, but if there’s a
<br> tag, then the non-breaking space shouldn’t be there.
So no filtering, but analysis instead.
However, the precision of conversion is not the main attraction of Texy2.
Nope, that’s just a manifestation of maturing older ideas. The real bombshell
is the maximum customizability.
Texy is Flexible and Billable
Now you can easily change the behavior of any document element. Need to build
a wiki over Texy2? I.e., control all the links on the page? It took me just a
few lines of code.
Need to generate content based on headings? Want to insert flash animations
using [* movie.swf *]? Want to automatically add a CSS class to all
phrases "hello .(description)"? You can! And extremely easily.
Some solutions are found directly in the distribution, but mostly in the
documentation, which is not yet available 🙂 At least there’s a brief changelog. I’m sorry, I have such
important tasks now that there’s no time to write the manual. However, the
Texy website is now designed so that creating documentation does not have to
depend only on me.
Texy2 is Here
Texy2 wasn’t meant to be released. I realized that I had no motivation to
release my software as open source. It comes with many limitations, in the comfort zone you won’t find
(i.e., in the Czech Republic), everyone bothers with support, you encounter idiots. If it weren’t
for the Giraffe & co. at the last HBWBH, I would’ve
probably kept it to myself.
The revision released today with the beautiful number 111 is the
first official beta version of Texy2. Download it, play around, test it.