Finally, the Truth About XHTML and HTML

I recently participated in a discussion that reminded me (again) of the deeply entrenched myths regarding the differences between HTML and XHTML. The campaign for the formats with the letter “X” was accompanied by great emotions, which usually do not go hand in hand with a clear head. Although the enthusiasm has long since faded, a significant part of the professional community and authors still believe a number of misconceptions.

In this article, I will attempt to bury the biggest of these myths in the following way. This article will contain only facts. I will save my opinions and your comments for a second article.

In the text below, by HTML I mean the version HTML 4.01, and by XHTML I mean the version XHTML 1.0 Second Edition. For completeness, I add that HTML is an application of the SGML language, while XHTML is an application of the XML language.

Myth: HTML allows tag crossing

Not at all. Tag crossing is directly prohibited in SGML, and consequently in HTML. This fact is mentioned, for example, in the W3C recommendation: “…overlapping is illegal in SGML…”. All these markup languages perceive the document as a tree structure, and therefore it is not possible to cross tags.

I am also responding to a reformulation of the myth: “The advantage of XHTML is the prohibition of crossing tags.” This is not the case; tags cannot be crossed in any existing version of HTML or XHTML.

Myth: XHTML banned presentation elements and introduced CSS

Not at all. XHTML contains the same sort of elements as HTML 4.01. This is mentioned right in the first paragraph of the XHTML specification: “The meaning of elements and their attributes is defined in the W3C recommendation for HTML 4.” From this perspective, there is no difference between XHTML and HTML.

Some elements and attributes were deprecated already in HTML 4.01. Presentation elements are forbidden in favor of CSS, which also answers the second part of the myth: the arrival of cascading styles with XHTML is unrelated, having occurred earlier.

Myth: HTML parser must guess tag endings

Not at all. In HTML, for a defined group of elements, the ending or starting tag can optionally be omitted. This is for elements where omitting the tag cannot cause ambiguity. As an example, take the ending tag for the p element. Since the standard states that a paragraph cannot be inside another paragraph, it is clear by writing…

<p>....
<p>....

…that by opening the second paragraph, the first must close. Therefore, stating the ending tag is redundant. However, for example, the div element can be nested within itself, so both the starting and ending tags are required.

Myth: HTML attribute notation is ambiguous

Not at all. XHTML always requires enclosing attribute values in quotes or apostrophes. HTML also requires this, except if the value consists of an alphanumeric string. For completeness, I add that even in these cases, the specification recommends using quotes.

Thus, in HTML it is permissible to write <textarea cols=20 rows=30>, which is formally as unambiguous as <textarea cols="20" rows="30">. If the value contained multiple words, HTML insists on using quotes.

Myth: HTML document is ambiguous

Not at all. The reasons given for ambiguity are either the possibility of crossing tags, ambiguity in writing attributes without quotes, which are already debunked myths, or also the possibility of omitting some tags. Here I repeat that the group of elements where tags can be omitted is chosen so as to omit only redundant information.

Thus, an HTML document is always unambiguously determined.

Myth: Only in XHTML is the ‘&’ character written as ‘&’

Not at all – it must also be written that way in HTML. For both languages, the characters < and & have a specific meaning. The first opens a tag and the second an entity. To prevent them from being understood in their meta-meaning, they must be written as an entity. Thus also in HTML, as stated by the specification.

Myth: HTML allows ‘messes’ that would not pass in XHTML

Not at all. This view is rooted in a series of myths that I have already refuted above. I haven't yet mentioned that XHTML, unlike HTML, is case sensitive for element and attribute names. However, this is a completely legitimate feature of the language. In this way, Visual Basic differs from C#, and it cannot objectively be said that one or the other approach is worse. HTML code can be made confusing by inappropriately mixing upper and lower case (<tAbLe>), XML code can also be confusing by using strings like id, ID, Id for different attributes.

The clarity of the notation in no way relates to the choice of one language over the other.

Myth: Parsing XHTML is much easier

Not at all. Comparing them would be subjective and therefore has no place in this article, but objectively, there is no reason why one parser should have a significantly easier time. Each has its own set of challenges.

Parsing HTML is conditioned by the fact that the parser must know the document type definition. The first reason is the existence of optional tags. Although their addition is unambiguous (see above) and algorithmically easy to handle, the parser must know the respective definition. The second reason concerns empty elements. That an element is empty is known to the parser only from the definition.

Parsing XHTML is complicated by the fact that the document can (unlike HTML) contain an internal subset DTD with the definition of its own entities (see example). I add that an “entity” does not have to represent a single character, but any lengthy segment of XHTML code (possibly containing further entities). Without processing the DTD and verifying its correctness, we cannot talk about parsing XHTML. Furthermore, syntactically, DTD is essentially the opposite of XML language.

In summary: both HTML and XHTML parsers must know the document type definition. The XHTML parser additionally must be able to read it in DTD language.

Myth: Parsing XHTML is much faster

In terms of the syntactic similarity of both languages, the speed of parsing is only determined by the skill of the programmers of the individual parsers. The time required for machine processing of a typical web page (whether HTML or XHTML) on a regular computer is imperceptible to human perception.

Myth: HTML parser must always cope

Not at all. The HTML specification does not dictate how an application should behave in case of processing an erroneous document. Due to competitive pressures in the real world, browsers have become completely tolerant of faulty HTML documents.

It is different in the case of XHTML. The specification, by referring to XML dictates that the parser must not continue processing the logical structure of the document in case of an error. Again, due to competitive pressures in the real world, RSS readers have become tolerant of faulty XML documents (RSS is an application of XML, just like XHTML).

If we were to deduce something negative about HTML from the tolerance of web browsers, then we must necessarily deduce something negative about XML from the tolerance of RSS readers. Objectively, the draconian approach of XML to errors in documents is utopian.

Conclusion?

If your mind is no longer burdened by any of the myths mentioned above, you can better perceive the difference between HTML and XHTML. Or rather, you can better perceive that there is no difference. The real difference occurs a level higher: it is the departure from SGML and the transition to the new XML.

Unfortunately, it cannot be said that XML only solves the problems of SGML and adds no new ones. I have encountered two in this article alone. One of them is the draconian processing of errors in XML, which is not in line with practice, and the other is the existence of a different DTD language inside XML, which complicates parsing and the understandability of XML documents. Moreover, the expressive capability of this language is so small that it cannot formally cover even XHTML itself, so some features must be defined separately. For a language not bound by historical shackles, this is a sad and striking finding. However, criticism of XML is a topic for a separate article.

(If I encounter more myths, I will gradually update the article. If you want to refer to them, you can take advantage of the fact that each headline has its own ID)