Let's once and for all crack this eternal question that divides the programming community. I decided to dive into the dark waters of regular expressions to bring an answer (spoiler: yes, it's possible).
So, what exactly does an HTML document contain? It's a mix of text, entities, tags, comments, and the special doctype tag. Let's first explore each ingredient separately.
Entities
The foundation of an HTML page is text, which consists of ordinary characters
and special sequences called HTML entities. These can be either named, like
for a non-breaking space, or numerical, either in
decimal  
or hexadecimal  
format.
A regular expression capturing an HTML entity would look like this:
(?<entity>
&
(
[a-z][a-z0-9]+ # named entity
|
\#\d+ # decimal number
|
\#x[0-9a-f]+ # hexadecimal number
)
;
)
All regular expressions are written in extended mode, ignore case, and a
dot represents any character. That is, the modifier six
.
Tags
These iconic elements make HTML what it is. A tag starts with
<
, followed by the tag name, possibly a set of attributes, and
closes with >
or />
. Attributes can optionally
have a value, which can be enclosed in double, single, or no quotes. A regular
expression capturing an attribute would look like this:
(?<attribute>
\s+ # at least one white space before the attribute
[^\s"'<>=`/]+ # attribute name
(
\s* = \s* # equals sign before the value
(
" # value enclosed in double quotes
(
[^"] # any character except double quote
|
(?&entity) # or HTML entity
)*
"
|
' # value enclosed in single quotes
(
[^'] # any character except single quote
|
(?&entity) # or HTML entity
)*
'
|
[^\s"'<>=`]+ # value without quotes
)
)? # value is optional
)
Notice that I am referring to the previously defined
entity
group.
Elements
An element can represent either a standalone tag (so-called void element) or paired tags. There is a fixed list of void element names by which they are recognized. A regular expression for capturing them would look like this:
(?<void_element>
< # start of the tag
( # element name
img|hr|br|input|meta|area|embed|keygen|source|base|col
|link|param|basefont|frame|isindex|wbr|command|track
)
(?&attribute)* # optional attributes
\s*
/? # optional /
> # end of the tag
)
Other tags are thus paired and captured by this regular expression (I use a
reference to the content
group, which we will define later):
(?<element>
< # starting tag
(?<element_name>
[a-z][^\s/>]* # element name
)
(?&attribute)* # optional attributes
\s*
> # end of the starting tag
(?&content)*
</ # ending tag
(?P=element_name) # repeat element name
\s*
> # end of the ending tag
)
A special case is elements like <script>
, whose content
must be processed differently from other elements:
(?<special_element>
< # starting tag
(?<special_element_name>
script|style|textarea|title # element name
)
(?&attribute)* # optional attributes
\s*
> # end of the starting tag
(?> # atomic group
.*? # smallest possible number of any characters
</ # ending tag
(?P=special_element_name)
)
\s*
> # end of the ending tag
)
The lazy quantifier .*?
ensures that the expression stops at the
first ending sequence, and the atomic group ensures that this stop is
definitive.
Comments
A typical HTML comment starts with the sequence <!--
and
ends with -->
. A regular expression for HTML comments might
look like this:
(?<comment>
<!--
(?> # atomic group
.*? # smallest possible number of any characters
-->
)
)
The lazy quantifier .*?
again ensures that the expression stops
at the first ending sequence, and the atomic group ensures that this stop is
definitive.
Doctype
This is a historical relic that exists today only to switch the browser to
so-called standard mode. It usually looks like
<!doctype html>
, but can contain other characters as well.
Here is the regular expression that captures it:
(?<doctype>
<!doctype
\s
[^>]* # any character except '>'
>
)
Putting It All Together
With the regular expressions ready for each part of HTML, it's time to create an expression for the entire HTML 5 document:
\s*
(?&doctype)? # optional doctype
(?<content>
(?&void_element) # void element
|
(?&special_element) # special element
|
(?&element) # paired element
|
(?&comment) # comment
|
(?&entity) # entity
|
[^<] # character
)*
We can combine all the parts into one complex regular expression. This is it, a superhero among regular expressions with the ability to parse HTML 5.
Final Notes
Even though we have shown that HTML 5 can be parsed using regular expressions, the provided example is not useful for processing an HTML document. It will fail on invalid documents. It will be slow. And so on. In practice, regular expressions like the following are more commonly used (for finding URLs of images):
<img.+?src=["'](.+?)["'].*?>
But this is a very unreliable solution that can lead to errors. This regexp
incorrectly matches custom tags
such as <imgs-tag src="image.jpg">
, custom attributes like
<img data-src="custom info">
, or fails when the attribute
contains a quote <img src="mcdonald's.jpg">
. Therefore, it is
recommended to use specialized libraries. In the world of PHP, we're unlucky
because the DOM extension supports only the ancient, decaying HTML
4. Fortunately, PHP 8.4 promises an HTML 5 parser.
Leave a comment