Can Regular Expressions Be Used to Parse HTML?

Let's once and for all crack this eternal question that divides the programming community. I decided to dive into the dark waters of regular expressions to bring an answer (spoiler: yes, it's possible).

So, what exactly does an HTML document contain? It's a mix of text, entities, tags, comments, and the special doctype tag. Let's first explore each ingredient separately.

Entities

The foundation of an HTML page is text, which consists of ordinary characters and special sequences called HTML entities. These can be either named, like   for a non-breaking space, or numerical, either in decimal   or hexadecimal   format. A regular expression capturing an HTML entity would look like this:

(?<entity>
	&
	(
		[a-z][a-z0-9]+     # named entity
		|
		\#\d+              # decimal number
		|
		\#x[0-9a-f]+       # hexadecimal number
	)
	;
)

All regular expressions are written in extended mode, ignore case, and a dot represents any character. That is, the modifier six.

Elements

An element can represent either a standalone tag (so-called void element) or paired tags. There is a fixed list of void element names by which they are recognized. A regular expression for capturing them would look like this:

(?<void_element>
	<                  # start of the tag
	(                  # element name
		img|hr|br|input|meta|area|embed|keygen|source|base|col
		|link|param|basefont|frame|isindex|wbr|command|track
	)
	(?&attribute)*     # optional attributes
	\s*
	/?                 # optional /
	>                  # end of the tag
)

Other tags are thus paired and captured by this regular expression (I use a reference to the content group, which we will define later):

(?<element>
	<                  # starting tag
	(?<element_name>
		[a-z][^\s/>]*  # element name
	)
	(?&attribute)*     # optional attributes
	\s*
	>                  # end of the starting tag
	(?&content)*
	</                 # ending tag
	(?P=element_name)  # repeat element name
	\s*
	>                  # end of the ending tag
)

A special case is elements like <script>, whose content must be processed differently from other elements:

(?<special_element>
	<                  # starting tag
	(?<special_element_name>
		script|style|textarea|title  # element name
	)
	(?&attribute)*     # optional attributes
	\s*
	>                  # end of the starting tag
	(?>                # atomic group
		.*?            # smallest possible number of any characters
		</             # ending tag
		(?P=special_element_name)
	)
	\s*
	>                  # end of the ending tag
)

The lazy quantifier .*? ensures that the expression stops at the first ending sequence, and the atomic group ensures that this stop is definitive.

Comments

A typical HTML comment starts with the sequence . A regular expression for HTML comments might look like this:

(?<comment>
	<!--
	(?>           # atomic group
		.*?       # smallest possible number of any characters
		-->
	)
)

The lazy quantifier .*? again ensures that the expression stops at the first ending sequence, and the atomic group ensures that this stop is definitive.

Doctype

This is a historical relic that exists today only to switch the browser to so-called standard mode. It usually looks like <!doctype html>, but can contain other characters as well. Here is the regular expression that captures it:

(?<doctype>
	<!doctype
	\s
	[^>]*         # any character except '>'
	>
)

Putting It All Together

With the regular expressions ready for each part of HTML, it's time to create an expression for the entire HTML 5 document:

\s*
(?&doctype)?              # optional doctype
(?<content>
	(?&void_element)      # void element
	|
	(?&special_element)   # special element
	|
	(?&element)           # paired element
	|
	(?&comment)           # comment
	|
	(?&entity)            # entity
	|
	[^<]                  # character
)*

We can combine all the parts into one complex regular expression. This is it, a superhero among regular expressions with the ability to parse HTML 5.

Final Notes

Even though we have shown that HTML 5 can be parsed using regular expressions, the provided example is not useful for processing an HTML document. It will fail on invalid documents. It will be slow. And so on. In practice, regular expressions like the following are more commonly used (for finding URLs of images):

<img.+?src=["'](.+?)["'].*?>

But this is a very unreliable solution that can lead to errors. This regexp incorrectly matches custom tags such as <imgs-tag src="image.jpg">, custom attributes like <img data-src="custom info">, or fails when the attribute contains a quote <img src="mcdonald's.jpg">. Therefore, it is recommended to use specialized libraries. In the world of PHP, we're unlucky because the DOM extension supports only the ancient, decaying HTML 4. Fortunately, PHP 8.4 promises an HTML 5 parser.