On Writing a Streaming HTML5 Parser in PHP

One of my biggest complaints with the Swytch Framework is that it is using a validating HTML5 parser. This is great if you want to sanitize some HTML or validate HTML, but in the case of Swytch, this isn’t ideal.

Now if you haven’t heard of the Swytch Framework, it’s essentially JSX for PHP. You write reusable components that render to HTML, and with the power of htmx, can have very interactive websites. Since I was using a validating parser, the following HTML would result in some weird HTML being sent to the browser:

<input type='hidden' name='json' value='{"obj": "value"}' />
<!-- would become -->
<input type="hidden" name="json" value="{"obj": "value"}" />

That was fun.

After digging around through several libraries, most of the other libraries out there were dedicated to querying HTML, after all, who creates a template language out of HTML?

So, it was off to the HTML spec and writing an HTML parser.

As the spec mentions, there is a bunch of logic you can skip, if you aren’t doing any validation or rendering — thus I kick a bunch of error detection and handling to the browser, I just have to properly recover according to the spec. I decided to go with a streaming parser approach, instead of creating a DOM. This removes a bunch of complexity, but it introduces some interesting challenges:

  1. I need to be able to capture fragments of HTML, based on an id attribute.
  2. I need to be able to scope DataProviders to their closing tag.
  3. Components will output HTML which needs to be able to update the document and continue parsing.

At first blush, these 3 requirements are all very different, but they can be distilled down to two simple requirements:

  1. I need to be able to replace portions of the document.
  2. I need to be able to execute some code when a certain portion of code has completed parsing.

With these two technical requirements, we can handle our three business requirements and provide enough flexibility that we can support virtually any future requirements as well, including things we can’t even imagine right now.

Parsing

Parsing an HTML5 document is relatively straightforward. It’s essentially a state machine, where you read a character or two (or 7), and then decide what the next state is going to be. However, we are dealing with a document that likely includes user-submitted data. This user-submitted data could contain malicious output that can cause problems for users. We use an “Escaper” which scans the entire document for data marked as untrustworthy and replaces it with __BLOB_ID__ where ID is a unique ID for that blob.

Once our document is escaped, we begin the HTML state machine. At each step along the way, we contextually escape each __BLOB__, as needed. When we encounter a tag name that is registered as a component, we stop processing Swytch-specific code.

It’s important to note that there are several processes running on the document at any given time.

  1. The HTML parser.
  2. The escaper.
  3. Swytch components.

So, once we find our first component, we stop escaping and processing any further components (though we do track nesting), and mark the start of the tag, until we reach the matching closing tag. Once we find the closing tag, we can capture the tag’s children, process the tag’s attributes, create the matching component, and call it’s render() function.

If the component is a DataProvider, we set up a callback to be called to deregister the provider’s scope at the end of the tag. If it is a RewritingTag we set up something similar.

At this point, we capture the output from the render() function, replace any <children/> tags with the previously captured children, and replace the entire tag (unless it is a container component, like a form component). After this, we simply move the parsing cursor back to the start of the component and re-enable the escape and Swytch component parsing.

It’s actually pretty simple. And fast.

With JIT enabled and warmed up, it can parse around 6mbps of HTML. That’s not going to be winning any awards or anything, but for PHP, which requires copying a string just to modify it, this isn’t too terrible.

There was some talk on the internals list a few months ago, about potentially allowing in-place string modifications. I don’t remember the outcome of that conversation, but that would greatly speed up parsing; especially on larger documents.

Optimizations

I’m sure there are a ton of things that can be improved. I can think of some low-hanging fruit, especially in regards to how some string handling is done; it could be better. One of the biggest reasons I haven’t started optimizing, yet, is because I need more real-world projects using it, which will help locate the best places to focus on and I haven’t upgraded all the existing projects.

Speaking of ‘real projects’…

I’m planning on starting a small tutorial series on YouTube demonstrating the power of this framework (including some very interesting capabilities with WASM), so if you are interested, please subscribe to my channel. I also occasionally stream some silly games there as well, so stay tuned until I create a proper channel for this content.

Until next time,

Rob

Want an inside scoop?

Check out PHP Shenanigans for only €5/mo or €30/yr


Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.