In the last post I mentioned there were a few topics we need to close up today. The two topics we’ve left undone are popping the attribute information off the stack when we hit a closing element and dealing with the paragraph gap that normally appears between paragraph elements.
Last week we parsed the HTML and created code that keeps track of the various attributes we are going to need when we create the PDF. Today we will finish the code and create the Elements that we can include in our PDF document.
One consideration we will need to keep in mind as we write out the PDF is that we have pushed various font characteristics that may overlap onto our stack.
Now that we have the HTML cleaned up, the next thing we will want to do is to parse the HTML.
In my actual code for this, I parse the HTML and create the PDF at the same time, but for the purposes of these posts, I’m going to deal primarily with parsing the HTML here and then deal with the PDF creation code later.
The last prerequisite step prior to actually converting our HTML into PDF code is to clean up the HTML.
The method I use takes advantage of the XML parser in .NET but in order to use that we have to have XHTML compliant XML.
For this exercise, what I am most concerned about is that the HTML tags all have matching closing tags, that the tags are nested in a hierarchical structure, and that the tags all are lower case.
Some of this we will have to rely on the user to provide, like properly nesting the tags. But some of this we can attempt to clean up in our code. If you know you will have complete control over your HTML, you might be able to skip this step. But I think the code is simple enough that you’ll want to add it anyhow.
Before we get into the nitty gritty of parsing the HTML so that we can create PDF code from it, it is important that we develop the concept of how text layout works in iTextSharp. So today we will cover those basics.