One secret to generating clean HTML from text

Markdown mixed with Smartypants.

And if you want to jump to the chase: the Markdown Dingus for converting your text.

This particular secret to generating clean HTML from text is to start with clean text. Clean text is ASCII or Unicode text in a text file ( .txt) with a blank line between paragraphs.

If you cannot convince your authors of the joys of distraction free text editors like Writemonkey, Q10 or WriteRoom, which all produce .txt files, you are probably starting with a Word file.

I’m assuming you know your way around Word.

Use the joys of Find & Replace to ensure the document has a blank Paragraph marker between each paragraph, then simply save it as text ( .txt). Markdown uses the blank line between paragraphs to work out where the final HTML <p> tags are to go.

The next step is to use Markdown and SmartyPants to convert your text to HTML complete with typographically correct quotes, hyphens and ellipses.

Markdown and Smartypants were originally created by John Gruber of daringfireball.net as tools for formatting blog posts. I am completely ignoring the fact that Markdown provides a huge range of mark up options to specify all kinds of things like headings, bullet points, block quotes and more. Read about that here.

John Gruber generously provides Markdown Dingus, a web page where you can convert text to HTML using Markdown and Smartypants to get clean HTML with pretty quotes and proper dashes. You can find the Markdown Dingus here.

But Jimmy, I can’t go pasting my clients’ text into a web page! But Jimmy, pasting text into web pages doesn’t fit into my workflow!

Let’s get technical

Oh, man. I can’t hold your hand through this, but I can show you how Byrnes Woder uses Markdown locally.

First, install Python. Download and install the latest 2.X version from here.

Then install Smartypants for Python from here.

And, of course, install Markdown for Python from here.

Below is some very basic and very unpretty Python code for performing the conversion.

import markdown
import smartypants
import sys
import re
import codecs

html_header = u"""<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" >
<body>
"""

html_footer = u"""</body>
</html>
"""

infile_name = sys.argv[1]
outfile_name = sys.argv[2]
infile = open(infile_name, "rb")
original_txt = infile.read()
infile.close()

# catch any mis-typed en dashes
converted_txt = original_txt.replace(" - ", " -- ")
converted_txt = smartypants.educateQuotes(converted_txt)
converted_txt = smartypants.educateEllipses(converted_txt)
converted_txt = smartypants.educateDashesOldSchool(converted_txt)
# normalise line endings and insert blank line between paragraphs for Markdown
converted_txt = re.sub("\r\n", "\n", converted_txt)
converted_txt = re.sub("\n\n+", "\n", converted_txt)
converted_txt = re.sub("\n", "\n\n", converted_txt)
converted_txt = unicode( converted_txt, "utf8" )

html = markdown.markdown(converted_txt)
html_out = html_header + html + html_footer

outfile = codecs.open(outfile_name, "w", encoding="utf8")
outfile.write(html_out)
outfile.close()

We call it txtclnr.py and use it like so:

python txtclnr.py their.txt our.xhtml

It does pretty much what the Markdown Dingus does, but adds the header and footer so we end up with a properly formed XHTML document.

Other people did the hard work. John Gruber with the original Markdown and Smartypants, and the programmers that created the Python ports. Thanks to them, by starting with a straight text file we can get 90% of the way to the XHTML required for an ePUB ebook in a few seconds.