HTML Purifier: an open source filter library

Safely dealing with user input is a tricky business, especially when you want to allow some HTML. This is not a new problem and there are many existing solutions out there, most notably, BB CODE Markdown. HTML Purifier is an open source HTML filtering library written in PHP which on paper ticks all the right boxes.

HTMLPurifier: The verdict

A 30 second analysis of HTMLPurifier

Good points

  • Open source
  • Whitelist approach
  • Standards compliant output
  • The end user does not need to learn new syntax or pseudo-code
  • XSS safe

Bad points

  • Large library
  • Relatively slow

If you need a more in depth analysis the HTML Purifier website has a thorough comparison.

Implementing HTML Purifier

Implementing HTML Purifier is quite straight forward. First it is necessary to download and copy all the necessary files into the application.

When receiving the user input from the website any html code that is not on the whitelist will be removed. Keeping code that is wrapped in

 tags can be done by escaping the necessary html characters with a clever piece of regex.

preg_replace("/()(.*)(<\/code>)/eis", "'$1' . htmlspecialchars('$2') . '$3'", $dirtyHtml);

Now HTMLPurifier can run its magic, in this example I am using codeigniter so the library is loaded like this:


  $this->load->library('HTMLPurifier');

The rest of the process is taken care of with this function. This changes some of the config settings and runs the code through the HTML Purifier library:


function cleanComments($dirtyHtml)
{
  // load the config and overide defaults as necessary
  $config = HTMLPurifier_Config::createDefault();
  $config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
  $config->set('HTML', 'AllowedElements', 'a,em,blockquote,p,strong,pre,code');
  $config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
  $config->set('HTML', 'TidyLevel', 'light');

  // run the escaped html code through the purifier
  $cleanHtml = $this->htmlpurifier->purify($codeEscaped, $config);
  return $cleanHtml;
}

The cleanComments function now returns a nice clean string of user input. All the good bits are kept whilst HTMLPurifier takes care of making the code well formed and dealing with XSS security worries.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>