Safely dealing with user input is a tricky business, especially when you want to allow some HTML. This is not a new problem and there are many existing solutions out there, most notably, BB CODE Markdown. HTML Purifier is an open source HTML filtering library written in PHP which on paper ticks all the right boxes.
HTMLPurifier: The verdict
A 30 second analysis of HTMLPurifier
- Open source
- Whitelist approach
- Standards compliant output
- The end user does not need to learn new syntax or pseudo-code
- XSS safe
- Large library
- Relatively slow
If you need a more in depth analysis the HTML Purifier website has a thorough comparison.
Implementing HTML Purifier
Implementing HTML Purifier is quite straight forward. First it is necessary to download and copy all the necessary files into the application.
When receiving the user input from the website any html code that is not on the whitelist will be removed. Keeping code that is wrapped in
preg_replace("/()(.*)(<\/code>)/eis", "'$1' . htmlspecialchars('$2') . '$3'", $dirtyHtml);
Now HTMLPurifier can run its magic, in this example I am using codeigniter so the library is loaded like this:
The rest of the process is taken care of with this function. This changes some of the config settings and runs the code through the HTML Purifier library:
// load the config and overide defaults as necessary
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional');
$config->set('HTML', 'AllowedElements', 'a,em,blockquote,p,strong,pre,code');
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
$config->set('HTML', 'TidyLevel', 'light');
// run the escaped html code through the purifier
$cleanHtml = $this->htmlpurifier->purify($codeEscaped, $config);
The cleanComments function now returns a nice clean string of user input. All the good bits are kept whilst HTMLPurifier takes care of making the code well formed and dealing with XSS security worries.