Sunday, May 24, 2009

how to remove attributes from html tags using php and regular expressions

Here's a problem my colleague had friday : he needed to strip the attributes from h2 and td tags in an html document.

There were several different kinds of attributes and they might change in the future, so we figured using regular expressions would be nicer than using the str_replace function.

Here's a little function to check that the 'strip' function we are writing does the job correctly :


function checkStripping()
{
$a = array();
$a['<h2 style="blabla">'] = '<h2>';
$a['<h2 style="blabla, bla bla">'] = '<h2>';
$a['<td style="blablabli, bla bla">'] = '<td>';
foreach ($a as $before => $after) {
if (strip($before) == $after) {
echo '.';
} else {
echo '<br />F (strip(' . htmlspecialchars($before)
. ') was expected to be '
. htmlspecialchars($after) . ' and is '
. htmlspecialchars(strip($before)) . ')';
}
}
}


It turns out the testing function will be much longer than the one that does the job...
Had we used a unit testing framework like PHPUnit it would have been shorter.

Now here's the function we first came up with to do the stripping :


function strip($s)
{
$pattern = array('<h2.*>', '<td.*>');
$replacement = array('<h2>', '<td>');
return preg_replace($pattern, $replacement, $s);
}


It uses the php function preg_replace which allows us to define what strings should replace what regular expressions in a text.

The 'tricky' part for us was defining the regular expression. We started by writing it this way :

'<h2.*>' (a)

We wanted to say our pattern was made of
  • '<h2'
  • any character zero or more times (.*)
  • the closing '>'.
That didn't work, it gave us '<<h2>' instead of '<h2>', and we couldn't understand why.

Then we wrote it this way :

'/<h2[^>]*/>' (b)

Instead of the dot (.) to signify 'any character', we chose '[^>]' to mean 'any character expect for the closing >'.
And i remembered that i'd always seen regular expressions used with some character at the beginning and at the end, so i added those '/' at the beginning and at the end.

And pattern (b) worked! So we looked no further.

Last night I decided to finally have a go at writing a blog on programming. I doubt it will be of interest to anybody, but you never know.

Now I've found that at least it's been useful to me, because now that i've described our problem, i figured out why pattern (a) didn't work, and how pattern (b) could be simplified.

I think pattern (a) did not work because we omitted those characters (here '/', but we could have chosen something else) around our regular expression. PHP must have thought that '<' served that purpose, so it looked for 'h2.*>' and replaced what it found with '<h2>'. So that explains why we ended up with '<<h2>' as a result. (you can read more about those delimiters on php.net)

And so replacing the dot with '[^>]' is not necessary, so the final version of our function would be :


function strip($s)
{
$pattern = array('/<h2.*/>', '/<td.*/>');
$replacement = array('<h2>', '<td>');
return preg_replace($pattern, $replacement, $s);
}


Of course I'll be glad to have feedback if anyone reads this and knows a better way to do it.



Monday, May 25th edit: my unit tests were not good enough, it turns out the '.*' is greedy and my function as it is would replace '<h2>title</h2>' with '<h2>'.
Adding a question mark after the '.*' does the trick, so the function should be :


function strip($s)
{
$pattern = array('/<h2.*?/>', '/<td.*?/>');
$replacement = array('<h2>', '<td>');
return preg_replace($pattern, $replacement, $s);
}