There were several different kinds of attributes and they might change in the future, so we figured using regular expressions would be nicer than using the str_replace function.
Here's a little function to check that the 'strip' function we are writing does the job correctly :
function checkStripping()
{
$a = array();
$a['<h2 style="blabla">'] = '<h2>';
$a['<h2 style="blabla, bla bla">'] = '<h2>';
$a['<td style="blablabli, bla bla">'] = '<td>';
foreach ($a as $before => $after) {
if (strip($before) == $after) {
echo '.';
} else {
echo '<br />F (strip(' . htmlspecialchars($before)
. ') was expected to be '
. htmlspecialchars($after) . ' and is '
. htmlspecialchars(strip($before)) . ')';
}
}
}
It turns out the testing function will be much longer than the one that does the job...
Had we used a unit testing framework like PHPUnit it would have been shorter.
Now here's the function we first came up with to do the stripping :
function strip($s)
{
$pattern = array('<h2.*>', '<td.*>');
$replacement = array('<h2>', '<td>');
return preg_replace($pattern, $replacement, $s);
}
It uses the php function preg_replace which allows us to define what strings should replace what regular expressions in a text.
The 'tricky' part for us was defining the regular expression. We started by writing it this way :
'<h2.*>' (a)
We wanted to say our pattern was made of
- '<h2'
- any character zero or more times (.*)
- the closing '>'.
Then we wrote it this way :
'/<h2[^>]*/>' (b)
Instead of the dot (.) to signify 'any character', we chose '[^>]' to mean 'any character expect for the closing >'.
And i remembered that i'd always seen regular expressions used with some character at the beginning and at the end, so i added those '/' at the beginning and at the end.
And pattern (b) worked! So we looked no further.
Last night I decided to finally have a go at writing a blog on programming. I doubt it will be of interest to anybody, but you never know.
Now I've found that at least it's been useful to me, because now that i've described our problem, i figured out why pattern (a) didn't work, and how pattern (b) could be simplified.
I think pattern (a) did not work because we omitted those characters (here '/', but we could have chosen something else) around our regular expression. PHP must have thought that '<' served that purpose, so it looked for 'h2.*>' and replaced what it found with '<h2>'. So that explains why we ended up with '<<h2>' as a result. (you can read more about those delimiters on php.net)
And so replacing the dot with '[^>]' is not necessary, so the final version of our function would be :
function strip($s)
{
$pattern = array('/<h2.*/>', '/<td.*/>');
$replacement = array('<h2>', '<td>');
return preg_replace($pattern, $replacement, $s);
}
Of course I'll be glad to have feedback if anyone reads this and knows a better way to do it.
Monday, May 25th edit: my unit tests were not good enough, it turns out the '.*' is greedy and my function as it is would replace '<h2>title</h2>' with '<h2>'.
Adding a question mark after the '.*' does the trick, so the function should be :
function strip($s)
{
$pattern = array('/<h2.*?/>', '/<td.*?/>');
$replacement = array('<h2>', '<td>');
return preg_replace($pattern, $replacement, $s);
}