How To: Bad Word Identifier for PHP

When writing an application in which user-submitted data can find its way without review into the public view, it’s awfully nice to be able to automatically staunch the flow of profanities at least a little bit. We needed such a thing for a recent social media widget we built but oddly couldn’t find anything in our brief searches. So we wrote our own.


Solving this problem is impossible mainly because one person’s profanity is another person’s literature. But we’ll assume we have some moral authority to judge these things and work from there.

Even then, the human brain is quite good at finding a bad word even when there isn’t one as the clothing store FCUK can attest. So that would lead a computer scientist to want to try out a neural network to identify when a submission is not so clean and I’d love to do that but as I am an engineer working under time constraints, I just needed to build something quick and dirty.

So here’s what I came up with:

<?php

// Put all your bad words here! If you don't want to check if the bad word is
// embedded in the input then set it to false. For example, we set "hell" to
// false because it's in common words like "Shelly" and "hello."
//
$badWords = array (
		'fudge' => true,
		'shoot' => true,
		'hell' => true
	);

function hasBadWord ( $value )
{
	$str1 = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿŔŕ01234567";
	$str2 = "aaaaaaaceeeeiiiidnoooooouuuuybsaaaaaaaceeeeiiiidnoooooouuuyybyRroiseasbt";

	// Map everything to a-z for bad word testing.
	//
	$value = strtr ( utf8_decode ( $value ), utf8_decode ( $str1 ), $str2 );
	$value = preg_replace ( '/[^a-z]/', '', strtolower ( $value ) );

	global $badWords;

	// Quick test.
	//
	if ( array_key_exists ( $value, $badWords ) )
	{
		return true;
	}

	// Longer test. Check for embedded bad words. Arbitrarily only look for 4
	// letter words.
	//
	foreach ( array_keys ( $badWords ) as $badWord )
	{
		if ( $badWords [ $badWord ] && strlen ( $badWord ) >= 4 && strpos ( $value, $badWord ) !== false )
		{
			return true;
		}
	}

	return false;
}

// And the rest is just for testing.
//
function testIt ( $value )
{
	echo $value . " is " . ( hasBadWord ( $value ) ? "naughty" : "nice" ) . ".<br />";
}

testIt ( "fudge" );
testIt ( "5.h.ò.0.t" );
testIt ( "Shelly" );
testIt ( "Mike" );

?>

If you copy this into a PHP script and run it, this is what you get:

fudge is naughty.
5.h.ò.0.t is naughty.
Shelly is nice.
Mike is nice.

Well, it worked! So as you can see, this is a super simple test and you’ve probably already come up with a few ways to break it (like using an “l” (el) instead of an “i”). Also, given such a naive approach, you might not use it for anything but simple inputs like first name and last name.

But if it works just well enough to make the person who submitted the profanity reconsider then maybe it will help staunch the flow of nasty words a little bit? Make your error message funny and maybe you’ll even make a friend of that person.

Please Share This Post:
TwitterFacebookRSS Feed

Leave a Response