Regular Expressions in PHP

June 13th, 2006

Regular Expressions (regex for short) appear to a lot of people as the 'black art' of coding. Most languages, be it PHP, Java, C, .NET, VB etc have a way of using regular expressions - and they can certainly make your job easier. So lets start on our journey into regular expressions. I am by no means an expert, but hopefully I'll be able to clear the fog that surrounds regular expressions!

The Functions

Before you can concoct a regular expression, you need to know how to use them and what all those 'weird characters' do. In PHP, there are several POSIX-type regex functions, and the most common ones are:

Rather than go on about how they all work, I think the best way is to show them in use with an example. Ignore the regular expression patterns for now, they will be explained in depth later!

Phone Number Example

Matching with Regular Expressions

Matching with regex can also be thought of as 'searching' or 'finding' with regex. You supply the pattern, and the preg_match* functions will deliver the goods. The beauty of the functions mean that the function will return true if a match is successfully made, and you can also optionally retrieve the 'found' contents by supplying a variable name as the third parameter, $matches in this case.

To demonstrate, lets take a look at the following. Our user has input their phone number, but we want to store the area code separately from the rest of the number. For the sake of arguments, lets assume the phone number is input as xxx xxx xxxx with the separator being either a space, a hyphen or nothing at all:

PHP:
  1. $var = '123-456-7890';
  2. preg_match("/^(\d{3})[\s-]?/",$var,$matches); //returns true
  3. print_r($matches);

$matches gives:

Array
(
    [0] => 123-
    [1] => 123
)

As you can see, we get a 'true' return from preg_match and our $matches variable contains the information matched by our expression, /^(\d{3})[\s-]?/. Regular expressions will match and return patterns surrounded in round brackets, as you can see in the above example with /^(\d{3})[\s-]?/. The first element in the $matches array is always the entire result of the regex match. Subsequent matches are stored in the array in the order of the matching brackets in the regex.

The trick to dealing with regex patterns is to take it one character at a time. Looking at the above regex, lets dismantle it and take it piece by piece:

Dismantling "/^(\d{3})[\s-]?/"

/ Forward Slashes are at the beginning and end of the actual expression, and define the expression - similar to how spaces define where a word starts and ends. You can use a multitude of characters instead of / such as | or @. If you need to use the character within your expression, then it has to be escaped with a backslash
^ The caret, ^, signifies the start of the string. This forces our regular expression to start from the beginning of our string. Within a character class, [...], the caret can signify, 'does not contain'. For example, [^a], means "does not contain 'a'"
( ) The curved brackets here signify a group matching. That is, whatever is matched by the expression within the brackets is returned to the user in the $matches array
\d The \d is a special character class meaning 'match all digits'. It is identical to [0-9] but easier to write. This will match all digits found but will not match letters, punctuation etc
{3} The number 3 surrounded by curly brackets signified the number of times the previous expression must match. In this case, it means that \d should be matched three times - so it will match 123 but not just 12.
[\s-]? This character class will match one of the characters in the square brackets. In this case, its either s which represents a space character, and a hyphen to match a hyphen. The question mark at the end means 'Match 0 or 1 times', meaning that there can only be a space, or a hyphen or neither.
/ The final slash represents the end of our expression

As you can see, when you break up the regular expression, things do appear more simple. We've covered a lot in the above expression, and so I will try to elaborate a bit more below.

Anchors

There are only two anchors I want to bring into the equation for now:

^ - will match the beginning of a line/string when it is not present in a [...] character class.
$ - will match the end of a line/string

Character Classes

There are two sorts of character classes. One's within square brackets, and others that are shorthand, like \d above. Within square brackets, you can list characters that you want to be accepted or rejected:

[abc] - will match a or b or c
[^abc] - will match anything BUT a or b or c - i.e. any character that is not a or b or c

You can include ranged of information:

[a-zA-Z] - will match a-z letters in both upper and lower case
[0-9] - will match digits 0-9

If you want to match a hyphen, then you need to have the hyphen as the last character:

[a-z-] - will match a-z in lowercase only, and hyphens

There are other character classes that are shorthand:

\s - matches whitespace characters: n, r (both new lines), t (tabs) and f (form feeds - not use often). Equivalent to [nrtf]
\S - matches all non-whitespace characters. Equivalent to [^nrtf]
\d - matches 0-9. Equivalent to [0-9]
\D - matches all non-digits. Equivalent to [^0-9]
\w - matches a word character: a-z, A-Z and 0-9. Equivalent to [a-zA-Z0-9]
\W - matches all non-word characters. Equivalent to [^a-zA-Z0-9]
. - the period matches any character apart from new lines. If you wish to match an actual period, you need to escape it using a backslash: \.

Ranges

As we saw above with {3} you can specify the number of times you wish to match the previous expression. Above we used \d{3} but we could have easily used \d\d\d - but using a range is a lot more neater, and readable.

However, a range can be as you'd expect: a range from x through to y. For example, {3,5} means that the previous expression must be matched at least 3 times and at most 5 times. So for:

preg_match("/\d{3,5}/",$var);

we will get a true return from preg_match() if $var ranges from 100 through to 99999.

Now, what if you want $var to range from 100 until infinity? This can be done by not providing a second digit in the range: {3,} - will match at least three digits.

Ranges are simple, and can be a great help

Wildcards

The last thing I want to discuss are expression wildcards. There are three wildcards that are used regularly:

? - The question mark will match the preceeding expression 0 or 1 times. This is very handy for when you want the input to have an optional character. We used this above to account for a hyphen, space or nothing as our separator in the phone number
+ - A plus sign will match the preceeding expression 1 or more times. This is useful for when you require something to be present.
* - An asterisk wildcard will match the preceeding expression 0 or more times. There is an unlimited times that it will match. For example, .* will match every character in a line.

A final example

Using the information we have above, let's create a regular expression that will match a hex colour that will end up in a CSS stylesheet. Hex colours have 6 characters that range from 0-9 or A-F. To check this using a regular expression, we need to start at the beginning of our string, so we start with a ^. We want to check for a # sign so we add that to our expression so far:

/^#/

We now need to add the juicy bit, which is checking for A-F, a-f and 0-9 only 6 characters allowed from that list:

/^#[a-fA-F0-9]{6}/

We finish the expression by ensuring we are checking the entire string. We can do this by using the end of line character, $. This, combined with ^ means we check from the beginning of the input through to the end of the input, and the only characters allowed are those in our expression. This gives our final expression:

/^#[a-fA-F0-9]{6}$/

This has been a whistle-stop tour of basic regular expression. There is a lot more that can be done, and a lot that I have missed out for the sake of simplicity. Keep an eye out for a follow up explaining more of the world of regular expressions.


 Add to del.icio.us    Digg this    Technorati

Related Posts:

Entry Filed under: Input Validation

9 Comments Add your own

  • 1. Jim O’Halloran̵&hellip  |  June 14th, 2006 at 10:21 pm

    […] Regular Expressions in PHP ยท Jelly and Custard An excellent intorduction to using regex's in PHP, including a basic explanation of regex syntax. (tags: php regex tutorial webdev) […]

  • 2. Nate K  |  June 22nd, 2006 at 2:43 pm

    Good beginner post for Regular Expressions. I would challenge anyone reading this to dive in deeper on each section. This is a GREAT surface article, but could be built upon for different needs.

    Understanding regex takes time and practice. Even the above example would allow a phone number like 000-000-0000 - which is valid according to your regex, but obviously is not a real phone number. So, dive a little deeper, find the constraints and patterns that need to be checked - and in what order - to compile a valid, working, phone number.

    Also, as a side note - you have references to [s-] and [a-z-], in which case I would ALWAYS put the hyphen in first. This is because the hyphen can also define a range (a-z), so having it first ensures that it is not trying to match a range of letters/numbers because nothing will come before it.

    Nice read!

  • 3. Abhijeet  |  August 9th, 2007 at 7:38 am

    Great!!!

    it's really helpfull. I got cleared lot of things

  • 4. Rob  |  August 24th, 2007 at 1:20 am

    Very good, very helpful. Waaaayyyy better than all of the other reg exp tutorials I have seen in the past twenty minutes!

  • 5. Ved  |  October 23rd, 2007 at 7:57 pm

    Hi,

    Till now I was shit scared of regular expressions. Very nicely written article.

    Thanks a lot.

  • 6. Velmurugan  |  October 31st, 2007 at 4:26 am

    Thank you for this article.It helps me to find the pattern matching for hexadecimal color code

  • 7. raghu  |  January 9th, 2008 at 9:22 am

    It is a nicely written tutorial about regular expression. Lots of doubts about regex are cleared after reading this.
    Thanks a lot.

  • 8. israeljernigan.com »&hellip  |  January 11th, 2008 at 4:03 pm

    […] I wanted to learn about REGEXP in PHP, mostly for security verification. Here are two links that helped pave the way. The basics. More advanced methods. […]

  • 9. Jerry  |  March 15th, 2008 at 4:35 am

    Great post on regex. Thanks!

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

June 2006
M T W T F S S
« May   Jul »
 1234
567891011
12131415161718
19202122232425
2627282930  

Most Recent Posts