26.03.2019 HomeSite mapContacts

WEAV Server Sig Regular Expressions

WEAV Server SIG

Regular Expressions

Pattern matching involves searching a string for a set of characters based on a specific pattern. Regular expressions are basically a pattern definition language used to make complex and flexible searches possible.

Many tools use regular expressions (or "regex" for short), including PHP, Perl, and Javascript in CGI. MySQL has a regex function. The Unix tools grep and sed use regex. Even my Macintosh text editor BBEdit supports them in its search tool. So the basics here will be applicable in many instances, not just with PHP, though that's what we're going to look at for the examples.

In a web application, regex works well for:

  • input validation
  • verifying input format, eg an email address
  • parsing data from pre-defined variables
  • searching and replacing data in a file or database

While the basics of regex are pretty much the same across the board, there are different flavours. We'll start off with the POSIX Extended Regular Expressions, which PHP uses.

When you search for a word or phrase in your word processor, you usually type in just what you want to find -- literal characters. A regular expression consists of both literal and metacharacters. Metacharacters have special meanings, depending on context.

In most tools that use regex, if you use the pattern: cat

cat match
catastrophe match
concatenation match

The metacharacter ^ means that the pattern has to appear at the beginning of a string. The corresponding character $ matches the end of a string. So:

^cat matches any string that begins with cat.

cat match
catastrophe match
the lazy cat slept no match

cat$ matches any string that ends with cat.

cat match
catastrophe no match
kick the lazy cat match

So ^cat$ will only match the string "cat".

If you want to search for literal instance of a metacharacter, you have to escape it with a backslash:

$10

Suppose you were searching for the word "grey" but you knew it might also be spelled "gray". To allow for the variant spelling, you could use a bracket expression. The pattern:

gr[ae]y

will match either spelling. The [] delimiters mean that any of the characters inside can match, but only once. The string "graey" wouldn't be a match.

You could use this to allow for variations in capitalization too, since the match is case-sensitive. (We'll look at making case-insensitive searches later on.) For example:

[Ff]red would match "Alfred", "Frederick", and "fred".

If you wanted to search through an HTML file and pull out each instance of a header tag, regardless of the number, you could use:

<H[123456]>

but to save typing, you can also use a range:

<H[1-6]>

Within a bracket expression, the hyphen denotes a range. You can use more than one range inside one expression:

[1-5a-cH-M]

NB: an invalid range, like [H-B], will generate an error.

NB: to match a hyphen within a bracket expression,
you have to put it last or first: [,3-]

Sometimes it's easier to specify what shouldn't match. To negate a bracket expression, use the ^ again. Inside a bracket, it doesn't match the beginning of a string, but rather has the meaning "not". ($ is just a literal inside a bracket expression.)

[^01] will match any character except a 0 or 1.

You can also use POSIX character classes inside a bracket expression. This is a partial list:

[:alnum:] alphanumeric character
[:alpha:] alphabetic character, any case
[:blank:] space and tab
[:digit:] digits
[:lower:] lowercase alphabetics
[:punct:] punctuation characters
[:space:] all whitespace characters, including newline and carriage return
[:upper:] uppercase alphabetics

Since these can only be used inside bracket expressions, they end up looking very ugly. But they can be useful.

[0-9] is the same as [[:digit:]]

[[:alpha:]], though, isn't quite the same as [a-zA-Z] as it can include characters such as é if that's part of the locale set.

Another useful metacharacter is the period. Outside a bracket expression, it matches any character.

Pattern: tre.

tree matches
trea3 matches
centre no match

Pattern: tr.e

tree matches
true matches
treat no match

To match one of two or more expressions, use | to denote alternation.

Pattern: Fred|Muffin

Fred the poodle matches
My cat is Muffin matches

Note that ^Fred|Muffin isn't the same as ^(Fred|Muffin). The parentheses set off the parts into pieces so that only what's inside them alternates and ^ must match regardless.

The earlier example of gray and grey could also be done like this:

gr(a|e)y

Sometimes you want to optionally match an item. The metacharacter ? matches if the item appears zero or one time. For example, if you wanted to match the word "colour" but knew that sometimes it would appear as "color" you could do it like this:

colou?r

? only applies to the item it follows. To include more than one character in an item, use parentheses:

Pattern: do(ugh)?nut

doughnut matches
donut matches

These metacharacters are used in a similar way:

? matches 0 or 1 time
+ matches 1 or more times
* matches any number of times, including zero

As an example, suppose you are trying to find instances of the <HR> tag in HTML files. You have to take into account the variations in how it could be entered. We'll build a pattern to find any valid instance. (We won't, though, take case into account. Instead, we'll use a search function that isn't case sensitive. We'll cover that a little later.)

The first characters will always be <HR since no spaces are allowed after the <. But there could be any number of spaces after that, including none at all. As well, there might be a size attribute: <HR   SIZE=20 >. So next we'll allow for optional spaces.

Since the size attribute is also optional, we'll build that piece first and enclose it in parentheses. There will be one or more spaces, the word, more optional space, an equals sign, optional quotes, and a number.

( +SIZE *= *(")?[0-9]+(")?)

The whole pattern looks like this:

<HR( +SIZE *= *(")?[0-9]+(")?)? *>

NB: While the regular expression syntax doesn't require that a quotation mark be escaped, you'll need to escape it within PHP code as usual. Or use a single quoted string around the pattern.

If you want to specify an exact number of times something should match, or a min and max, you can use bounds {} following an item.

{3} match exactly 3 times
{3,} match at least 3 times
{3,5} match at least 3 times with a maximum of 5

To match a regular 7 digit phone number:

[0-9]{3}-[0-9]{4}

NB: this is only a simple match for the basic format of a phone number, not a real test for a valid number. 000-0000 would match here just fine.

 

PHP functions for regex

ereg() and eregi() both check for a match and return true or false. eregi() performs a case-insensitive match. Here's the syntax:

ereg(pattern, target, optional_array);

For example:

if (ereg("do(ugh)?nut", "Have a donut, Homer.")) {

    [some stuff]

}

As always, you can use variables:

$pat = "[0-9]{3}-[0-9]{4}";
$target = "389-208";

ereg($pat, $target, $matches);

Here, $matches is an array. The matched text is placed into $matches[0]. Then, if there are any matches for parenthesized substrings, those are placed into $matches[1] and so on, beginning at the leftmost parenthesis. For example:

ereg("do(ugh)?nut.*(H(o)?mer)", "Have a doughnut, Homer.", $matches);

$matches[0] doughnut, Homer
$matches[1] ugh
$matches[2] Homer
$matches[3] o

This is very handy if you need to pull the data out and use it.

NB: $matches will have exactly 10 elements, regardless of the number of matches. ereg() can match more than 10 substrings, but they cannot be stored.

eregi() works exactly the same, but ignores case distinctions when matching alphabetic characters.

More on ereg():
http://www.php.net/manual/html/function.ereg.html

Do some tests:
http://www.crazygrrl.com/weav/regex/ereg.php3

ereg_replace() and eregi_replace() do a pattern match and then replace the matched text with a specified string. As before, eregi_replace() does a case-insensitive match. Here's the syntax:

ereg_replace(pattern, replacement, target);

For example:

$result = ereg_replace("apples", "Guinness", "I love apples and oranges.");

$result is: "I love Guinness and oranges."

If the pattern contains parenthesized substrings, you can refer to them in the replacement string using the notation \n where n refers to the nth substring. You can use up to nine. This is easier to see in an example. Here, we search a phone number entry and replace the area code since it has changed. As well, we change the separator symbol.

// Target is a phone number entry

$target = "Karen, 604-358-5478";

// Build patterns for parts of the number - this is just to show the different ways
// you can do it. (As before, this doesn't ensure the phone number is valid.)

$pat_area = "([0-9][0-9][0-9])";
$pat_exchange = "([0-9]{3})";
$pat_number = "([[:digit:]]{4})";

// Combine them into one

$pattern = "$pat_area-$pat_exchange-$pat_number";

$new = ereg_replace($pattern, "250.\2.\3", $target);

print("<P>Old: $target</P>");
print("<P>New: $new</P>");        

This code displays:

Old: Karen, 604-358-5478

New: Karen, 250.358.5478

More on ereg_replace():
http://www.php.net/manual/html/function.ereg-replace.html

Do some tests:
http://www.crazygrrl.com/weav/regex/ereg_replace.php3

Other POSIX-style regex functions, including split():
http://www.php.net/manual/html/ref.regex.html

 

Perl-style regex functions

PHP, from version 3.09 on, also supports Perl-like regex functions. They use most of what we've seen already, but handle character classes in a different way and allow many more options. We'll cover a few of those options here.

When using Perl-style matching, the pattern also has to be enclosed by special delimiters. The default is the forward slash, though you can use others. For example:

/colou?r/

Usually you'll want to stick with the default, but if you need to use the forward slash a lot in the actual pattern (especially if you're dealing with pathnames) you might want to use something else:

!/root/home/random!

To make a match case-insensitive, all you need to do is append the option i to the pattern:

/colou?r/i

Perl-style functions support these extra metacharacters (this is not a full list):

b A word boundary, the spot between word (w) and non-word (W) characters.
B A non-word boundary.
d A single digit character.
D A single non-digit character.
n The newline character. (ASCII 10)
r The carriage return character. (ASCII 13)
s A single whitespace character.
S A single non-whitespace character.
t The tab character. (ASCII 9)
w A single word character - alphanumeric and underscore.
W A single non-word character.

Example:

/bhomerb/

Have a donut, Homer no match
A tale of homeric proportions! no match
Do you think he can hit a homer? match

Corresponding to ereg() is preg_match(). Syntax:

preg_match(pattern (string), target (string), optional_array);

Example:

$pattern = "/b(do(ugh)?nut)b.*b(Homer|Fred)b/i";

$target = "Have a donut, Homer.";

if (preg_match($pattern, $target, $matches)) {

        print("<P>Match: $reg[0]</P>");
        print("<P>Pastry: $reg[1]</P>");
        print("<P>Variant: $reg[2]</P>");
        print("<P>Name: $reg[3]</P>");
}

else {
        print("No match.");
}

Results:

Match: donut, Homer

Pastry: donut

Variant:   [blank because there was no "ugh"]

Name: Homer

If you use the $target "Doughnut, Frederick?" there will be no match, since there has to be a word boundary after Fred.

but "Doughnut, fred?" will match since we've specified it to be case-insensitive.

There is much, much more that you can do with these functions, but we don't have time to go into it. Read the docs and some reference material.

Syntax overview:
http://www.php.net/manual/html/pcre.pattern.syntax.html

Perl-style functions:
http://www.php.net/manual/html/ref.pcre.html

Test it:
http://www5.islandnet.com/~kfriesen/weav/regex/preg.php3

 

Examples:

Sending email:
http://www.canowhoopass.com/weav/wssig/email.php

Regex Reference Sheet:
http://www.crazygrrl.com/weav/reference.php3

 

Reference Links and Books

On-line tutorials:

 

 
 
|  Home  |  Site map  |  Contacts  |


   Last modified: 2014 year, 26 of September, 13:59

Copyright © 2005-2014 Eurodata SIA                  Powered by FleksCMS