Regex
Imal Perera  

Get to know REGEX

Spread the love

A regular expression (regex or some says regexp) is a defined search pattern. Regular expressions are not language specific but you can find slight changes from language to language, Regex can be use to search a specific patterned string within a text.

Very few symbols that often use.

^  : Matching string should be at the beginning of the string, or represent negation according to the place
$   : Matching string should be at the end of the string
*  : 0 or more times
+  : 1 or more times
a|b  : Matches a or b
\s  : white space character
[a-z]  : Indicate a range, in this it is a to z

Special Characters in REGEX

$   .   ^   *   ?   (   )   \   <   >   {   }   [   ]   +   –

So whenever we need to match a string that contains a reserved character we need to escape it. with a black slash

\$   \.   \^   \*   \?   \(   \)   \\   \<   \>   \{   \}   \[   \]   \+   \-

Here is a example for matching a string that contains reserved character

A\(Nh4\)2

Above regex will match a string exactly equivalent to this, A(Nh4)2

Let’s Write a REGEX

I know that above information is not that useful when come to writing a regex practically so let’s try to write a regex to match an email address to understand little bit more how regex work.

First of all according to RFC2822 standard emails are allowed to use range of ASCII . 

Local part of email (part before @ sign)

The local-part of the email address may use any of these ASCII characters RFC 5322 Section 3.2.3, RFC 6531 permits Unicode beyond the ASCII range:

  • Uppercase and lowercase English letters (a–z, A–Z) (ASCII: 65–90, 97–122)
  • Digits 0 to 9 (ASCII: 48–57)
  • These special characters: ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~ (Support for these is limited)
  • Character . (dot, period, full stop) (ASCII: 46) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively (e.g. [email protected] is not allowed).

Let’s write a valid email format verbally

Valid email might have a . (dot) but email can’t start from a dot also there can’t be two dot near we can write valid email format like this

{(valid string without dot [1 or more times])  concat  (valid string with a dot, [0 or more times])} @
(valid end part)

Valid first part should contains any combination of  a-z 0-9 ! # $ % & ‘* + /= ? ^ _ ` { | } ~  without a dot.  so by using range symbol or in other words character class or character set  we can write a regex that matches any character within the character set above like this

[a-z0-9!#$%&’*+/=?^_`{|}~-]

But matching string should have 1 or more times these symboles, so the regex is

[a-z0-9!#$%&’*+/=?^_`{|}~-]+

Valid second part should also contain any combination of a-z 0-9 ! # $ % & ‘* + /= ? ^ _ ` { | } ~ also it may have non consecutive dots, we can write that also using range symbol

\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+

if you carefully understand above regex you will see that it will provide expression similar to

dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x)     [x>0]

final out put of above regex will not provide you consecutive dots but it provide one dot all the time so it matches some strings like this

.abc#1

.zzzzxy0123

but spec clearly says that dot can be anywhere but only thing is there should not be consecutive dots, so we can change regex like this

(\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*

(* indicates 0 or more times) , so above regex will be equivalent to an expersion like this

(dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x))*y         [x>0] [y>=0]

Now lets take a situation like this where
x = 1
y =2
assume anyof()  always give “abc12#x” as the result, ( this is not true in a real senario )

then

(dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x))*y

= ((. + abc12#x) * 1)+((. + abc12#x) * 1)

= ((.abc12#x) * 1)+((.abc12#x) * 1)

= (.abc12#x * 1)+(.abc12#x * 1)

= .abc12#x + .abc12#x

   .abc12#x.abc12#x

by studying the expression we can clearly see that though output contains multiple dots neither of them becomes consecutive

valid final part of the pattern

([a-zA-Z0-9\.])+([a-zA-Z0-9])+

Putting everything to gather

Regex is :

[a-z0-9!#$%&’*+/=?^_`{|}~-]+(\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*@([a-zA-Z0-9\.])+([a-zA-Z0-9])+

If we matching a string to check whether it is a email then in regex we can say matching string should be an email from the beginning to the end, like this

^[a-z0-9!#$%&’*+/=?^_`{|}~-]+(\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*@([a-zA-Z0-9\.])+([a-zA-Z0-9])+$

you can follow tutorials on this http://www.regexbuddy.com which explains regex perfectly.
also you can test your regex pattern before use it in a program using http://regexhero.net/tester/

Simple java regex Example

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 *
 * @author imal
 */
public class RegexTuto {

    public static void main(String[] args) {
        String teststr = "[email protected] imal2 abcdeimal [email protected] testimal1243";
        Pattern pattern = Pattern.compile("[a-z0-9!#$%&'*+/=?^_{|}~-]+(\\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)*@([a-zA-Z0-9\\.])+([a-zA-Z0-9])+");
        Matcher matcher = pattern.matcher(teststr);
        while(matcher.find()){
            System.out.println("Found : "+matcher.group());
        }
    }

}

Output :

Leave A Comment