Regex

Imal Perera 10 years agoMarch 16, 2020

Get to know REGEX

Spread the love

A regular expression (regex or some says regexp) is a defined search pattern. Regular expressions are not language specific but you can find slight changes from language to language, Regex can be use to search a specific patterned string within a text.

Very few symbols that often use.

^	:	Matching string should be at the beginning of the string, or represent negation according to the place
$	:	Matching string should be at the end of the string
*	:	0 or more times
+	:	1 or more times
a\|b	:	Matches a or b
\s	:	white space character
[a-z]	:	Indicate a range, in this it is a to z

Special Characters in REGEX

$ . ^ * ? ( ) \ < > { } [ ] + –

So whenever we need to match a string that contains a reserved character we need to escape it. with a black slash

\$ \. \^ \* \?  \\ \< \> \{ \} \[ \] \+ \-

Here is a example for matching a string that contains reserved character

A\(Nh4\)2

Above regex will match a string exactly equivalent to this, A(Nh4)2

Let’s Write a REGEX

I know that above information is not that useful when come to writing a regex practically so let’s try to write a regex to match an email address to understand little bit more how regex work.

First of all according to RFC2822 standard emails are allowed to use range of ASCII .

Local part of email (part before @ sign)

The local-part of the email address may use any of these ASCII characters RFC 5322 Section 3.2.3, RFC 6531 permits Unicode beyond the ASCII range:

Uppercase and lowercase English letters (a–z, A–Z) (ASCII: 65–90, 97–122)
Digits 0 to 9 (ASCII: 48–57)
These special characters: ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~ (Support for these is limited)
Character . (dot, period, full stop) (ASCII: 46) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively (e.g. [email protected] is not allowed).

Let’s write a valid email format verbally

Valid email might have a . (dot) but email can’t start from a dot also there can’t be two dot near we can write valid email format like this

{(valid string without dot [1 or more times]) concat (valid string with a dot, [0 or more times])} @
(valid end part)

Valid first part should contains any combination of a-z 0-9 ! # $ % & ‘* + /= ? ^ _ ` { | } ~ without a dot. so by using range symbol or in other words character class or character set we can write a regex that matches any character within the character set above like this

[a-z0-9!#$%&’*+/=?^_`{|}~-]

But matching string should have 1 or more times these symboles, so the regex is

[a-z0-9!#$%&’*+/=?^_`{|}~-]+

Valid second part should also contain any combination of a-z 0-9 ! # $ % & ‘* + /= ? ^ _ ` { | } ~ also it may have non consecutive dots, we can write that also using range symbol

\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+

if you carefully understand above regex you will see that it will provide expression similar to

dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x) [x>0]

final out put of above regex will not provide you consecutive dots but it provide one dot all the time so it matches some strings like this

.abc#1

.zzzzxy0123

but spec clearly says that dot can be anywhere but only thing is there should not be consecutive dots, so we can change regex like this

(\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*

(* indicates 0 or more times) , so above regex will be equivalent to an expersion like this

(dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x))*y [x>0] [y>=0]

Now lets take a situation like this where
x = 1
y =2
assume anyof() always give “abc12#x” as the result, ( this is not true in a real senario )

then

(dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x))*y

= ((. + abc12#x) * 1)+((. + abc12#x) * 1)

= ((.abc12#x) * 1)+((.abc12#x) * 1)

= (.abc12#x * 1)+(.abc12#x * 1)

= .abc12#x + .abc12#x

.abc12#x.abc12#x

by studying the expression we can clearly see that though output contains multiple dots neither of them becomes consecutive

valid final part of the pattern

([a-zA-Z0-9\.])+([a-zA-Z0-9])+

Putting everything to gather

Regex is :

[a-z0-9!#$%&’*+/=?^_`{|}~-]+(\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*@([a-zA-Z0-9\.])+([a-zA-Z0-9])+

If we matching a string to check whether it is a email then in regex we can say matching string should be an email from the beginning to the end, like this

^[a-z0-9!#$%&’*+/=?^_`{|}~-]+(\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*@([a-zA-Z0-9\.])+([a-zA-Z0-9])+$

you can follow tutorials on this http://www.regexbuddy.com which explains regex perfectly.
also you can test your regex pattern before use it in a program using http://regexhero.net/tester/

Simple java regex Example

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 *
 * @author imal
 */
public class RegexTuto {

    public static void main(String[] args) {
        String teststr = "[email protected] imal2 abcdeimal [email protected] testimal1243";
        Pattern pattern = Pattern.compile("[a-z0-9!#$%&'*+/=?^_{|}~-]+(\\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)*@([a-zA-Z0-9\\.])+([a-zA-Z0-9])+");
        Matcher matcher = pattern.matcher(teststr);
        while(matcher.find()){
            System.out.println("Found : "+matcher.group());
        }
    }

}

Output :

Found : [email protected]
Found : [email protected]

iTech

Get to know REGEX

Very few symbols that often use.

Local part of email (part before @ sign)

Let’s write a valid email format verbally

imal

Leave A Comment Cancel reply