Get to know REGEX
A regular expression (regex or some says regexp) is a defined search pattern. Regular expressions are not language specific but you can find slight changes from language to language, Regex can be use to search a specific patterned string within a text.
Very few symbols that often use.
^ | : | Matching string should be at the beginning of the string, or represent negation according to the place |
$ | : | Matching string should be at the end of the string |
* | : | 0 or more times |
+ | : | 1 or more times |
a|b | : | Matches a or b |
\s | : | white space character |
[a-z] | : | Indicate a range, in this it is a to z |
Special Characters in REGEX
So whenever we need to match a string that contains a reserved character we need to escape it. with a black slash
Here is a example for matching a string that contains reserved character
A\(Nh4\)2
Above regex will match a string exactly equivalent to this, A(Nh4)2
Let’s Write a REGEX
I know that above information is not that useful when come to writing a regex practically so let’s try to write a regex to match an email address to understand little bit more how regex work.
First of all according to RFC2822 standard emails are allowed to use range of ASCII .
Local part of email (part before @ sign)
The local-part of the email address may use any of these ASCII characters RFC 5322 Section 3.2.3, RFC 6531 permits Unicode beyond the ASCII range:
- Uppercase and lowercase English letters (a–z, A–Z) (ASCII: 65–90, 97–122)
- Digits 0 to 9 (ASCII: 48–57)
- These special characters: ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~ (Support for these is limited)
- Character . (dot, period, full stop) (ASCII: 46) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively (e.g. [email protected] is not allowed).
Let’s write a valid email format verbally
Valid email might have a . (dot) but email can’t start from a dot also there can’t be two dot near we can write valid email format like this
(valid end part)
Valid first part should contains any combination of a-z 0-9 ! # $ % & ‘* + /= ? ^ _ ` { | } ~ without a dot. so by using range symbol or in other words character class or character set we can write a regex that matches any character within the character set above like this
But matching string should have 1 or more times these symboles, so the regex is
Valid second part should also contain any combination of a-z 0-9 ! # $ % & ‘* + /= ? ^ _ ` { | } ~ also it may have non consecutive dots, we can write that also using range symbol
if you carefully understand above regex you will see that it will provide expression similar to
dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x) [x>0]
final out put of above regex will not provide you consecutive dots but it provide one dot all the time so it matches some strings like this
.abc#1
.zzzzxy0123
but spec clearly says that dot can be anywhere but only thing is there should not be consecutive dots, so we can change regex like this
(* indicates 0 or more times) , so above regex will be equivalent to an expersion like this
(dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x))*y [x>0] [y>=0]
Now lets take a situation like this where
x = 1
y =2
assume anyof() always give “abc12#x” as the result, ( this is not true in a real senario )
then
(dot + (anyof( a-z0-9!#$%&’*+/=?^_`{|}~- ) * x))*y
= ((. + abc12#x) * 1)+((. + abc12#x) * 1)
= ((.abc12#x) * 1)+((.abc12#x) * 1)
= (.abc12#x * 1)+(.abc12#x * 1)
= .abc12#x + .abc12#x
.abc12#x.abc12#x
by studying the expression we can clearly see that though output contains multiple dots neither of them becomes consecutive
valid final part of the pattern
Putting everything to gather
Regex is :
If we matching a string to check whether it is a email then in regex we can say matching string should be an email from the beginning to the end, like this
you can follow tutorials on this http://www.regexbuddy.com which explains regex perfectly.
also you can test your regex pattern before use it in a program using http://regexhero.net/tester/
Simple java regex Example
import java.util.regex.Matcher; import java.util.regex.Pattern; /** * * @author imal */ public class RegexTuto { public static void main(String[] args) { String teststr = "[email protected] imal2 abcdeimal [email protected] testimal1243"; Pattern pattern = Pattern.compile("[a-z0-9!#$%&'*+/=?^_{|}~-]+(\\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)*@([a-zA-Z0-9\\.])+([a-zA-Z0-9])+"); Matcher matcher = pattern.matcher(teststr); while(matcher.find()){ System.out.println("Found : "+matcher.group()); } } }
Output :
Found : [email protected]
Found : [email protected]