Java Regular Expression: April 2008

Saturday, April 5, 2008

Java regular expression email validations

JavaRegxEmailValidations.java

1    import java.util.regex.Matcher;
2    import java.util.regex.Pattern;
3    import java.util.regex.PatternSyntaxException;
4    
5    /**
6     * Created by IntelliJ IDEA.
7     * User: Ishara Samantha
8     * Date: Apr 6, 2008
9     * Time: 10:34:06 AM
10    * To change this template use File | Settings | File Templates.
11    */
12   public class JavaRegxEmailValidations
13   {
14   
15   // \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
16   // 
17   // Options: case insensitive
18   // 
19   // Assert position at a word boundary «\b»
20   // Match a single character present in the list below «[A-Z0-9._%+-]+»
21   //    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
22   //    A character in the range between “A” and “Z” «A-Z»
23   //    A character in the range between “0” and “9” «0-9»
24   //    One of the characters “._%” «._%»
25   //    The character “+” «+»
26   //    The character “-” «-»
27   // Match the character “@” literally «@»
28   // Match a single character present in the list below «[A-Z0-9.-]+»
29   //    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
30   //    A character in the range between “A” and “Z” «A-Z»
31   //    A character in the range between “0” and “9” «0-9»
32   //    The character “.” «.»
33   //    The character “-” «-»
34   // Match the character “.” literally «\.»
35   // Match a single character in the range between “A” and “Z” «[A-Z]{2,4}»
36   //    Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}»
37   // Assert position at a word boundary «\b»
38       
39       public static void main(String[] args)
40       {
41           System.out.println("emailValidation(\"ishara@gmail.com\") = " + emailValidation("ishara@gmail.com"));
42           System.out.println("emailValidation(\"ip@1.2.3.123\") = " + emailValidation("ip@1.2.3.123"));
43           System.out.println("emailValidation(\"pharaoh@egyptian.museum\") = " + emailValidation("pharaoh@egyptian.museum"));
44           System.out.println("emailValidation(\"john.doe+regexbuddy@gmail.com\") = " + emailValidation("john.doe+regexbuddy@gmail.com"));
45           System.out.println("emailValidation(\"Mike.O'Dell@ireland.com\") = " + emailValidation("Mike.O'Dell@ireland.com"));
46           System.out.println("emailValidation(\"\\\"Mike\\\\ O'Dell\\\"@ireland.com\") = " + emailValidation("\"Mike\\ O'Dell\"@ireland.com"));
47           System.out.println("emailValidation(\"IPguy@[1.2.3.4]\") = " + emailValidation("IPguy@[1.2.3.4]"));
48           System.out.println("emailValidation(\"ishara.samantha@gmail.com\") = " + emailValidation("ishara.samantha@gmail.com"));
49           System.out.println("emailValidation(\"ishara@ac.lk\") = " + emailValidation("ishara@ac.lk"));
50           System.out.println("emailValidation(\"1024x768@60Hz\") = " + emailValidation("1024x768@60Hz"));
51           System.out.println("emailValidation(\"not.a.valid.email\") = " + emailValidation("not.a.valid.email"));
52           System.out.println("emailValidation(\"not@valid.email\") = " + emailValidation("not@valid.email"));
53           System.out.println("emailValidation(\"john@aol...com\") = " + emailValidation("john@aol...com"));
54           System.out.println("emailValidation(\"Mike\\\\ O'Dell@ireland.com\") = " + emailValidation("Mike\\ O'Dell@ireland.com"));
55   
56       }
57   
58       private static boolean emailValidation(String email)
59       {
60           boolean foundMatch = false;
61           try {
62               Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
63               Matcher regexMatcher = regex.matcher(email);
64               foundMatch = regexMatcher.find();
65           } catch (PatternSyntaxException ex) {
66               // Syntax error in the regular expression
67           }
68           return foundMatch;
69       }
70   }
71

Email address
Use this version to seek out email addresses in random documents and texts.
Does not match email addresses using an IP address instead of a domain name.
Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum. Including these increases the risk of false positives when applying the regex to random documents.
Requires the "case insensitive" option to be ON.

emailValidation("ishara@gmail.com") = true
emailValidation("ip@1.2.3.123") = false
emailValidation("pharaoh@egyptian.museum") = false
emailValidation("john.doe+regexbuddy@gmail.com") = true
emailValidation("Mike.O'Dell@ireland.com") = true
emailValidation("\"Mike\\ O'Dell\"@ireland.com") = false
emailValidation("IPguy@[1.2.3.4]") = false
emailValidation("ishara.samantha@gmail.com") = true
emailValidation("ishara@ac.lk") = true
emailValidation("1024x768@60Hz") = false
emailValidation("not.a.valid.email") = false
emailValidation("not@valid.email") = false
emailValidation("john@aol...com") = true
emailValidation("Mike\\ O'Dell@ireland.com") = true

More about email Validations as fallows

Email address (anchored)
Use this anchored version to check if a valid email address was entered.
Does not match email addresses using an IP address instead of a domain name.
Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
Requires the "case insensitive" option to be ON.

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

Email address (anchored; no consecutive dots)
Use this anchored version to check if a valid email address was entered.
Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol...com
Does not match email addresses using an IP address instead of a domain name.
Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum. Including these increases the risk of false positives when applying the regex to random documents.
Requires the "case insensitive" option to be ON.

^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$

Email address (no consecutive dots)
Use this version to seek out email addresses in random documents and texts.
Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol...com
Does not match email addresses using an IP address instead of a domain name.
Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum. Including these increases the risk of false positives when applying the regex to random documents.
Requires the "case insensitive" option to be ON.

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b

Email address (specific TLDs)
Does not match email addresses using an IP address instead of a domain name.
Matches all country code top level domains, and specific common top level domains.
Requires the "case insensitive" option to be ON.

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:com|org|net|gov|mil|biz|info|name|aero|biz|info|mobi|jobs|museum|[A-Z]{2})$

Email address: RFC 2822
This regular expression implements the official RFC 2822 standard for email addresses. Using this regular expression in actual applications is NOT recommended. It is shown to illustrate that with regular expressions there's always a trade-off between what's exact and what's practical.

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Email address: RFC 2822 (simplified)
Matches a normal email address. Does not check the top-level domain.
Requires the "case insensitive" option to be ON.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

Email address: RFC 2822 (specific TLDs)
Matches all country code top level domains, and specific common top level domains.
Requires the "case insensitive" option to be ON.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|name|aero|biz|info|mobi|jobs|museum)\b

Java Regular Expression Samples

Check if the regex matches a string entirely
IF/else branch whether the regx matches a sring entirely
Create an object to use the same regx for many operations
Create an object to apply a regx repeatedly to a given string
Use regex object to test if (part of ) a string can be matched
Use regex object to test if a string can be match entirely
use regex object to get the part of a string matched by the regex
Use regex object to get the path of a string matched by a numbered group
Use regex object to get a list of all text matched by a numbered group
Iterate over all matches in a string
Iterate over all matches and capturing groups in a string

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

/**
* Created by IntelliJ IDEA.
* User: Ishara Samantha
* Date: Apr 5, 2008
* Time: 8:46:45 PM
* To change this template use File | Settings | File Templates.
*/
public class JavaRegX
{
private static String subjectString;
private static String subjectString1;
private static String anotherSubjectString;

public static void main(String[] args)
{
subjectString = "ishara@hoofoo.net";
anotherSubjectString = "test@hoofoo.net";
subjectString1 = subjectString + "," + anotherSubjectString;

//Check if the regex matches a string entirely
try
{
boolean foundMatch = subjectString.matches("(?i)\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b");
System.out.println("foundMatch = " + foundMatch);
} catch (PatternSyntaxException ex)
{
ex.printStackTrace();
}

//IF/else branch whether the regx matches a sring entirely
try
{
if (subjectString.matches("(?i)\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b"))
{
System.out.println("Match");
} else
{
System.out.println("Match Faild");
}
} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}

try
{
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);

} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}

try
{
//Create an object to use the same regx for many operations
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
//Create an object to apply a regx repeatedly to a given string
Matcher regexMatcher = regex.matcher(subjectString);
//Aply the same regex to more than one string
regexMatcher.reset(anotherSubjectString);

} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}

//Use regex object to test if (part of ) a string can be matched
boolean foundMatch = false;
try
{
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString1);
foundMatch = regexMatcher.find();
System.out.println("regexMatcher = " + regexMatcher);
System.out.println("foundMatch = " + foundMatch);
} catch (PatternSyntaxException ex)

{
// Syntax error in the regular expression
}

//Use regex object to test if a string can be match entirely
try
{
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.matches();
System.out.println("Use regex object to test if a string can be match entirely");
System.out.println("regexMatcher = " + regexMatcher);
System.out.println("foundMatch = " + foundMatch);
} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}

//use regex object to get the part of a string matched by the regex
//Use regex object to get the path of a string matched by a numbered group
String ResultString = null;
try
{
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString1);
if (regexMatcher.find())
{
ResultString = regexMatcher.group(0);
System.out.println("ResultString = " + ResultString);
}
} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}

//Use regex object to get a list of all text matched by a numbered group
List matchList = new ArrayList();
try
{
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString1);
while (regexMatcher.find())
{
matchList.add(regexMatcher.group(0));
}
} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}
System.out.println("matchList.size() = " + matchList.size());

//Iterate over all matches in a string
try
{
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find())
{
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}

//Iterate over all matches and capturing groups in a string
try
{
Pattern regex = Pattern.compile("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find())
{
for (int i = 1; i <= regexMatcher.groupCount(); i++)
{
// matched text: regexMatcher.group(i)
// match start: regexMatcher.start(i)
// match end: regexMatcher.end(i)
}
}
} catch (PatternSyntaxException ex)
{
// Syntax error in the regular expression
}

}
}

Tuesday, April 1, 2008

Using Regular Expressions in Java

Java 4 (JDK 1.4) and later have comprehensive support for regular expressions through the standard java.util.regex package. Because Java lacked a regex package for so long, there are also many 3rd party regex packages available for Java. I will only discuss Sun's regex library that is now part of the JDK. Its quality is excellent, better than most of the 3rd party packages. Unless you need to support older versions of the JDK, the java.util.regex package is the way to go.

Java 5 and 6 use the same regular expression flavor (with a few minor fixes), and provide the same regular expression classes. They add a few advanced functions not discussed on this page.

Quick Regex Methods of The String Class

The Java String class has several methods that allow you to perform an operation using a regular expression on that string in a minimal amount of code. The downside is that you cannot specify options such as "case insensitive" or "dot matches newline". For performance reasons, you should also not use these methods if you will be using the same regular expression often.

myString.matches("regex") returns true or false depending whether the string can be matched entirely by the regular expression. It is important to remember that String.matches() only returns true if the entire string can be matched. In other words: "regex" is applied as if you had written "^regex$" with start and end of string anchors. This is different from most other regex libraries, where the "quick match test" method returns true if the regex can be matched anywhere in the string. If myString is abc then myString.matches("bc") returns false. bc matches abc, but ^bc$ (which is really being used here) does not.

myString.replaceAll("regex", "replacement") replaces all regex matches inside the string with the replacement string you specified. No surprises here. All parts of the string that match the regex are replaced. You can use the contents of capturing parentheses in the replacement text via $1, $2, $3, etc. $0 (dollar zero) inserts the entire regex match. $12 is replaced with the 12th backreference if it exists, or with the 1st backreference followed by the literal "2" if there are less than 12 backreferences. If there are 12 or more backreferences, it is not possible to insert the first backreference immediately followed by the literal "2" in the replacement text.

In the replacement text, a dollar sign not followed by a digit causes an IllegalArgumentException to be thrown. If there are less than 9 backreferences, a dollar sign followed by a digit greater than the number of backreferences throws an IndexOutOfBoundsException. So be careful if the replacement string is a user-specified string. To insert a dollar sign as literal text, use \$ in the replacement text. When coding the replacement text as a literal string in your source code, remember that the backslash itself must be escaped too: "\\$".

myString.split("regex") splits the string at each regex match. The method returns an array of strings where each element is a part of the original string between two regex matches. The matches themselves are not included in the array. Use myString.split("regex", n) to get an array containing at most n items. The result is that the string is split at most n-1 times. The last item in the string is the unsplit remainder of the original string.

Using The Pattern Class

In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an optional second parameter. Pattern.compile("regex", Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE) makes the regex case insensitive for US ASCII characters, causes the dot to match line breaks and causes the start and end of string anchors to match at embedded line breaks as well. When working with Unicode strings, specify Pattern.UNICODE_CASE if you want to make the regex case insensitive for all characters in all languages. You should always specify Pattern.CANON_EQ to ignore differences in Unicode encodings, unless you are sure your strings contain only US ASCII characters and you want to increase performance.

If you will be using the same regular expression often in your source code, you should create a Pattern object to increase performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately, Pattern.CANON_EQ does not have an embedded mode modifier equivalent.

Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly the same results as myString.split("regex"). The difference is that the former is faster since the regex was already compiled.

Using The Matcher Class

Except for splitting a string (see previous paragraph), you need to create a Matcher object from the Pattern object. The Matcher will do the actual work. The advantage of having two separate classes is that you can create many Matcher objects from a single Pattern object, and thus apply the regular expression to many subject strings simultaneously.

To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject"). If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for duty.

To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when find() fails.

The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match. start() is the index of the first character in the match. end() is the index of the first character after the match. Both are relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched by the regular expression or pair of capturing parentheses.

myMatcher.replaceAll("replacement") has exactly the same results as myString.replaceAll("regex", "replacement"). Again, the difference is speed.

The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your own code. You can do this with the appendReplacement() and appendTail() Here is how:

StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
while (myMatcher.find()) {
 if (checkIfThisMatchShouldBeReplaced()) {
   myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
 }
}
myMatcher.appendTail(myStringBuffer);

Obviously, checkIfThisMatchShouldBeReplaced() and computeReplacementString() are placeholders for methods that you supply. The first returns true or false indicating if a replacement should be made at all. Note that skipping replacements is way faster than replacing a match with exactly the same text as was matched. computeReplacementString() returns the actual replacement string.

Regular Expressions, Literal Strings and Backslashes

In literal Java strings the backslash is an escape character. The literal string "\\" is a single backslash. In regular expressions, the backslash is also an escape character. The regular expression \\ matches a single backslash. This regular expression as a Java string, becomes "\\\\". That's right: 4 backslashes to match a single one.

The regex \w matches a word character. As a Java string, this is written as "\\w".

The same backslash-mess occurs when providing replacement strings for methods like String.replaceAll() as literal Java strings in your Java code. In the replacement text, a dollar sign must be encoded as \$ and a backslash as \\ when you want to replace the regex match with an actual dollar sign or backslash. However, backslashes must also be escaped in literal Java strings. So a single dollar sign in the replacement text becomes "\\$" when written as a literal Java string. The single backslash becomes "\\\\". Right again: 4 backslashes to insert a single one.

Java Regular Expression

Saturday, April 5, 2008

Java regular expression email validations

Java Regular Expression Samples

Tuesday, April 1, 2008

Using Regular Expressions in Java

Using Regular Expressions in Java

Quick Regex Methods of The String Class

Using The Pattern Class

Using The Matcher Class

Regular Expressions, Literal Strings and Backslashes

Blog Archive

Rate Your Regular Expression Knowlage

Labels

NetBean IDE

Welcome to Java Regular Expression Zone

Welcome to Java Regular Expression