Alessandro Lacava’s Blog

Google
 

December 3, 2008

Java split() of String | Multiple whitespace characters

Filed under: Computer, Java, RegEx — alessandrolacava @ 12:47 pm

The split method of the String class is very useful when you want to tokenize a string. Its power lies in the fact that it accepts a string, as a parameter, which can be a regular expression. However you must be careful when you want to split a string using the whitespace character as a delimiter. Consider the following snippet of code:

JAVA:
  1. String str = "Testing split using two  whitespace characters";
  2. String[] tokens = str.split("\\s");
  3. for(String token : tokens)
  4. {
  5. System.out.println("-" + token + "-");
  6. }

What's the output produced by the previous code? If you think it is the following one you're wrong:
-Testing-
-split-
-using-
-two-
-whitespace-
-characters-

The actual output is instead the following one:

-Testing-
-split-
-using-
-two-
--
-whitespace-
-characters-

Where in the hell did that empty string come out from? It comes out from the two whitespace characters that are between the word two and whitespace of the str string. If this is what you want OK. However, most of the time, you will want to discard that empty string from your resulting string array. You can obtain this result by using the \\s+ regex in place of \\s. Basically, the previuos code becomes:

JAVA:
  1. String str = "Testing split using two  whitespace characters";
  2. String[] tokens = str.split("\\s+");
  3. for(String token : tokens)
  4. {
  5. System.out.println("-" + token + "-");
  6. }


October 10, 2006

How to select any character across multiple lines in Java

Filed under: Computer, Java, RegEx — alessandrolacava @ 10:52 am

You can do that using the following pattern in the compile static method of the java.util.regex.Pattern class. The pattern is (.|\n|\r)*? which means: any character (the .) or (the |) \n or \r. Zero or more times (the *) of the whole stuff.

Example: The following method strips the multiline comments (those between /* and */) from a string passed in and returns the resulting string:

JAVA:
  1. import java.util.regex;
  2.  
  3. [...]
  4.  
  5. // Strip multiline comments
  6. public void deleteMultilineComments(String subject)
  7. {
  8. Pattern pattern = Pattern.compile("(/\\*(.|\n|\r)*?\\*/)");
  9. Matcher matcher = pattern.matcher(subject);
  10. return matcher.replaceAll("");
  11. }

Note: \r\n works for Windows systems. \n works for Unix-like systems. \r works for Mac systems.


Next Page »