Tuesday, August 26, 2008

Trim Tokens

When working with a tokenized string, like those found in comma-separated value (CSV) files, it's common to encounter this problem. What do you do when someone edits the file by hand, and inserts extra space characters to make the file more human-readable? Add a space to the token? That's no good because then all of your files have to match the "new and improved" format.

For example, take the following snippet of CSV file.

username, password,home folder,default editor, favorite color,web address

Notice that the delimiter is inconsistent in this example. Sometimes I have tokens separated with a single comma (",") and sometimes with a comma followed by a space (", "). If the delimiter were consistent it would be a simple matter to get all tokens with the following code:

String[] tokens = line.split(",");

First of all, let's note how much simpler this is than the old method of using a StringTokenizer to loop through the text scanning for more tokens. String's split method added in the Java 1.4 release is a great improvement. The second thing you should note is that the split method accepts one argument, and that argument is a regular expression. This should be a clue to solving our CSV formatting problem. How do we specify that the delimiter in our file is a comma that might sometimes be followed by a space? By harnessing the power of regular expressions. (Ok, that's overstating the case by quite a bit. We're really only harnessing a tiny fraction of the power of regular expressions.)

String[] tokens = line.split(",\\s*");

That says split the line on any comma followed by zero or more spaces (check out Mastering Regular Expressions for an in-depth guide to regular expression syntax). This single line of code should have the desired effect of splitting the line into the following array.

username
password
home folder
default editor
favorite color
web address

Try it out by creating a simple class that reads a single line of comma-delimited text and prints out the tokens. If you really want to see what an improvement String's split method brings, try writing the same function using a StringTokenizer instead.

No comments: