Sunday, March 18, 2007

String Manipulation

String manipulation forms the basis of many algorithms and utilities. Input validation, text analysis and file conversions are several direct applications of string manipulation. This tutorial explores some of the basics needed. Unless otherwise noted, the following classes are contained in the java.lang library.

NOTE: For the following method parameters the prefix g indicates string, i indicates integer and c indicates character types.

The String Class

String class objects work with complete strings instead of treating them as character arrays as some languages do.

Accessor methods: length(), charAt(i), getBytes(), getChars(istart,iend,gtarget[],itargstart), toCharArray(), valueOf(g,iradix), substring(iStart [,iEndIndex)]) [returns up to but not including iEndIndex]

Modifier methods: toLowerCase(), toUpperCase(), trim(), concat(g), replace(cWhich, cReplacement). The method format(gSyn,g) uses c-like printf syntax for fixed fields if required in reports.

Boolean test methods: contentEquals(g), endsWith(g), equals(g), equalsIgnoreCase(g), matches(g), regionMatches(i1,g2,i3,i4), regionMatches(bIgnoreCase,i1,g2,i3,i4), startsWith(g)

Integer test methods: compareTo(g) [returns 0 if object equals parameter, -1 if object is before parameter in sort order, +1 if otherwise], indexOf(g) [returns position of first occurrence of substring g in the string, -1 if not found], lastIndexOf(g) [returns position of last occurrence of substring g in the string, -1 if not found], length().

String class objects are immutable (ie. read only). When a change is made to a string, a new object is created and the old one is disused. This causes extraneous garbage collection if string modifier methods are used too often. The StringBuffer class should be used instead of String objects in these cases.

Warning: Since strings are stored differently than other data (as a memory address), you can't use the == operator for comparison. However the string methods equals() and equalsIgnoreCase() do the required comparison. A simple example is:

String aName = "Roger"; String bName = "Roger";
if (aName == bName) {System.out.println('== worked')};
if (aName.equals(bName)) {System.out.println('equals worked')};

Here is a program fragment to validate characters in a string, possibly from data entry or from a file. Regular Expression techniques can also be used for validation.

String testString="agjSDRoir";
String validChars = "atgc";
testString=testString.toLowerCase(); // make sure case is correct
for (int i=0;i

The String Buffer Class

StringBuffer class objects allow manipulation of strings without creating a new object each time a manipulation occurs. Examples of setting up a string buffer variable are:

StringBuffer aString = new StringBuffer("the start value"); // sets value
StringBuffer nulString = new StringBuffer(6); // explicitly sets size
StringBuffer defString = new StringBuffer(); // sets size to default of 16

Accessor methods: capacity(), length(), charAt(i)

Modifier methods: ensureCapacity(), setLength(), append(g), delete(i1, i2), deleteCharAt(i), insert(iPosn, g), setCharAt(iposn, c), replace(i1,i2,gvalue), reverse(), toString(g)

The String Buider Class

StringBuilder class methods are similar to StringBuffer ones but they are unsychronized (ie not for multithreaded apps). They are also much faster. Examples of setting up a string buffer variable are:

StringBuilder aString = new StringBuilder("the start value"); // sets value
StringBuilder nulString = new StringBuilder(6); // explicitly sets size
StringBuilder defString = new StringBuilder(); // sets size to default of 16

Accessor methods: capacity(), length(), charAt(i), indexOf(g), lastIndexOf(g)

Modifier methods: append(g), delete(i1, i2), insert(iPosn, g), getChars(i), setCharAt(iposn, c), substring(), replace(i1,i2,gvalue), reverse(), trimToSize(g ), toString(g)

The String Tokenizer Class [java.util library]

Many text manipulation utilities require a tokenizer function which breaks up lines of text into subunits called tokens based on specific delimiters or break characters. The most common delimiter is whitespace which yields words as the tokens. Java has a very useful StringTokenizer class to perform this type of task.

StringTokenizer class objects may be created by one of three constructor methods depending on the parameters used. If a string is used as a single parameter, it is the source text to be broken at the default set of delimiters (space, tab, newline, cr, formfeed aka whitespace). If a second string is also passed, it is assumed to be a set of delimiting characters. Use the escaper \ character to represent the string quote character " or non-typeable delimiters such as tab (\t). If a true flag is added as a third parameter, any delimiters found are also returned as string tokens.

Note: If the information to be tokenized is coming from a file, the StreamTokenizer object may be a more efficient choice.

The StringTokenizer methods are int countTokens(), boolean hasMoreTokens() and String nextToken().

import java.util.*;
public class Test
{
public static void main(String args[])
{
int idx = 0; int tokenCount;
String words[] = new String [500];
String message="The text of the message to be broken up for analysis";
StringTokenizer st = new StringTokenizer(message);
tokenCount = st.countTokens();
System.out.println("Number of tokens = " + tokenCount);
while (st.hasMoreTokens()) // make sure there is stuff to get
{ words[idx] = st.nextToken(); idx++; }
for (idx=0;idx

Regular Expressions [java.util.regex library]

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set (ie. by pattern matching). They can be used as a tool to search, edit or manipulate text or data. One common use is validation of data entry strings. All classes related to regular expressions are found in the java.util.regex package which must be imported.

Java regular expression patterns use a syntax similar to the one used by perl. The best references are found at sun.com and RegExp.info. Simple examples of the use of regular expressions are:

Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches();

As a convenience for a one-time use situation the matches method simplifies the syntax (but does not precompile the pattern).

boolean b = Pattern.matches("a*b", "aaaaab");

As an aid to understanding the syntax of regular expressions and as a development tool, I strongly recommend downloading the test harness.

String Applications

Analyzing, searching and reformatting text is often required in task automation. Simple analysis could be used for word counts, concordance, spell checking, reading difficulty analysis, input validity, etc. Searching is done to find relevant terms. Reformatting of text for new file formats is a common task.

Many tasks involve parsing a line using either field positions or delimiters. A useful task would be to create reuseable methods for txtAt(start,end), wordAt(start,length), txtBefore(delim) and txtAfter(delim). These methods would return null if linereturn was reached before the delimiter or if position parameters caused a length violation. Each method would use some of the string class methods mentioned above.

Simple processing such as removing unwanted characters from strings or translating them is an easy first exercise in string manipulation. cleanchr1.java and cleanchr2.java show two approaches to simple string manipulation.

Note: Each of the following projects in this tutorial will be reused several times as other important topics are introduced. For now the workspace for each project will require hard coding of several lines of data into a string array.

Project: Word Counting

WordCount uses text parsing or tokenization to analyze text into counts of each distinct word unit. This has many practical applications in spell checking, document difficulty analysis, cryptology and file compression. A first approach is to create an array of words, add to the end and search from the front. The first refinement for larger lists is to use an insertion sort and binary search. A further refinement. once file IO has been covered, is to keep the parsed words in files so that array size is not an issue -- ie. large documents can be handled.

Start with the string tokenizer example. Add a second parameter to the tokenizer that specifies explicitly all delimiters needed (punctuation is missing in the default constructor). Lowercase (desense) all parsed words. Now rework the example to accept an array of text rather than a single line. Add a linear search for the word in the word array prior to adding it. If found, increment its count else add it to the list with a count of 1.

The output should display the word array with its associated word count. Make sure to use a fixed field format that allows sorting of the report by another utility. A suggestion is ### wordstring. You may want to review the DecimalFormat Class for specifics.

Note: WordCount1.java is in the source code package located below.

The second phase of the word count project will add a text file io class. It will then be extended with a GUI to produce a useful utility - a word count reporter.

Project: Cipher Text Preparation

The Prep class modifies plain text into the common cipher format of uppercased five letter groups for transmission. This class will be reused in a later case study that enciphers the prepared text using a variety of coding techniques.

Note: Prep1.java is in the source code package located below.

The second phase of the cipher text preparation project will add a text file io class. It will then be extended with two GUIs. The first GUI will be screen oriented for applets. The second will be batch file oriented.

Project: XHTML Analysis

This project uses pattern recognition to identify elements, tags and attributes within XHTML documents . This will be used in a later project to spot deprecated components and attributes that are better represented with CSS properties. For now you will simply count the number of each type of tag. Check for matched tags (and nesting if you can). Output each attribute and its value.

Because tokenization is awkward for XHTML, it is easier to build an ad-hoc pattern recognizer to spot both tag and attribute formats. This takes care of the

or


forms of tags. Regular expression classes can be used to make the pattern recognizer very compact. Take special care to watch for tags wrapped on multiple lines.

Note: TagScan1 is in the source code package located below.

The analysis project will be extended with file io and a GUI will be added to produce a two useful utility - a search word extractor for a site search engine and the obsolete component finder. A utility can also be written to validate internal links.

No comments: