Sources:
About Unicode in Java
Java is using Unicode 16-bit representation internally.
Unicode distinguishes between the association of characters as abstract concepts (e.g., "Greek capital letter omega Ω") to a subset of the natural numbers, called
code point on the one hand, and the representation of code points by values stored in units of a computer's memory. The Unicode standard defines seven of these character encoding schemes.
In Java, the 65536 numeric values of Unicode are UTF-16
code units, values that are used in the UTF-16 encoding of Unicode texts. Any representation of Unicode must be capable of representing the full range of code points, its upper bound being 0x10FFFF. Thus, code points beyond 0xFFFF need to be represented by pairs of UTF-16 code units, and the values used with these so-called
surrogate pairs are exempt from being used as code points themselves.
The full range of Unicode code points can only be stored in a variable of type
int
. The actual number of Unicode characters cannot be represented in a
char
variable.
It is possible that a String value contains surrogate pairs intermingled with individual code units.
In such cases one character can take up two indices in the string.
To verify if the string consists of only individual code units one can use:
s.lenght() == s.codePointCount(0, s.length())
...because String.length() method returns the number of code units, or 16-bit char values, in the string, while the String.codePointCount() method returns the count of the number of characters (including supplementary characters).
If you have to process strings containing surrogate pairs, there's an implementation of a unicode charAt method
in this article (using offsetByCodePoints and codePointAt methods of string).
Regarding conversion to uppercase and lowercase, use the String.toUpperCase() and String.toLowerCase() methods only because those handle all cases of conversions correctly compared to the Character implementations.
Working with text in Java
If your editor and file system allow it, you can use Java's native UTF-16 characters directly in your code.
Always use 'single quotes' for char literals and "double quotes" for String literals.
Escape sequences for char and String literals: \b (backspace), \t (tab), \n (line feed), \f (form feed), \r (carriage return), \" (double quote), \' (single quote), and \\ (backslash).
Primitive type: char
A
char is a single
16-bit Unicode character. It has a minimum value of
'\u0000'
(or 0) and a maximum value of
'\uffff'
(or 65,535 inclusive).
Default value:
'\u0000'
Check for default value:
ch == Character.MIN_VALUE
or
ch == 0
Non-primitive types: Character and String
Use Java's built in Character class for that:
char ch = '\u0041';
assertEquals('A', ch );
assertFalse(Character.isDigit(ch));
assertTrue(Character.isLetter(ch));
assertTrue(Character.isLetterOrDigit(ch));
assertFalse(Character.isLowerCase(ch));
assertTrue(Character.isUpperCase(ch));
assertFalse(Character.isSpaceChar(ch));
assertTrue(Character.isDefined(ch));
How to decide if some letter is in the English Alphabet?
char ch = 'A';
assertTrue(((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')));
ch = 'á';
assertFalse(((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z')));
Or using regex:
Pattern p = Pattern.compile("[A-Za-z]");
assertTrue(p.matcher("a").find());
assertFalse(p.matcher("Á")
.find());
Sorting text
The default String comparator compares based on the unicode values of characters. (putting uppercase before lowercase, etc.)
To sort in (localized) natural language order one must use a
Collator. An example usage in shown is
this article,
here's a lengthier demonstration, and here's some information on
how customize sorting rules.
List list = Arrays.asList("alma", "Anna", "Ági", "ágy");
Collections.sort(list);
assertEquals(Arrays.asList("Anna", "alma", "Ági", "ágy"), list);
Collections.sort(list, String.CASE_INSENSITIVE_ORDER);
assertEquals(Arrays.asList("alma", "Anna", "Ági", "ágy"), list);
Collator huCollator = Collator.getInstance(Locale.forLanguageTag("hu-HU"));
Collections.sort(list, huCollator);
assertEquals(Arrays.asList("Ági", "ágy", "alma", "Anna"), list);
Splitting and joining text
Conversion between String and char
String str = "My fancy text";
char[] chars = str.toCharArray();
String joined = new String(chars);
assertEquals(str, joined);
Splitting and joining Strings
String str = "My fancy text";
String[] splitted = str.split(" ");
String joined = String.join(" ", splitted);
assertEquals(str, joined );
Parsing text
A Scanner breaks its input into tokens using a delimiter pattern.
Default delimiter: whitespace. Set it with Scanner.useDelimiter()
Localization for reading numbers: via the Scanner.useLocale(locale) method.
Reset to defaults with Scanner.reset() method.
Delimiters:
Navigate with Scanner.next() returns Object between the current and the next delimiter.
The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur.
Boundaries:
- Character
- Word
- Sentence
- Line
Navigate with BreakIterator.next() and BreakIterator.previous() - returns next int index of boundary.
StringTokenizer is a
legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality
use the split method of String or the java.util.regex package instead.
Splits a string around matches of the given regular expression. Returns an array.
With the regex match it works like Scanner but parses the whole text at once.
Delimiter defaults to whitespace.
Pattern matching
java.util.regex contains classes for matching character sequences against patterns specified by regular expressions.
An instance of the
Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.
Instances of the
Matcher class are used to match character sequences against a given pattern. Input is provided to matchers via the CharSequence interface in order to support matching against characters from a wide variety of input sources.
The different matching methods of Matcher
Pattern pattern = Pattern.compile("foo");
// find: all occurrences one by one
assertTrue(pattern.matcher("afooo").find());
assertFalse(pattern.matcher("aooo").find());
// find starting at given index
assertTrue(pattern.matcher("afooo").find(0));
assertTrue(pattern.matcher("afooo").find(1));
assertFalse(pattern.matcher("afooo").find(2));
// lookingAt: like String.startsWith() but with regex
assertTrue(pattern.matcher("fooo").lookingAt());
assertFalse(pattern.matcher("afooo").lookingAt());
// matches: like String.equals() but with regex
assertTrue(pattern.matcher("foo").matches());
assertFalse(pattern.matcher("fooo").matches());
Retrieving matched subsequences
The explicit state of a matcher includes the start and end indices of the most recent successful match. It also includes the start and end indices of the input subsequence captured by each capturing group in the pattern as well as a total count of such subsequences. This can be used to retrieve what is matched:
Pattern pattern = Pattern.compile("f.o");
Matcher matcher = pattern.matcher("afaoofeoofo");
assertTrue(matcher.find()); // finds the first match
assertEquals("fao", matcher.group());
assertTrue(matcher.find()); // finds the second match
assertEquals("feo", matcher.group());
assertFalse(matcher.find()); // no more to find
matcher.reset(); // resets the matcher
assertTrue(matcher.find()); // finds the first match again
assertEquals("fao", matcher.group());
Iterating over the matches
while(matcher.find()) {
String group = matcher.group();
}
Using capturing groups
Pattern pattern = Pattern.compile("(f(.)o)");
Matcher matcher = pattern.matcher("afaoofeoofuo");
assertEquals(2, matcher.groupCount()); // groups specified in pattern
assertTrue(matcher.find()); // finds the first match again
assertEquals("fao", matcher.group(1)); //referencing the capturing group
assertEquals("a", matcher.group(2)); //referencing the capturing group
Making replacements
Replace the first substring of a string that matches the given regular expression with the given replacement:
str.replaceFirst(regex, repl)
yields exactly the same result as
Pattern.compile(regex).matcher(str).replaceFirst(repl)
Replace each substring of this string that matches the given regular expression with the given replacement:
str.replaceAll(regex, repl)
yields exactly the same result as
Pattern.compile(regex).matcher(str).replaceAll(repl)
Making complex replacements
To have more control on the replacement, use
Matcher.appendReplacement() with
Matcher.appendTail().
The most basic case: replace with fixed string
Pattern p = Pattern.compile("f.o");
Matcher m = p.matcher("afaoofeoofuo");
StringBuffer sb = new StringBuffer(); // the buffer to write the result to
while (m.find()) {
m.appendReplacement(sb, "-"); // replace the whole match with the given string
}
m.appendTail(sb); // write the rest of the string after the last match to the buffer.
assertEquals("a-o-o-", sb.toString());
A more complex case: replace with multiple capturing groups
Pattern p = Pattern.compile("(f)(.)(o)");
Matcher m = p.matcher("afaoofeoofuo");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "$1-$3"); // replace only the second group
}
m.appendTail(sb);
assertEquals("af-oof-oof-o", sb.toString());
A more complex case: replace with value from map
Map map = new HashMap<>();
map.put("a", "1"); map.put("e", "2"); map.put("u", "3");
Pattern p = Pattern.compile("(f)(.)(o)");
Matcher m = p.matcher("afaoofeoofuo");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "$1" + map.get(m.group(2)) + "$3"); // replace only the second group
}
m.appendTail(sb);
assertEquals("af1oof2oof3o", sb.toString());
Note: If you want the replacement to contain $ or \ literals, wrap it in
Matcher.quoteReplacement().
Escape special characters with double backslash
Regex patterns are specified within String literals. Java has some reserved escapes like \n for line break, so the regex escapes like \s need to be escaped with an extra \ resulting in \\s for matching a whitespace character.
The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.
Sources: