/** Test Frink's parsing of graphemes. To quote the Unicode standard: "It is important to recognize that what the user thinks of as a 'character'--a basic unit of a writing system for a language--may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, 'G' + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically. Samples are taken from: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries */ printAll[""] // Test empty string. // Grapheme clusters (both legacy and extended) printAll["you \u0067\u0308o\u0308"] // G with combining diaeresis printAll["\uAC01\u1100\u1161\u11A8"] // Hangul gag printAll["\u0E01"] // Thai ko /** Extended grapheme clusters An extended grapheme cluster is the same as a legacy grapheme cluster, with the addition of some other characters. The continuing characters are extended to include all spacing combining marks, such as the spacing (but dependent) vowel signs in Indic scripts. For example, this includes U+093F DEVANAGARI VOWEL SIGN I. The extended grapheme clusters should be used in implementations in preference to legacy grapheme clusters, because they provide better results for Indic scripts such as Tamil or Devanagari in which editing by orthographic syllable is typically preferred. For scripts such as Thai, Lao, and certain other Southeast Asian scripts, editing by visual unit is typically preferred, so for those scripts the behavior of extended grapheme clusters is similar to (but not identical to) the behavior of legacy grapheme clusters. */ printAll["\u0BA8"] // Tamil na printAll["\u0BA8\u0BBF"] // Tamil ni (hmmm... don't combine?) printAll["\u0E40"] // Thai character sara e printAll["\u0E01\u0E33"] // Thai "ko kai" + "sara am" = "kam" // hmmm... don't combine? printAll["\u0937\u093F"] // Devanagari SSA + Vowel sign I = ssi /* Legacy grapheme clusters. A legacy grapheme cluster is defined as a base (such as A or カ) followed by zero or more continuing characters. One way to think of this is as a sequence of characters that form a “stack”. The base can be single characters, or be any sequence of Hangul Jamo characters that form a Hangul Syllable, as defined by D133 in The Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji national flag symbols corresponding to ISO country codes. Sequences of more than two RI characters should be separated by other characters, such as U+200B ZERO WIDTH SPACE (ZWSP). The continuing characters include nonspacing marks, the Join_Controls (U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) used in Indic languages, and a few spacing combining marks to ensure canonical equivalence. Additional cases need to be added for completeness, so that any string of text can be divided up into a sequence of grapheme clusters. Some of these may be degenerate cases, such as a control code, or an isolated combining mark. */ printAll["\u0E33"] // Thai character sara am printAll["\u0937"] // Devanagari letter ssa printAll["\u093F"] // Devanagari vowel sign i (combining but alone?) // Tailored grapheme clusters printAll["\u0063\u0068"] // Slovak ch digraph printAll["\u006B\u02B7"] // k^w (sequence with letter modifier) hmmm.. not combining? printAll["\u0915\u094D\u0937\u093F"] // Devanagari letter ka + sign virama + letter ssa + vowel sign i = kshi // Something from StackOverflow: // printAll["\u{1F468}\u{200D}\u{2764}\u{FE0F}\u{200D}\u{1F48B}\u{200D}\u{1F468}"] printAll[str] := { printGraphemes[str] printGraphemes[normalizeUnicode[str]] printGraphemes[reverse[str]] /* g = new graphics g.font["SansSerif", 10] g.text["$str\n" + reverse[str], 0, 0] g.show[] */ } printGraphemes[str] := { graphemes = array[graphemeList[str]] print["$str (" + length[str] + "," + graphemeLength[str] + "):\t"] println[inputForm[graphemes]] }