Class UnicodeSetCloseOver

java.lang.Object
com.ibm.icu.dev.tool.translit.UnicodeSetCloseOver

class UnicodeSetCloseOver extends Object
This class produces the data tables used by the closeOver() method of UnicodeSet. Whenever the Unicode database changes, this tool must be re-run (AFTER the data file(s) underlying ICU4J are udpated). The output of this tool should then be pasted into the appropriate files: ICU4J: com.ibm.icu.text.UnicodeSet.java ICU4C: /icu/source/common/uniset.cpp
  • Field Details

  • Constructor Details

    • UnicodeSetCloseOver

      UnicodeSetCloseOver()
  • Method Details

    • main

      public static void main(String[] args) throws IOException
      Throws:
      IOException
    • createCaseFoldEquivalencyClasses

      static Map createCaseFoldEquivalencyClasses()
      Create a map of String => Set. The String in this case is a folded string for which UCharacter.foldCase(folded. DEFAULT_CASE_MAP).equals(folded). The Set contains all single-character strings x for which UCharacter.foldCase(x, DEFAULT_CASE_MAP).equals(folded), as well as folded itself.
    • analyzeCaseData

      static void analyzeCaseData(Map equivClasses, StringBuffer pairs, Vector nonpairs, Vector lengths)
      Analyze the case fold equivalency classes. Break them into two groups: 'pairs', and 'nonpairs'. Create a tally of the length configurations of the nonpairs. Length configurations of equivalency classes, as of Unicode 3.2. Most of the classes (83%) have two single codepoints. Here "112:28" means there are 28 equivalency classes with 2 single codepoints and one string of length 2. 11:656 111:16 1111:3 112:28 113:2 12:31 13:12 22:38 Note: This method does not count the frequencies of the different length configurations (as shown above after ':'); it merely records which configurations occur.
      Parameters:
      pairs - Accumulate equivalency classes that consist of exactly two codepoints here. This is 83+% of the classes. E.g., {"a", "A"}.
      nonpairs - Accumulate other equivalency classes here, as lists of strings. E,g, {"st", "ſt", "st"}.
      lengths - Accumulate a list of unique length structures, not including pairs. Each length structure is represented by a string of digits. The digit string "12" means the equivalency class contains a single code point and a string of length 2. Typical contents of 'lengths': { "111", "1111", "112", "113", "12", "13", "22" }. Note the absence of "11".
    • generateCaseData

      static void generateCaseData() throws IOException
      Throws:
      IOException
    • getCaseSensitive

      static UnicodeSet getCaseSensitive()
      Create the set of case-sensitive characters. These are characters that participate in any case mapping operation as a source or as a member of a target string.
    • emitUCharRangesArray

      static void emitUCharRangesArray(PrintStream out, UnicodeSet set, String id)
      Given a UnicodeSet, emit it as an array of UChar pairs. Each pair will be the start/end of a range. Code points >= U+10000 will be represented as surrogate pairs.
    • emitRangesString

      static void emitRangesString(PrintStream out, UnicodeSet set, String id)
      Given a UnicodeSet, emit it as a Java string. The most economical format is not the pattern, but instead a pairs list, with each range pair represented as two adjacent characters.