Class Collation

java.lang.Object
com.ibm.icu.impl.coll.Collation

public final class Collation extends Object
Collation v2 basic definitions and static helper functions. Data structures except for expansion tables store 32-bit CEs which are either specials (see tags below) or are compact forms of 64-bit CEs.
  • Field Details

    • SENTINEL_CP

      public static final int SENTINEL_CP
      UChar32 U_SENTINEL. TODO: Create a common, public constant?
      See Also:
    • LESS

      public static final int LESS
      See Also:
    • EQUAL

      public static final int EQUAL
      See Also:
    • GREATER

      public static final int GREATER
      See Also:
    • TERMINATOR_BYTE

      public static final int TERMINATOR_BYTE
      See Also:
    • LEVEL_SEPARATOR_BYTE

      public static final int LEVEL_SEPARATOR_BYTE
      See Also:
    • BEFORE_WEIGHT16

      static final int BEFORE_WEIGHT16
      The secondary/tertiary lower limit for tailoring before any root elements.
      See Also:
    • MERGE_SEPARATOR_BYTE

      public static final int MERGE_SEPARATOR_BYTE
      Merge-sort-key separator. Same as the unique primary and identical-level weights of U+FFFE. Must not be used as primary compression low terminator. Otherwise usable.
      See Also:
    • MERGE_SEPARATOR_PRIMARY

      public static final long MERGE_SEPARATOR_PRIMARY
      See Also:
    • MERGE_SEPARATOR_CE32

      static final int MERGE_SEPARATOR_CE32
      See Also:
    • PRIMARY_COMPRESSION_LOW_BYTE

      public static final int PRIMARY_COMPRESSION_LOW_BYTE
      Primary compression low terminator, must be greater than MERGE_SEPARATOR_BYTE. Reserved value in primary second byte if the lead byte is compressible. Otherwise usable in all CE weight bytes.
      See Also:
    • PRIMARY_COMPRESSION_HIGH_BYTE

      public static final int PRIMARY_COMPRESSION_HIGH_BYTE
      Primary compression high terminator. Reserved value in primary second byte if the lead byte is compressible. Otherwise usable in all CE weight bytes.
      See Also:
    • COMMON_BYTE

      static final int COMMON_BYTE
      Default secondary/tertiary weight lead byte.
      See Also:
    • COMMON_WEIGHT16

      public static final int COMMON_WEIGHT16
      See Also:
    • COMMON_SECONDARY_CE

      static final int COMMON_SECONDARY_CE
      Middle 16 bits of a CE with a common secondary weight.
      See Also:
    • COMMON_TERTIARY_CE

      static final int COMMON_TERTIARY_CE
      Lower 16 bits of a CE with a common tertiary weight.
      See Also:
    • COMMON_SEC_AND_TER_CE

      public static final int COMMON_SEC_AND_TER_CE
      Lower 32 bits of a CE with common secondary and tertiary weights.
      See Also:
    • SECONDARY_MASK

      static final int SECONDARY_MASK
      See Also:
    • CASE_MASK

      public static final int CASE_MASK
      See Also:
    • SECONDARY_AND_CASE_MASK

      static final int SECONDARY_AND_CASE_MASK
      See Also:
    • ONLY_TERTIARY_MASK

      public static final int ONLY_TERTIARY_MASK
      Only the 2*6 bits for the pure tertiary weight.
      See Also:
    • ONLY_SEC_TER_MASK

      static final int ONLY_SEC_TER_MASK
      Only the secondary invalid input: '&' tertiary bits; no case, no quaternary.
      See Also:
    • CASE_AND_TERTIARY_MASK

      static final int CASE_AND_TERTIARY_MASK
      Case bits and tertiary bits.
      See Also:
    • QUATERNARY_MASK

      public static final int QUATERNARY_MASK
      See Also:
    • CASE_AND_QUATERNARY_MASK

      public static final int CASE_AND_QUATERNARY_MASK
      Case bits and quaternary bits.
      See Also:
    • UNASSIGNED_IMPLICIT_BYTE

      static final int UNASSIGNED_IMPLICIT_BYTE
      See Also:
    • FIRST_UNASSIGNED_PRIMARY

      static final long FIRST_UNASSIGNED_PRIMARY
      First unassigned: AlphabeticIndex overflow boundary. We want a 3-byte primary so that it fits into the root elements table. This 3-byte primary will not collide with any unassigned-implicit 4-byte primaries because the first few hundred Unicode code points all have real mappings.
      See Also:
    • TRAIL_WEIGHT_BYTE

      static final int TRAIL_WEIGHT_BYTE
      See Also:
    • FIRST_TRAILING_PRIMARY

      static final long FIRST_TRAILING_PRIMARY
      See Also:
    • MAX_PRIMARY

      public static final long MAX_PRIMARY
      See Also:
    • MAX_REGULAR_CE32

      static final int MAX_REGULAR_CE32
      See Also:
    • FFFD_PRIMARY

      public static final long FFFD_PRIMARY
      See Also:
    • FFFD_CE32

      static final int FFFD_CE32
      See Also:
    • SPECIAL_CE32_LOW_BYTE

      static final int SPECIAL_CE32_LOW_BYTE
      A CE32 is special if its low byte is this or greater. Impossible case bits 11 mark special CE32s. This value itself is used to indicate a fallback to the base collator.
      See Also:
    • FALLBACK_CE32

      static final int FALLBACK_CE32
      See Also:
    • LONG_PRIMARY_CE32_LOW_BYTE

      static final int LONG_PRIMARY_CE32_LOW_BYTE
      Low byte of a long-primary special CE32.
      See Also:
    • UNASSIGNED_CE32

      static final int UNASSIGNED_CE32
      See Also:
    • NO_CE32

      static final int NO_CE32
      See Also:
    • NO_CE_PRIMARY

      static final long NO_CE_PRIMARY
      No CE: End of input. Only used in runtime code, not stored in data.
      See Also:
    • NO_CE_WEIGHT16

      static final int NO_CE_WEIGHT16
      See Also:
    • NO_CE

      public static final long NO_CE
      See Also:
    • NO_LEVEL

      public static final int NO_LEVEL
      Unspecified level.
      See Also:
    • PRIMARY_LEVEL

      public static final int PRIMARY_LEVEL
      See Also:
    • SECONDARY_LEVEL

      public static final int SECONDARY_LEVEL
      See Also:
    • CASE_LEVEL

      public static final int CASE_LEVEL
      See Also:
    • TERTIARY_LEVEL

      public static final int TERTIARY_LEVEL
      See Also:
    • QUATERNARY_LEVEL

      public static final int QUATERNARY_LEVEL
      See Also:
    • IDENTICAL_LEVEL

      public static final int IDENTICAL_LEVEL
      See Also:
    • ZERO_LEVEL

      public static final int ZERO_LEVEL
      Beyond sort key bytes.
      See Also:
    • NO_LEVEL_FLAG

      static final int NO_LEVEL_FLAG
      Sort key level flags: xx_FLAG = 1 invalid input: '<'invalid input: '<' xx_LEVEL. In Java, use enum Level with flag() getters, or use EnumSet rather than hand-made bit sets.
      See Also:
    • PRIMARY_LEVEL_FLAG

      static final int PRIMARY_LEVEL_FLAG
      See Also:
    • SECONDARY_LEVEL_FLAG

      static final int SECONDARY_LEVEL_FLAG
      See Also:
    • CASE_LEVEL_FLAG

      static final int CASE_LEVEL_FLAG
      See Also:
    • TERTIARY_LEVEL_FLAG

      static final int TERTIARY_LEVEL_FLAG
      See Also:
    • QUATERNARY_LEVEL_FLAG

      static final int QUATERNARY_LEVEL_FLAG
      See Also:
    • IDENTICAL_LEVEL_FLAG

      static final int IDENTICAL_LEVEL_FLAG
      See Also:
    • ZERO_LEVEL_FLAG

      static final int ZERO_LEVEL_FLAG
      See Also:
    • FALLBACK_TAG

      static final int FALLBACK_TAG
      Fall back to the base collator. This is the tag value in SPECIAL_CE32_LOW_BYTE and FALLBACK_CE32. Bits 31..8: Unused, 0.
      See Also:
    • LONG_PRIMARY_TAG

      static final int LONG_PRIMARY_TAG
      Long-primary CE with COMMON_SEC_AND_TER_CE. Bits 31..8: Three-byte primary.
      See Also:
    • LONG_SECONDARY_TAG

      static final int LONG_SECONDARY_TAG
      Long-secondary CE with zero primary. Bits 31..16: Secondary weight. Bits 15.. 8: Tertiary weight.
      See Also:
    • RESERVED_TAG_3

      static final int RESERVED_TAG_3
      Unused. May be used in the future for single-byte secondary CEs (SHORT_SECONDARY_TAG), storing the secondary in bits 31..24, the ccc in bits 23..16, and the tertiary in bits 15..8.
      See Also:
    • LATIN_EXPANSION_TAG

      static final int LATIN_EXPANSION_TAG
      Latin mini expansions of two simple CEs [pp, 05, tt] [00, ss, 05]. Bits 31..24: Single-byte primary weight pp of the first CE. Bits 23..16: Tertiary weight tt of the first CE. Bits 15.. 8: Secondary weight ss of the second CE.
      See Also:
    • EXPANSION32_TAG

      static final int EXPANSION32_TAG
      Points to one or more simple/long-primary/long-secondary 32-bit CE32s. Bits 31..13: Index into int table. Bits 12.. 8: Length=1..31.
      See Also:
    • EXPANSION_TAG

      static final int EXPANSION_TAG
      Points to one or more 64-bit CEs. Bits 31..13: Index into CE table. Bits 12.. 8: Length=1..31.
      See Also:
    • BUILDER_DATA_TAG

      static final int BUILDER_DATA_TAG
      Builder data, used only in the CollationDataBuilder, not in runtime data. If bit 8 is 0: Builder context, points to a list of context-sensitive mappings. Bits 31..13: Index to the builder's list of ConditionalCE32 for this character. Bits 12.. 9: Unused, 0. If bit 8 is 1 (IS_BUILDER_JAMO_CE32): Builder-only jamoCE32 value. The builder fetches the Jamo CE32 from the trie. Bits 31..13: Jamo code point. Bits 12.. 9: Unused, 0.
      See Also:
    • PREFIX_TAG

      static final int PREFIX_TAG
      Points to prefix trie. Bits 31..13: Index into prefix/contraction data. Bits 12.. 8: Unused, 0.
      See Also:
    • CONTRACTION_TAG

      static final int CONTRACTION_TAG
      Points to contraction data. Bits 31..13: Index into prefix/contraction data. Bits 12..11: Unused, 0. Bit 10: CONTRACT_TRAILING_CCC flag. Bit 9: CONTRACT_NEXT_CCC flag. Bit 8: CONTRACT_SINGLE_CP_NO_MATCH flag.
      See Also:
    • DIGIT_TAG

      static final int DIGIT_TAG
      Decimal digit. Bits 31..13: Index into int table for non-numeric-collation CE32. Bit 12: Unused, 0. Bits 11.. 8: Digit value 0..9.
      See Also:
    • U0000_TAG

      static final int U0000_TAG
      Tag for U+0000, for moving the NUL-termination handling from the regular fastpath into specials-handling code. Bits 31..8: Unused, 0.
      See Also:
    • HANGUL_TAG

      static final int HANGUL_TAG
      Tag for a Hangul syllable. Bits 31..9: Unused, 0. Bit 8: HANGUL_NO_SPECIAL_JAMO flag.
      See Also:
    • LEAD_SURROGATE_TAG

      static final int LEAD_SURROGATE_TAG
      Tag for a lead surrogate code unit. Optional optimization for UTF-16 string processing. Bits 31..10: Unused, 0. 9.. 8: =0: All associated supplementary code points are unassigned-implicit. =1: All associated supplementary code points fall back to the base data. else: (Normally 2) Look up the data for the supplementary code point.
      See Also:
    • OFFSET_TAG

      static final int OFFSET_TAG
      Tag for CEs with primary weights in code point order. Bits 31..13: Index into CE table, for one data "CE". Bits 12.. 8: Unused, 0. This data "CE" has the following bit fields: Bits 63..32: Three-byte primary pppppp00. 31.. 8: Start/base code point of the in-order range. 7: Flag isCompressible primary. 6.. 0: Per-code point primary-weight increment.
      See Also:
    • IMPLICIT_TAG

      static final int IMPLICIT_TAG
      Implicit CE tag. Compute an unassigned-implicit CE. All bits are set (UNASSIGNED_CE32=0xffffffff).
      See Also:
    • MAX_EXPANSION_LENGTH

      static final int MAX_EXPANSION_LENGTH
      We limit the number of CEs in an expansion so that we can use a small number of length bits in the data structure, and so that an implementation can copy CEs at runtime without growing a destination buffer.
      See Also:
    • MAX_INDEX

      static final int MAX_INDEX
      See Also:
    • CONTRACT_SINGLE_CP_NO_MATCH

      static final int CONTRACT_SINGLE_CP_NO_MATCH
      Set if there is no match for the single (no-suffix) character itself. This is only possible if there is a prefix. In this case, discontiguous contraction matching cannot add combining marks starting from an empty suffix. The default CE32 is used anyway if there is no suffix match.
      See Also:
    • CONTRACT_NEXT_CCC

      static final int CONTRACT_NEXT_CCC
      Set if the first character of every contraction suffix has lccc!=0.
      See Also:
    • CONTRACT_TRAILING_CCC

      static final int CONTRACT_TRAILING_CCC
      Set if any contraction suffix ends with lccc!=0.
      See Also:
    • HANGUL_NO_SPECIAL_JAMO

      static final int HANGUL_NO_SPECIAL_JAMO
      For HANGUL_TAG: None of its Jamo CE32s isSpecialCE32().
      See Also:
    • LEAD_ALL_UNASSIGNED

      static final int LEAD_ALL_UNASSIGNED
      See Also:
    • LEAD_ALL_FALLBACK

      static final int LEAD_ALL_FALLBACK
      See Also:
    • LEAD_MIXED

      static final int LEAD_MIXED
      See Also:
    • LEAD_TYPE_MASK

      static final int LEAD_TYPE_MASK
      See Also:
  • Constructor Details

    • Collation

      public Collation()
  • Method Details

    • isAssignedCE32

      static boolean isAssignedCE32(int ce32)
    • makeLongPrimaryCE32

      static int makeLongPrimaryCE32(long p)
    • primaryFromLongPrimaryCE32

      static long primaryFromLongPrimaryCE32(int ce32)
      Turns the long-primary CE32 into a primary weight pppppp00.
    • ceFromLongPrimaryCE32

      static long ceFromLongPrimaryCE32(int ce32)
    • makeLongSecondaryCE32

      static int makeLongSecondaryCE32(int lower32)
    • ceFromLongSecondaryCE32

      static long ceFromLongSecondaryCE32(int ce32)
    • makeCE32FromTagIndexAndLength

      static int makeCE32FromTagIndexAndLength(int tag, int index, int length)
      Makes a special CE32 with tag, index and length.
    • makeCE32FromTagAndIndex

      static int makeCE32FromTagAndIndex(int tag, int index)
      Makes a special CE32 with only tag and index.
    • isSpecialCE32

      static boolean isSpecialCE32(int ce32)
    • tagFromCE32

      static int tagFromCE32(int ce32)
    • hasCE32Tag

      static boolean hasCE32Tag(int ce32, int tag)
    • isLongPrimaryCE32

      static boolean isLongPrimaryCE32(int ce32)
    • isSimpleOrLongCE32

      static boolean isSimpleOrLongCE32(int ce32)
    • isSelfContainedCE32

      static boolean isSelfContainedCE32(int ce32)
      Returns:
      true if the ce32 yields one or more CEs without further data lookups
    • isPrefixCE32

      static boolean isPrefixCE32(int ce32)
    • isContractionCE32

      static boolean isContractionCE32(int ce32)
    • ce32HasContext

      static boolean ce32HasContext(int ce32)
    • latinCE0FromCE32

      static long latinCE0FromCE32(int ce32)
      Get the first of the two Latin-expansion CEs encoded in ce32.
      See Also:
    • latinCE1FromCE32

      static long latinCE1FromCE32(int ce32)
      Get the second of the two Latin-expansion CEs encoded in ce32.
      See Also:
    • indexFromCE32

      static int indexFromCE32(int ce32)
      Returns the data index from a special CE32.
    • lengthFromCE32

      static int lengthFromCE32(int ce32)
      Returns the data length from a ce32.
    • digitFromCE32

      static char digitFromCE32(int ce32)
      Returns the digit value from a DIGIT_TAG ce32.
    • ceFromSimpleCE32

      static long ceFromSimpleCE32(int ce32)
      Returns a 64-bit CE from a simple CE32 (not special).
    • ceFromCE32

      static long ceFromCE32(int ce32)
      Returns a 64-bit CE from a simple/long-primary/long-secondary CE32.
    • makeCE

      public static long makeCE(long p)
      Creates a CE from a primary weight.
    • makeCE

      static long makeCE(long p, int s, int t, int q)
      Creates a CE from a primary weight, 16-bit secondary/tertiary weights, and a 2-bit quaternary.
    • incTwoBytePrimaryByOffset

      public static long incTwoBytePrimaryByOffset(long basePrimary, boolean isCompressible, int offset)
      Increments a 2-byte primary by a code point offset.
    • incThreeBytePrimaryByOffset

      public static long incThreeBytePrimaryByOffset(long basePrimary, boolean isCompressible, int offset)
      Increments a 3-byte primary by a code point offset.
    • decTwoBytePrimaryByOneStep

      static long decTwoBytePrimaryByOneStep(long basePrimary, boolean isCompressible, int step)
      Decrements a 2-byte primary by one range step (1..0x7f).
    • decThreeBytePrimaryByOneStep

      static long decThreeBytePrimaryByOneStep(long basePrimary, boolean isCompressible, int step)
      Decrements a 3-byte primary by one range step (1..0x7f).
    • getThreeBytePrimaryForOffsetData

      static long getThreeBytePrimaryForOffsetData(int c, long dataCE)
      Computes a 3-byte primary for c's OFFSET_TAG data "CE".
    • unassignedPrimaryFromCodePoint

      static long unassignedPrimaryFromCodePoint(int c)
      Returns the unassigned-character implicit primary weight for any valid code point c.
    • unassignedCEFromCodePoint

      static long unassignedCEFromCodePoint(int c)