Bracket-Like Characters

In ICU4J, here is draft code that gets the opening and closing bracket-like characters. I threw them into a Map, but you can adapt it to the format you want. If you need C++, there are corresponding methods in ICU4C that do the same thing. For help, please join the icu-support mailing list.

    static final UnicodeSet INITIAL_PUNCTUATION = new UnicodeSet("[[:Ps:][:Pi:]-[༺༼᚛‘‚‛“„‟〝]]").freeze();

    private static <T extends Map<Integer, Integer>> T getMatchingBraces(T output) {
        for (String start : INITIAL_PUNCTUATION) {
            int startChar = start.codePointAt(0);
            String end = UCharacter.getStringPropertyValue(UProperty.BIDI_MIRRORING_GLYPH, startChar, UProperty.NameChoice.SHORT);
            if (end == null) {
                continue;
            }
            output.put(startChar, end.codePointAt(0));
        }
        return output;
    }

Known issues are listed at the end of this page.


Here is a listing:

( ) LEFT PARENTHESIS
[ ] LEFT SQUARE BRACKET
{ } LEFT CURLY BRACKET
LEFT SQUARE BRACKET WITH QUILL
SUPERSCRIPT LEFT PARENTHESIS
SUBSCRIPT LEFT PARENTHESIS
LEFT-POINTING ANGLE BRACKET
MEDIUM LEFT PARENTHESIS ORNAMENT
MEDIUM FLATTENED LEFT PARENTHESIS ORNAMENT
MEDIUM LEFT-POINTING ANGLE BRACKET ORNAMENT
HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
HEAVY LEFT-POINTING ANGLE BRACKET ORNAMENT
LIGHT LEFT TORTOISE SHELL BRACKET ORNAMENT
MEDIUM LEFT CURLY BRACKET ORNAMENT
LEFT S-SHAPED BAG DELIMITER
MATHEMATICAL LEFT WHITE SQUARE BRACKET
MATHEMATICAL LEFT ANGLE BRACKET
MATHEMATICAL LEFT DOUBLE ANGLE BRACKET
MATHEMATICAL LEFT WHITE TORTOISE SHELL BRACKET
MATHEMATICAL LEFT FLATTENED PARENTHESIS
LEFT WHITE CURLY BRACKET
LEFT WHITE PARENTHESIS
Z NOTATION LEFT IMAGE BRACKET
Z NOTATION LEFT BINDING BRACKET
LEFT SQUARE BRACKET WITH UNDERBAR
LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
LEFT ANGLE BRACKET WITH DOT
LEFT ARC LESS-THAN BRACKET
DOUBLE LEFT ARC GREATER-THAN BRACKET
LEFT BLACK TORTOISE SHELL BRACKET
LEFT WIGGLY FENCE
LEFT DOUBLE WIGGLY FENCE
LEFT-POINTING CURVED ANGLE BRACKET
LEFT SUBSTITUTION BRACKET
LEFT DOTTED SUBSTITUTION BRACKET
LEFT TRANSPOSITION BRACKET
LEFT RAISED OMISSION BRACKET
LEFT LOW PARAPHRASE BRACKET
LEFT VERTICAL BAR WITH QUILL
TOP LEFT HALF BRACKET
BOTTOM LEFT HALF BRACKET
LEFT SIDEWAYS U BRACKET
LEFT DOUBLE PARENTHESIS
LEFT ANGLE BRACKET
LEFT DOUBLE ANGLE BRACKET
LEFT CORNER BRACKET
LEFT WHITE CORNER BRACKET
LEFT BLACK LENTICULAR BRACKET
LEFT TORTOISE SHELL BRACKET
LEFT WHITE LENTICULAR BRACKET
LEFT WHITE TORTOISE SHELL BRACKET
LEFT WHITE SQUARE BRACKET

SMALL LEFT PARENTHESIS
SMALL LEFT CURLY BRACKET
SMALL LEFT TORTOISE SHELL BRACKET
FULLWIDTH LEFT PARENTHESIS
FULLWIDTH LEFT SQUARE BRACKET
FULLWIDTH LEFT CURLY BRACKET
FULLWIDTH LEFT WHITE PARENTHESIS
HALFWIDTH LEFT CORNER BRACKET

(see below)

Known Issues

The characters http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[༺༼᚛‘‚‛“„‟〝] were removed by hand; that list would have to be updated for each release of Unicode.

The following may also behave like the curly quotes, and should be removed from the above (or, rather, have their language-specific behavior in an environment that permits it).

SINGLE LEFT-POINTING ANGLE QUOTATION MARK
« » LEFT-POINTING DOUBLE ANGLE QUOTATION MARK

The code doesn't generate the right results for the following, so they would also need to be hand-tuned to fix them.

ORNATE LEFT PARENTHESIS
PRESENTATION FORM FOR VERTICAL LEFT WHITE LENTICULAR BRACKET
PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS
PRESENTATION FORM FOR VERTICAL LEFT CURLY BRACKET
PRESENTATION FORM FOR VERTICAL LEFT TORTOISE SHELL BRACKET
PRESENTATION FORM FOR VERTICAL LEFT BLACK LENTICULAR BRACKET
PRESENTATION FORM FOR VERTICAL LEFT DOUBLE ANGLE BRACKET
︿ ︿ PRESENTATION FORM FOR VERTICAL LEFT ANGLE BRACKET
PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET


Comments