Portability	GHC
Stability	experimental
Maintainer	[email protected]

Data.Text.ICU.Char

Contents

Working with character properties
Property identifier types
Property value types
- Text boundaries
Functions
- Conversion to numbers

Description

Access to the Unicode Character Database, implemented as bindings to the International Components for Unicode (ICU) libraries.

Unicode assigns each code point (not just assigned character) values for many properties. Most are simple boolean flags, or constants from a small enumerated list. For some, values are relatively more complex types.

For more information see "About the Unicode Character Database" http://www.unicode.org/ucd/ and the ICU User Guide chapter on Properties http://icu-project.org/userguide/properties.html.

Synopsis

Working with character properties

The property function provides the main view onto the Unicode Character Database. Because Unicode character properties have a variety of types, the property function is polymorphic. The type of its first argument dictates the type of its result, by use of the Property typeclass.

For instance, property Alphabetic returns a Bool, while property NFCQuickCheck returns a Maybe Bool.

class Property p v | p -> vSource

Instances

Property TrailingCanonicalCombiningClass_ Int
Property LeadCanonicalCombiningClass_ Int
Property GeneralCategory_ GeneralCategory
Property EastAsianWidth_ EastAsianWidth
Property CanonicalCombiningClass_ Int
Property Block_ BlockCode
Property BidiClass_ Direction
Property Bool_ Bool
Property WordBreak_ (Maybe WordBreak)
Property SentenceBreak_ (Maybe SentenceBreak)
Property GraphemeClusterBreak_ (Maybe GraphemeClusterBreak)
Property NFKDQuickCheck_ (Maybe Bool)
Property NFKCQuickCheck_ (Maybe Bool)
Property NFDQuickCheck_ (Maybe Bool)
Property NFCQuickCheck_ (Maybe Bool)
Property HangulSyllableType_ (Maybe HangulSyllableType)
Property NumericType_ (Maybe NumericType)
Property LineBreak_ (Maybe LineBreak)
Property JoiningType_ (Maybe JoiningType)
Property JoiningGroup_ (Maybe JoiningGroup)
Property Decomposition_ (Maybe Decomposition)

Property identifier types

data BidiClass_ Source

Constructors

BidiClass

Instances

Show BidiClass_
Typeable BidiClass_
Property BidiClass_ Direction

data Block_ Source

Constructors

Block

Instances

Property Block_ BlockCode

data Bool_ Source

Constructors

Alphabetic
ASCIIHexDigit	0-9, A-F, a-f
BidiControl	Format controls which have specific functions in the Bidi Algorithm.
BidiMirrored	Characters that may change display in RTL text.
Dash	Variations of dashes.
DefaultIgnorable	Ignorable in most processing.
Deprecated	The usage of deprecated characters is strongly discouraged.
Diacritic	Characters that linguistically modify the meaning of another character to which they apply.
Extender	Extend the value or shape of a preceding alphabetic character, e.g. length and iteration marks.
FullCompositionExclusion
GraphemeBase	For programmatic determination of grapheme cluster boundaries.
GraphemeExtend	For programmatic determination of grapheme cluster boundaries.
GraphemeLink	For programmatic determination of grapheme cluster boundaries.
HexDigit	Characters commonly used for hexadecimal numbers.
Hyphen	Dashes used to mark connections between pieces of words, plus the Katakana middle dot.
IDContinue	Characters that can continue an identifier.
IDStart	Characters that can start an identifier.
Ideographic	CJKV ideographs.
IDSBinaryOperator	For programmatic determination of Ideographic Description Sequences.
IDSTrinaryOperator
JoinControl	Format controls for cursive joining and ligation.
LogicalOrderException	Characters that do not use logical order and require special handling in most processing.
Lowercase
Math
NonCharacter	Code points that are explicitly defined as illegal for the encoding of characters.
QuotationMark
Radical	For programmatic determination of Ideographic Description Sequences.
SoftDotted	Characters with a soft dot, like i or j. An accent placed on these characters causes the dot to disappear.
TerminalPunctuation	Punctuation characters that generally mark the end of textual units.
UnifiedIdeograph	For programmatic determination of Ideographic Description Sequences.
Uppercase
WhiteSpace
XidContinue	`IDContinue` modified to allow closure under normalization forms NFKC and NFKD.
XidStart	`IDStart` modified to allow closure under normalization forms NFKC and NFKD.
CaseSensitive	Either the source of a case mapping or in the target of a case mapping. Not the same as the general category `Cased_Letter`.
STerm	Sentence Terminal. Used in UAX #29: Text Boundaries http://www.unicode.org/reports/tr29/.
VariationSelector	Indicates all those characters that qualify as Variation Selectors. For details on the behavior of these characters, see http://unicode.org/Public/UNIDATA/StandardizedVariants.html and 15.6 Variation Selectors.
NFDInert	ICU-specific property for characters that are inert under NFD, i.e. they do not interact with adjacent characters. Used for example in normalizing transforms in incremental mode to find the boundary of safely normalizable text despite possible text additions.
NFKDInert	ICU-specific property for characters that are inert under NFKD, i.e. they do not interact with adjacent characters.
NFCInert	ICU-specific property for characters that are inert under NFC, i.e. they do not interact with adjacent characters.
NFKCInert	ICU-specific property for characters that are inert under NFKC, i.e. they do not interact with adjacent characters.
SegmentStarter	ICU-specific property for characters that are starters in terms of Unicode normalization and combining character sequences.
PatternSyntax	See UAX #31 Identifier and Pattern Syntax http://www.unicode.org/reports/tr31/.
PatternWhiteSpace	See UAX #31 Identifier and Pattern Syntax http://www.unicode.org/reports/tr31/.
POSIXAlNum	Alphanumeric character class.
POSIXBlank	Blank character class.
POSIXGraph	Graph character class.
POSIXPrint	Printable character class.
POSIXXDigit	Hex digit character class.

Instances

Enum Bool_
Eq Bool_
Show Bool_
Typeable Bool_
Property Bool_ Bool

data Decomposition_ Source

Constructors

Decomposition

Instances

Show Decomposition_
Typeable Decomposition_
Property Decomposition_ (Maybe Decomposition)

data EastAsianWidth_ Source

Constructors

EastAsianWidth

Instances

Show EastAsianWidth_
Typeable EastAsianWidth_
Property EastAsianWidth_ EastAsianWidth

data GeneralCategory_ Source

Constructors

GeneralCategory

Instances

Show GeneralCategory_
Typeable GeneralCategory_
Property GeneralCategory_ GeneralCategory

data HangulSyllableType_ Source

Constructors

HangulSyllableType

Instances

Show HangulSyllableType_
Typeable HangulSyllableType_
Property HangulSyllableType_ (Maybe HangulSyllableType)

data JoiningGroup_ Source

Constructors

JoiningGroup

Instances

Show JoiningGroup_
Typeable JoiningGroup_
Property JoiningGroup_ (Maybe JoiningGroup)

data JoiningType_ Source

Constructors

JoiningType

Instances

Show JoiningType_
Typeable JoiningType_
Property JoiningType_ (Maybe JoiningType)

data NumericType_ Source

Constructors

NumericType

Instances

Show NumericType_
Typeable NumericType_
Property NumericType_ (Maybe NumericType)

Combining class

data CanonicalCombiningClass_ Source

Constructors

CanonicalCombiningClass

Instances

Show CanonicalCombiningClass_
Typeable CanonicalCombiningClass_
Property CanonicalCombiningClass_ Int

data LeadCanonicalCombiningClass_ Source

Constructors

LeadCanonicalCombiningClass

Instances

Show LeadCanonicalCombiningClass_
Typeable LeadCanonicalCombiningClass_
Property LeadCanonicalCombiningClass_ Int

data TrailingCanonicalCombiningClass_ Source

Constructors

TrailingCanonicalCombiningClass

Instances

Show TrailingCanonicalCombiningClass_
Typeable TrailingCanonicalCombiningClass_
Property TrailingCanonicalCombiningClass_ Int

Normalization checking

data NFCQuickCheck_ Source

Constructors

NFCQuickCheck

Instances

Show NFCQuickCheck_
Typeable NFCQuickCheck_
Property NFCQuickCheck_ (Maybe Bool)

data NFDQuickCheck_ Source

Constructors

NFDQuickCheck

Instances

Show NFDQuickCheck_
Typeable NFDQuickCheck_
Property NFDQuickCheck_ (Maybe Bool)

data NFKCQuickCheck_ Source

Constructors

NFKCQuickCheck

Instances

Show NFKCQuickCheck_
Typeable NFKCQuickCheck_
Property NFKCQuickCheck_ (Maybe Bool)

data NFKDQuickCheck_ Source

Constructors

NFKDQuickCheck

Instances

Show NFKDQuickCheck_
Typeable NFKDQuickCheck_
Property NFKDQuickCheck_ (Maybe Bool)

Text boundaries

data GraphemeClusterBreak_ Source

Constructors

GraphemeClusterBreak

Instances

Show GraphemeClusterBreak_
Typeable GraphemeClusterBreak_
Property GraphemeClusterBreak_ (Maybe GraphemeClusterBreak)

data LineBreak_ Source

Constructors

LineBreak

Instances

Show LineBreak_
Typeable LineBreak_
Property LineBreak_ (Maybe LineBreak)

data SentenceBreak_ Source

Constructors

SentenceBreak

Instances

Show SentenceBreak_
Typeable SentenceBreak_
Property SentenceBreak_ (Maybe SentenceBreak)

data WordBreak_ Source

Constructors

WordBreak

Instances

Show WordBreak_
Typeable WordBreak_
Property WordBreak_ (Maybe WordBreak)

Property value types

data BlockCode Source

Descriptions of Unicode blocks.

Constructors

NoBlock
BasicLatin
Latin1Supplement
LatinExtendedA
LatinExtendedB
IPAExtensions
SpacingModifierLetters
CombiningDiacriticalMarks
GreekAndCoptic
Cyrillic
Armenian
Hebrew
Arabic
Syriac
Thaana
Devanagari
Bengali
Gurmukhi
Gujarati
Oriya
Tamil
Telugu
Kannada
Malayalam
Sinhala
Thai
Lao
Tibetan
Myanmar
Georgian
HangulJamo
Ethiopic
Cherokee
UnifiedCanadianAboriginalSyllabics
Ogham
Runic
Khmer
Mongolian
LatinExtendedAdditional
GreekExtended
GeneralPunctuation
SuperscriptsAndSubscripts
CurrencySymbols
CombiningDiacriticalMarksForSymbols
LetterlikeSymbols
NumberForms
Arrows
MathematicalOperators
MiscellaneousTechnical
ControlPictures
OpticalCharacterRecognition
EnclosedAlphanumerics
BoxDrawing
BlockElements
GeometricShapes
MiscellaneousSymbols
Dingbats
BraillePatterns
CJKRadicalsSupplement
KangxiRadicals
IdeographicDescriptionCharacters
CJKSymbolsAndPunctuation
Hiragana
Katakana
Bopomofo
HangulCompatibilityJamo
Kanbun
BopomofoExtended
EnclosedCJKLettersAndMonths
CJKCompatibility
CJKUnifiedIdeographsExtensionA
CJKUnifiedIdeographs
YiSyllables
YiRadicals
HangulSyllables
HighSurrogates
HighPrivateUseSurrogates
LowSurrogates
PrivateUseArea
CJKCompatibilityIdeographs
AlphabeticPresentationForms
ArabicPresentationFormsA
CombiningHalfMarks
CJKCompatibilityForms
SmallFormVariants
ArabicPresentationFormsB
Specials
HalfwidthAndFullwidthForms
OldItalic
Gothic
Deseret
ByzantineMusicalSymbols
MusicalSymbols
MathematicalAlphanumericSymbols
CJKUnifiedIdeographsExtensionB
CJKCompatibilityIdeographsSupplement
Tags
CyrillicSupplement
Tagalog
Hanunoo
Buhid
Tagbanwa
MiscellaneousMathematicalSymbolsA
SupplementalArrowsA
SupplementalArrowsB
MiscellaneousMathematicalSymbolsB
SupplementalMathematicalOperators
KatakanaPhoneticExtensions
VariationSelectors
SupplementaryPrivateUseAreaA
SupplementaryPrivateUseAreaB
Limbu
TaiLe
KhmerSymbols
PhoneticExtensions
MiscellaneousSymbolsAndArrows
YijingHexagramSymbols
LinearBSyllabary
LinearBIdeograms
AegeanNumbers
Ugaritic
Shavian
Osmanya
CypriotSyllabary
TaiXuanJingSymbols
VariationSelectorsSupplement
AncientGreekMusicalNotation
AncientGreekNumbers
ArabicSupplement
Buginese
CJKStrokes
CombiningDiacriticalMarksSupplement
Coptic
EthiopicExtended
EthiopicSupplement
GeorgianSupplement
Glagolitic
Kharoshthi
ModifierToneLetters
NewTaiLue
OldPersian
PhoneticExtensionsSupplement
SupplementalPunctuation
SylotiNagri
Tifinagh
VerticalForms
N'Ko
Balinese
LatinExtendedC
LatinExtendedD
PhagsPa
Phoenician
Cuneiform
CuneiformNumbersAndPunctuation
CountingRodNumerals
Sundanese
Lepcha
OlChiki
CyrillicExtendedA
Vai
CyrillicExtendedB
Saurashtra
KayahLi
Rejang
Cham
AncientSymbols
PhaistosDisc
Lycian
Carian
Lydian
MahjongTiles
DominoTiles

Instances

Enum BlockCode
Eq BlockCode
Show BlockCode
Typeable BlockCode
Property Block_ BlockCode

data Direction Source

The language directional property of a character set.

Constructors

LeftToRight
RightToLeft
EuropeanNumber
EuropeanNumberSeparator
EuropeanNumberTerminator
ArabicNumber
CommonNumberSeparator
BlockSeparator
SegmentSeparator
WhiteSpaceNeutral
OtherNeutral
LeftToRightEmbedding
LeftToRightOverride
RightToLeftArabic
RightToLeftEmbedding
RightToLeftOverride
PopDirectionalFormat
DirNonSpacingMark
BoundaryNeutral

Instances

Enum Direction
Eq Direction
Show Direction
Typeable Direction
Property BidiClass_ Direction

data Decomposition Source

Constructors

Canonical
Compat
Circle
Final
Font
Fraction
Initial
Isolated
Medial
Narrow
NoBreak
Small
Square
Sub
Super
Vertical
Wide
Count

Instances

Enum Decomposition
Eq Decomposition
Show Decomposition
Typeable Decomposition
Property Decomposition_ (Maybe Decomposition)

data EastAsianWidth Source

Constructors

EANeutral
EAAmbiguous
EAHalf
EAFull
EANarrow
EAWide
EACount

Instances

Enum EastAsianWidth
Eq EastAsianWidth
Show EastAsianWidth
Typeable EastAsianWidth
Property EastAsianWidth_ EastAsianWidth

data HangulSyllableType Source

Constructors

LeadingJamo
VowelJamo
TrailingJamo
LVSyllable
LVTSyllable

Instances

Enum HangulSyllableType
Eq HangulSyllableType
Show HangulSyllableType
Typeable HangulSyllableType
Property HangulSyllableType_ (Maybe HangulSyllableType)

data JoiningType Source

Constructors

JoinCausing
DualJoining
LeftJoining
RightJoining
Transparent

Instances

Enum JoiningType
Eq JoiningType
Show JoiningType
Typeable JoiningType
Property JoiningType_ (Maybe JoiningType)

data NumericType Source

Constructors

NTDecimal
NTDigit
NTNumeric

Instances

Enum NumericType
Eq NumericType
Show NumericType
Typeable NumericType
Property NumericType_ (Maybe NumericType)

Text boundaries

data SentenceBreak Source

Constructors

SBATerm
SBClose
SBFormat
SBLower
SBNumeric
SBOLetter
SBSep
SBSP
SBSTerm
SBUpper
SBCR
SBExtend
SBLF
SBSContinue

Instances

Enum SentenceBreak
Eq SentenceBreak
Show SentenceBreak
Typeable SentenceBreak
Property SentenceBreak_ (Maybe SentenceBreak)

data WordBreak Source

Constructors

WBALetter
WBFormat
WBKatakana
WBMidLetter
WBMidNum
WBNumeric
WBExtendNumLet
WBCR
WBExtend
WBLF
WBMidNumLet
WBNewline

Instances

Enum WordBreak
Eq WordBreak
Show WordBreak
Typeable WordBreak
Property WordBreak_ (Maybe WordBreak)

Functions

blockCode :: Char -> BlockCode Source

Return the Unicode allocation block that contains the given character.

charFullName :: Char -> String Source

Return the full name of a Unicode character.

Compared to charName, this function gives each Unicode code point a unique extended name. Extended names are lowercase followed by an uppercase hexadecimal number, within angle brackets.

charName :: Char -> String Source

Return the name of a Unicode character.

The names of all unassigned characters are empty.

The name contains only invariant characters like A-Z, 0-9, space, and '-'.

charFromFullName :: String -> Maybe Char Source

Find a Unicode character by its full or extended name, and return its code point value.

The name is matched exactly and completely.

A Unicode 1.0 name is matched only if it differs from the modern name.

Compared to charFromName, this function gives each Unicode code point a unique extended name. Extended names are lowercase followed by an uppercase hexadecimal number, within angle brackets.

charFromName :: String -> Maybe Char Source

Find a Unicode character by its full name, and return its code point value.

The name is matched exactly and completely.

A Unicode 1.0 name is matched only if it differs from the modern name. Unicode names are all uppercase.

combiningClass :: Char -> Int Source

direction :: Char -> Direction Source

Return the bidirectional category value for the code point, which is used in the Unicode bidirectional algorithm (UAX #9 http://www.unicode.org/reports/tr9/).

property :: Property p v => p -> Char -> vSource

isoComment :: Char -> String Source

Return the ISO 10646 comment for a character.

If a character does not have an associated comment, the empty string is returned.

The ISO 10646 comment is an informative field in the Unicode Character Database (UnicodeData.txt field 11) and is from the ISO 10646 names list.

isMirrored :: Char -> Bool Source

Determine whether the code point has the BidiMirrored property. This property is set for characters that are commonly used in Right-To-Left contexts and need to be displayed with a mirrored glyph.

mirror :: Char -> Char Source

Conversion to numbers

digitToInt :: Char -> Maybe Int Source

Return the decimal digit value of a decimal digit character. Such characters have the general category Nd (decimal digit numbers) and a NumericType of NTDecimal.

No digit values are returned for any Han characters, because Han number characters are often used with a special Chinese-style number format (with characters for powers of 10 in between) instead of in decimal-positional notation. Unicode 4 explicitly assigns Han number characters a NumericType of NTNumeric instead of NTDecimal.

numericValue :: Char -> Maybe Double Source

Return the numeric value for a Unicode code point as defined in the Unicode Character Database.

A Double return type is necessary because some numeric values are fractions, negative, or too large to fit in a fixed-width integral type.

GeneralOtherType
UppercaseLetter
LowercaseLetter
TitlecaseLetter
ModifierLetter
OtherLetter
NonSpacingMark
EnclosingMark
CombiningSpacingMark
DecimalDigitNumber
LetterNumber
OtherNumber
SpaceSeparator
LineSeparator
ParagraphSeparator
ControlChar
FormatChar
PrivateUseChar
Surrogate
DashPunctuation
StartPunctuation
EndPunctuation
ConnectorPunctuation
OtherPunctuation
MathSymbol
CurrencySymbol
ModifierSymbol
OtherSymbol
InitialPunctuation
FinalPunctuation

Ain
Alaph
Alef
Beh
Beth
Dal
DalathRish
E
Feh
FinalSemkath
Gaf
Gamal
Hah
HamzaOnHehGoal
He
Heh
HehGoal
Heth
Kaf
Kaph
KnottedHeh
Lam
Lamadh
Meem
Mim
Noon
Nun
Pe
Qaf
Qaph
Reh
ReversedPe
Sad
Sadhe
Seen
Semkath
Shin
SwashKaf
SyriacWaw
Tah
Taw
TehMarbuta
Teth
Waw
Yeh
YehBarree
YehWithTail
Yudh
YudhHe
Zain
Fe
Khaph
Zhain
BurushaskiYehBarree

Ambiguous
LBAlphabetic
BreakBoth
BreakAfter
BreakBefore
MandatoryBreak
ContingentBreak
ClosePunctuation
CombiningMark
CarriageReturn
Exclamation
Glue
LBHyphen
LBIdeographic
Inseparable
InfixNumeric
LineFeed
Nonstarter
Numeric
OpenPunctuation
PostfixNumeric
PrefixNumeric
Quotation
ComplexContext
LBSurrogate
Space
BreakSymbols
Zwspace
NextLine
WordJoiner
H2
H3
JL
JT
JV

Enum GeneralCategory
Eq GeneralCategory
Show GeneralCategory
Typeable GeneralCategory
Property GeneralCategory_ GeneralCategory

Enum JoiningGroup
Eq JoiningGroup
Show JoiningGroup
Typeable JoiningGroup
Property JoiningGroup_ (Maybe JoiningGroup)

Enum GraphemeClusterBreak
Eq GraphemeClusterBreak
Show GraphemeClusterBreak
Typeable GraphemeClusterBreak
Property GraphemeClusterBreak_ (Maybe GraphemeClusterBreak)

Enum LineBreak
Eq LineBreak
Show LineBreak
Typeable LineBreak
Property LineBreak_ (Maybe LineBreak)