Synopsis 15: Unicode [DRAFT]

VERSION

Created 2 December 2013
Last Modified 10 October 2014
Version 8

This document describes how Unicode and Perl 6 work together. Needless to say, it would be good for your chosen reader to support various Unicode characters.

String Base Units

A Unicode string can be looked at in any of four ways. It could be seen in terms of its graphemes, its codepoints, its encoding's code units, or the bytes that make up the encoding.

For example, consider a string containing the Devanagari syllable नि, which is comprised of the codepoints

U+0928 DEVANAGARI LETTER NA
U+093F DEVANAGARI VOWEL SIGN I

There are a variety of ways in which to perceive the length of this string. For reference, here is how the syllable gets encoded in each of UTF-8, UTF-16BE, and UTF-32BE.

UTF-8 E0 A4 A8 E0 A4 BF
UTF-16BE 0928 093F
UTF-32BE 00000928 0000093F

And depending on if you desire to count by graphemes, codepoints, code units, or bytes, the perceived length of the string differs:

|------------+-------+--------+--------|
| count by | UTF-8 | UTF-16 | UTF-32 |
|============+=======+========+========|
| bytes | 6 | 4 | 8 |
| code units | 6 | 2 | 2 |
| codepoints | 2 | 2 | 2 |
| graphemes | 1 | 1 | 1 |
|------------+-------+--------+--------|

Perl 6 offers various mechanisms to count by each of these "base units" of a string.

Perl 6 by default operates on graphemes, so counting by graphemes involves:

"string".chars or "string".graphs

To count by codepoints, conversion of a string to one of NFC, NFD, NFKC, or NFKD is needed:

"string".NFC.chars
"string".NFKD.codes

To count by code units, you can convert to the appropriate buffer type.

"string".encode("UTF-32LE").elems
"string".encode("utf-8").elems

And finally, counting by bytes simply involves converting that buffer to a buf8 object:

"string".encode("UTF-16BE").buf8.elems

Note that utf8 already stores by bytes, so the count for bytes and code units is always the same.

Normalization Forms

NFG

Perl 6, by default, stores all strings given to it in NFG form, Normalization Form Grapheme. It's a Perl 6-invented character representation, designed to deal with un-precomposed graphemes properly.

Formally Perl 6 graphemes are defined exactly according to Unicode Grapheme Cluster Boundaries at level "extended" (in contrast to "tailored" or "legacy"), see Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION 3 Grapheme Cluster Boundaries>. This is the same as the Perl 5 character class

\X   Match Unicode "eXtended grapheme cluster"

With NFG, strings start by being run through the normal NFC process, compressing any given character sequences into precomposed characters.

Any graphemes remaining without precomposed characters, such as ậ or नि, are given their own negative numbers to refer to them, at least 32 bits in length. This is done to avoid clashing with any potential future changes to Unicode.

The mapping between negative numbers and graphemes in this form is not guaranteed constant, even between strings in the same process.

The Perl 6 Str type, and more generally the Stringy role, deals exclusively in NFG form.

NFC and NFD

The NFC and NFD normalization forms are a defined part of the Unicode standard. NFD takes precomposed characters and separates them into their constituent parts, with a specific ordering of those pieces. NFC tries to replace characters sequences into singular precomposed characters whenever possible, after first running it through the NFD process.

These two Normalization Forms are similar to NFG, except that graphemes without precomposed versions exist as multiple codepoints.

NFC is the form Perl 6 uses whenever NFG is not viable, such as printing the string to stdout or passing it to a { use v5; } section.

NFKC and NFKD

These forms are known as compatibility forms (denoted by a K to avoid confusion with C for Composition). They are similar to their canonical counterparts, but may transform various characters (such as fi or ſ) to perform better with the software.

All four of NFC, NFD, NFKC, and NFKD can be considered valid "codepoint views", though each differ in their exact formulation of the contents of a string:

say "ẛ̣".chars;               # OUTPUT: 1 (NFG, ẛ̣)
say "ẛ̣".NFC.chars;           # OUTPUT: 2 (NFC, ẛ + ̣)
say "ẛ̣".NFD.chars;           # OUTPUT: 3 (NFD, ſ + ̣+ ̇)
say "ẛ̣".NFKC.chars;          # OUTPUT: 1 (NFKC, ṩ)
say "ẛ̣".NFKD.chars;          # OUTPUT: 3 (NFKD, s + ̣+ ̇)

Those who wish to operate with strings on the codepoint level may wish to use NFC, as it is the least different from NFG, as well as Perl 6's default form for NFG-less contexts.

All of Uni, NFC, NFD, NFKC, and NFKD, and more generally the Unicodey role, deal with the various codepoint-based compositions.

The Str Type

Presented are the variety of methods of Str which are related to Unicode. Str deals exclusively in the NFG form of Unicode strings.

String to Numeral Conversion

Str.ord
Str.ords
ord(Str $string)
ords(Str $string)

These give you the numeric values of the characters in a string. ord only works on the first character, while ords works on every character.

Some the values returned may be negative numbers, and are useless outside that specific string. You must convert to one of the codepoint-based types for a to-Standard list of numbers.

Length Methods

Str.chars
Str.graphs

These methods are equivalent, and count the number of graphemes in your string.

[Should there be methods that implicitly convert to the other string types, or would .NFKD.chars be necessary?]

Buf conversion

Str.encode($enc = "utf-8")

Encodes the contents of the string by the specified encoding (by default UTF-8) and generates the appropriate blob.

Note that if you convert to one of the UTFs, you'll get a UTF-aware version of the blob. (Non-Unicode encodings will go for the most appropriate blob type.)

UTF-16 and UTF-32 default to big endian if you don't specify endianness.

Str.encode             --> utf8
Str.encode("UTF-16")   --> utf16 (big endian)
Str.encode("UTF-32LE") --> utf32 (little endian)
Str.encode("ASCII")    --> blob8

The NF* Types

Perl 6 has four types corresponding to a specific Unicode Standard Normalization Form: NFC, NFD, NFKC, and NFKD.

Each one of these types perform normalization on strings stored in it.

The NF* types do the Unicodey role.

The Uni Type

The Uni type is like the various NF* types, but allows a mixed collection of normalization forms to make up the string.

The Uni type does the Unicodey role.

The Unicodey Role

The Unicodey role deals in various Unicode-aware functions.

Length Methods

Unicodey.chars
Unicodey.codes

Both are synonymous. Counts the number of codepoints in a Unicodey type.

[Maybe Unicodey does Stringy ?]

The Stringy Role

The Stringy role deals with a more general, not necessarily Unicode-based view of strings. Str uses this because it doesn't always play by the Unicode Standards' rules (most notably the use of NFG).

Buf methods

Decoding buffers

Buf.decode($dec = "utf-8");

Transforms the buffer into a Str. Defaults to assuming a "utf-8" encoding. Encoding-aware buffers have a different default decoding, for instance:

utf8.decode($dec = "utf-8");
utf16.decode($dec = "utf-16be");
utf32.decode($dec = "utf-32be");

[It would be best if utf16 and utf32 changed its default between BE and LE at creation, either because of what Str.encode said or, if the utf16/32 was manually created, analyzing the BOM, if any. Just know that Unicode itself defaults to BE if nothing else.]

String Type Conversions

If you desire to have a string in one of the other Normalization Forms, there are various conversion methods to do this.

Cool.Str
Cool.NFG  [ XXX this is purely a synonym to .Str. Necessary? ]
Cool.NFC
Cool.NFD
Cool.NFKC
Cool.NFKD
Cool.Uni

Notably, conversion to the Uni type will assume NFC for either NFG strings or non-strings being converted to this string-like type. Otherwise it's a transposition of the string without changes in normalization.

Unicode Information

There's plenty of information each Unicode codepoint possesses, and Perl 6 provides various ways of accessing that information.

Unless plural forms of these functions are provided, each function operates only on the first codepoint of the string. Various array-based operations would be needed to gain information on every character in the string.

[Note: The properties of graphemes are not defined by Unicode, but to inherit the properties from the first NFC codepoint in the grapheme should make sense in most cases, but not for e.g. charnames. Or are they not supported at NFG?]

[Note: If adding additional methods to access Unicode information, priority should be placed on info that can't be accessed as a Unicode property.]

Property Lookup

uniprop(Int $codepoint, Stringy $property)
Int.uniprop(Str $property)

uniprop(Unicodey $char, Stringy $property)
Unicodey.uniprop(Stringy $property)

uniprops(Unicodey $str, Stringy $property)
Unicodey.uniprops(String $property)

This function returns the value of $property for the given $codepoint or $char, or an array of values of the property of each character in $str .

All official spellings of a property name are supported.

uniprops("a", "ASCII_Hex_Digit") # is this character an ASCII hex digit?
uniprops("a", "AHex")            # ditto

Values returned for properties may be the narrowest possible type for numeric and boolean values (widest Rat), and Str objects for all other types of values. (To treat boolean values as boolean, see unibool.)

Note there is no version of uniprops for integers, while there is one for strings. To achieve the same thing, use normal array operations:

my @isws = (32,42,43)».uniprop("White_Space");

Note that the integer-based lookup is the fundamental version; the string-based versions are convenience functions. These two are nearly equivalent:

uniprop("0".ord, "Numeric_Value");  # integer lookup
uniprop("0", "Numeric_Value");      # stringy lookup

However, the string-based version will convert NFG strings to NFC before sending either the first or all characters through the lookup. This is because Unicode property lookup is considered an NFG-less environment (see NFC and NFD).

Integer-based lookup should die on negative integers, or integers greater than 0x10_FFFF.

[Conjecture: would versions of uniprop with a slurpy instead of a single string property be useful? Or is uniprop(0x20, $_) for @props good enough?]

Binary Property Lookup

unibool(Int $codepoint, Stringy $property)
Int.unibool(Str $property)

unibool(Unicodey $char, Stringy $property)
Unicodey.unibool(Stringy $property)

unibools(Unicodey $str, Stringy $property)
Unicodey.unibools(String $property)

Looks up a boolean Unicode property (such as Case_Ignorable) and returns a boolean. Throws an error on non-boolean properties.

unibool(0x41, "Case_Ignorable");   # OK
unibool(0x41, "General_Category"); # dies

As with uniprop, the string version converts NFG strings to NFC, but otherwise is equivalent to feeding the result of .ord through the base integer version.

Binary Category Check

unimatch(Int $codepoint, Stringy $category)
Int.unimatch(Str $category)

unimatch(Unicodey $char, Stringy $category)
Unicodey.unimatch(Stringy $category)

unimatches(Unicodey $str, Stringy $category)
Unicodey.unimatches(String $category)

Checks to see if the character(s) given are in the given $category. The string-based versions are conveniences that convert any NFG input to NFC, and then pass it along to the integer version.

unimatch("A", "Lu"); # True
unimatch("A", "L");  # True
unimatch("A", "Sc"); # False

An error may be issued if the given category name is not valid.

Numeric Codepoint

ord(Stringy $char) --> Int
ords(Stringy $string) --> Array[Int]

Stringy.ord() --> Int

Stringy.ords() --> Array[Int]

The &ord function (and corresponding Stringy.ord method) return the codepoint number of the first codepoint of the string. The &ords function and method returns an Array of codepoint numbers for every codepoint in the string.

This works on any type that does the Stringy role. Note that using this on type Str may return invalid negative numbers as "codepoints".

Character Representation

chr(Int $codepoint) --> Uni
chrs(Array[Int] @codepoints) --> Uni

Cool.chr() --> Uni
Cool.chrs() --> Uni

Converts one or more numbers into a series of characters, treating those numbers as Unicode codepoints. The chrs version generates a multi-character string from the given array.

Note that this operates on encoding-independent codepoints (use Buf types for encoded codepoints).

An error will occur if the Uni generated by these functions contains an invalid character or sequence of characters. This includes, but is not limited to, codepoint values greater than 0x10FFFF and parts of surrogate code pairs.

To obtain a more definitive string type, the normal ways of type conversion may be used.

Character Name

uniname(Str $char, :$one = False, :$either = False) --> Str
uninames(Str $char, :$one = False, :$either = False) --> Array[Str]

Str.uniname(:$one = False, :$either = False) --> Str
Str.uninames(:$one = False, :$either = False) --> Array[Str]

The &uniname function returns the Unicode name associated with the first codepoint of the string. &uninames returns an array of names, one per codepoint.

By default, uniname tries to find the Unicode name associated with that character, returning a code point label (see UAX#44 and section 4.8 of the Standard). This is nearly identical to accessing the Name property from the uniprops hash, except that the hash holds an empty string for undefined names.

uninames("A\x[00]¶\x[2028,80]")
# results in:
"LATIN CAPITAL LETTER A",
"<control-0000>",
"PILCROW SIGN",
"LINE SEPARATOR",
"<control-0080>"

The :one adverb instead tries to find the Unicode 1.0 name associated with the character (this would most often be useful with getting a proper name for control codes). If there is no Unicode 1.0 name associated with the character, a code point label is returned. This is similar to the Unicode_1_Name property of the uniprops hash, except that the hash holds an empty string for undefined Unicode 1.0 names.

uninames("A\x[00]¶\x[2028,80]", :one)
# results in:
"<graphic-0041>",
"NULL",
"PARAGRAPH SIGN",
"<format-2028>",
"<control-0080>"

The :either adverb will try to first obtain a Unicode name for the character. Failing that, it will try to instead obtain the Unicode 1.0 name. If the character has neither name property defined, a code point label is returned.

uninames("A\x[00]¶\x[2028,80]", :either)
# results in:
"LATIN CAPITAL LETTER A",
"NULL",
"PILCROW SIGN",
"LINE SEPARATOR",
"<control-0080>"

The use of :either and :one together will prefer Unicode 1.0 names over newer Unicode names, but otherwise function identically to :either.

uninames("A\x[00]¶\x[2028,80]", :either :one)
# results in:
"LATIN CAPITAL LETTER A",
"NULL",
"PARAGRAPH SIGN",
"LINE SEPARATOR",
"<control-0080>"

In the case of graphical or formatting characters without a Unicode 1.0 name, the use of the :one adverb by itself will return a non-standard codepoint label of either of the following:

<graphic-XXXX>
<format-XXXX>

Note that the use of :either and :one together will not use these non-standard labels, as every graphic and format character has a current Unicode name.

The definition of "graphic" and "format" characters is covered in Section 2.4, Table 2-3 of the current Unicode Standard.

This command does not deal with name aliases; use the Name_Alias property from the uniprops hash.

If a strict adherence to the values in those properties is desired (i.e. return null strings instead of code-point labels), the Name and Unicode_1_Name properties of the uniprops hash may be used.

Numeric Value

unival(Int $codepoint)
Int.unival

unival(Unicodey $char)
Unicodey.unival

univals(Unicodey $str)
Unicodey.univals

Returns a Rat (or Int if the denominator is 1) of the given character's numeric value. Returns NaN if the character is not a number.

say unival("0"); # output: 0
say unival("½"); # output: .5
say unival("."); # output: NaN
say univals("½¾"); # output: .5 .75 (array of Rats and/or Ints)

Note that this will not convert a multi-digit string into one numeral; use the normal string-to-numeral coercers for that.

[Conjecture: should val() use unival on one-character strings as part of its allomorphic type process? E.g. ./fractionmagic.p6 ¾ takes the one positional argument as a RatStr.]

Regexes

By default regexes operate on the grapheme (NFG) level, regardless of how the string itself is stored.

The following is a list of adverbs that change how regexes view strings:

:i    Ignore case  (a ~~ A)
:m    Ignore marks (ä ~~ a)

:nfg    String matching against as NFG (default)
:nfc    String as NFC
:nfd    String as NFD
:nfkc   String as NFKC
:nfkd   String as NFKD

There's of course the syntax for accessing unicode properties inside a regex:

<:Letter>
<:East_Asian_Width<Narrow>>

(For example, if you needed to collect combining mark usage (e.g. for language-guessing purposes):

$string ~~ /:nfd [<:Letter> (<:Mark>*)]+/

would get that info for you.)

/./ always matches one "character" in the current view, in other words one element of "string being matched".ords.

Grapheme Explosion

To match to one specific character under different rules, you may use one of the </ /> rules.

<D/ />    Work on next character in NFD mode
<C/ />    NFC mode
<KD/ />   NFKD mode
<KC/ />   NFKC mode
<G/ />    NFG mode
</ />     NFD mode

This construct was primarily invented to allow you to deal with combining characters (matches <:Mark>) on single graphemes. This is why </ /> is used as a synonym to <D/ />.

The forms with letters may use any brackets. Similar to how m/ / and / / relate to each other.

</ />    Explodes grapheme
<⦃ ⦄>    Doesn't explode grapheme
<D/ />   Explodes grapheme
<D⦃ ⦄>   Explodes grapheme

So to collect base characters and combining marks in one section of what you're parsing, you could define such a regex as:

$string ~~ / </ $<base>=<:Letter> $<marks>=[<:Mark>+] /> /

Note that each of these exploders become useless when their counterpart adverbs are used beforehand.

And yes, some of these forms do the opposite of exploding; the imagery of radically changing things in a localized area still applies :).

Quoting Constructs

By default, all quoting forms create Str objects:

"interpolating $string"
'non-interpolating'
Q[Base form]

Various adverbs may be used to generate non-NFG literals:

Q:nfd/NFD literal/
qq:nfc:to/heredocIsNFC/
qx:nfkd/useful for commands on less capable terminals perhaps/

The typical :nf adverbs are in use here.

:nfg    Str literal (default)
:nfd    NFD literal
:nfc    NFC literal
:nfkd   NFKD literal
:nfkc   NFKC literal

Unicode Literals

Identifiers

Identifiers in Perl 6 can start with any alphabetic character (those characters in the L category, as well as underscore), followed by any number of alphanumeric characters (those characters in the L or N category, as well as underscore). Dashes (-) and apostrophes (') may also appear, provided that they are followed by an alphabetic character.

my $foo;     # OK
my $foo;     # also OK
my $১০kinds; # not ok (১০ are digits)

Combining marks (characters in the M category) may not be the first character in an identifier, but they may appear at any time afterwards.

Perl 6 internally stores all identifiers in NFG form, so these two lines create the same variable (and throw a redeclaration warning if used like this in code):

my $ä; # precomposed
my $ä; # not precomposed

Numbers

Similar to its support for any kind of Unicode in identifiers, Perl 6 allows any kind of character within the category Nd (Decimal Number) for decimal numbers:

say 42 + ٤٢; # ascii digits + arabic indic
say ᱄2;     # lepcha & ascii
say ⑨;       # not ok (category No, not Nd)

For hexadecimal digits, any character with Hex_Digit = yes is allowed:

say 0xCAFE;     # OK
say 0xcafe; # OK too

The :radix[] form of specifying numbers can accept strings following this same rule, with the following sets of characters specifying digits 10..35, as they have characters with true Hex_Digit properties:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

For radices greater than 36, you must use literal numbers (see doc:S02/General radices for details).

Pragmas

[ the NF* pragmas have been removed, as they no longer are attributes of a Str object, and there's no sane way to set a default string-like type in a clean fashion. ]

use encoding :utf8;
use encoding :utf16<be> or :utf16<le>;
use encoding :utf32<be> or :utf32<le>;

The encoding pragma changes the default encoding for situations where that's necessary, e.g. for the default Str.encode or Buf.decode. Strs themselves don't store encoding information.

    use unicode :v(version);

Specifies what version of Unicode you want to use. The $*UNICODE variable will tell you what version of Unicode is currently in use. This is useful if you need to work on data created for much older Unicode versions, or if you're doing work with properties known to be highly volatile between versions.

Pragmas are of course localize-able:

my $first = "hello"; # NFG string
{
    use unicode :v(3.2);
    ⋮
}

my $buffer = "foobar".encode; # object of type utf8 in $buffer
{
    use utf32<le>;
    $buffer = "foobar".encode; # object of type utf32 in $buffer
}
say $buffer.WHAT # output: (utf32)

Final Considerations

The Stringy and Unicodey roles need some expansion, definitely. Keep in mind that the Uni type is supposed to accept any of the NFC, NFD, NFKC, and NFKD contents without normalizing.

The inclusion of ropey types will most directly impact Uni.

Operators between various string types need defining. The general rule should be "most specialized type wins" for the return value.

NFD ~ NFD   -->  NFD
NFC ~ NFKD  -->  Uni
    (UAX#15 says concat of mismatched NFs results in a non-NF string, which
    is our Uni type.)

[Note: concatenation and similar operations forming one string from parts can lead to non-NF, even between BFC ~ NFC or NFG ~ NFG, if the second string begins with one (or more) "orphan" combining characters.]

Regexes likely need more work, though I don't see anything immediate.

Some easy way to change how Perl 6 handles language specific weirdness, possibly through another type (Rope? Twine? Yarn?). A very small selection of those weirdnesses:

How would a hypothetical EBCDIC string type be implemented by some module writer?

Other areas to consider, surely.

(This spec should not be moved to a status more official than DRAFT status until this Final Considerations section disappears.)

AUTHOR

Matthew N. "lue" <L<rnddim@gmail.com|mailto:rnddim@gmail.com>>
Helmut Wollmersdorfer <L<helmut.wollmersdorfer@gmail.com|mailto:helmut.wollmersdorfer@gmail.com>>

ACKNOWLEDGEMENT

Thanks to TimToady and the rest here: http://irclog.perlgeek.de/perl6/2013-12-02#i_7942599 for answering my questions and inadvertently steering this document in a far different direction than I would've taken it otherwise.