Unicode UTF-8 Functions

< Previous | Contents | Manuals Home | Boris FX | Next >

or “cancel”, which is returned from the function.

Unicode UTF-8 Functions

Sizzle strings are unencoded bytes. In most cases, the encoding of a string is irrelevant to what Sizzle and SynthEyes are doing, even if the data they handle represent encoded UTF-8 data such as filenames.

The following functions replace their ordinary versions, interpreting the input bytes as UTF-8-encoded strings, allowing direct access to extended code points if necessary.

Not all byte strings are valid UTF-8 encodings; when that happens, Sizzle maps erroneous bytes to special code points in the range of 0xdc80 – 0xdcff.

This allows a byte string to be converted to code points and back without change, though code point zero is the string terminator and may not occur in the middle of the string.

You can insert Unicode into strings using \u0041 (or \U for 8 hex digits). utf8char(val) Convert the code point to the equivalent byte

string (1-4 bytes)

utf8errors(str) The number of erroneous bytes in str, ie that don’t constitute a valid UTF-8 encoding.

utf8from1252(str) Convert the string from Windows-1252/ISO- Latin-1-encoded to UTF-8 encoding. All characters can be converted, with unassigned 1252 characters corresponding to Unicode control codes.

utf8index(haystack, needle) Look for the first occurrence of needle in

haystack, returning the number of code points from the beginning of haystack at which needle may be found. Starts at one; zero means needle wasn’t found.

utf8length(str) The number of code points in the string.

utf8ord(str) Convert a UTF-8 character to a numeric code point, from zero to 0x10FFFF (1,114,111).

The string may contain bytes for more than one code point; only the first code point is returned, and excess bytes are ignored.

utf8rindex(haystack, needle) Look for the last occurrence of needle in

haystack, returning the number of code points from the beginning of haystack at which needle may be found. Starts at one; zero means needle wasn’t found.

utf8substr(str, start[, len]) Pull a substring from the UTF-8-encoded

string. Both start and len are measured in code points, not bytes. Start begins at one per Sizzle standard. If start is less than zero, it is measured from the end. If len is less than zero, code points are taken from the end.

utf8to1252(string) Convert a UTF-8 encoded string to a Windows-1252/ISO-Latin-1-encoded string. This function will strip accents and decompose characters in an effort to be as readable as possible. Any illegal characters, or remaining characters not expressible in 1252, are replaced with a bullet • (0x95, Unicode 0x2022).

utf8translate(string, from_chars, to_chars) Translate string, replacing each

codepoint in from_chars to the corresponding codepoint of to_chars. Other characters are unaffected.

< Previous | Contents | Manuals Home | Boris FX | Next >