14  Characters, Strings, and Unicode

14.1 Character Encodings (Early History)

There were - depending on manufacturer, country, programming language, operating system, etc. - a large variety of encodings.

Still relevant today are:

14.1.1 ASCII

The American Standard Code for Information Interchange (ASCII) was published as a standard in the USA in 1963.

  • It defines \(2^7=128\) characters, namely:
    • 33 control characters, such as newline, escape, end of transmission/file, delete
    • 95 graphically printable characters:
      • 52 Latin letters a-z, A-Z
      • 10 digits 0-9
      • 7 punctuation marks .,:;?!"
      • 1 space
      • 6 parentheses [{()}]
      • 7 mathematical operations +-*/<>=
      • 12 special characters #$%&'\^_|~`@
  • ASCII is still the “lowest common denominator” in the encoding chaos.
  • The first 128 Unicode characters are identical to ASCII.

14.1.2 ISO 8859 Character Sets

  • ASCII uses only 7 bits.
  • In a byte, you can fit another 128 characters by setting the 8th bit.
  • In 1987/88, various 1-byte encodings were standardized in ISO 8859, all ASCII-compatible, including:
Encoding Region Languages
ISO 8859-1 (Latin-1) Western Europe German, French,…, Icelandic
ISO 8859-2 (Latin-2) Eastern Europe Slavic languages with Latin script
ISO 8859-3 (Latin-3) Southern Europe Turkish, Maltese,…
ISO 8859-4 (Latin-4) Northern Europe Estonian, Latvian, Lithuanian, Greenlandic, Sami
ISO 8859-5 (Latin/Cyrillic) Eastern Europe Slavic languages with Cyrillic script
ISO 8859-6 (Latin/Arabic)
ISO 8859-7 (Latin/Greek)
ISO 8859-15 (Latin-9) 1999: Revision of Latin-1: now including Euro sign

14.2 Unicode

The goal of the Unicode Consortium is a uniform encoding for all scripts worldwide.

  • Unicode version 1 was published in 1991
  • Unicode version 17 was published in 2025 with 159,801 characters, including:
    • 172 scripts
    • mathematical and technical symbols
    • Emojis and other symbols, control and formatting characters
  • Over 90,000 characters are assigned to the CJK scripts (Chinese/Japanese/Korean)

14.2.1 Technical Details

  • Each character is assigned a codepoint, which is simply a sequential number written hexadecimally
    • either with 4 digit as U+XXXX (zeroth plane)
    • or with 6 digit as U+XXXXXX (further planes)
  • Each plane ranges from U+XY0000 to U+XYFFFF, thus containing \(2^{16}=65\;534\) characters.
  • 17 planes XY=00 to XY=10 are provided, giving a value range from U+0000 to U+10FFFF.
  • Thus, a maximum of 21 bits per character are needed.
  • The total number of possible codepoints is slightly less than 0x10FFFF, as certain areas are not used for technical reasons. It is about 1.1 million, so there is still much room.
  • So far, codepoints have been assigned only from these planes:
    • Plane 0 = BMP (Basic Multilingual Plane) U+0000 - U+FFFF,
    • Plane 1 = SMP (Supplementary Multilingual Plane) U+010000 - U+01FFFF,
    • Plane 2 = SIP (Supplementary Ideographic Plane) U+020000 - U+02FFFF,
    • Plane 3 = TIP (Tertiary Ideographic Plane) U+030000 - U+03FFFF, and
    • Plane 14 = SSP (Supplementary Special-purpose Plane) U+0E0000 - U+0EFFFF.
  • U+0000 to U+007F is identical to ASCII
  • U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1)

14.2.2 Properties of Unicode Characters

In the standard, each character is described by

  • its codepoint (number)
  • a name (which consists only of ASCII uppercase letters, digits, and hyphens) and
  • various attributes such as
    • script direction
    • category: uppercase letter, lowercase letter, modifier letter, digit, punctuation, symbol, separator,….

In the Unicode standard, this looks like (simplified, only codepoint and name):

...
U+0041 LATIN CAPITAL LETTER A
U+0042 LATIN CAPITAL LETTER B
U+0043 LATIN CAPITAL LETTER C
U+0044 LATIN CAPITAL LETTER D
...
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX
...
U+0641 ARABIC LETTER FEH
U+0642 ARABIC LETTER QAF
...
U+21B4 RIGHTWARDS ARROW WITH CORNER DOWNWARDS
...

What does ‘RIGHTWARDS ARROW WITH CORNER DOWNWARDS’ look like?

Julia uses \U... for input of Unicode codepoints.

'\U21b4'
'↴': Unicode U+21B4 (category So: Symbol, other)

14.2.3 A Selection of Scripts

Note

If individual characters or scripts are not displayable in your browser, you must install appropriate fonts on your computer.

Alternatively, you can use the PDF version of this page. There, all fonts are embedded.

A small helper function:

function printuc(c, n)
    for i in 1:n
        print(c + i -1)
        if i%70 == 0 print("\n") end
    end
end
printuc (generic function with 1 method)

Cyrillic

printuc('\U0400', 100)
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфх
цчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣ

Tamil

printuc('\U0be7',20)
௧௨௩௪௫௬௭௮௯௰௱௲௳௴௵௶௷௸௹௺

Chess

printuc('\U2654', 12)
♔♕♖♗♘♙♚♛♜♝♞♟

Mathematical Operators

printuc('\U2200', 255)
∀∁∂∃∄∅∆∇∈∉∊∋∌∍∎∏∐∑−∓∔∕∖∗∘∙√∛∜∝∞∟∠∡∢∣∤∥∦∧∨∩∪∫∬∭∮∯∰∱∲∳∴∵∶∷∸∹∺∻∼∽∾∿≀≁≂≃≄≅
≆≇≈≉≊≋≌≍≎≏≐≑≒≓≔≕≖≗≘≙≚≛≜≝≞≟≠≡≢≣≤≥≦≧≨≩≪≫≬≭≮≯≰≱≲≳≴≵≶≷≸≹≺≻≼≽≾≿⊀⊁⊂⊃⊄⊅⊆⊇⊈⊉⊊⊋
⊌⊍⊎⊏⊐⊑⊒⊓⊔⊕⊖⊗⊘⊙⊚⊛⊜⊝⊞⊟⊠⊡⊢⊣⊤⊥⊦⊧⊨⊩⊪⊫⊬⊭⊮⊯⊰⊱⊲⊳⊴⊵⊶⊷⊸⊹⊺⊻⊼⊽⊾⊿⋀⋁⋂⋃⋄⋅⋆⋇⋈⋉⋊⋋⋌⋍⋎⋏⋐⋑
⋒⋓⋔⋕⋖⋗⋘⋙⋚⋛⋜⋝⋞⋟⋠⋡⋢⋣⋤⋥⋦⋧⋨⋩⋪⋫⋬⋭⋮⋯⋰⋱⋲⋳⋴⋵⋶⋷⋸⋹⋺⋻⋼⋽⋾

Runes

printuc('\U16a0', 40)
ᚠᚡᚢᚣᚤᚥᚦᚧᚨᚩᚪᚫᚬᚭᚮᚯᚰᚱᚲᚳᚴᚵᚶᚷᚸᚹᚺᚻᚼᚽᚾᚿᛀᛁᛂᛃᛄᛅᛆᛇ

Phaistos Disc

  • This script is not deciphered.
  • It is unclear what language it represents.
  • There is only one single document in this script: the Phaistos Disc from the Bronze Age.
printuc('\U101D0', 46 )
𐇐𐇑𐇒𐇓𐇔𐇕𐇖𐇗𐇘𐇙𐇚𐇛𐇜𐇝𐇞𐇟𐇠𐇡𐇢𐇣𐇤𐇥𐇦𐇧𐇨𐇩𐇪𐇫𐇬𐇭𐇮𐇯𐇰𐇱𐇲𐇳𐇴𐇵𐇶𐇷𐇸𐇹𐇺𐇻𐇼𐇽

14.2.4 Unicode Transformation Formats: UTF-8, UTF-16, UTF-32

Unicode transformation formats define how a sequence of codepoints is represented as a sequence of bytes.

Since codepoints are of different lengths, they cannot simply be written down one after the other. Where does one end and the next begin?

  • UTF-32: The simplest but also most memory-intensive is to make them all the same length. Each codepoint is encoded in 4 bytes = 32 bits.
  • In UTF-16, a codepoint is represented either with 2 bytes or with 4 bytes.
  • In UTF-8, a codepoint is represented with 1, 2, 3, or 4 bytes.
  • UTF-8 is the format with the highest prevalence. Julia also uses it.

14.2.5 UTF-8

  • For each codepoint, 1, 2, 3, or 4 full bytes are used.

  • With variable-length encoding, you must be able to recognize which byte sequences belong together:

    • A byte of the form 0xxxxxxx represents an ASCII codepoint of length 1.
    • A byte of the form 110xxxxx starts a 2-byte code.
    • A byte of the form 1110xxxx starts a 3-byte code.
    • A byte of the form 11110xxx starts a 4-byte code.
    • All further bytes of a 2-, 3-, or 4-byte code have the form 10xxxxxx.
  • Thus, the space available for the codepoint (number of x) is:

    • One-byte code: 7 bits
    • Two-byte code: 5 + 6 = 11 bits
    • Three-byte code: 4 + 6 + 6 = 16 bits
    • Four-byte code: 3 + 6 + 6 + 6 = 21 bits
  • Thus, every ASCII text is automatically also a correctly encoded UTF-8 text.

  • If the 17 planes (equivalent to 21 bits, resulting in approximately 1.1 million possible characters) currently defined in Unicode are ever depleted, UTF-8 can be extended to include 5- and 6-byte code sequences.

14.3 Characters and Strings in Julia

14.3.1 Characters

The Char type encodes a single Unicode character.

  • Julia uses single quotes for characters: 'a'.
  • A Char occupies 4 bytes of memory and
  • represents a Unicode codepoint.
  • Chars can be converted to/from UInts and
  • the integer value is equal to the Unicode codepoint.

Chars can be converted to/from UInts:

UInt('a')
0x0000000000000061
b = Char(0x2656)
'♖': Unicode U+2656 (category So: Symbol, other)

14.3.2 Strings

  • In Julia, strings are denoted with double quotes: "a".
  • These strings are encoded in UTF-8, where a single character may consist of 1 to 4 bytes.
@show typeof('a') sizeof('a') typeof("a") sizeof("a");
typeof('a') = Char
sizeof('a') = 4
typeof("a") = String
sizeof("a") = 1

For a non-ASCII string, the number of bytes and the number of characters differ:

asciistr = "Hello World!"
@show length(asciistr) ncodeunits(asciistr);
length(asciistr) = 12
ncodeunits(asciistr) = 12
str = "😄 Hellö 🎶"
@show length(str) ncodeunits(str);
length(str) = 9
ncodeunits(str) = 16

Iterating over a string iterates over the characters:

for i in str
    println(i, "  ", typeof(i))
end
😄  Char
   Char
H  Char
e  Char
l  Char
l  Char
ö  Char
   Char
🎶  Char

14.3.3 Concatenation of Strings

Strings with concatenation form a non-commutative monoid.

Therefore, Julia writes concatenation multiplicatively.

 str * asciistr * str
"😄 Hellö 🎶Hello World!😄 Hellö 🎶"

Powers with natural exponents are thus also defined.

str^3,  str^0
("😄 Hellö 🎶😄 Hellö 🎶😄 Hellö 🎶", "")

14.3.4 String Interpolation

The dollar sign serves a special purpose in strings, frequently utilized within print() statements. It enables the interpolation of variables or expressions.

a = 33.4
b = "x"

s = "The result for $b is equal to $a and the doubled square root of it is $(2sqrt(a))\n"
"The result for x is equal to 33.4 and the doubled square root of it is 11.55854662143991\n"

14.3.5 Backslash Escape Sequences

The backslash \ also has a special function in string constants. Julia uses the backslash codings known from C and other languages for special characters, dollar signs, and backslashes themselves:

s = "This is how one gets \'quotes\" and a \$ sign and a\nline break and a \\ etc... "
print(s)
This is how one gets 'quotes" and a $ sign and a
line break and a \ etc... 

14.3.6 Triple Quotes

Strings may also be enclosed in triple quotes, preserving line breaks and embedded quotes:

s = """
 This should
be a "longer"  
  'text'.
"""

print(s)
 This should
be a "longer"  
  'text'.

14.3.7 Raw Strings

In a raw string, all backslash escape sequences except for \" are disabled:

s = raw"A $ and a \ and two \\ and a 'bla'..."
print(s)
A $ and a \ and two \\ and a 'bla'...

14.4 Further Functions for Characters and Strings (Selection)

14.4.1 Tests for Characters

@show isdigit('0') isletter('Ψ') isascii('\U2655') islowercase('α') 
@show isnumeric('½') iscntrl('\n') ispunct(';');
isdigit('0') = true
isletter('Ψ') = true
isascii('♕') = false
islowercase('α') = true
isnumeric('½') = true
iscntrl('\n') = true
ispunct(';') = true

14.4.2 Application to Strings

These tests can be used on strings with all(), any(), or count():

all(ispunct, ";.:")
true
any(isdigit, "It is 3 o'clock! 🕒" )
true
count(islowercase, "Hello, du!!")
6

14.4.3 Other String Functions

@show startswith("Lampenschirm", "Lamp")  occursin("pensch", "Lampenschirm")  
@show endswith("Lampenschirm", "irm");
startswith("Lampenschirm", "Lamp") = true
occursin("pensch", "Lampenschirm") = true
endswith("Lampenschirm", "irm") = true
@show uppercase("Eis") lowercase("Eis")  titlecase("eiSen");
uppercase("Eis") = "EIS"
lowercase("Eis") = "eis"
titlecase("eiSen") = "Eisen"
# remove newline from end of string

@show chomp("Eis\n")  chomp("Eis");
chomp("Eis\n") = "Eis"
chomp("Eis") = "Eis"
split("π is irrational.")
3-element Vector{SubString{String}}:
 "π"
 "is"
 "irrational."
replace("π is irrational.", "is" => "is allegedly")
"π is allegedly irrational."

14.5 Indexing of Strings

Strings are immutable but indexable, with a few special features:

  • The index numbers the bytes of the string.
  • For a non-ASCII string, not all indices are valid because a valid index always addresses a Unicode character.

Our example string:

str
"😄 Hellö 🎶"

The first character

str[1]
'😄': Unicode U+1F604 (category So: Symbol, other)

This character is 4 bytes long in UTF-8 encoding. Thus, 2, 3, and 4 are invalid indices.

str[2]
StringIndexError: invalid index [2], valid nearby indices [1]=>'😄', [5]=>' '
Stacktrace:
 [1] string_index_err(s::AbstractString, i::Int64)
   @ Base ./strings/string.jl:12
 [2] getindex_continued(s::String, i::Int64, u::UInt32)
   @ Base ./strings/string.jl:473
 [3] getindex(s::String, i::Int64)
   @ Base ./strings/string.jl:465
 [4] top-level scope
   @ ~/Julia/Book26/JuliaBook/chapters/10_Strings.qmd:461

Only the 5th byte is a new character:

str[5]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Even when addressing substrings, start and end must always be valid indices; i.e., the end index must also index the first byte of a character, and that character is the last of the substring.

str[1:7]
"😄 He"

The function eachindex() returns an iterator over the valid indices:

for i in eachindex(str)
    c = str[i]
    println("$i: $c")
end
1: 😄
5:  
6: H
7: e
8: l
9: l
10: ö
12:  
13: 🎶

As usual, collect() makes an iterator into a vector.

collect(eachindex(str))
9-element Vector{Int64}:
  1
  5
  6
  7
  8
  9
 10
 12
 13

The function nextind() returns the next valid index.

@show nextind(str, 1) nextind(str, 2);
nextind(str, 1) = 5
nextind(str, 2) = 5

Why does Julia use a byte index instead of a character index? The main reason is the efficiency of indexing.

  • In a long string (e.g., book text), the position s[123455] can be found quickly with a byte index.
  • A character index would have to traverse the entire string in UTF-8 encoding to find the n-th character, since characters can be 1, 2, 3, or 4 bytes long.

Some functions return indices or ranges as results. They always return valid indices:

findfirst('l', str)
8
findfirst("Hel", str)
6:8
str2 = "αβγδϵ"^3
"αβγδϵαβγδϵαβγδϵ"
n = findfirst('γ', str2)
5

So you can continue searching from the next valid index after n=5:

findnext('γ', str2, nextind(str2, n))
15