'\U21b4''↴': Unicode U+21B4 (category So: Symbol, other)
There were - depending on manufacturer, country, programming language, operating system, etc. - a large variety of encodings.
Still relevant today are:
The American Standard Code for Information Interchange (ASCII) was published as a standard in the USA in 1963.
newline, escape, end of transmission/file, deletea-z, A-Z0-9.,:;?!"[{()}]+-*/<>=#$%&'\^_|~`@| Encoding | Region | Languages |
|---|---|---|
| ISO 8859-1 (Latin-1) | Western Europe | German, French,…, Icelandic |
| ISO 8859-2 (Latin-2) | Eastern Europe | Slavic languages with Latin script |
| ISO 8859-3 (Latin-3) | Southern Europe | Turkish, Maltese,… |
| ISO 8859-4 (Latin-4) | Northern Europe | Estonian, Latvian, Lithuanian, Greenlandic, Sami |
| ISO 8859-5 (Latin/Cyrillic) | Eastern Europe | Slavic languages with Cyrillic script |
| ISO 8859-6 (Latin/Arabic) | ||
| ISO 8859-7 (Latin/Greek) | ||
| … | ||
| ISO 8859-15 (Latin-9) | 1999: Revision of Latin-1: now including Euro sign |
The goal of the Unicode Consortium is a uniform encoding for all scripts worldwide.
codepoint, which is simply a sequential number written hexadecimally
U+XXXX (zeroth plane)U+XXXXXX (further planes)U+XY0000 to U+XYFFFF, thus containing \(2^{16}=65\;534\) characters.XY=00 to XY=10 are provided, giving a value range from U+0000 to U+10FFFF.U+0000 - U+FFFF,U+010000 - U+01FFFF,U+020000 - U+02FFFF,U+030000 - U+03FFFF, andU+0E0000 - U+0EFFFF.U+0000 to U+007F is identical to ASCIIU+0000 to U+00FF is identical to ISO 8859-1 (Latin-1)In the standard, each character is described by
In the Unicode standard, this looks like (simplified, only codepoint and name):
...
U+0041 LATIN CAPITAL LETTER A
U+0042 LATIN CAPITAL LETTER B
U+0043 LATIN CAPITAL LETTER C
U+0044 LATIN CAPITAL LETTER D
...
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+00EA LATIN SMALL LETTER E WITH CIRCUMFLEX
...
U+0641 ARABIC LETTER FEH
U+0642 ARABIC LETTER QAF
...
U+21B4 RIGHTWARDS ARROW WITH CORNER DOWNWARDS
...
What does ‘RIGHTWARDS ARROW WITH CORNER DOWNWARDS’ look like?
Julia uses \U... for input of Unicode codepoints.
'\U21b4''↴': Unicode U+21B4 (category So: Symbol, other)
If individual characters or scripts are not displayable in your browser, you must install appropriate fonts on your computer.
Alternatively, you can use the PDF version of this page. There, all fonts are embedded.
A small helper function:
function printuc(c, n)
for i in 1:n
print(c + i -1)
if i%70 == 0 print("\n") end
end
endprintuc (generic function with 1 method)
Cyrillic
printuc('\U0400', 100)ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфх
цчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣ
Tamil
printuc('\U0be7',20)௧௨௩௪௫௬௭௮௯௰௱௲௳௴௵௶௷௸௹௺
Chess
printuc('\U2654', 12)♔♕♖♗♘♙♚♛♜♝♞♟
Mathematical Operators
printuc('\U2200', 255)∀∁∂∃∄∅∆∇∈∉∊∋∌∍∎∏∐∑−∓∔∕∖∗∘∙√∛∜∝∞∟∠∡∢∣∤∥∦∧∨∩∪∫∬∭∮∯∰∱∲∳∴∵∶∷∸∹∺∻∼∽∾∿≀≁≂≃≄≅
≆≇≈≉≊≋≌≍≎≏≐≑≒≓≔≕≖≗≘≙≚≛≜≝≞≟≠≡≢≣≤≥≦≧≨≩≪≫≬≭≮≯≰≱≲≳≴≵≶≷≸≹≺≻≼≽≾≿⊀⊁⊂⊃⊄⊅⊆⊇⊈⊉⊊⊋
⊌⊍⊎⊏⊐⊑⊒⊓⊔⊕⊖⊗⊘⊙⊚⊛⊜⊝⊞⊟⊠⊡⊢⊣⊤⊥⊦⊧⊨⊩⊪⊫⊬⊭⊮⊯⊰⊱⊲⊳⊴⊵⊶⊷⊸⊹⊺⊻⊼⊽⊾⊿⋀⋁⋂⋃⋄⋅⋆⋇⋈⋉⋊⋋⋌⋍⋎⋏⋐⋑
⋒⋓⋔⋕⋖⋗⋘⋙⋚⋛⋜⋝⋞⋟⋠⋡⋢⋣⋤⋥⋦⋧⋨⋩⋪⋫⋬⋭⋮⋯⋰⋱⋲⋳⋴⋵⋶⋷⋸⋹⋺⋻⋼⋽⋾
Runes
printuc('\U16a0', 40)ᚠᚡᚢᚣᚤᚥᚦᚧᚨᚩᚪᚫᚬᚭᚮᚯᚰᚱᚲᚳᚴᚵᚶᚷᚸᚹᚺᚻᚼᚽᚾᚿᛀᛁᛂᛃᛄᛅᛆᛇ
Phaistos Disc
printuc('\U101D0', 46 )𐇐𐇑𐇒𐇓𐇔𐇕𐇖𐇗𐇘𐇙𐇚𐇛𐇜𐇝𐇞𐇟𐇠𐇡𐇢𐇣𐇤𐇥𐇦𐇧𐇨𐇩𐇪𐇫𐇬𐇭𐇮𐇯𐇰𐇱𐇲𐇳𐇴𐇵𐇶𐇷𐇸𐇹𐇺𐇻𐇼𐇽
Unicode transformation formats define how a sequence of codepoints is represented as a sequence of bytes.
Since codepoints are of different lengths, they cannot simply be written down one after the other. Where does one end and the next begin?
For each codepoint, 1, 2, 3, or 4 full bytes are used.
With variable-length encoding, you must be able to recognize which byte sequences belong together:
Thus, the space available for the codepoint (number of x) is:
Thus, every ASCII text is automatically also a correctly encoded UTF-8 text.
If the 17 planes (equivalent to 21 bits, resulting in approximately 1.1 million possible characters) currently defined in Unicode are ever depleted, UTF-8 can be extended to include 5- and 6-byte code sequences.
The Char type encodes a single Unicode character.
'a'.Char occupies 4 bytes of memory andChars can be converted to/from UInts andChars can be converted to/from UInts:
UInt('a')0x0000000000000061
b = Char(0x2656)'♖': Unicode U+2656 (category So: Symbol, other)
"a".@show typeof('a') sizeof('a') typeof("a") sizeof("a");typeof('a') = Char
sizeof('a') = 4
typeof("a") = String
sizeof("a") = 1
For a non-ASCII string, the number of bytes and the number of characters differ:
asciistr = "Hello World!"
@show length(asciistr) ncodeunits(asciistr);length(asciistr) = 12
ncodeunits(asciistr) = 12
str = "😄 Hellö 🎶"
@show length(str) ncodeunits(str);length(str) = 9
ncodeunits(str) = 16
Iterating over a string iterates over the characters:
for i in str
println(i, " ", typeof(i))
end😄 Char
Char
H Char
e Char
l Char
l Char
ö Char
Char
🎶 Char
Strings with concatenation form a non-commutative monoid.
Therefore, Julia writes concatenation multiplicatively.
str * asciistr * str"😄 Hellö 🎶Hello World!😄 Hellö 🎶"
Powers with natural exponents are thus also defined.
str^3, str^0("😄 Hellö 🎶😄 Hellö 🎶😄 Hellö 🎶", "")
The dollar sign serves a special purpose in strings, frequently utilized within print() statements. It enables the interpolation of variables or expressions.
a = 33.4
b = "x"
s = "The result for $b is equal to $a and the doubled square root of it is $(2sqrt(a))\n""The result for x is equal to 33.4 and the doubled square root of it is 11.55854662143991\n"
The backslash \ also has a special function in string constants. Julia uses the backslash codings known from C and other languages for special characters, dollar signs, and backslashes themselves:
s = "This is how one gets \'quotes\" and a \$ sign and a\nline break and a \\ etc... "
print(s)This is how one gets 'quotes" and a $ sign and a
line break and a \ etc...
Strings may also be enclosed in triple quotes, preserving line breaks and embedded quotes:
s = """
This should
be a "longer"
'text'.
"""
print(s) This should
be a "longer"
'text'.
In a raw string, all backslash escape sequences except for \" are disabled:
s = raw"A $ and a \ and two \\ and a 'bla'..."
print(s)A $ and a \ and two \\ and a 'bla'...
@show isdigit('0') isletter('Ψ') isascii('\U2655') islowercase('α')
@show isnumeric('½') iscntrl('\n') ispunct(';');isdigit('0') = true
isletter('Ψ') = true
isascii('♕') = false
islowercase('α') = true
isnumeric('½') = true
iscntrl('\n') = true
ispunct(';') = true
These tests can be used on strings with all(), any(), or count():
all(ispunct, ";.:")true
any(isdigit, "It is 3 o'clock! 🕒" )true
count(islowercase, "Hello, du!!")6
@show startswith("Lampenschirm", "Lamp") occursin("pensch", "Lampenschirm")
@show endswith("Lampenschirm", "irm");startswith("Lampenschirm", "Lamp") = true
occursin("pensch", "Lampenschirm") = true
endswith("Lampenschirm", "irm") = true
@show uppercase("Eis") lowercase("Eis") titlecase("eiSen");uppercase("Eis") = "EIS"
lowercase("Eis") = "eis"
titlecase("eiSen") = "Eisen"
# remove newline from end of string
@show chomp("Eis\n") chomp("Eis");chomp("Eis\n") = "Eis"
chomp("Eis") = "Eis"
split("π is irrational.")3-element Vector{SubString{String}}:
"π"
"is"
"irrational."
replace("π is irrational.", "is" => "is allegedly")"π is allegedly irrational."
Strings are immutable but indexable, with a few special features:
Our example string:
str"😄 Hellö 🎶"
The first character
str[1]'😄': Unicode U+1F604 (category So: Symbol, other)
This character is 4 bytes long in UTF-8 encoding. Thus, 2, 3, and 4 are invalid indices.
str[2]StringIndexError: invalid index [2], valid nearby indices [1]=>'😄', [5]=>' '
Stacktrace:
[1] string_index_err(s::AbstractString, i::Int64)
@ Base ./strings/string.jl:12
[2] getindex_continued(s::String, i::Int64, u::UInt32)
@ Base ./strings/string.jl:473
[3] getindex(s::String, i::Int64)
@ Base ./strings/string.jl:465
[4] top-level scope
@ ~/Julia/Book26/JuliaBook/chapters/10_Strings.qmd:461
Only the 5th byte is a new character:
str[5]' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
Even when addressing substrings, start and end must always be valid indices; i.e., the end index must also index the first byte of a character, and that character is the last of the substring.
str[1:7]"😄 He"
The function eachindex() returns an iterator over the valid indices:
for i in eachindex(str)
c = str[i]
println("$i: $c")
end1: 😄
5:
6: H
7: e
8: l
9: l
10: ö
12:
13: 🎶
As usual, collect() makes an iterator into a vector.
collect(eachindex(str))9-element Vector{Int64}:
1
5
6
7
8
9
10
12
13
The function nextind() returns the next valid index.
@show nextind(str, 1) nextind(str, 2);nextind(str, 1) = 5
nextind(str, 2) = 5
Why does Julia use a byte index instead of a character index? The main reason is the efficiency of indexing.
s[123455] can be found quickly with a byte index.Some functions return indices or ranges as results. They always return valid indices:
findfirst('l', str)8
findfirst("Hel", str)6:8
str2 = "αβγδϵ"^3"αβγδϵαβγδϵαβγδϵ"
n = findfirst('γ', str2)5
So you can continue searching from the next valid index after n=5:
findnext('γ', str2, nextind(str2, n))15