A tour of strings

June, 2020 ∙ 15 minute read

Much has been written about encodings—and specifically unicode—by programmers and for programmers, but it’s easy to get lost in the weeds and easier still to feel enlightened until the next text encoding bug bites. Recently, I’ve been exploring how text and strings are represented in different programming languages and I think the various language specific implementations are actually a critical component of why encodings are (still) so hard for programmers. There’s the standard adage: all programmers should know about encodings which is true and sure, some programmers still just don’t know (I certainly didn’t learn any of that in my formal education), but at a deeper level text encodings are hard because: a) trying to represent the world’s writing is an ambitious task, b) different programming tasks favor differing representations, and c) different programming languages do subtly different things in how they represent and handle text.

So with that in mind, I present to you a brief tour of 6 different languages: JavaScript, Ruby, Go, Rust, Swift, and Haskell—and how each implements and represents text.

Why encode things?

First, it’s essential to understand that in order for written human languages to be used, stored, and shared on computers we have to have a system by which we can:

Take text and turn it into numbers. (e.g., hi -> [104, 105].)
Take numbers and turn them back into text. (e.g., [104, 105] -> hi.)

The former is called encoding, the later decoding, and critically we all have to agree on the big table that translates character-like-things to numbers and back again. The relevant big table that we all agree on (for the purposes of this article) is Unicode, but know that there are others[1]. All this is important because computers only know about numbers: everything you interact with on a computer has to eventually be represented as a number. Pixels? Yep. Sound? Yep. PDF files? Yep. Text? Yep, that’s what this is all about. Even the number 1 in text is represented as another number (the unicode code point U+0031).

What I’m not going to talk about but find interesting

I’m not going to talk about detecting encodings from arbitrary text or binary arrays of bytes. I’m also not going to talk about fonts and the rendering or display of characters and graphemes. Those are both topics for another post.

Things we expect from our strings

One more aside. Programmers have a bunch of different things they would like to do with strings. And because of this, we have some very high expectations for what our strings can do:

Iterate, enumerate and index into them
Find the length of
Search and use regexes
Find & replace
Create, combine, and build new strings
Sort (trickier than you might think!)
Reverse (also tricky)
Manipulate by intercalating, interspersing, or transposing.
Change case (up, down, title, camel, pascal, snake, etc)
Change encodings
Detect encodings (very difficult!)
Strip, trim
Break up in various ways

My explorations aren’t super rigorous and I mostly explore basic representation, enumeration, and indexing, but even with that surface scratching the breadth of different decisions made is fascinating.

JavaScript

JavaScript, language darling of the web, uses UTF-16 to represent strings despite the fact that the entire modern web uses UTF-8 to represent text. This has some important consequences and surprises.

Quick: How long is this string? hi 👋🏻?

> "hi 👋🏻".length
7

Huh? What we really should have asked next is: in what units?

JavaScript strings are represented in UTF-16 and this representation leaks through into things like the length and how you might go about indexing strings. I find the API is much less obvious than many of the other languages.

We can get at unicode code points using the for of String iterator or using Array.from.

> let arr = []; for (const c of "hi 👋🏻") { arr.push(c) }; arr
["h", "i", " ", "👋", "🏻"]
> [..."hi 👋🏻"  ].length
5
> [..."hi 👋🏻"]
["h", "i", " ", "👋", "🏻"]
> [..."hi 👋🏻"]  .reverse()
[ '🏻', '👋', ' ', 'i', 'h' ]

Be careful if you index as if the string was an array:

const str = "hi 👋🏻"; let arr = []; for (let i = 0; i < str.length; i++) { arr.push(str[i]) }; arr
(7) ["h", "i", " ", "�", "�", "�", "�"]

And good luck if you want to deal with UTF-8 instead (See TextDecoder and TextEncoder).

Ruby

Next there’s Ruby, a delightful and quirky old friend of a language. I don’t want to spend too much time on historical language versions, but I will briefly note that until Ruby 1.9 (and even then it was kinda broken), Ruby basically considered strings just arrays of bytes with some extra helper methods. Eventually explicit encoding support was added and has evolved over time and as of writing, in Ruby 2.7, it’s relatively straightforward to deal with and understand text. Ruby has fully embraced UTF-8 as the default encoding for strings, but it’s easy to convert to/from other encodings if that’s your thing

>> RUBY_VERSION
=> "2.7.1"

# Let's get the length again
> "hi 👋🏻".size
=> 5

# We can iterate over code points
>> "hi 👋🏻".codepoints.map{ |b| "0x%x" % b }
=> ["0x68", "0x69", "0x20", "0x1f44b", "0x1f3fb"]

# Or bytes
>> "hi 👋🏻".bytes.map{ |b| "0x%x" % b }
=> ["0x68", "0x69", "0x20", "0xf0", "0x9f", "0x91", "0x8b", "0xf0", "0x9f", "0x8f", "0xbb"]

# Or change to UTF-16 and do the same
# Notice how it takes a few more bytes
>> "hi 👋🏻".encode("UTF-16").bytes.map{ |b| "0x%x" % b }
=> ["0xfe", "0xff", "0x0", "0x68", "0x0", "0x69", "0x0", "0x20", "0xd8", "0x3d", "0xdc", "0x4b", "0xd8", "0x3c", "0xdf", "0xfb"]

# You can lookup elements like this
>> "hi 👋🏻"['👋🏻']
=> "👋🏻"

# Skipping the skin tone modifier is fine
>> "hi 👋🏻"['👋']
=> "👋"

# But reversing it doesn't quite work
>> "hi 👋🏻".reverse
=> "🏻👋 ih"

# You can look at characters
>> "hi 👋🏻".chars
=> ["h", "i", " ", "👋", "🏻"]

# Or even grapheme cluster
>> "hi 👋🏻".each_grapheme_cluster.to_a
=> ["h", "i", " ", "👋🏻"]
>> "hi 👋🏻".each_grapheme_cluster.reverse_each.to_a
=> ["👋🏻", " ", "i", "h"]

# It's interesting to see how something like the family emoji in constructed
>> "👨‍👩‍👧‍👧".chars
=> ["👨", "‍", "👩", "‍", "👧", "‍", "👧"]
>> "👨‍👩‍👧‍👧".reverse
=> "👧‍👧‍👩‍👨"
>> "👨‍👩‍👧‍👧".each_grapheme_cluster.to_a
=> ["👨‍👩‍👧‍👧"]

Haskell

Haskell has a number of data types that can be used to represent text: String, Text, and ByteString (the later two which also have lazy and strict implementations, but I’ll spare you the details of that). Let’s start with String. You can use the OverloadedStrings language pragma, but I’m going to be explicit about the types for clarity (note: λ is just my prompt in ghci). String is literally a list of Char (it’s a type synonym: type String = [Char]) where Char is an enum of all the unicode code points.

-- The wave with a skin tone is made up of two unicode code points
λ length ("hi 👋🏻" :: String)
5

-- ghci just bails on printing these
λ reverse ("hi 👋🏻" :: String)
"\127995\128075 ih"

-- index into the string
λ ("hi 👋🏻" :: String) !! 0
'h'
λ ("hi 👋🏻" :: String) !! 3
'\128075'
λ ("hi 👋🏻" :: String) !! 4
'\127995'

-- Deconstruct the list
λ let _:_:_:x:_ = ("hi 👋🏻" :: String)
λ x
'\128075'

-- Check out the code points
λ import Numeric
λ foldr (\x acc -> "U+" <> showHex (fromEnum x) "" : acc ) [] ("hi 👋🏻" :: String)
["U+68","U+69","U+20","U+1f44b","U+1f3fb"]

Notice that the underlying representation is unicode code points, not UTF-16 or UTF-8 like we’ve seen in other languages so far.

The trouble with String is that the implementation is very inefficient. You’re better off with Text (though String is in the base library and used all over the place for things like error). There is a similar API, and text operates mostly like a list, but you can’t use list deconstruction in quite the same way.

λ import qualified Data.Text as T
λ T.length ("hi 👋🏻" :: T.Text)
5
λ T.index ("hi 👋🏻" :: T.Text) 3
'\128075'
λ T.index ("hi 👋🏻" :: T.Text) 4
'\127995'
λ T.reverse ("hi 👋🏻" :: T.Text)
"\127995\128075 ih

-- This doesn't work
λ let _:x:_ = ("hi 👋🏻" :: T.Text)
<interactive>:8:14-30: error:
    • Couldn't match expected type ‘[a]’ with actual type ‘T.Text’
    • In the expression: ("hi 👋🏻" :: T.Text)
      In a pattern binding: _ : x : _ = ("hi 👋🏻" :: T.Text)
    • Relevant bindings include x :: a (bound at <interactive>:8:7)

λ T.foldr (\x acc -> "U+" <> showHex (fromEnum x) "" : acc ) [] ("hi 👋🏻" :: T.Text)
["U+68","U+69","U+20","U+1f44b","U+1f3fb"]

ByteString should really just be called Bytes and is useful when dealing with true binary data (it can be even more efficient than Text for that reason) and it’s how we can represent UTF-8 or UTF-16. Text lets you encode and decode to different representations. Notice that the types produced are all ByteString.

λ import Data.Text.Encoding
λ :t encodeUtf8
encodeUtf8 :: T.Text -> Data.ByteString.Internal.ByteString

-- encode to utf8
λ encodeUtf8 ("hi 👋🏻" :: T.Text)
"hi \240\159\145\139\240\159\143\187"

-- or, a bit easier to read bytes in hex
λ B.foldr (\x acc -> "0x" <> showHex x "" : acc) [] $ encodeUtf8 ("hi 👋🏻" :: T.Text)
["0x68","0x69","0x20","0xf0","0x9f","0x91","0x8b","0xf0","0x9f","0x8f","0xbb"]

-- encode to utf16
λ B.foldr (\x acc -> "0x" <> showHex x "" : acc) [] $ encodeUtf16BE ("hi 👋🏻" :: T.Text)
["0x0","0x68","0x0","0x69","0x0","0x20","0xd8","0x3d","0xdc","0x4b","0xd8","0x3c","0xdf","0xfb"]

-- ghci won't let me insert 👨‍👩‍👧‍👧 into a string literal. You can manually spell out the unicode, but if you paste in or insert, the repl just takes the first code point and drop everything else:
λ "👨" :: T.Text
"\128104"

Notice that you have to pick little or big endian for UTF-16 using either encodeUtf16BE or encodeUtf16LE.

Swift

In Swift, we have another interesting point in the design landscape: a String (declared as a struct) is a collection of extended grapheme clusters with views into the various ways we might want to access the data. This actually seems quite sane to me, though I understand there are still some arguments about specific grapheme clusters.

import Cocoa

var str = "hi 👋🏻"

print(str)
print("grapheme count: \(str.count)")
print(Array(str))
print(str.reversed().map{ c in c })

print("unicode code points (scalars) count: \(str.unicodeScalars.count)")
print(str.unicodeScalars.map { i in String(format:"U+%x", i.value) })

print("utf8 count: \(str.utf8.count)")
print(Array(str.utf8).map(toHex))

print("utf16 count: \(str.utf16.count)")
print(Array(str.utf16).map(toHex))

func toHex(_ v: CVarArg) -> String {
    return String(format:"0x%x", v)
}

output

hi 👋🏻
grapheme count: 4
["h", "i", " ", "👋🏻"]
["👋🏻", " ", "i", "h"]
unicode code points (scalars) count: 5
["U+68", "U+69", "U+20", "U+1f44b", "U+1f3fb"]
utf8 count: 11
["0x68", "0x69", "0x20", "0xf0", "0x9f", "0x91", "0x8b", "0xf0", "0x9f", "0x8f", "0xbb"]
utf16 count: 7
["0x68", "0x69", "0x20", "0xd83d", "0xdc4b", "0xd83c", "0xdffb"]

This is fascinating because it’s our first language to have graphemes be the primary unit (you can get to them from Ruby, but the standard methods use code points).

Go

Go uses UTF-8 to represent text in the standard “strings” package. You’re able to both get at the underlying UTF-8 bytes as well as iterate over the unicode code points. There’s a utf16 package if needed. However, you’re on your own if you want to reverse a string (make sure you iterate properly, hint: don’t use len()).

package main

import "fmt"

func main() {
	str := "hi 👋🏻"
	fmt.Println(str)
	fmt.Printf("bytes: %v\n", []byte(str))
	fmt.Printf("length %v\n", len(str))

	fmt.Println()
	fmt.Println("iterate forward")
	for i, c := range str {
		fmt.Printf("%v: %v\n", i, c)
	}

	fmt.Println()
	fmt.Println("reverse")
	for i := 0; i < len(str); i++ {
		fmt.Printf("%v: %v\n", i, str[i])
	}
}

output

hi 👋🏻
bytes: [104 105 32 240 159 145 139 240 159 143 187]
length 11

iterate forward
104
105
32
128075
127995

reverse
104
105
32
240
159
145
139
240
159
143
187

Rust

Rust’s std::string::String uses a UTF-8 representation and forces that strings only contain valid UTF-8 (you can use OsString if you must). Also, indexing into a String is not allowed: doing "hello"[0] is a compile time error.

// ❯ rustc --version
// rustc 1.40.0 (73528e339 2019-12-16)

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let str = "hi 👋🏻";

    // println!("{}", str[0]); Indexing a str is not allowed, compile time error

    println!("{}", str);
    println!("UTF-8 len: {}", str.len());
    println!("UTF-8 bytes: {:?}", str.bytes().map({|b| return format!("0x{:x}", b)}).collect::<Vec<String>>());
    println!("code points len: {:?}", str.chars().count());
    println!("code points: {:?}", str.chars().collect::<Vec<char>>());

    // have to pull in a library to iterate by grapheme clusters.
    let graphemes = UnicodeSegmentation::graphemes(str, true).collect::<Vec<&str>>();
    println!("graphemes len: {:?}", graphemes.len());
    println!("graphemes: {:?}", graphemes);

    // operations like reverse can be done on any of the specific representations
    println!("graphemes reversed: {:?}", graphemes.into_iter().rev().collect::<Vec<&str>>());
}

hi 👋🏻
UTF-8 len: 11
UTF-8 bytes: ["0x68", "0x69", "0x20", "0xf0", "0x9f", "0x91", "0x8b", "0xf0", "0x9f", "0x8f", "0xbb"]
code points len: 5
code points: ['h', 'i', ' ', '👋', '🏻']
graphemes len: 4
graphemes: ["h", "i", " ", "👋🏻"]
graphemes reversed: ["👋🏻", " ", "i", "h"]

Conclusions

I’m not really sure what to conclude here other than to confirm that encodings are hard and particularly difficult due to how many programing languages uses subtly different (and often leaky) abstractions. On the other hand, I find this kind of delightful and interesting. What’s your favorite way to work with text? What are some of the other interesting data points in the programming language design space for representing and manipulating strings?

[1] Popular encodings include: ASCII, ISO 8859, Windows-1252, Mac OS Roman, Shift JIS, and many, many others.

Building GitHub since 2011, programming language connoisseur, Marin resident, aspiring surfer, father of two, life partner to @ktkates—all words by me, Tim Clem.