UTF-8 Conversions with BitRank

July, 2023 ∙ 6 minute read

One problem I’ve run into a number of times over the years in working on parsers and information retrieval systems is the need to identify positions in a document by multiple units. These positions are used, for example, to identify the location of a cursor, to highight a matching term, to denote an identifier for lookup, or to do syntax highlighting. On a backend system, it is desireable to store only the most compact and efficent representation (e.g. UTF-8 and UTF-8 byte offsets), but frontends will often deal with characters (code points) and/or graphemes, and sometimes even represent strings with different encodings. This means there has to be a way to convert between UTF-8 byte offsets and code points, or in some cases between UTF-8 code units (bytes) and UTF-16 code units (words).

In blackbird, the code search engine I’ve been working on for the past few years, we store all documents as UTF-8 and represent the positions and spans of text in those documents as UTF-8 byte offsets. This is motivated by two things:

However, once search results appear in the front end, we need to be able to identify positions by character offsets (not bytes) and the more human friendly line/column coordinates. This is what you expect out of a text editor or viewing a blob on github.com. Additionally, our front end is written in JavaScript, which uses UTF-16 to represent strings so there is an encoding mismatch between the backend and frontend—rendering UTF-8 byte offsets not-so-helpful. (Even the LSP specification used to require and still defaults to identifying spans with UTF-16 code units.)

The question is: how do you efficiently convert between these different unit systems?

Representing strings in UTF-8 and UTF-16

Here’s an example string so that we can visualize the differences between the two encodings and the units we want to convert between:

h✅😆x

Below is how this is represented in memory in UTF-8. The ✅ emoji takes 3 bytes and 😆 takes 4 bytes to represent. NOTE: In UTF-8, each unicode code point is represented using one to four one-byte (8-bit) code units.

   h |             ✅ |                  😆 |    x |
0x68 | 0xe2 0x9c 0x85 | 0xf0 0x9f 0x98 0x86 | 0x78 | <- hex representation (utf-8)
   0 |    1    2    3 |    4    5    6    7 |    8 | <- byte index
   0 |   _1_  _2_   3 |   _4_  _5_  _6_   7 |    8 | <- code unit offset (utf-8, so same as byte index)
   0 |              1 |                   2 |    3 | <- code point offset

And here’s what it looks like in UTF-16. The ✅ emoji takes 2 bytes (one UTF-16 code unit) and 😆 takes 4 bytes (two UTF-16 code units) to represent. NOTE: In UTF-16, each unicode code point is represented with one or two 16-bit code units.

     h |     ✅ |            😆 |      x |
0x0068 | 0x2705 | 0xd83d 0xde06 | 0x0078 | <- hex representation (utf-16)
   0 1 |    2 3 |    4 5    6 7 |    8 9 | <- byte index
     0 |      1 |     _2_     3 |      4 | <- code unit offset (utf-16 uses words)
     0 |      1 |             2 |      3 | <- code point offset

It’s clear now that locating, for example the "x", by character offset (character meaning a unicode code point) or in a UTF-16 encoded version of the string, is going to require a linear scan of the content due to the variable encoding and the fact that converting between these units is content dependent. I’m sure you can think of a number of naive approaches to enable this conversion: lookup tables, re-encoding and measuring string lengths, adhoc linear scanning when you need to, etc; but there’s a clever solution that by colleague Alexander Neubeck has come up with called BitRank.

BitRank

BitRank allows efficient RANK operations on bit vectors. Bit vectors are a well known succinct data structures that provide lossless compression close to the information-theoretic lower bound, while still allowing efficient query operations. We use them in a number of places in blackbird.

For a bit vector, we define the operation rank(i) as the number of set bits (1s) in the range [0, i) (i.e. an exclusive rank). There’s some additional good background on rank (and select) in Rank-Select optimizations in BitMagic Library by Anatoliy Kuznetsov, but our implementation and use case is quite different.

Converting between UTF-8 byte offsets and code points

What we’re going to do is scan the content once to build up a couple of BitRank datastructures. These will then let us efficiently convert 1) UTF-8 code units (byte offsets) to code points and 2) UTF-8 code units (byte offsets) to UTF-16 code units (word offsets).

[h, ✅,       😆,x]
[0,1,2,3,4,5,6,7,8] - byte idx
[1,0,0,1,0,0,0,1,1] - bit vector for utf8 -> code points
[1,0,0,1,1,0,0,1,1] - bit vector for utf8 code units -> utf16 code units

[  h, ✅,     😆,  x] - what this string would look like in utf-16
[0,1,2,3,4,5,6,7,8,9] - byte idx

NOTE: These datastructures are going to be ~1/8 the size of the content (one bit per byte of content) plus maybe 10-20% overhead, depending on the bitrank implementation.

Now, we can compute the rank of these bit vectors and use that to convert between byte offsets and UTF-8/16 code units as needed. For example, given byte index: 8, what is the unicode code point offset? We do this by finding the rank of the element at index 8 in the bit vector. That rank is our code point offset.

[h,   ✅,     😆,x]
[0,1,2,3,4,5,6,7,8] - byte idx
                 ^
                 | byte idx: 8

[1,0,0,1,0,0,0,1,1] - bit vector for utf8 -> code points
[1,1,1,2,2,2,2,3,4] - rank
[0,1,1,1,2,2,2,2,3] - rank (exclusive)
                 ^
                 | idx: 8, rank: 3, code point offset is 3

[h,   ✅,     😆,x]
[0,1,2,3,4,5,6,7,8] - byte idx
[0,    1,      2,3] - code point offset
                 ^
                 | sure enough, our math checks out
                 | for byte idx: 8, the code point offset is 3

OK, now let’s convert UTF-8 code units to UTF-16 code units at the same original offset. Again, we find the rank of the element at index 8, but this time looking in the UTF-8 to UTF-16 bit vector.

[  h, ✅,     😆,  x] - utf16 (takes one more byte to represent)
[h,   ✅,     😆,x] - utf8
[0,1,2,3,4,5,6,7,8] - byte idx
[1,0,0,1,1,0,0,1,1] - bit vector for utf8 code units -> utf16 code units
[1,1,1,2,3,3,3,4,5] - rank
[0,1,1,1,2,3,3,3,4] - rank (exclusive)
                 ^
                 | idx: 8, rank: 4, utf16 code unit offset is 4

[  h, ✅,     😆,  x] - utf16
[0,1,2,3,4,5,6,7,8,9] - byte idx
[  0,  1,  2,  3,  4] - utf16 code unit offset
                   ^
                   | check out math again:
                   | for byte idx: 8 in the utf8 repr,
                   | the utf16 code unit offset is 4

You can also easily go the other way (from character offset to byte position). Given the code point offset 3 (the "x"): what is the UTF-8 byte offset in the original UTF-8 content?

[h,   ✅,     😆,x] - utf8
[0,    1,      2,3] - code point offset
                 ^
                 | what's the byte offset of the "x" at line: 0, col: 3?

Instead of rank, now we perform a select operation. Go find the first element that has rank 3: it’s index is the byte offset we’re looking for.

[1,0,0,1,0,0,0,1,1] - bit vector for utf8 -> code points
[1,1,1,2,2,2,2,3,4] - rank
[0,1,1,1,2,2,2,2,3] - rank (exclusive)
                 ^
                 | rank: 3 is at idx: 8

[h,   ✅,     😆,x] - utf8
[0,1,2,3,4,5,6,7,8] - byte idx
                 ^
                 | byte offset: 8

Building the bit vectors

Building the bit vectors is straightforward. We iterate through the bytes of the content, jumping forward by the byte_len of each UTF-8 character. One of the cool things about UTF-8 is that it’s designed so that you can quickly determine the length of the character just by reading the first byte (first nibble really).

/// Returns the number of bytes this utf8 char occupies given the first byte of the utf8 encoding.
/// Returns 0 if the byte is not a valid first byte of a utf8 char.
fn utf8_width(c: u8) -> usize {
    // Every nibble represents the utf8 length given the first 4 bits of a utf8 encoded byte.
    const UTF8_WIDTH: usize = 0x4322_0000_1111_1111;
    (UTF8_WIDTH >> ((c >> 4) * 4)) & 0xf
}

For the UTF-8 to code point conversion, when you get to a character boundary, you toggle that bit in the bit vector. For the UTF-8 to UTF-16 code unit conversion, you also toggle a bit on each character boundary, but then there’s one extra step. In the cases where you have a character that takes 4 bytes to represent in UTF-8, it is required to toggle an additional bit (because they will require 2 UTF-16 code units to represent). You can set the extra bit before the boundary bit that’s always set.

Back to work!

And that’s about it! A clever datastructure with excellent space and time complexity, happily working away behind the scenes to keep code search delightfully fast.



Special thanks to Alexander Neubeck for introducing me to the subject, writing our internal BitRank implementation, and helping me improve the utf8/16 conversion code. Also thanks to Rick Winfrey for reading and giving feedback on my early drafts of this post.

Tim's Avatar Building GitHub since 2011, programming language connoisseur, San Francisco resident, aspiring surfer, father of two, life partner to @ktkates—all words by me, Tim Clem.