Notes on UTF-8

UTF-8 is the most popular, to my mind, character encoding invented by Ken Thomson (co-inventor of Unix and Go language) and Rob Pike (co-inventor of Go language). If you use Rust and Python, strings are encoded in UTF-8 by default. Anyway, I got interested in Boyer-Moore string search algorithm. And of course I wanted to implement it for UTF-8 which is most practical. Before I could do that effectively, I had to understand UTF-8 internals.

Description

UTF-8 is pretty well described in RFC 3629. UTF-8 character size varies between 1-4 bytes. The exact size depends on a character, but personally I think that more common characters are encoded with less bytes. E.g. ASCII symbols are encoded with 1 byte, cyrillic, Lithuanian - 2 bytes, Japanese - 3, etc.

The maximum size of UTF-8 character code is 21 bits. Other 11 bits are used as metadata. This table sums up the encoding format:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

where x are bits used to encode character code.

Example

Let's take character č and encode it to UTF-8. It's encoded with 2 bytes 0xC48D and the character code is 0x10D. We can test this with Rust:

let chars = vec![0xC4, 0x8D];
for c in String::from_utf8(chars).unwrap().chars() {
    println!("{} 0x{:X}", c, c as u32);
}

The output is:

č 0x10D

String operations

UTF-8 strings can be easily concatenated.

Character at position N access complexity is O(N). That's why Rust strings don't have index operator str[] because usually this operator means O(1) complexity.

String search is the same as in ASCII strings. Basically, you can do byte to byte search. Thus Boyer-Moore search should also work with UTF-8 strings without bigger problems.

Iterating UTF-8 strings is a bit more complicated than ASCII. You have to calculate how many bytes each character takes up.

Comments