UTF-8 encoding is a variable-length character encoding of Unicode and an implementation of Unicode, also known as the universal code; UTF8 uses 1 to 4 bytes for each character encoding, which is relative to the fixed four bytes of Unicode Length saves storage space. The corresponding relationship between UTF-8 byte length and Unicode code point is as follows:
One byte (0x00-0x7F) -> U+00～U+7F
Two bytes (0xC280-0xDFBF) -> U+80～U+7FF
Three bytes (0xE0A080-0xEFBFBF) -> U+800～U+FFFF
Four bytes (0xF0908080-0xF48FBFBF) -> U+10000～U+10FFFF
The characters U+0000 to U+007F (ASCII) are encoded as bytes 0×00 to 0x7F (ASCI II compatible). This means that files containing only 7-bit ASCIl characters are the same in both ASCI II and UTF-8 encoding methods.
All characters greater than 0x007F are encoded as a string with multiple bytes, each byte has a set of mark bits, and common Chinese characters are basically encoded into three bytes.