Understanding Unicode and Bytes
When working with text in programming, it’s essential to understand how languages represent characters, especially those that might seem simple, like letters and symbols. However, behind the scenes, computers don't inherently understand "letters" or "characters." They know numbers, and these numbers are represented as "bytes," which are just sequences of ones and zeros (binary data).
Bits
A bit is the smallest unit of data in a computer. It is called a binary digit because it can have one of two values: 0 and 1. This is the foundation of how computers store, process, and transmit data. Everything in a computer, from texts and images to sounds and videos, is ultimately represented using bits. Computers can represent more complex data by combining bits in various patterns.
One bit: Can represent two values, like
0
or1
.Two bits: Can represent four values, like
00
,01
,10
, or11
.Three bits: Can represent eight values, like
000
,001
,010
,011
,100
,101
,110
,111
.
The number of unique values that can be represented by a certain number of bits is calculated as 2^n, where n is the number of bits. For example:
1 bit → 2 possible values (
2^1 = 2
)8 bits → 256 possible values (
2^8 = 256
)
Bytes Explained
A byte is a unit of digital information typically consisting of 8 bits (each bit is either a 0 or a 1). In programming, a byte can represent a small number but also other types of data. For example, a byte can store a number from 0 to 255, but it can also represent a specific letter, symbol, or character. This mapping between numbers and characters is standardised, which brings us to Unicode.
What is Unicode?
Unicode is a global standard that assigns a unique number (called a “code point”) to every character in almost every language in the world, including symbols and emojis. This allows computers to represent and process text consistently across different languages, systems, and platforms.
Each Unicode character has a specific code point, and the character can be represented as one or more bytes. For example:
English letters like “A” or “B” are stored in a single byte.
Complex characters, like Chinese or Japanese symbols, or emojis, may require multiple bytes to represent them.
So, when we talk about "Unicode support," we're referring to a system that can handle text from any language, regardless of how many bytes it takes to represent each character.
Strings as Sequences of Bytes
In programming, a string is a sequence of characters, but under the hood, it’s actually a sequence of bytes. Each character in a string has one or more bytes that represent it.
In Go, strings are simply sequences or slices of bytes. This means that if you want to access or manipulate characters in a string, you are dealing with individual bytes, which can become tricky when working with multi-byte Unicode characters.
Understanding these concepts is essential for working with text data, especially when handling multi-language applications or doing complex string manipulations. Happy coding!