Data in the computer is stored in bytes of 8 bits each. Data is sent out of the computer in bytes of 8 bits each. Data is received into the computer in bytes of 8 bits each.
A stream of bytes can be converted into a stream of sextets (6 bits per symbol). And that is base64 encoding. A stream of sextets can be converted into a stream of bytes. And that is base64 decoding. In other words, a stream of ASCII characters can be converted into a stream of sextet symbols. This is encoding, and the reverse is decoding. The stream of sextet symbols, converted from a stream of octet (byte) symbols, is longer than the stream of octet symbols by number. In other words, a stream of base64 characters is longer than the corresponding stream of ASCII characters. Well, encoding into base64 and decoding from it is not as straightforward as just expressed.
This article explains the encoding and decoding of Base64 with the C++ computer language. The first part of the article explains base64 encoding and decoding properly. The second part shows how some C++ features can be used to encode and decode base64. In this article, the word “octet” and “byte” are used interchangeably.
Article Content
- Moving up to Base 64
- Encoding Base64
- New Length
- Decoding Base64
- Transmission Error
- C++ Bit Features
- Conclusion
Moving up to Base 64
An alphabet or character set of 2 symbols can be represented with one bit per symbol. Let the alphabet symbols consist of: zero and one. In this case, zero is bit 0, and one is bit 1.
An alphabet or character set of 4 symbols can be represented with two bits per symbol. Let the alphabet symbols consist of: 0, 1, 2, 3. In this situation, 0 is 00, 1 is 01, 2 is 10, and 3 is 11.
An alphabet of 8 symbols can be represented with three bits per symbol. Let the alphabet symbols consist of: 0, 1, 2, 3, 4, 5, 6, 7. In this situation, 0 is 000, 1 is 001, 2 is 010, 3 is 011, 4 is 100, 5 is 101, 6 is 110 and 7 is 111.
An alphabet of 16 symbols can be represented with four bits per symbol. Let the alphabet symbols consist of: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. In this situation, 0 is 0000, 1 is 0001, 2 is 0010, 3 is 0011, 4 is 0100, 5 is 0101, 6 is 0110, 7 is 0111, 8 is 1000, 9 is 1001, A is 1010, B is 1011, C is 1100, D is 1101, E is 1110 and F is 1111.
An alphabet of 32 different symbols can be represented with five bits per symbol.
This leads us to an alphabet of 64 different symbols. An alphabet of 64 different symbols can be represented with six bits per symbol. There is a particular character set of 64 different symbols, called base64. In this set, the first 26 symbols are the 26 uppercase letters of the English spoken language, in its order. These 26 symbols are the first binary numbers from 0 to 25, where each symbol is a sextet, six bits. The next binary numbers from 26 to 51 are the 26 lowercase letters of the English spoken language, in its order; again, each symbol, a sextet. The next binary numbers from 52 to 61 are the 10 Arabic digits, in their order; still, each symbol, a sextet.
The binary number for 62 is for the symbol +, and the binary number for 63 is for the symbol / . Base64 has different variants. So some variants have different symbols for the binary numbers of 62 and 63.
The base64 table, showing correspondences for the index, binary number, and character, is:
The Base64 Alphabet
Index | Binary | Char | Index | Binary | Char | Index | Binary | Char | Index | Binary | Char |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000000 | A | 16 | 010000 | Q | 32 | 100000 | g | 48 | 110000 | w | 1 | 000001 | B | 17 | 010001 | R | 33 | 100001 | h | 49 | 110001 | x | 2 | 000010 | C | 18 | 010010 | S | 34 | 100010 | i | 50 | 110010 | y | 3 | 000011 | D | 19 | 010011 | T | 35 | 100011 | j | 51 | 110011 | z | 4 | 000100 | E | 20 | 010100 | U | 36 | 100100 | k | 52 | 110100 | 0 | 5 | 000101 | F | 21 | 010101 | V | 37 | 100101 | l | 53 | 110101 | 1 | 6 | 000110 | G | 22 | 010110 | W | 38 | 100110 | m | 54 | 110110 | 2 | 7 | 000111 | H | 23 | 010111 | X | 39 | 100111 | n | 55 | 110111 | 3 | 8 | 001000 | I | 24 | 011000 | Y | 40 | 101000 | o | 56 | 111000 | 4 | 9 | 001001 | J | 25 | 011001 | Z | 41 | 101001 | p | 57 | 111001 | 5 | 10 | 001010 | K | 26 | 011010 | a | 42 | 101010 | q | 58 | 111010 | 6 | 11 | 001011 | L | 27 | 011011 | b | 43 | 101011 | r | 59 | 111011 | 7 | 12 | 001100 | M | 28 | 011100 | c | 44 | 101100 | s | 60 | 111100 | 8 | 13 | 001101 | N | 29 | 011101 | d | 45 | 101101 | t | 61 | 111101 | 9 | 14 | 001110 | O | 30 | 011110 | e | 46 | 101110 | u | 62 | 111110 | + | 15 | 001111 | P | 31 | 011111 | f | 47 | 101111 | v | 63 | 111111 | / |
Padding =
There are actually 65 symbols. The last symbol is =, whose binary number still consists of 6 bits, which is 111101. It does not conflict with the base64 symbol of 9 – see below.
Encoding Base64
Sextet bit-fields
Consider the word:
There are three ASCII bytes for this word, which are:
joined. These are 3 octets but consists of 4 sextets as follows:
From the base64 alphabet table above, these 4 sextets are the symbols,
Notice that the encoding of “dog” into base64 is “ZG9n”, which is not understandable.
Base64 encodes a sequence of 3 octets (bytes) into a sequence of 4 sextets. 3 octets or 4 sextets are 24 bits.
Consider now the following word:
There are two ASCII octets for this word, which are:
joined. These are 2 octets but consists of 2 sextets and 4 bits. A stream of base64 characters is made up of sextets (6 bits per character). So, two zero bits have to be appended to these 16 bits to have 3 sextets, that is:
That is not all. Base64 sequence is made up of 4 sextets per group; that is, 24 bits per group. The padding character = is 111101. Two zero bits have already been appended to the 16 bits to have 18 bits. So, if the 6 padding bits of the padding character are appended to the 18 bits, there will be 24 bits as required. That is:
The last six bits of the last sextet is the padding sextet, = . These 24 bits consist of 4 sextets, of which the last-but-one sextet has the first 4 bits of the base64 symbol, followed by two zero bits.
Now, consider the following one character word:
There is one ASCII octet for this word, which is:
This is 1 octet but consists of 1 sextet and 2 bits. A stream of base64 characters is made up of sextets (6 bits per character). So, four zero bits have to be appended to these 8 bits to have 2 sextets, that is:
That is not all. Base64 sequence is made up of 4 sextets per group; that is, 24 bits per group. The padding character = is 111101, which is six bits long. Four zero bits have already been appended to the 8 bits to have 12 bits. This is not up to four sextets. So, two more padding sextets have to be appended to make 4 sextets, that is:
Output Stream of Base64
In the program, an array-of-chars of the base64 alphabet has to be made, where index 0 has the character of 8 bits, A; index 1 has the character of 8 bits, B; index 2 has the character of 8 bits, C, until index 63 has the character of 8 bits, / .
So, the output for the word of three characters, “dog” will be “ZG9n” of four bytes, expressed in bits as
where Z is 01011010 of 8 bits; G is 01000111 of 8 bits; 9 is 00111001 of 8 bits, and n is 01101110 of 8 bits. This means that from three bytes of the original string, four bytes are outputted. These four bytes are values of the base64 alphabet array, where each value is a byte.
The output for the word of two characters, “it” will be “aXQ=” of four bytes, expressed in bits as
obtained from the array. This means that from two bytes, four bytes are still outputted.
The output for the word of one character, “I” will be “SQ==” of four bytes, expressed in bits as
This means that from one byte, four bytes are still outputted.
A sextet of 61 (111101) is outputted as 9 (00111001). A sextet of = (111101) is outputted as = (00111101).
New Length
There are three situations to consider here to have an estimate for the new length.
- The original length of the string is a multiple of 3, e.g., 3, 6, 9, 12, 15, etc. In this case, the new length will be exactly 133.33% of the original length because three octets end up as four octets.
- The original length of the string is two bytes long, or it ends with two bytes, after a multiple of 3. In this case, the new length will be above 133.33% of the original length because a string part of two octets ends up as four octets.
- The original length of the string is one byte long, or it ends with one byte after a multiple of 3. In this case, the new length will be above 133.33% of the original length (more above than the previous case), because a string part of one octet ends up as four octets.
Maximum Length of Line
After going from the original string through the base64 alphabet array and ending up with octets of at least 133.33% long, no output string must be more than 76 octets long. When an output string is 76 characters long, a newline character has to be added before another 76 octets, or fewer characters are added. A long output string has all sections, consisting of 76 characters each, except the last, if it is not up to 76 characters. The line separator programmers use is likely the newline character, ‘\n’; but it is supposed to be “\r\n”.
Decoding Base64
To decode, do the reverse of encoding. Use the following algorithm:
- If the received string is longer than 76 characters (octets), split the long string into an array of strings, removing the line separator, which may be “\r\n” or ‘\n’.
- If there is more than one line of 76 characters each, then it means all the lines except the last consist of groups of four characters each. Each group will result in three characters using the base64 alphabet array. The four bytes have to be converted to six sextets before being converted to three octets.
- The last line, or the only line the string might have had, still consists of groups of four characters. The last group of four characters may either result in one or two characters. To know if the last group of four characters will result in one character, check if the last two octets of the group are each ASCII, =. If the group results in two characters, then only the last octet should be ASCII, =. Any quadruple sequence of characters in front of this last quadruple sequence is handled like in the previous step.
Transmission Error
At the receiving end, any character other than that of the line separation character or characters that is not a value of the base64 alphabet array indicates a transmission error; and should be handled. Handling transmission errors is not addressed in this article. Note: The presence of the byte, = among the 76 characters, is not a transmission error.
C++ Bit Features
Fundamental members of the struct element can be given a number of bits other than 8. The following program illustrates this:
using namespace std;
struct S3 {
unsigned int a:6;
unsigned int b:6;
unsigned int c:6;
unsigned int d:6;
}s3;
int main()
{
s3.a = 25;
s3.b = 6;
s3.c = 61;
s3.d = 39;
cout<<s3.a<<", "<<s3.b<<", "<<s3.c<<", "<<s3.d <<endl;
return 0;
}
The output is:
The output integers are as assigned. However, each occupies 6 bits in the memory and not 8 or 32 bits. Note how the number of bits is assigned, in the declaration, with the colon.
Extracting First 6 Bits from Octet
C++ does not have a function or operator to extract the first set of bits from an octet. To extract the first 6 bits, right-shift the content of the octet by 2 places. The vacated two bits on the left end are filled with zeros. The resulting octet, which should be an unsigned char, is now an integer, represented by the first 6 bits of the octet. Then assign the resulting octet to a struct bit-field member of 6 bits. The right shift operator is >>, not to be confused with the extraction operator of the cout object.
Assuming that the struct 6 bit-field member is, s3.a, then the first 6 bits of the character ‘d’ is extracted as follows:
ch1 = ch1 >>2;
s3.a = ch1;
The value of s3.a can now be used for indexing the base64 alphabet array.
Producing second Sextet from 3 Characters
The second six bits consist of the last two bits of the first octet and the next 4 bits of the second octet. The idea is to get the last two bits into the fifth and sixth positions of its octet and make the rest of the octet’s bits zero; then bit-wise AND it with the first four bits of the second octet that has been right-shifted to its end.
Left-shifting the last two bits to the fifth and sixth positions is done by the bit-wise left-shift operator, <<, which is not to be confused with the cout insertion operator. The following code segment left-shifts the last two bits of ‘d’ to the fifth and sixth positions:
i = i <<4;
At this point, the vacated bits have been filled with zeros, while the non-vacated shifted bits that are not required are still there. To make the rest of the bits in i zero, i has to be bit-wise AND with 00110000, which is the integer, 96. The following statement does it:
The following code segment, shifts the first four bits of the second octet to the last four bit positions:
j = j >>4;
The vacated bits have been filled with zeros. At this point, i has 8 bits, and j has 8 bits. All the 1’s in these two unsigned chars are now in their right positions. To get the char, for the second sextet, these two 8 bit chars have to be bit-wise AND, as follows:
ch2 still has 8 bits. To make it six bits, it has to be assigned to a struct bit-field member of 6 bits. If the struct bit-field member is s3.b, then the assignment will be done as follows:
Henceforth, s3.b will be used instead of ch2 to index the base64 alphabet array.
Adding Two Zeros for Third Sextet
When the sequence to be encoded has two characters, the third sextet needs to be added two zeros. Assume an octet is already prefixed by two zero bits, and the next four bits are the right bits. In order to make the last two bits of this octet, two zeros, bit-wise AND the octet with 11111100, which is the integer, 252. The following statement does it:
ch3 now has all the last six bits, which are the required bits, though it still consists of 8 bits. To make it six bits, it has to be assigned to a struct bit-field member of 6 bits. If the struct bit-field member is s3.c, then the assignment will be done as follows:
Henceforth, s3.c will be used instead of ch2 to index the base64 alphabet array.
The rest of the bit handling can be done as explained in this section.
Base64 Alphabet Array
For encoding, the array should be something like,
Decoding is the reverse process. So, an unordered map should be used for this structure, something like,
The String Class
The string class should be used for the total un-coded and coded sequences. The rest of the programming is normal C++ programming.
Conclusion
Base64 is a character set of 64 characters, where each character consists of 6 bits. For encoding, every three-byte of the original string is converted into four sextets of 6 bits each. These sextets are used as indexes for the base64 alphabet table for encoding. If the sequence consists of two characters, four sextets are still obtained, with the last sextet, being the number 61. If the sequence consists of one character, four sextets are still obtained, with the last two sextets, being two of the number 61.
Decoding does the reverse.