C Programming

Unicode in C

Unicode is a globally used standard for character encoding. It is specifically used to assign some code to every character in every linguistic worldwide. There are many other encoding standards. Unfortunately, not a single encoding standard can be applied to all worldwide languages. So, Here comes the Unicode encoding standard that makes sure to fetch and syndicate any data information by utilizing any grouping of linguistics. Unicode encoding is required at all web-based platforms and in different languages, i.e., XML, Java, JavaScript, etc. There are 2 most used Unicode implementations for PCs, i.e., UTF-8 and UTF-16.

The UTF-8 encoding is the most common type having 8-bit characters encoded. The UTF-8 is most utilized for Linux platforms. The UTF-16 has a 2-byte code. Every character in the world has been given a unique number amongst U+0000 and U+10FFFF using the Unicode encoding. For instance, the encoding of the alphabet “A” is U+0041. Let’s take a new instance to make it more perfect. Let say you have a combination of two words, i.e., “Come Home”. Then, each character has a unique encoding pattern. The resultant encoding pattern for the word “Come Home” is shown as follows:

U+0043 U+006F U+006D U+0065 U+0048 U+006F U+006D U+0065

Let’s have a gaze at various examples to see how Unicode encoding has been done. We will be using the C language to get the original characters from its Unicode. So, start your shell terminal on the Ubuntu 20.04 desktop by using the “Ctrl+Alt+T”. After opening the console application, try creating a new C file to do Unicode in it. Use the “touch” query to do so. An editor is required to open the newly created file, i.e., vim or Gnu Nano. Thus, we have been using the GNU Nano editor so far to open the Unicode.c file, as shown below.

Example 01:

Let’s have our first example of using Unicode to get the actual character from it. As the file is opened, we have added the necessary header for C language to get full support. The standard input-out header is a must. After this, we have initialized a main() function taking the void as its parametric value.

Now, we have added a long Unicode “U0001f602” within the printf statement of C code to print out its actual value. The return 0 clause shows that the main() function is shut here. Let’s save our code to avoid inconvenience. Use “Ctrl+S” for this cause. After this, we have to quit the editor in which our file has been currently opened. Use “Ctrl+X” for this purpose.

Now, we are back to the terminal console. Let’s make the gcc compiler, compile our newly created code file “Unicode.c”. The compilation is successful as no errors have been found. Let’s run this code with the everlasting “./a.out” command in the console. And, it is very overwhelming to see the smiley on our Linux shell screen as an output to the C code. This means that the Unicode “U0001f602” is encoded for this shown smiley.

Example 02:

Let’s take a look at some other Unicode in C while coding. So, we have unwrapped a similar file once again and updated it. The header files are the same as used in the above example. The initialization of the main method is the same. The difference has been found so far in the printf statement. We have declared a new Unicode in it. This format is specially designed for GNU utilities. That’s why we have used it. Let’s save the updated code and quit the file via “Ctrl+S” and “Ctrl+X”.

After the compilation and execution of a command, we have got the sign usually used to indicate that there is some threat or warning. You can see the beautiful output on your console.

You can also use the old way of initializing a Unicode in the printf statement. You can perceive that we have unlocked the same file to update it. We have changed the Unicode format in the printf statement. We have used “\u2620” to replace “\x” sort of a format. Save your code once again to see the changes.

After compilation and execution of the Unicode.c file, we have got the same results as above.

If you want to know the hexadecimal output of the same binary code used in this example, try using the printf command shown in the image with the “hexdump” keyword.

Example 03:

If you don’t know, then we are letting you know that the “x65” Unicode is used for the character “e” while the Unicode “x09” indicates space of almost 9 characters. So, we have opened the same file and replaced the previous encoding with “x65” to see how it works.

After saving the file, we have compiled and executed the code inside it. We have got the character “e” in return as expected and shown in the image below.

Let’s try to use the 9 characters space before the Unicode of character “e”. So, open the very same file and update the printf statement with “\x09\x65”. Save this code to apply the changes.

Come back to the terminal and compile the code. After the compilation, we have executed the file and got the result as displayed in the attached snap. The character “e” has been displayed while there are spaces before it showing the Unicode “x09”.

Example 04:

Let’s find the character representing the Unicode “x0965”. Open the same file with the “nano” command and add the ” don’t change ” overall code. The only change is required at the printf statement. So, we have replaced the old Unicode with the new one, i.e., “x0965”. After this, we have saved this code and quit coming back towards the shell for compilation.

After the compilation and execution of this updated Unicode script, we got the error. It says that the Unicode is a Hexa-escape pattern that is not in our system’s range.

So, when we run it with the “echo” command on the shell, it successfully outputs the value 65 with spaces. The format is very indifferent. The 65 represents 65, but x09 represents spaces. Therefore, it was quite difficult for printf to display on the shell. The echo statement has been printing it on the shell as a variable; that’s why it is easy to do so.

Example 05:

Let’s have a look at the last example to use Unicode in the code. Open the same file and make an update on the line having a printf statement. The rest of the program will be left unchanged. The printf statement has been utilizing the different Unicode, i.e., “u0965” this time. Save this newly updated code with Ctrl+S and quit via the Ctrl+X to see to whom this Unicode belongs.

The script has been accumulated using the “Gcc” compiler. Running this code shows the “OR” sign in return to Unicode “u0965”.

Conclusion:

Within this article, we have discussed the concept of Unicode and its different encoding implementations, i.e., UTF-8 and UTF-16. We have also seen some examples to display the values of Unicode in the shell while using the C language. We are quite hopeful that this article will clear all your issues regarding Unicode.

About the author

Aqsa Yasin

I am a self-motivated information technology professional with a passion for writing. I am a technical writer and love to write for all Linux flavors and Windows.