CSCI 265 notes: character sets and unicode

The impact of the QWERTY keyboard can be seen in many of the programming languages originally developed by companies and universities in North America - both the keywords of the language and the character sets and symbols used in the language are driven substantially by keys held over from the old typewriter days.

The character sets supported tended to be encoded as single 8-bit bytes, either using the EBCDIC (for IBM mainframes) or ASCII (for everything else) tables.

There are two main issues we'll discuss here in character representation:
(i) incompatibilities across platforms in handling characters
(ii) the inability of those two character encodings to represent more than 256 characters, and thus their inability to represent the majority of the symbols and characters in use around the world.

Platform (in)compatibility in character sets

Because the EBCDIC and ASCII systems use different orderings for the characters represented, programs may behave differently when compiled on an EBCDIC platform than on an ASCII platform.

Example 1: in ASCII, the digit encodings have lower values than the uppercase alphabetic characters, which in turn have lower values than the lowercase alphabetic characters. In EBCDIC, on the other hand, the lowercase alphabetic characters come first, then the uppercase, then the digits. Thus the code below will produce opposite results on the two systems:

char c1 = 'x';
char c2 = 'X';
if (c1 < c2) printf("Lower\n");
if (c1 > c2) printf("Higher\n");
else printf("Equal\n";

Example 2: in ASCII, the lowercase characters all appear in sequence a..z, as do the uppercase characters. In EBCDIC, on the other hand, there are gaps between characters i and j (and I,J), and more gaps between r and s (and R,S), with various symbols in the gaps. The code below will thus print a,b,...,z on an ASCII system, but will generate extra characters in between on an EBCDIC system:

for (char c = 'a'; c <='z'; c++) printf("%c,", c);

Furthermore, different platforms identify the end-of-line characters differently:

On old mac systems (pre-OS X), an end of line was marked with the carriage return character (ascii 13)
On linux/unix systems, an end-of-line is marked with the line feed character (ascii 10)
On windows systems, an end-of-line character is marked by a carriage return followed by a line feed character.

As a result, if programmers write code specifically for the type of character used on their system, the code may not work correctly if compiled on other systems.

The inability of ASCII/EBCDIC to represent more than 256 characters

Because both systems are based on one character per 8-bit-byte (with ASCII only using 7 of the bits), neither system can represent more than 256 characters.

To represent all the alphabets and special symbols used world wide, a system is needed that can represent tens of thousands of different characters.

The solution is UNICODE, which is currently set to represent over 136,000 characters, with hex encodings from 0 to 10FFFF (see unicode.org/charts or unicode-table.com/en/ ).

Different ranges of values have been set aside for symbols in over 100 languages, and also include a wide range of special symbols (everything from math and mapping symbols to emoticons).

To maximize flexibility, some of these symbols are denoted as combining symbols - meaning they are combined with the previous character in a string to create a single more complex character (e.g. combining a letter with an accent mark).

It is also possible for the programmer to assign custom designed symbols, using portions of the UNICODE range that are not in use already.

(Note that, as UNICODE gradually evolved, many groups developed their own standards for what values to use for specific characters, and the unicode standards committee has attempted to support many of these, thus you may find the same character represented with a variety of different codes.)

Programming languages provide varying degrees of unicode support - with more recent versions of languages more likely to either support unicode innately or to provide libraries to enable the use of unicode.

The key issues most frequently encountered are ensuring programmers understand:

which functions/methods etc are character based and which are byte-based (as unicode characters might be composed of multiple bytes)
if unicode strings and characters require different routines for I/O, regular expressions, etc
the relationship between the character format in files used by the program as opposed to the format in use within the program itself

One aspect that has produced some anomalies is the set of "permanently unassigned" codes. These are codes in the ranges FFFE-FFFF, FFFFE-FFFFF, FFFFFE-FFFFFF, and their use should be avoided within code. However, some developers of editors and other tools have chosen to use codes in these ranges as sentinels (not unlike null terminators) within documents. When the documents are then used by other programs (not expecting these values, or using them in an incompatible fashion) the results can be unpredictable.

Below we briefly look at support in C++ and Perl.

Perl and unicode

Use of unicode in perl is relatively easy:
(i) add the following line of code at the beginning of your script/module:
use feature 'unicode_strings';

Then you can specify unicode characters via their hex code using "\N{U+xxxx}", e.g. the following uses a happy face in the string for Hello World:

#! /usr/bin/perl -w
use strict;
use feature 'unicode_strings';

my $face = "\N{U+263a}";                   # the happy face
my $str = "Hello W" . "$face" . "rld!";  # the . concatenates strings
print "$str\n";

C++ and unicode

C++ uses the setlocale function (from the <clocale> library) to customize the use of character sets, currencies, times, etc. The setting below permits all customizations, using default values:
setlocale(LC_ALL, "");

For unicode characters, C++ uses "wide" characters (i.e. more than 1 byte), supported through the cwchar and cwctype libraries.

These include wide versions of most text handling routines, e.g. wprintf, fgetws, iswspace, etc.

The iostream library also provides wcin and wcout, i.e. wide versions of cin and cout.

Syntactically, individual character literals can be handled as follows:
wchar_t ch = 0x001f6; // assign a unicode character

Wide string literals are prefixed with an L, e.g.
wprintf(L"blah blah blah");

The format strings for wide characters and wide strings are %lc and %ls, respectively, e.g.
wprintf(L"this is the char: %lc\n", ch);

Editors and unicode

Many editors provide mechanisms like macros that allow you to create command shortcuts - these are often useful for creating shortcuts to insert unicode characters you commonly use.

In vim, for example, you can create macros either by recording them during normal editing, or by specifying them using the nmap command in your .vimrc file.

E.g. recording and "playing back" a macro to insert a smiley face:

The sequence of vim commands to insert a smiley from normal mode is:

   (i) press the i key (to enter insert mode)
  (ii) press control-V (to mark the start of a text item)
 (iii) enter U263a (the unicode for the smiley)
  (iv) press escape (to leave insert mode)

You begin and end recording macros in vim using the q command, followed by the letter you want to use as the name of your new macro, e.g. S (for smiley)
Thus the sequence to record the macro is
```
press q
press s
press i
press control-V
enter U263a
press escape
press q
```
Playback: the macro is now recorded, so you can play it back (i.e. insert a smiley) whenever you want by using the keys @s

We can use the vim configuration file (.vimrc in your home directory) to set up a collection of such macro mappings that will be available whenever you use vim.

The mapping command to put in the .vimrc file (using the smiley example again) is:

   nmap @s i<c-v>U263<esc>

Note that nmap uses the notation <c-v> to indicate control v, and <esc> to indicate escape.
(The n in nmap indicates the command works from vim 'normal' mode.)

Bash and unicode

You can use unicode characters directly on the command line in bash using the sequence
control-shift-u followed by the unicode hex value (e.g. 263a) then hit enter.

Within bash scripts, you can either enter the unicode character directly in the script using whatever mechanism your editor provides, e.g. for adding the smiley in vim
smiley="☺"; echo "$smiley" # i <ctrl-v> U 263a <esc> to get the smiley part

or by using the character code, e.g.
printf "\\u263a"