The character sets supported tended to be encoded as single 8-bit bytes, either using the EBCDIC (for IBM mainframes) or ASCII (for everything else) tables.
There are two main issues we'll discuss here in character representation:
(i) incompatibilities across platforms in handling characters
(ii) the inability of those two character encodings to represent more
than 256 characters, and thus their inability to represent the majority of the
symbols and characters in use around the world.
Example 1: in ASCII, the digit encodings have lower values than the uppercase alphabetic characters, which in turn have lower values than the lowercase alphabetic characters. In EBCDIC, on the other hand, the lowercase alphabetic characters come first, then the uppercase, then the digits. Thus the code below will produce opposite results on the two systems:
char c1 = 'x'; char c2 = 'X'; if (c1 < c2) printf("Lower\n"); if (c1 > c2) printf("Higher\n"); else printf("Equal\n";
Example 2: in ASCII, the lowercase characters all appear in sequence a..z, as do the uppercase characters. In EBCDIC, on the other hand, there are gaps between characters i and j (and I,J), and more gaps between r and s (and R,S), with various symbols in the gaps. The code below will thus print a,b,...,z on an ASCII system, but will generate extra characters in between on an EBCDIC system:
for (char c = 'a'; c <='z'; c++) printf("%c,", c);
Furthermore, different platforms identify the end-of-line characters differently:
To represent all the alphabets and special symbols used world wide, a system is needed that can represent tens of thousands of different characters.
The solution is UNICODE, which is currently set to represent over 136,000 characters, with hex encodings from 0 to 10FFFF (see unicode.org/charts or unicode-table.com/en/ ).
Different ranges of values have been set aside for symbols in over 100 languages, and also include a wide range of special symbols (everything from math and mapping symbols to emoticons).
To maximize flexibility, some of these symbols are denoted as combining symbols - meaning they are combined with the previous character in a string to create a single more complex character (e.g. combining a letter with an accent mark).
It is also possible for the programmer to assign custom designed symbols, using portions of the UNICODE range that are not in use already.
(Note that, as UNICODE gradually evolved, many groups developed their own standards for what values to use for specific characters, and the unicode standards committee has attempted to support many of these, thus you may find the same character represented with a variety of different codes.)
Programming languages provide varying degrees of unicode support - with more recent versions of languages more likely to either support unicode innately or to provide libraries to enable the use of unicode.
The key issues most frequently encountered are ensuring programmers understand:
One aspect that has produced some anomalies is the set of "permanently unassigned" codes. These are codes in the ranges FFFE-FFFF, FFFFE-FFFFF, FFFFFE-FFFFFF, and their use should be avoided within code. However, some developers of editors and other tools have chosen to use codes in these ranges as sentinels (not unlike null terminators) within documents. When the documents are then used by other programs (not expecting these values, or using them in an incompatible fashion) the results can be unpredictable.
Below we briefly look at support in C++.
C++ and unicode
C++ uses the setlocale function (from the <clocale> library) to customize
the use of character sets, currencies, times, etc. The setting below permits
all customizations, using default values:
setlocale(LC_ALL, "");
For unicode characters, C++ uses "wide" characters (i.e. more than 1 byte), supported through the cwchar and cwctype libraries.
These include wide versions of most text handling routines, e.g. wprintf, fgetws, iswspace, etc.
The iostream library also provides wcin and wcout, i.e. wide versions of cin and cout.
Syntactically, individual character literals can be handled as follows:
wchar_t ch = 0x001f6; // assign a unicode character
Wide string literals are prefixed with an L, e.g.
wprintf(L"blah blah blah");
The format strings for wide characters and wide strings are %lc and %ls, respectively,
e.g.
wprintf(L"this is the char: %lc\n", ch);
Example: hello world with unicode smiley face for the o's
#include <cstdio> #include <cwchar> #include <clocale> int main() { setlocale(LC_ALL, ""); wchar_t smiley = 0x00263a; wprintf(L"Hell%lc W%lc%ls\n", smiley, smiley, L"rld"); return 0; }
Editors and unicode
Many editors provide mechanisms like macros that allow you to create command shortcuts - these are often useful for creating shortcuts to insert unicode characters you commonly use.
In vim, for example, you can create macros either by recording them during normal editing, or by specifying them using the nmap command in your .vimrc file.
E.g. recording and "playing back" a macro to insert a smiley face:
(i) press the i key (to enter insert mode) (ii) press control-V (to mark the start of a text item) (iii) enter U263a (the unicode for the smiley) (iv) press escape (to leave insert mode)
Thus the sequence to record the macro is
press q press s press i press control-V enter U263a press escape press q
We can use the vim configuration file (.vimrc in your home directory) to set up a collection of such macro mappings that will be available whenever you use vim.
The mapping command to put in the .vimrc file (using the smiley example again) is:
nmap @s i<c-v>U263<esc>Note that nmap uses the notation <c-v> to indicate control v, and <esc> to indicate escape.
Bash and unicode
You can use unicode characters directly on the command line in bash using the sequence
control-shift-u followed by the unicode hex value (e.g. 263a) then hit enter.
Within bash scripts, you can either enter the unicode character directly in the script
using whatever mechanism your editor provides, e.g. for adding the smiley in vim
or by using the character code, e.g.
smiley="☺"; echo "$smiley" # i <ctrl-v> U 263a <esc> to get the smiley part
printf "\\u263a"