Marco Iannaccone wrote:
I'd like to start using Unicod (especially UTF-8) in my C programs, and
would like some infos on how to start.
Can you tell me some documents (possibily online) explaining Unidoce
and UTF-8, and how I can use them in my programs (writing and reading
from file, from the console, processing Unicode strings and chars
inside the program, etc..)?
C provides a concept of wide characters (arrays of wchar_t) and
multibyte characters (arrays of char where each character may take up
more than one byte). The C standard defines functions for converting
between wide and multibyte representations. The standard does not
specify what encoding these two representational forms take.
On at least one platform, depending on the current locale setting, the
wide characters built in to C represent Unicode characters, and the
multibyte characters represent the UTF-8 form.
The following program attempts to set the locale to en_AU.UTF-8, which
means Australian English in UTF-8 encoding. The language portion doesn't
matter, just the encoding does. It then takes a UTF-8 string (which
happens to contain Simplified Chinese characters), and converts it to
the wide character representation, which on my platform is equivalent to
Unicode.
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
wchar_t ucs2[5];
if(!setlocale(LC_ALL, 'en_AU.UTF-8'))
{
printf('Unable to set locale to Australian English in UTF-8n');
return 0;
}
/* The UTF-8 representation of string 'æ°´è°ƒæ*Œå¤´'
(four Chinese characters pronounced shui3 diao4 ge1 tou2) */
char *utf8 = 'xE6xB0xB4xE8xB0x83xE6xADx8CxE5xA4xB4' ;
mbstowcs(ucs2, utf8, sizeof ucs2 / sizeof *ucs2);
printf('UTF-8: ');
for(char *p = utf8; *p; p++)
printf('%02X ', (unsigned)(unsigned char)*p);
printf('n');
printf('Unicode: ');
for(wchar_t *p = ucs2; *p; p++)
printf('U+%04lX ', (unsigned long) *p);
printf('n');
return 0;
}
[sbiber@eagle c]$ c99 -Wall utf8ucs2.c -o utf8ucs2
[sbiber@eagle c]$ ./utf8ucs2
UTF-8: E6 B0 B4 E8 B0 83 E6 AD 8C E5 A4 B4
Unicode: U+6C34 U+8C03 U+6B4C U+5934
I'd be interested to know how widespread this technique works. Is it
portable?
--
Simon.