First what everyone is looking for, the free tools. So here is one,
a free UTF8 convertor tool. This I
wrote using strictly WIN32 API calls. It should work on any version
of Windows from Win98, Win NT4.0 and up (ME, 2000, XP). I developed
this on an XP system, but WIN32 API is for all the listed OS's.
This is easy to use. It even gives the Windows character for your
code page for each converted UTF8 value so you can just select, copy,
paste these bytes into your html document. Why? Simple, most editors
do NOT support UTF8. They typically display all bytes as characters
from Windows character set for the current code page. But most current
browsers can handle the UTF8 data and properly display the characters.
(The meta content="UTF8" specifies this.)
So what is UNICODE? It is an international standard to represent the
characters of all written languages. But to accomplish that you need
more than 255 values (one byte). So along comes UTF8, a method for
encoding these higher values into byte wise data such that the original
values can be decoded back. And it neatly makes ASCII become … the
bytes they were originally. But that is where things take two different
paths. For the western European languages (latin alphabet based) the
latin 8859-1 works well. It allows the values from 0x80 to 0x7F for
the various letters with diacritical marks. UTF8 encodes these values
into 2 bytes. For 8859-1 every byte is rendered, for UTF8 multi byte
values are recognized and decoded (converted) to values corresponding
to any other UNICODE character. UTF8 supports the middle eastern,
Asian, etc. languages all in byte wise data. UNICODE does not.
How do you use this? Look up the desired character in the
UNICODE charts and enter
the value of this character in the convertor. Click the 'Convert'
button and you get the hex byte values
plus you get displayable
characters for these bytes. (They will be some odd characters, but
just copy and paste to your document.) There is a nice description
chart of the UNICODE <<>> UTF8 mapping in
Wikipedia. I based
my convertor tool on this chart.
And "
UTF-8
stands for
Unicode
Transformation
Format
-8.
It is an octet (8-bit) lossless encoding of Unicode characters."
A quick quiz. This page is UTF-8 and I have used one special character.
Unless you are viewing this with a very old browser (say from mid
1990's or older) you will see the character normally. (For those stuck
in the past you are seeing some garbage symbols in the page.) Those
with current browsers try to find the character. Give up? It's two
paragraphs up where I say "And it neatly makes ASCII
"
A note on the convertor application. It appears that the ability to
show printable (hence copyable) characters is due to the fact that
the Windows character set for your code page has printable characters
for most bytes from 0x80 to 0xFF, thus they can be copied. And UTF-8
has no characters for 0x80 through 0xA0, the values most used for multi
byte UTF-8 characters!
Or to make it easy here is the convertor as a "Cloud
Computing" aplication.
This page is a work in progress. So come back for more in the future.