mdaems wrote:As I don't understand all this stuff about UTF8
There are three different types of character encoding I'm aware at this moment:
ANSI: single-byte
MBCS: Multi-byte character sets
UNICODE: Two-byte character set
ANSI is using codes of 0-255 to display a maximum of 256 different characters. The "codepage" decides which character is what code.
UNICODE is using exactly two bytes which is very simple (no wonder Microsoft used it for NT4/2000/XP/Vista) it allows 65536 different characters.
UTF8 is one (not *the*) Multibyte encoding. The reason for using MBCS is: In most western languages, characters of code less than 128 are the most used characters in any text. The space ($20), characters 'a' to 'z' and 'A' to 'Z' as well as '0'..'9' are all in that bunch of codes.
So, MBCS - in this case UTF8 - uses one, two, or even three bytes per character making a string length unpredictable without decoding the whole string (good reason for Microsoft(r) "640 kb is enough for everybody" Windows(tm) not to use it :-)). You might refer to codes above 127 as 'escape' codes which trigger subsets and sub-subsets of 256 or (in case of three bytes codes) 65536 character codes.
Windows NT/2000/XP/Vista is using UNICODE for all functions. The 'A' versions of functions are just wrapper functions converting arguments to UNICODE and calling the 'W' function. Most functions that accept strings have such wrapper functions. Example: LoadStringA() LoadStringW()
The Delphi programmer should notably use the two functions:
UTF8Encode() to cast aa UNICODE (WideString) into UTF8String
UTF8Decode() to cast an UTF8String into WideString
Most notably, when dealing with UNICODE you should never ever touch any library function from VCL! These are always using AnsiString's. Use of AnsiString will internally convert into the ANSI charset, means: Whatever character isn't in the 256 character cp1251 charset is changed to a question mark ('?') There is a freeware conversion of the popular (unix) library 'libiconv' available at:
www.yunqa.de
The benefits of UTF8 is: Less memory consumption since characters as numbers and letters a-z use less space. UTF8 can use an unlimited number of character codes while ANSI only uses 256 and UNICODE uses 65536.
The downside of UTF8 is: There is no relation between the number of characters and the number of bytes consumed for a string, however maximum is currently three bytes. In databases the CHAR type typically is limited to 255 characters. Using UTF8 you can guarantee only an amount of 255 DIV 3 = 85 characters to be stored into that field.
I hope this little introduction is of some use to you. I'd also like to point you to (Delphi professional or enterprise only!) the VCL source code which notably implements one or two important unicode functions: ${BDS}\source\win32\rtl\common\WideStrings.pas and WideStrUtils.pas