Turkish character problem

boreas · Post by **boreas** » 19.12.2011, 11:14

Hello,

I am currently using ZEOS834 + FB2.5 + DelphiXE. My VARCHAR codecs are WIN1254. I entered the definitons codepage=WIN1254 and LC_ALL=WIN1254 to properties section under zconnection.If i use AnsiUpperCase and AnsiLowerCase commands in Zquery by using programs property, I am able to send Turkish character however, if i use sql.clear and sql add commands and write the SQL directly, the Turkish characters are are sent problematically and I cant make search.

zquery1.sql.clear;
zquery1.sql.text:='select * from musteri where UPPER(adi) like UPPER(:AD)';
zquery1.open;
zquery1.active;
zquery1.params[0].value := ansiuppercase(edit1.text);
zquery1.refresh;
// Correct result

zquery1.sql.clear;
zquery1.sql.text:='select * from musteri where UPPER(adi) like UPPER('+chr(39)+ansiuppercase(edit1.text)+chr(39)+')';
zquery1.open;
zquery1.active;
// Empty Result

I ll be glad if I recieve help about this.

Regards
atically.

Post by **EgonHugeist** » 30.12.2011, 16:01

boreas,

your problem is well known for me. Why it happens:

If you use a Parameter-Query Zeos does an UTF8Encoding/Decoding for your Parameters. That's why it works well.

If you use an Direct SQL-Query Zeos does not know which Data you want to Select/update/insert at runtime. The Problem occours with several Characters (Russian, Chinese ... Turkish) and it seems it's Database-independent. I made a patch to handle it, but it fails when you use Binary-Data because the UTF8Encoding runs over the whole Query and changes the binary Data. So it wasn't helpful and I'll remove it now.

It seems Zeos must preprepare the Query-Statements, check for Binary-Data, and send a pepared Query to the database...

It mean's if you use the UTF8Encode(edit1.text) it will works well i think. Please try it as a work around.

Best regards

Post by **mdaems** » 30.12.2011, 17:30

It seems Zeos must preprepare the Query-Statements, check for Binary-Data, and send a pepared Query to the database...

That's exactly what should be done in an ideal world! Actually, this prepreparation should disassemble sql statements moving all literal strings to parameter-like fields and only at the very last moment (ie. right before passing to the database dll) the query should be re-assembled. Implying all sql should be passed around as a special TZSqlObject from the component layer through the dbc layer to the plaindriver layer.

But can we find a task group of 4 community members who want to design this change and implement it afterwards? And can this change be done keeping database drivers we can't really test ourselves?
Why the task group? Because it's a very complex thing and once you start coding the idea should be approved by some more 'pairs of eyes'.

Mark

Post by **EgonHugeist** » 02.01.2012, 12:23

EgonHugeist wrote: If you use an Direct SQL-Query Zeos does not know which Data you want to Select/update/insert at runtime. The Problem occours with several Characters (Russian, Chinese ... Turkish) and it seems it's Database-independent.

mdaems wrote:
It seems Zeos must preprepare the Query-Statements, check for Binary-Data, and send a pepared Query to the database...
That's exactly what should be done in an ideal world! Actually, this prepreparation should disassemble sql statements moving all literal strings to parameter-like fields and only at the very last moment (ie. right before passing to the database dll) the query should be re-assembled. Implying all sql should be passed around as a special TZSqlObject from the component layer through the dbc layer to the plaindriver layer.

But can we find a task group of 4 community members who want to design this change and implement it afterwards? And can this change be done keeping database drivers we can't really test ourselves?
Why the task group? Because it's a very complex thing and once you start coding the idea should be approved by some more 'pairs of eyes'.

Mark

If there is a taskforce, i'll join to help. I think doing something is better than nothing...

So what can you do Mark? First you are THE 'pairs of eyes'

And i think i would be helpfull to make a Spezial Portal-Thread/FAQ which discribes this Problem here. Tell the users how to handle it, why it ocours(means Zeos em self is not the Problem here. This is the handling of the Drivers-dll's Zeos is using for. The Problems are the result of mixing its 1-Byte AnsiChars and 2/x-Byte PWideChars Delphi 2009 and higher is using) Your Alpha 7.0.x forum is full of Thread's which are result of this problem. That was why i felt encouraged to solve this Problem. I'm a german developer so i personally and my users DO NOT HAVE this problem here. Make a thread to advertise some Delphi12Up developers which join this taskforce and to amass only Bug's and Questions here. Also i think you make a special 7.0x User-Patches thread to difference which patch is for which Version...

Regards

marsupilami · Post by **marsupilami** » 13.01.2012, 15:28

Hello boreas, mdaems and EgonHugeist,

I am no expert at zeos. But shouldn't boreas use
CODEPAGE=UTF8
in TZConnection.Properties when he uses a Delphi that uses Unicode? This way Firebird can do all the conversion that is needed and all that the lower levels of Zeos have to do is advertise the resulting VARCHAR and BLOB sup type 1 fields as Unicode instead of ANSI?
I did a patch that worked like that once and it helped the guy that I was doing it for but did not help a second guy completely...
Mabe you can have a look at the patch and maybe that is a start for solving this problem until you can implement the way that it should be done in an ideal world ;o)
Best regards,

Jan

Post by **EgonHugeist** » 14.01.2012, 13:22

marsupilami,

Please add this patch to the Attachements of your post.. Let's see if it is helpfull.

marsupilami · Post by **marsupilami** » 16.01.2012, 20:27

Hello EgonHugeist,

I attached the patch. Maybe it is not completely right. But it might go into the right direction as it solved unicode handling for the standard Delphi TDBEdit. It seems that it solved problems for one guy whereas another guy seemed to have problems with some controls (see http://zeos.firmos.at/viewtopic.php?p=13717#13717).
Suggestions and critics are welcome

Best regards,

Jan

Post by **EgonHugeist** » 17.01.2012, 18:47

The way your gone (CharSetName := Info.Values['isc_dpb_lc_ctype'];)is a nice idea.

BUT:

boreas wrote: My VARCHAR codecs are WIN1254 !!!

All your VARCHAR-Fields are stUnicodeString fields. This way does'nt realy solve the problem boreas, has. So why this?

His CharacterSet do not need an UTF8Encoding. Which means on calling a non Parameter(unprepared)-Statement the ZDbcInterbase6Utils.PrepareStatement uses the UTF8Encode(your SQL-String)-Function. Which turns some characters to UTF8-Decodable Characters. This is what your patch does too(Enable VARCHAR as stUnicodeString).

Your patch and all others (MySQL, PostgreSQL, SQLite-not yet) USE UTF8Encode/Decode as an StringToAnsi-Adapter. Is this the right way?

So we need generally a CharacterSet-Conversation-Class which rearanges the 2-4Byte Chars from Delphi2009( and later) to Ansi-Database-Compatible-Characters in dependency of the used Field-CharacterSet or Database-Default-CharacterSet( if not Set). This is a Zeos Problem generally...

Why this:

the stUnicodeString allways (actual) does the UTF8-Decoding/Encoding to handle the Byte-Convertation (the 1Byte-AnsiChars (Database-Dll's uses) and the Delphi-nByte-Chars). Now what happens if another non-Zeos application reads this data and the Collumn-collation is a non-UTF/Unicode one? The result mus be some som corect and some cryptical UTF8 Chars!

Actualy Zeos ignors all this "Column-spezial-Code-Pages" the developer has set before! And this is a amazing Problem!

So what can we do? We have to know which Columns of your given SQL-Statement is in use and which String-handling is the right one here. This means we have to disassamble each Query by intelligent way....

Michael

marsupilami · Post by **marsupilami** » 17.01.2012, 22:24

Hello Michael,

EgonHugeist wrote:Your patch and all others (MySQL, PostgreSQL, SQLite-not yet) USE UTF8Encode/Decode as an StringToAnsi-Adapter.

Actually, this isn't what is happening. The patch doesn't use UTF8Encode anywhere.

I don't know how other database servers work. But with Firebird it is like this, if I understand the documentation correctly: When you connect to the server, you specify what kind of character encoding (and codepage and so on) your appliaction uses. You specify this using the parameter isc_dpb_lc_ctype.
Once you are connected, firebird transliterates all strings from the database encoding to the connection encoding. From version 2.1 on this also happens for text blobs.
In the other direction Firebird also converts your strings to the coding system specified in the database. From the Firebird point of view this is a necessity because in the database you can specify different character sets for different columns if you want to, whereas the connection character set is fixed.

So the patch takes the value of isc_dpb_lc_ctype and asks the database, if the specified character set is a unicode character set. If it is a unicode character set, it declares all columns to be unicode columns to the upper layers of Zeos because the encoding of these columns is unicode and Firebird expects to get unicode in return. If the conenction character set isn't unicode it must be ANSI. And in this case the patch declares the coluns to be ANSI columns because the encoding is ANSI.

EgonHugeist wrote:All your VARCHAR-Fields are stUnicodeString fields. This way does'nt realy solve the problem boreas, has. So why this?

Well - actually this way should solve Boreas problem. Because this way he asks Firebird to send everything in unicode and to expect everything in unicode. And there is no harm in doing that because his Delphi is fully unicode compatible.

And - there was some kind of a bug in Zeos 7, if I remember correctly: With Firebird the SQL statements are sent to the database in UTF-8, even if you specify an ANSI connection character set. I don't know why it is like this but that will almost always give some kind of trouble because firebird treat the incoming UTF-8 data as being encoded in ANSI.
But I have to be more careful: I saw this behavour in Delphi XE. I don't know what Zeos 7 will do if it is used in older Versions of Delphi, which use ANSI for their string encodings.

I hope I could clarify what the patch tries to do and how this could help Boreas.
Best regards,

Jan

Post by **EgonHugeist** » 18.01.2012, 20:11

Jan,

i understand you rigth, i'm shure. So let me explain what i mean:

Open Unit ZDbcInterbase6Utils.Pas.
Then goto:
procedure TZParamsSQLDA.EncodeString(Code: Smallint; const Index: Word;
const Str: String);

Set an break-point on: with FXSQLDA.sqlvar[Index] do

Look carefully what happen's here?

All WideStrings your patch is accessing now, where boiled down to a 1Byte-RawByteString by usin UTF8Encode. (This is not a FireBird-Problem it's a generaly problem)
So why this happens?
The FXSQLDA-Record only supports the PAnsiChar you have to know.
If we send nByte-Chars the post are currupt. We need a WideToAnsi-String translation here or we lost some String-Data. On executing UTF8Encode(AString) the result is Ansi-compatible. But the AnsiCodepage turns from your OS-specific CodePage to 6500 or 6501 (UTF8-CodePage).
Example:

Make a sample-project. Create a FireBird-Database and add a Test-Table. Now arrange your Componets to your Test-Database. Take a DBEdit and link it to one VARCHAR-Field. Now do an Insert with your ZTable or ZQuery-Component. Type in some german Characters like: "kk ääää ööö üü ßßß" and then post this with your Component. Take the FlameRobin and open the Table of your Test-Database. Select the data of this table and take a look what was happen to your intyped characters...

Would you be so kind and test this?

Not FireBird arranges the String-Type casting it's Zeos. Or am i wrong?

Sorry i've to correct myself. I found this:

codepage = utf8 equivalent to the "CHARACTER_SETS SET utf8".
it does not change anything on the MySQL database and applies only to the session initiated by TZConnection.
This lasts as TZConnection.connected = true.
Thus, "SET CHARACTER_SET utf8" tells the server, “future incoming messages from this client are in character set utf8.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement.)
Also according to the manual of MySQL this conversion is slow ...

But what happens if you execute an non-UTF8-Prepared Query directly? The Proplem should be the same. Please correct me..

Best regards,

Michael

marsupilami · Post by **marsupilami** » 19.01.2012, 15:32

Hello Michael,

I am sorry that my answer is somewhat lengthy. But I see no other way how to say everything and be precise at the same time.

EgonHugeist wrote:We need a WideToAnsi-String translation here or we lost some String-Data. On executing UTF8Encode(AString) the result is Ansi-compatible. But the AnsiCodepage turns from your OS-specific CodePage to 6500 or 6501 (UTF8-CodePage).

It seems that we both use some different kind of terminology. For me the term character set conversion means a transformation for transferring a character from one representation style to another representation style where the number assigned to the character may change or where the character may be not transferrable at all. So changing from CP850 to WIN1252 is a character set conversion because the ä changes from number $84 in CP850 to $E4 in WIN1252. The "┴" from CP850 is not transferrable at all. The same applies for a conversion WIN1252 <-> Unicode. Normally you need some kind of a translation table for being able to do these.
When it comes to Unicode, the translation from UTF16/UCS16 to UTF8 is a change in the coding style and no character set conversion, because it will always happen without a loss of data and the characters remain at the same position in the character set tables. The ä will always be $E4. Just the way, the $E4 is encoded changes. For translating between UTF16 (Delphi, Windows) and UTF8 (Firebird) you don't need translation tables - you just need a quite simple algorithm for reencoding numbers - much like it is necessary for converting numbers between big endian and little endian.

So for me the use of UTF8Encode is not a character set translation but just a change in the encoding style of a unicode String.

EgonHugeist wrote:Open Unit ZDbcInterbase6Utils.Pas.
Then goto:
procedure TZParamsSQLDA.EncodeString(Code: Smallint; const Index:
[...]
Look carefully what happen's here?

So, using this, I had a look at TZParamsSQLDA.EncodeString. This routine makes some assumptions that may be true or may not be true.

When it finds a Delphi or Pascal that uses the ansistring for its string type it just copies the incoming data to the Firebird XSQLDA-Structure and hopes that the encoding is correct for what the user specified in the connection settings via isc_dpb_lc_ctype.

And it seems that somebody has transported this way of assuming to the Delphis that use a UnicodeString for their String type. Now it assumes that the user has set the isc_dpb_lc_ctype parameter to UTF8. It just uses UTF8Encode to convert from the UTF16 representation, that is used by Delphi, to the UTF8 representation that Firebird uses, no matter what the connection character set really is.

So this all comes to some questions as to what the Firebird DBC driver can expect from the user and what the user can expect from the DBC driver.
With Zeos 6 it was the users responsibility to tell the DBC driver what encoding he needs for Delphi to work (WIN1252/BIG5/...). UTF8 normally was no valid choice for Delphi. It was a valid choice for Lazarus, as Lazarus uses UTF8. This also can be a codepage that is different from the database codepage because Firebird will do the translation.
What is the users responsibility with Zeos 7? With a Delphi that primarily uses AnsiStrings it will remain the same responsibility, because Zeos can not detect the Character set to be used reliably.
But what is the responsibility of a user with a unicode supporting Delphi?
-> Is the user supposed to use CHARSET=UTF8 to tell the DBC driver that unicode support is required for correct functionality?
-> Or is the DBC driver required to detect what kind of character set the user specified and to use UTF8Encode/UTF8Decode when it is UTF8 and to use something like WideToAnsi/AnsiToWide when it is something else?
-> Or should the DBC driver enforce the use of Unicode (UTF8) when there is a unicode Delphi and not allow anything else?

EgonHugeist wrote:Would you be so kind and test this?

So for your experiment: I did not do it. But I did something like it before. The outcome of the experiment very much depends on what value you set for the CHARSET parameter.
With CHARSET=WIN1252 writing will not work correctly. I can only assume, this is because TZParamsSQLDA.EncodeString silently encodes the target String as UTF8 whereas the database expects it to be in your selected codepage (WIN1252). Reading will work because the strings will be delivered in the correct codepage by the database server and be converted to UCS16 by Delphi.
With CHARSET=UTF8 writing will work properly. This is because now TZParamsSQLDA.EncodeString deliveres exactly the right encoding that the database expects. But reading will not work because the UTF8-encoded strings are just advertised as being ANSI and are not converted to the correct ANSI codepage.
When you apply my patch, reading will work too, because then the Strings will be treated as being Unicode. You get a situation where reading and writing works, regardless of what the database encoding for VARCHARs is.

For who does what codepage conversion:
Right now there are one or two conversions happening, when writing and you need full Unicode support:
1) Delphi to Firebird Client = Unicode(UCS16)<->Unicode(UTF8) -> In My book this is no Codepage Conversion because it is just a change in how the numbers are encoded. In your book this is a codepage Conversion. This one is necessary, when UTF8 is the only Unicode representation that the database understands, as it is with firebird.
2) Firebird Client to Firebird Engine = Unicode(UTF8)<->Database Encoding (UTF8/WIN1252/BIG5) -> This one is necessary for Firebird to be able to support clients that request different character set encodings.

For the use of UTF8Encode/UTF8Decode: You simply need to use that when the only Unicode encoding that your database understands is UTF8. Otherwise you will never be able to use the databases unicode functionality.
And also I think that the DBC driver is the right point to use UTF8Encode/UTF8Decode as only the driver knows what flavors of unicode are supported by the underlying database and what conversions are needed.

EgonHugeist wrote:But what happens if you execute an non-UTF8-Prepared Query directly? The Proplem should be the same. Please correct me..

Hmmm - I don't know, but then I don't need to. I don't know what MySQL does. With Firebird you specify the character set, you want to use when you connect and can not change it afterwards.

So - again I am sorry for this lengthy text. But maybe if you can answer my questions of what is to be the expected behaviour for the DBC driver, then maybe I can modify it to behave the way it is supposed to.

Best regards,

Jan

Post by **EgonHugeist** » 19.01.2012, 17:54

Ok Jan,

the in/out-String handling does FireBird.

So what can i say, your direction of your patch should be the right way to seperate where this Encoding is neccassary and where not. In boreas case the Statement.ExecuteQuery UTF8-Encoding is the problem. A Paramter-Like Query call's the TZInterbase6PreparedStatement where this UTFEncoding messes. So we've to saperete this handling exacly. This case your patch actually messes too. But i think you can integrate this slippy. Can you do this too?

Now we basically start from the premise that we have set ClientCodePage=UTF8 and we're using your patch and we've clearly separeted the use of UTFEncode in the ExecuteQuery-Statemnt. All Component-Handled in&out-Data proper to the database-data.

What happens if we send an UpdateStatement directly?
Like ZQuery1.SQL.Test := 'update Table1 set Col1='ン－手入' where ....; ? This normaly requires an UTF8Encoding, right?

Using a UpdateStatement and a generally UTF8-handling involves problems with binary-Data! Which declares the post of Mark and me made above. Could'nt it happen automaticly? I'm shure it can be done if we dissassamble the Query. Actually i've worked out a fast !workaround!(no perfect sollution) for this problem. But it's not ready for all PlainDrivers so i'll present it later...

So we can conclude combinating your completed-patch and my little solution should be better than all things actualy where done here.

But Jan there is almost one Big-Problem:

Each Ansi is able to ingest characters like 'ン－手入'. Boreas uses the Codepage WIN1254 which can ingest this characters too. So what happens with these charactes if no UTF8 was set and Delphi12Up is used? They turn to ?-?? or somthing like this. This is result of WideString to AnsiString. Here a strong String-Translation-Class would be wonderfull... Like i wrote before.

Doesn't it force us to accept only UTF8 or UTF16-ClientCodepages as minimum requrement since Delphi12 and later?

btw: What happens with Predefined String-Fields in your Patch? They do not automaticaly turn to TWideString-Field which is needed if you change to UTF8-CodePage. Should'nt it be better to declare VARCHAR-Fields always as stUnicodeString and only on setting or reading the stUnicodeStrings we have to do a intervention in dependency of IsUnicodeConnection?

Best regards,

Michael

marsupilami · Post by **marsupilami** » 20.01.2012, 19:32

Hello Michael,

hmmm - for the start I would try to implement a consistent handling of Stringfields. Use (Ansi)StringField when you have an ansi codepage and use WideStringField when it is a Unicode Connection. I think it is the responsibility of the application developer to select the right one. But we could implement a default of UTF8 for the CHARSET parameter, when the user has not set anything else. This will always be better than leaving the way it is now: The firebird default is the special character set NONE.

Using characters like ン－手入 is almost always like asking for trouble. You will almost certainly have data loss or exceptions from the database when you don't have a connection type and storage type that can accomodate these characters. WIN1254 simply can not do this. But knowing this and doing the right things to prevent this is the responsibility of the developer.
It could be desireable to expose all String-Fields as WideStringFields if you want to prevent data loss and raise an exception in a case where dataloss occurs. But I am not sure if this is something that should be done.

Also I don't know what will happen to predefined string fields, if you change between Unicode and ANSI. But we can try once, th patch is finished ;o)

For the binary data, hmmm, this is my very personal opinion:

don't put binary data anywhere else than in blobs
if you still need to store binary data in a non-blob-field encode it using something like base64, base85 or something like that. I use Base64 to store GUIDs...

So, but if you still need to store binary data and don't like these options, then Firebird has the special character sets OCTETS and NONE, which can do these things, as they don't touch the contents of the incoming Data. So for these fields no character set conversion happens.
Columns which use these character sets could / should be exposed as TVarBytesField or TBytesField imho. Does Zeos have support for these Field types? But adding support for this would be a second patch.
If you need to do something like this and you still need unicode support: Hmm - declare the columns in the database as the ANSI Codepage that is used by your system. Then you will get a convesion that could look like this:
WIN1252 <-> Unicode <-> Win1252
This will lead to your data being in the database in the same order as in your windows Ansistring and without data loss.
I don't know what your patch does, but to me it seems impossible to get around character conversions when you are using any other way than the ones described above. So - you got me courious there ;o)

Best regards,

Jan

marsupilami · Post by **marsupilami** » 26.01.2012, 19:54

Hello Michael,

this is the second version of the patch. Strings should now be converted correctly. Maybe you can take a look at it? I did not look into the correct translation of blob fields yet.
Also I have no idea how to do a comprehensive test. Maybe you have an idea?
Best regards,

Jan

Post by **EgonHugeist** » 29.01.2012, 19:16

Jan

sorry i was bussy.. I'll take a look now.

Actually i see no possibility for a character-set based test. Have you checked your patches with the testsuite?

you can emulate a test by coding in info.Values['isc_dpb_lc_ctype'] := 'UTF8' in the interbase contructor...
Then run the TestSuites.

detailed answer later Jan

Best regards