TZPlainDriver.GetUnicodeCodePageName?

aehimself · Post by **aehimself** » 01.03.2024, 22:04

I think we have some errors in the implementations of GetUnicodeCodePageName at some protocols. As far as I understand, this function should return the Unicode (or closest, if it's not supported) code page to use for the specific protocol.

At the moment the following drivers return it:
- TZAdoPlainDriver: CP_UTF16. Perfect
- TZASAPlainDriver: UTF-8. Unicode is not supported, UTF8 is fine
- TZInterbaseFirebirdPlainDriver: UTF8. UNICODE_FSS is available as a codepage, it should be used instead
- TZMySQLPlainDriver: utf8. I have some memories that utf8mb4 should be used as utf8 is not a "real" utf8
- TZOraclePlainDriver: AL32UTF8. UTF16 is available as a codepage, it should be used instead
- TZPostgreSQLPlainDriver: UTF8. UNICODE is available as a codepage, it should be used instead
- TZProxyBaseDriver: UTF-16. Perfect
- TZSQLitePlainDriver: UTF-8. Unicode is not supported, UTF8 is fine
- TZProxyMgmtBaseDriver: UTF-16. Perfect

I can prepare a pull request to fix TZInterbaseFirebirdPlainDriver, TZMySQLPlainDriver, TZOraclePlainDriver and TZPostgreSQLPlainDriver; I just want to make sure we all agree on the above.

MJFShark · Post by **MJFShark** » 02.03.2024, 04:37

I may be reading the suggestion for Firebird incorrectly, but according to their docs UNICODE_FSS isn't recommended.

https://firebirdsql.org/file/documentat ... es-unicode

Also I believe the "Unicode" character set in PostgreSQL is just an alias for UTF8.

https://www.postgresql.org/docs/current ... #MULTIBYTE

-Mark

aehimself · Post by **aehimself** » 02.03.2024, 13:43

To be honest my main targets are MSSQL, MySQL and Oracle, I barely (or never) used the rest of the RDBMSes. So yeah, thank you for the insight!

As for Firebird I agree. As the document you linked clearly states it's not complete, whatever that means:

UTF8 comes with collations for many languages. UNICODE_FSS is more limited and was previously used mainly by Firebird internally for storing metadata.

In PostgreSQL I'd still change it to Unicode - especially if it's a server side alias. My reasoning behind is, maybe in the future Unicode will be implemented and fully operational and they change the alias to it's own, valid codepage. In the mean time, it's fully backwards compatible as earlier versions will simply interpret it as UTF8.

Thoughts?

MJFShark · Post by **MJFShark** » 02.03.2024, 14:19

I've never liked "Unicode" as a character set designator just because it's hard to know exactly what it means. Microsoft (and then Delphi) decided to use it as an alias for UTF-16LE decades ago, but MS reversed that decision a couple years ago and now you can see the much better list in notepad's save window (for example.) I use the new list in all my stuff as I find it more clear. I'd vote to keep UTF8 until Pg supports UTF16LE as a client-side charset.

I agree that for Oracle, specifically on Windows, it makes sense to use OCI_UTF16ID. I've been using the deprecated OCI_UTF16 mode with another Oracle Library for many years without issue. I haven't looked into client-side UTF-16 in other clients.

-Mark

marsupilami · Post by **marsupilami** » 06.03.2024, 09:00

Note: Please don't take this post as an offense. It is not meant that way.
I feel a bit like I see the repetiotion of a long and hard discussion I had with somebody else some years ago. When talking about Unicode it seems to be important do make the distinction between Unicode as a character set (a table that assigns numbers to printable symbols) and the encoding of these numbers into bytes (UTF16(BE/LE), UTF32, UTF-8, ...)

So - when I read the following:

aehimself wrote: ↑01.03.2024, 22:04 - TZASAPlainDriver: UTF-8. Unicode is not supported, UTF8 is fine

It kinda itches and I have to make this distinction again: Unicode *is* supported with this driver. The Encoding just isn't UTF16.

So - going back to GetUnicodeCodePageName: The functions purpose is to give the name that this driver (and the database behind it) use for a unicode compatible encoding. Depending on the database this can be UTF8, UTF-8, UTF16, AL32UTF8 or whatever.
The idea is to be able to (hopefully) transfer all characters as good as possible and not have any data loss on modern Unicode enabled IDEs (Lazarus, Delphi since the 2009 version(?)).

aehimself wrote: ↑02.03.2024, 13:43 In PostgreSQL I'd still change it to Unicode - especially if it's a server side alias. My reasoning behind is, maybe in the future Unicode will be implemented and fully operational and they change the alias to it's own, valid codepage. In the mean time, it's fully backwards compatible as earlier versions will simply interpret it as UTF8.

Well Unicode already is implemented - in the UTF8 encoding

Also I don't like the idea of using the "Unicode" alias. Assume we use the "Unicode" alias and for now this means we get UTF8 which we are prepared for. Everybody is happy. And then PG decides to remap "Unicode" to to UTF16LE. We start getting UTF16LE encoded data which we are not prepared for. Since there is no specification on how we can detect the usage UTF16 strings, we most probably will get an error and Zeos stops working out of the blue just beause somebody uses a new more modern PotgreSQL.
I think we should stick to using UTF8 in GetUnicodeCodepageName. If PG ever decides to change its behavior, we can prepare for UTF16 encoded datan and then we can change to use that representation when we support it too.

Regarding the use of UTF16 instead of AL32UTF8 on Oracle: It makes a lot of sense to use UTF16 encoded data on Delphi. When it comes to Lazarus I am not so sure, because Lazarus and FPC (currently) use a lot of UTF8 internally.
Also I had the impression that UTF16 support was not very good until the fixes of Mark? Maybe it makes sense to not use UTF16 as a default for Oracle in Zeos 8 and maybe postpone that switch to an (not yet planned) Zeos 8.1 or Zeos 9?
But then - this is a decision for the main Oracle users to make as I don't use Oracle at all.

aehimself · Post by **aehimself** » 06.03.2024, 18:49

marsupilami wrote: ↑06.03.2024, 09:00Note: Please don't take this post as an offense. It is not meant that way.

None taken, the exact reason the patch is not sent is that I wanted to have a discussion first. There are lot's of things I don't see / know what others might!

marsupilami wrote: ↑06.03.2024, 09:00When talking about Unicode it seems to be important do make the distinction between Unicode as a character set (a table that assigns numbers to printable symbols) and the encoding of these numbers into bytes (UTF16(BE/LE), UTF32, UTF-8, ...)

So - when I read the following:
aehimself wrote: ↑01.03.2024, 22:04 - TZASAPlainDriver: UTF-8. Unicode is not supported, UTF8 is fine
It kinda itches and I have to make this distinction again: Unicode *is* supported with this driver. The Encoding just isn't UTF16.

I can see the difference, but the name (and the usage) of the method still supports the idea to use the "latest and greatest".
This method returns the code page name supported by Zeos for said protocol, and this is exclusively what I am talking about. I don't have deep knowledge of how specific RDBMS systems store data behind the curtains but I can see what code pages I can select when connecting to a host with Zeos.
This method is called by TZAbstractPlainDriver.ValidateCharEncoding IF none was supplied; it simpy tells Zeos to use the "latest and greatest" if available:

Code: Select all

    {$IF defined(LCL) or defined(UNICODE) or not defined(MSWINDOWS)} //if the user didn't set it
    if ClientCharacterSet = '' then begin
      S := UpperCase(GetUnicodeCodePageName);

marsupilami wrote: ↑06.03.2024, 09:00I think we should stick to using UTF8 in GetUnicodeCodepageName.

That is what I started to think when I read your post. But in this case all drivers should return the UTF-8 charset name and also the name of the method should be changed as it is confusing from implementation perspective.

marsupilami wrote: ↑06.03.2024, 09:00Also I had the impression that UTF16 support was not very good until the fixes of Mark?

Yes, Michael's implementation had issues which Mark managed to "fix". I'm putting it in quotes as he mentioned that there is still data lost but I could not reproduce it even on the insanely huge datasets we have - from my (real world) perspective UTF-16 is fully functional on Oracle now. I'm thinking the data loss is only on edge cases...

MJFShark · Post by **MJFShark** » 06.03.2024, 19:35

On the oracle utf16 buffer issue:
The Oracle data loss is specific to surrogate pair characters in UTF16 mode. I only found it out when doing some "edge case" type testing. I don't suspect it would happen very often "in the wild", but I tend to err on the side of caution and I hate to leave a possible bug. I've also noticed that a certain paid db library I use doesn't handle it correctly and silently truncates...

On the GetUncodeCodePageName question:
I think I'd vote to keep the current charset handling as far as GetUnicodeCodePageName goes. It works very well "as is" and it's really just a matter of where that UTF16 conversion takes place (either in the db client itself or in Zeoslib.) It's easy to change it for the Oracle properties if desired.

-Mark

Zeoslib Portal

TZPlainDriver.GetUnicodeCodePageName?

TZPlainDriver.GetUnicodeCodePageName?

Re: TZPlainDriver.GetUnicodeCodePageName?

Re: TZPlainDriver.GetUnicodeCodePageName?

Re: TZPlainDriver.GetUnicodeCodePageName?

Re: TZPlainDriver.GetUnicodeCodePageName?

Re: TZPlainDriver.GetUnicodeCodePageName?

Re: TZPlainDriver.GetUnicodeCodePageName?