[patch_done] Strings like E'\\' are incorrectly tokenized

Post by **EgonHugeist** » 15.08.2012, 12:38

olehs,

did now check your patch. Great work again. I like that ZUrl.OnPropertiesChange idea.

Question: Now you added CheckEscapeSyntax(). As far as i can see do you check now only for 'E'. If i did understand you right in the past then a complete String like E'abcd\\' should be detectable or not? A quoted state will only be detactable if '/"/´ is used...

Can you explain the purpose to a postgre dummy a little bit more detailed?

Patch done Rev1648.

Michael

olehs · Post by **olehs** » 15.08.2012, 14:34

EgonHugeist,

Sorry, I didn't actually understand your question.

A quoted state is everything in '
'E' is a modifier.

Post by **EgonHugeist** » 16.08.2012, 21:07

olehs,

this is right and i completely understand this. Where i have my problems to understand is that the detection of E'\sf\' still returns ttWord, ttQuoted (or am i wrong?). Thought a complete ttQuoted should be the better result ?? Anyway you did a great job.

Michael

olehs · Post by **olehs** » 17.08.2012, 19:08

EgonHugeist,

yes, it should be ttQuoted.
I just didn't know how to make it right (how to detect Emp as ttWord, but E'mp' as ttQuoted).

Post by **EgonHugeist** » 18.08.2012, 09:08

olehs,

that's what i try to propose in the past. The one and only possibility i see is to upgrade the ttWord detection. E is a ttWordState. But the ttWord detection breaks if CharInSet[A..Z, a..z, 0..9] is invalid. What if we add a test for Quotes before leaving and making a TProcedure like TttQuotedNextToken = procedure(): TZToken; then we can add the same procedure to the ttWord AND ttQuoted detection...

Ohlehs the E'asfsf' is just a simple start you proposed to have. But there are some more like these: 0x'de23ef...'(ODBC-binary) or x'af0d....'(sql-standart binary) as binary (for all plains who do understand this syntax).

If such spezials are detectable then we can think about some extra states like ttBinary, if that makes sence...

What do yo think?

Michael

olehs · Post by **olehs** » 18.08.2012, 09:24

EgonHugeist,

I absolutely agree. E-modifier was just a 'must have' case, because if tokenized incorrectly - it broke the whole query processing.

It would be great to have methods for detecting such states.

Post by **EgonHugeist** » 19.08.2012, 11:49

olehs,

seems like you confirm my thoughts. The point why i think it can be helpfull is to avoid Encoding checks or something like this. (Ideas are welcome)

i propose you start with postgre for the bytea \x.... and so on (sysmbol-state upgrade?)
and i'll start with the x''/0x'' for the other Drivers. Also do i propose to make this detection optional. (Nobody missed it be for ): Only me!)

Much more interesting for me would be _UTF8() string cast detections...

What do you think, olehs?

Michael

olehs · Post by **olehs** » 19.08.2012, 15:14

EgonHugeist,

Nobody missed it be for

I guess nobody missed that detection before because it doesn't affect resulting statement, so there is no need to make it optional (at least someone uses tokinizer directly).

i'll start with the x''/0x'' for the other Drivers

How do you think will be the best? Do we have to allow one state to replace itself with another (SymbolState->QuotedState)?

Post by **EgonHugeist** » 19.08.2012, 19:52

olehs,

I guess nobody missed that detection before because it doesn't affect resulting statement, so there is no need to make it optional (at least someone uses tokinizer directly).

Agree. And i can't see disadvantages or issues to support such detections. A token detection E as ttWord + '\\ee\'as ttQuoted or x as ttWord '15af2b' as ttQuoted is (in my mind) useless except the function knows about issues like these and reassambles these Tokens again (IZSelectSchema, ZExpression for example).

How do you think will be the best? Do we have to allow one state to replace itself with another (SymbolState->QuotedState)?

Hmm i think the best would be:
- keeping the generic CharacterStates like they are (and after you latest patches).
- creating child detections of a state which the tokens do use first: \x0123af (PostgreSQL Binary) should be a ttSymbolState first. Also the octal and all others you've proposed before. So i'm not thinking about creating new TZToken objects. I'm only thinking about how to implement such child detections IF the TZPostgreSQLSymbolState(and all others).NextToken would be prepared for this.
Then the result of the detection could be a new one which we decide like:
Token.Value='_UTF8'/Token.Type=ttCharacterSet or Token.Value='x'12afed45'/Token.Type = ttBinary.

Just have a look to the ttKeyWord detection or the number-states(for integers/Hex... etc.)

Here i see generally no disadvantages if we make a fine tuning if the sub-detections. But this is now only my thoughts. Other/better porposals, olehs?

Michael

olehs · Post by **olehs** » 19.08.2012, 20:26

EgonHugeist,

as far as I know, there is only 4 types of string "modifiers" in PG:
E'abcd\n' - escaped strings
B'1001' - bit-strings
X'1FF' - hex-stings
U&'abc\0422' - strings with Unicode escapes
and all of them are just quoted strings.

So if I understand you right, TZPostgreSQLWordState will have to detect these sequences and call TZPostgreSQLQuoteState.NextToken? But in this case TZPostgreSQLQuoteState.NextToken will have to look back anyway to check the E-syntax.

And what do you think of string like
U&'d!0061t!+000061' UESCAPE '!'
Does it have to be just a single ttQuote too?

Post by **EgonHugeist** » 19.08.2012, 21:06

olehs,

So if I understand you right, TZPostgreSQLWordState will have to detect these sequences and call TZPostgreSQLQuoteState.NextToken?

Yes this is what i've tryed to say in the past too. In my mind should we export the TZPostgreSQLQuoteState.NextToken function to be able to call the same from different states like TZPostgreSQLQuoteState IF (example)

'E' was detected as ttWord. Then the next char is the #39(') and TZPostgreSQLQuoteState.NextToken breaks here. But if TZPostgreSQLQuoteState.NextToken knows about that 'E','B','X','U&' can be followed from a ttQuoted then we go on reading and change the result-state to an optional type...

But in this case TZPostgreSQLQuoteState.NextToken will have to look back anyway to check the E-syntax.

Yes this is the issue we have now. I think this look back to the 'E' is not really needed IF you agree with my proposal. The result of my proposal is either a complete ttQuoted/ttEscapeQuoted(which we can introduce here) or ttBinary or ttBit and so on..

And what do you think of string like
U&'d!0061t!+000061' UESCAPE '!'
Does it have to be just a single ttQuote too?

That's a good point i didn't know about before. I don't think a complete ttQuoted would be helpfull here or becomes to complicatate. You are the PostgreSQL hero here, not me.

The qestion for me is for what do you think we can use such detections?

I'm trying to explain what i want to do:
My PreprepareSQL was an idea to avoid double check's of still right encoded and escaped String from our Parameter-Queries.

Now we start from the premise we use an Ansi-Compiler.
TZQuery.SQL.Text := 'select * from Table1 where Field1 = :F1 and Field2 = 'Müller'.

If the connection was opened in UTF8 then Zeos takes care the the Parameter F1 is allway right encoded as UTF8String. But the Query will fail because the 'ü' of 'Müller' is not UTF8Encoded. Simply that was the reason for splitting the queries into tokens to avoid partial wrong encoded values. For the UnicodeIDE's this is not the case here, this has onother planned purpose. So i set the escapeMarks around the F1 Parameter-Value to avoid speed decrease of the function ZPlainString which calls DetectUTF8Encoding('Müller') first and decides what happens to the partial string value.

Now we send also binary data as strings to the Server and checking this token first. Which means we read the whole token once and then again to check the encoding. IF we know the next token is ttBinary or an Character-State then we do not need to check this again..

On the other hand it could be helpfull for everybody to have such a detection, or not? You are one of the guys who know obout the possibilities we have and others too.

Btw. the EncodeBytea function returns a string like '\x34dfa2' as binary encoded strings... With a starting backslash and no quotes. Does postgre understand the x'edfaee2455' syntax too?

Michael

olehs · Post by **olehs** » 19.08.2012, 22:11

EgonHugeist,

'E' was detected as ttWord. Then the next char is the #39(') and TZPostgreSQLQuoteState.NextToken breaks here. But if TZPostgreSQLQuoteState.NextToken knows about that 'E','B','X','U&' can be followed from a ttQuoted then we go on reading and change the result-state to an optional type...

Ok. I think this looks good and can be done.

I'm trying to explain what i want to do:

My PreprepareSQL was an idea to avoid double check's of still right encoded and escaped String from our Parameter-Queries.

Oh... now I understood the whole idea you put into PreprepareSQL processing.
I don't really know what to say... This seems like a big challenge. There can be so many issues like the one I patched at rev.1651. There was a query like

Code: Select all

select id as Код from orders

and TWordState couldn't detect non-latin alias. And now according to what you said, ttWord has to be checked for Encoding. And there may be surprises also in DDL statements and so on.

I think as a developer that uses Zeos, for me is better to manually control that encoding of SQL and in-parameters were the same.

Btw. the EncodeBytea function returns a string like '\x34dfa2' as binary encoded strings... With a starting backslash and no quotes. Does postgre understand the x'edfaee2455' syntax too?

PostgreSQL understands x'edfaee2455' as a bit string constant (and has nothing to BYTEA as I know). For example
select x'edfaee2455'
returns 1110110111111010111011100010010001010101
and
select x'edfaee2455'::int8
returns 1022117160021

olehs · Post by **olehs** » 20.08.2012, 13:18

EgonHugeist,

Ok. I think this looks good and can be done.

can you please look at this patch. It overrides CheckEscapeState and replaces the State. Can't say it's the fastest way, but the simplest.

Post by **EgonHugeist** » 20.08.2012, 14:32

olehs,

had a look now. Not my proposal but it seems to handle really everything we've talked about... Great job again. Any suggestions about different token types than ttQuoted?

We can add an option enumeration type like toAcceptModifiers and on TZTokenizer.TokenizeStreamToList we can check for this option. If this option is not set we reset the Token.Type(ttBinary) back to ttQuoted..

What do you think?

Commit your patch, olehs.

Michael

Btw. Actually i'm trying to complete the PostgreSQL RealPrepared-Statements (not ready yet, have some issues left. If you want to test them then exchange the TZPostgreSQLEmulatedPreparedStatement with TZPostgreSQLPreparedStatement on line 576 ZDbcPostgreSql.pas)

olehs · Post by **olehs** » 20.08.2012, 15:09

EgonHugeist,

We can add an option enumeration type like toAcceptModifiers and on TZTokenizer.TokenizeStreamToList we can check for this option. If this option is not set we reset the Token.Type(ttBinary) back to ttQuoted..

What do you think?

I absolutely don't mind

. Couldn't do it myself because I'm not familiar with other databases' specific syntax.

I also think there is no need for extra token type in PostgreSQL. ttQuote (and maybe ttEscapedQuoted) is quite enough.

Btw. Actually i'm trying to complete the PostgreSQL RealPrepared-Statements (not ready yet, have some issues left. If you want to test them then exchange the TZPostgreSQLEmulatedPreparedStatement with TZPostgreSQLPreparedStatement on line 576 ZDbcPostgreSql.pas)

Yes I saw it and think this is just a great job you are doing. If you need any help I will be glad to help you.