"Out of Memory" whilefetching big dataset
Moderators: gto, cipto_kh, EgonHugeist, olehs
"Out of Memory" whilefetching big dataset
I'm developing application for data migration between serval systems. During test I experienced problem with "Out of memory" while fetching big result (about 360 000 records in about 100 columns).
Application exports data from database into CSV file (after some processing).
I thought about partial exporting and appending result CSV file (through LIMIT ... OFFSET statement).
But.
At the beginning I need to find out how much row my Query will return.
And ?
I have stucked.
How can I solve that problem. How can I ask the database about RowCount without actually fetching those records.
I have tried add SELECT COUNT(*) FROM ( <base_statement> ) AS RecCount but it doubles time needed for data extract (first for record count, second for partial data extract)
Any idea ?
Application exports data from database into CSV file (after some processing).
I thought about partial exporting and appending result CSV file (through LIMIT ... OFFSET statement).
But.
At the beginning I need to find out how much row my Query will return.
And ?
I have stucked.
How can I solve that problem. How can I ask the database about RowCount without actually fetching those records.
I have tried add SELECT COUNT(*) FROM ( <base_statement> ) AS RecCount but it doubles time needed for data extract (first for record count, second for partial data extract)
Any idea ?
Hi,
what means slow to you?
count(*) shouldn't be very slow. On my development machine count with similar table with cca 500 000 records executes under 200 msec.
Is your postgres server configured correctly? Size of shared_buffers param?
If you don't need exact number of records you can try with:
Fetch all records with ZQuery requires a lot of memory in this case. Maybe you can try to first fetch data to local file with COPY and then parse it?
I can send you more details if you are interested.
what means slow to you?
count(*) shouldn't be very slow. On my development machine count with similar table with cca 500 000 records executes under 200 msec.
Is your postgres server configured correctly? Size of shared_buffers param?
If you don't need exact number of records you can try with:
Code: Select all
select
( case
when reltuples > 0
then pg_relation_size('MY_TABLE')/(8192*relpages/reltuples)
else 0 end
)::bigint as est_count
from pg_class
where oid = 'MY_TABLE'::regclass;
I can send you more details if you are interested.
fetching data from pg_class will not work, because I use a lot of joins, unions, and so on in one query (there is 25 different complicated queries collecting data from many tables).
By 'slow' I mean doubling time needed for counting and then extracting data - Postgres need at the beginning to prepare complicated statement for conting, and then repeat whole process for another statement for fetch data.
Can I use COPY to extract data from server int local disk ? I mean - user connects remotelly to the server - Can 'SELECT ... COPY' write fetched data to local disk ?
I hope I explain my problem clear enough ... my english is not good enough
By 'slow' I mean doubling time needed for counting and then extracting data - Postgres need at the beginning to prepare complicated statement for conting, and then repeat whole process for another statement for fetch data.
Can I use COPY to extract data from server int local disk ? I mean - user connects remotelly to the server - Can 'SELECT ... COPY' write fetched data to local disk ?
I hope I explain my problem clear enough ... my english is not good enough
- mdaems
- Zeos Project Manager
- Posts: 2766
- Joined: 20.09.2005, 15:28
- Location: Brussels, Belgium
- Contact:
toyman,
As you're using pg you could do the preprocessing on the server side using a stored proc which returns only the fields to be exported.
Does your calculation really need all 100 columns of these 360000 rows? Otherwise you can save memory by not selecting unneeded ones.
You can also split data by splitting only the the 'base table' of your complex query. Usually even complex queries start from one table. By doing a (quick!!) count of the relevant records in this table you could decide which intervals to query in your partial exports.
Finally you could try to use a TZReadOnlyQuery configured to be forward only. In some cases this might make a dataset without internal zeoslib cache.
Mark
As you're using pg you could do the preprocessing on the server side using a stored proc which returns only the fields to be exported.
Does your calculation really need all 100 columns of these 360000 rows? Otherwise you can save memory by not selecting unneeded ones.
You can also split data by splitting only the the 'base table' of your complex query. Usually even complex queries start from one table. By doing a (quick!!) count of the relevant records in this table you could decide which intervals to query in your partial exports.
Finally you could try to use a TZReadOnlyQuery configured to be forward only. In some cases this might make a dataset without internal zeoslib cache.
Mark
Maby I didn't explain clear enough
My program migration package between (actually) 4 (four) dabase systems (all on different Databaes - Oracle, PostgreSQL, MySQL, DB2). Yes - all 100 columns should be included in dataset - this is caused by input format one of the supported system - it imports data by column noumber and I cannot predict which of them is null, not important or 'always empty'. Another disadvantage is that I did not compose this queries - I have then redy to fire-up (another crew of our team composing this, and I got it 'as is' in the future probably queries may be importet dynamically in xml files).
Not always I can predict which table is 'master' and can be used for row counting.
I will try with TZReadOnlyQuery as you suggested - Can you tell me more about it ? How it works ?
My program migration package between (actually) 4 (four) dabase systems (all on different Databaes - Oracle, PostgreSQL, MySQL, DB2). Yes - all 100 columns should be included in dataset - this is caused by input format one of the supported system - it imports data by column noumber and I cannot predict which of them is null, not important or 'always empty'. Another disadvantage is that I did not compose this queries - I have then redy to fire-up (another crew of our team composing this, and I got it 'as is' in the future probably queries may be importet dynamically in xml files).
Not always I can predict which table is 'master' and can be used for row counting.
I will try with TZReadOnlyQuery as you suggested - Can you tell me more about it ? How it works ?
- mdaems
- Zeos Project Manager
- Posts: 2766
- Joined: 20.09.2005, 15:28
- Location: Brussels, Belgium
- Contact:
Basically the TZReadonlyQuery works the same way the TZQuery does. Only posting back changes doesn't work, allowing less buffers to be created.
The special situation you need is when in TZPostgreSQLStatement.CreateResultSet the NativeResultste is returned.
I think for a TZReadOnlyQuery this is the default situation.
Mark
The special situation you need is when in TZPostgreSQLStatement.CreateResultSet the NativeResultste is returned.
I think for a TZReadOnlyQuery this is the default situation.
Mark
It is hard to find a generic solution for your problem. You can fetch select directly into txt file with something like
and avoid memory issues - you will have local txt file which is less demanding for future processing ( eg. you can simply count lines to find out record count).
Code: Select all
ZReadOnlyQuery.SQL := 'copy (select something with join ...) to E'c:\\tmp\\copy_test.txt' with CSV;
ZReadOnlyQuery.ExecSQL;
I found a solution. Something close enough to my expectations.
Solution is simple - make partial reads until result set is empty.
Delphi code looks like this:
I hope it can help some one in the future
Solution is simple - make partial reads until result set is empty.
Delphi code looks like this:
I know. It is not beautifull and extra clear, but for now it produces good enough effect.procedure ExecSQL(ASQL: string);
var l_limit,l_offset : Integer;
begin
l_offset:=0;
l_limit:=10000;
repeat
Query.SQL.Clear;
Query.SQL.Add('(');
Query.SQL.Add(ASQL);
Query.SQL.Add(')');
Query.SQL.Add('LIMIT '+IntToStr(l_limit)+' OFFSET '+IntToStr(l_offset)+' ');
Query.Active:=TRUE;
//export routine here
l_offset:=l_offset+l_limit;
until Query.RecordCount=0;
end;
I hope it can help some one in the future