Monday, February 20, 2012

How to import big GAMS result data into SQL Server

Recently I was involved a few models where GAMS was generating a lot of results, exceeding what we could easily store in Excel or Access. Here is an alternative: store the results in SQL server.

Here is a small two-liner example script that shows how we can efficiently import a very large parameter with 8 million records residing in a GDX file into a SQL server table.

$call gdx2txt.exe in=x.gdx id=p out=p2.txt tab
$call bcp testdb.dbo.result in p2.txt -c -S .\SQLEXPRESS –T
-h "ROWS_PER_BATCH=8000000,TABLOCK"

The first line exports the parameter from a GDX file to a tab delimited text file. The tool gdx2txt is a little bit more convenient than using GAMS PUT statements and it is also faster (using the same precision more than 5 times as fast).

The second line calls the BCP utility to import this text file.  The hints setting the batch size and requesting a table lock increase the performance.

The log looks like:

--- Job gdx2txt3.gms Start 02/24/12 23:09:02 WEX-WEI 23.7.3 x86_64/MS Windows    
GAMS Rev 237  Copyright (C) 1987-2011 GAMS Development. All rights reserved
Licensee: Erwin Kalvelagen                               G100803/0001CV-WIN
          Amsterdam Optimization Modeling Group                      DC4572
--- Starting compilation
--- gdx2txt3.gms(1) 2 Mb
--- call gdx2txt.exe in=x.gdx id=p out=p2.txt tab
GDX2TXT:Opening "x.gdx"
GDX2TXT:Symbol:p Uels:200
GDX2TXT: 100000 records (elapsed 172 millisec)
GDX2TXT: 200000 records (elapsed 320 millisec)

GDX2TXT: 7800000 records (elapsed 11201 millisec)
GDX2TXT: 7900000 records (elapsed 11343 millisec)
GDX2TXT: total records: 8000000 (elapsed 11484 millisec)
--- gdx2txt3.gms(2) 2 Mb
--- call bcp testdb.dbo.result in p2.txt -c -S .\SQLEXPRESS -T -h "ROWS_PER_BATCH=8000000,TABLOCK"

Starting copy...
1000 rows sent to SQL Server. Total sent: 1000
1000 rows sent to SQL Server. Total sent: 2000
1000 rows sent to SQL Server. Total sent: 3000

1000 rows sent to SQL Server. Total sent: 7998000
1000 rows sent to SQL Server. Total sent: 7999000
1000 rows sent to SQL Server. Total sent: 8000000

8000000 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total     : 24055  Average : (332571.19 rows per sec.)
--- Starting execution - empty program
*** Status: Normal completion
--- Job gdx2txt3.gms Stop 02/24/12 23:09:38 elapsed 0:00:35.736

The total turn around time for this test data was 35 seconds. About 11 seconds for GDX2TXT and 24 seconds for BCP. So the database import is only 2.5 times as slow compared to my time to generate the CSV file. That is very fast indeed. The table had four columns: three GAMS indices (converted to type NCHAR(4)) and a value (type REAL). No (SQL) indices or keys were used against the table so the inserts could be done at optimal speed. Also the db file sizes were extended so they did not need to grow during the inserts. As a result,  in practice the BCP timings may be a bit slower than shown here.

Note: dropping all rows by “DELETE FROM testdb.dbo.result” can take a long time (1.5 minutes on my machine). Much faster is “TRUNCATE TABLE testdb.dbo.result”.

Usage:

C:\projects\gdx2txt>gdx2txt
Usage:
  gdx2txt in=gdxfile [id=name] [out=txtfile] [csv|tab]
Write identifier from GDX file to an ascii file
Parameters:
   in=gdxfile  name of GDX file, required
   id=name     name of identifier, optional if gdx has only one id
   out=txtfile name of outputfile, optional, stdout is used otherwise
   csv         write ascii file in CSV format
   tab         write ascii file in tab delimited format
Example:
  > gamslib trnsport
  > gams trnsport gdx=trnsport
  > gdx2txt in=trnsport.gdx id=x
  > gdx2txt in=trnsport.gdx id=x out=x.csv csv

C:\projects\gdx2txt>bcp
usage: bcp {dbtable | query} {in | out | queryout | format} datafile
  [-m maxerrors]            [-f formatfile]          [-e errfile]
  [-F firstrow]             [-L lastrow]             [-b batchsize]
  [-n native type]          [-c character type]      [-w wide character type]
  [-N keep non-text native] [-V file format version] [-q quoted identifier]
  [-C code page specifier]  [-t field terminator]    [-r row terminator]
  [-i inputfile]            [-o outfile]             [-a packetsize]
  [-S server name]          [-U username]            [-P password]
  [-T trusted connection]   [-v version]             [-R regional enable]
  [-k keep null values]     [-E keep identity values]
  [-h "load hints"]         [-x generate xml format file]
  [-d database name]

C:\projects\gdx2txt>

No comments:

Post a Comment