Fooling around with TPC-H data, ADF and Hyperscale Serverless

In one of my last blogs, I wrote about my first encounter with the Azure Hyperscale Serverless offering. Now it’s time to dig a bit deeper and what it’s up to.

Disclaimer. Azure Hyperscale Serverless is in preview and one of the things that isn’t active yet, is the auto shutdown. This means that it will stay online 24/7. And bill you for every second it’s online. In my case, this meant that my Visual Studio credits ran out and I couldn’t use my Azure subscription anymore. Keep it in mind when testing this out, especially if your credit card is connected to said subscription.


To get the data, I downloaded and compiled the TPC-H data generator and created 500 GB of data files in 5 iterations. The process generates 5 sets of 8 files, each set being about 100 GB. This way, I hoped to get some parallelism reading the data. The LineItem file is the biggest one with not only the most records (over three billion) but also in size (a little over 400 GB).

One set of TPC-H files

The storage account is a standard hot tier blob storage, no hierarchical namespace enabled. Uploading the files took about 10 hours, creating the files about 35 hours on a 10 year old Intel Core i5 machine.

The Hyperscale database was configured in the Serverless option with a maximum of 16 cores, minimum was automatically set at 2 cores. The tables were created with the correct schema, no indexes, (foreign) keys or other tricks were used. You can find the schema I’ve used here.

Finally, I used Azure Data Factory copy flows to get the data from the storage to the database. The schema was imported from the database and some conversion can take place when reading the data and writing it to the database.

Example column and data type mapping (Orders table)

Those who know the TPC-H output will find that I’m not writing on all the tables. I did load all of them, but took out the small, medium, large and extra large ones for readability. The smallest ones (region and nation) are discarded as their size doesn’t make sense in this test case.

Test 1: CSV to SQL

Copy data flows

My first try was to get all the data from the CSV files into the SQL instance. I created the flows to read the the files and directly write into the database. As the tables have no indexes, I expected a lot from both data factory (that had no limits specified) and the database (that had to welcome all the data with open arms).

Default settings for the copy flows

Let’s see what happened. The smallest table (Suppliers, 723 MB) took a little under three minutes to load. It grew to almost 1,5 GB in size and when the 5 million rows were loaded, the performance was about 29.940 records per second.

One of the medium sized tables (Part) with 100 million rows (12,5 GB) took 52 minutes to load and had an average speed of 31.867 rows per second. This one grew to 21 GB in SQL data.

The Orders table (90 GB, 750 million rows) took over 6 hours to load, grew to 136 GB and averaged at 33.708 rows per second.

The really big one, LineItem, timed out after 12 hours. I forgot to change the maximum run time. In that time, it managed to transfer a little over 1 billion rows to the SQL database, growing the data from 136 to 178 GB. It averaged a speed of 23.321 rows per second.

The different copy flows had 4 used DIUs and 2 used parallel copies. They had 20 peak connections at the source, 2 at the sink.

Let’s see the speed in a graph:

Rows per second when reading CSV into SQL

Now, the Azure storage reportedly works better with parquet files, so let’s try that!

CSV to Parquet

I created new copy data flows that read the CSV files and write them to parquet files. To give all the processes the ability to go as parallel as they wish, I’ve set the parquet sink to split the files in sets of 1 million rows. From a size point of view, maybe on the small side but file sizing is another discussion ;).

The smallest table (Suppliers, 723 MB) took a little under a minute to process. The volume shrunk to a little over 400 MB and when the 5 million rows were loaded, the performance averaged at 87.719 rows per second.

The medium sized table (Part) with 100 million rows (12,5 GB) took 5 minutes to load and had an average speed of 323.625 rows per second. This one shrunk to 4,5 GB in Parquet data.

The Orders table (90 GB, 750 million rows) took a little over 42 minutes to process, shrunk to 42 GB and averaged at 291.149 rows per second.

The really big one, LineItem didn’t time out. It finished in 3 hours and 40 minutes. Processing rows shrinking it from 407 to 187 GB and averaging at 226.588 rows per second.

The different copy flows had 20 used DIUs and 5 used parallel copies. They had 50 peak connections at the source, 5 at the sink. A huge increase compared to the CSV to SQL load. The flows had the same settings as the CSV to SQL flows. This is just Azure Data Factory having a better time copying data from one folder to another one with a better(?) data type.

Let’s see this in a graph, compared to CSV to SQL:

Rows per second processed, comparison between CSV to SQL and CSV to Parquet

Parquet to SQL

Now to get the data from parquet to sql. As I found that the standard settings for the data flows were insufficient, I changed some settings to increase performance. Changing the DIU and copy parallelism settings greatly increases performance, but it also increases cost. As I found out when my subscription locked again.

The smallest table (Suppliers, 823 MB read) took a little under four minutes to load. It grew to almost 1,5 GB in size and when the 5 million rows were loaded, the performance was about 22.422 records per second.

Suppliers settings

One of the medium sized tables (Part) with 100 million rows (9 GB read) took 41 minutes to load and had an average speed of 39.952 rows per second. This one grew to 21 GB in SQL data. The part copy flow has the same settings as Suppliers.

The Orders table (81 GB read, 750 million rows) took about 2,5 hours to load, grew to 136 GB and averaged at 86.846 rows per second.

Orders settings

The really big one, LineItem, finished as well. With 375 GB read, it transferred the 3 billion rows in a little under 7,5 hours. The data grew to 532 GB in the SQL DB but the process managed 111.704 rows per second.

LineItems settings

For this last process, Suppliers and part were running with a max DIU of 8, Orders had 16 and LineItems 32. That is a lot of computing power. The copy parallelism was set to 8 across all tables. This meant that peak connections to the target went up to 8 for most tables, LineItem got 16 to write to the SQL DB. Let’s add the graphs to the previous one and see the differences.

Rows per second over the three loading strategies

When you compare just the graphs, it seems that the best way forward is CSV to Parquet. Thing is, most people don’t speak parquet but speak SQL or Python. So you need to read the data again in either Synapse Studio, Synapse Serverless Database or even a dedicated SQL Pool to make the most out of the data.


As I started with, performance comes at a price.

Data Factory costs

You can see quite clearly that between february 26th and 27th, I spent more money on my data factory. Increasing the DIU’s make huge difference.

Azure SQL Database costs

My database costs went through the roof as well, no wonder with a Hyperscale on 24/7.

Azure storage costs

The costs of storing a little over 500 GB of data on storage is almost nothing compared to SQL data storage.

Resource usage

Now, how hard are my resources working? It’s ok that they cost money, but I’m expecting a lot in return. The data factory has shut down so hard, none of the graphs are returning any data, so let’s look at the Azure SQL database.

Loading CSV into the database

When the loading process starts, there’s a peak with 13% CPU (about 2 cores) and 58% Log IO. After the initial peak, the usage quickly drops down to almost a trickle. This concurs with the performance seen at the ADF side of things. It was calmly strolling along, reading and writing the data whilst taking in the spring smells and sun.

Loading Parquet into the database

In general, the graph looks more or less the same. But the CPU reaches almost 50% at it’s peak moment, so almost 8 cores out the available 16 were used. The log IO almost hit the max but quickly decreased. What doesn’t cease to amaze me is the fact that the data IO remains at 0%. As the data is being processed by the distributed compute under the Hyperscale front, I can only imagine that the default monitoring query’s by Azure can’t read that part. Hopefully this will come in the near future as it’s a very welcome addition to the other data when analysing the tier and SKU settings.


To finish of this blog, I’ll leave you with three matrices of data collected during the running of the Data Factory processes. One of the things worth paying attention to is the amount of data read and data written. It’s noteworthy that when reading parquet, there’s more data read then there was written; the data needs to be decompressed and this generates overhead in the data read.


Data Read723 MB12,438 GB12,411 GB61,979GB90,77 GB136,155 GB
Files Read555552
Rows Read5.000.00075.000.000100.000.000400.000.000750.000.0001.007.478.970
Peak Connections502020202020
Data Written1,305 GB22,425 GB21,285 GB111,6 GB136.349 GB178,885 GB
Rows Written5.000.00075.000.000100.000.000400.000.000750.000.0001.007.405.315
Peak Connections822222
Copy Duration00:02:4700:45:1700:52:1803:34:4406:10:5012:00:00
Throughput4,468MB/s4,588 MB/s3,961 MB/s4,812MB/s4,081MB/s3,152 Mb/s
Used DIUs844444
Used Parallel copies822222
Reading from source00:00:0900:18:1100:16:0601:57:2602:11:2404:15:00
Writing to sink00:01:2100:32:1900:28:5302:22:1804:06:4205:51:55
Rows per hour107.784.43199.374.310114.722.753111.766.532121.348.31583.956.581
Rows per minute1.796.4071.656.2381.912.0461.862.7762.022.4721.399.276
Rows per second29.94027.60431.86731.04633.70823.321
Measurements from CSV into SQL

CSV to Parquet

Data Read723 MB12,438 GB12,411 GB61,979GB90,77 GB407,322 GB
Files Read555555
Rows Read5.000.00075.000.000100.000.000400.000.000750.000.0003.000.028.242
Peak Connections505050505050
Data Written411,659 MB6,592 GB4,528 GB23,018 GB42,034 GB187,263 GB
Files written5751004007503003
Rows Written5.000.00075.000.000100.000.000400.000.000750.000.0003.000.028.242
Peak Connections555555
Copy Duration00:00:5700:05:0400:05:0900:20:2500:42:5603:40:40
Throughput13,919 MB/s41.738 MB/s40,825 MB/s50,844 MB/s35,319 MB/s30,776 MB/s
Used DIUs82020202020
Used Parallel copies555555
Reading from source00:00:0600:01:5500:01:3300:11:2300:20:4601:35:24
Writing to sink00:00:0700:02:0800:01:0200:06:3200:13:4700:56:53
Rows per hour315.789.474888.157.8951.165.048.5441.175.510.2041.048.136.646815.717.649
Rows per minute5.263.15814.802.63219.417.47619.591.83717.468.94413.595.294
Rows per second87.719246.711323.625326.531291.149226.588
Measurements from CSV into Parquet

Parquet into SQL

Data Read823,328 MB13,185 GB9,05646,037 GB81,075374,537 GB
Files Read5751004007503003
Rows Read5.000.00075.000.000100.000.000400.000.000750.000.0003.000.028.242
Peak Connections548888
Data Written1,305 GB22,425 GB21,285 GB111,6 GB136,349 GB532,717 GB
Rows Written5.000.00075.000.000100.000.000400.000.000750.000.0003.000.028.242
Peak Connections8488816
Copy Duration00:03:4300:21:0300:41:4302:42:0502:23:5607:27:37
Throughput3,794 MB/s10,481 MB/s3,625 MB/s4,737 MB/s9,742 MB/s13,948 MB/s
Used DIUs48881632
Used Parallel copies848888
Reading from source00:00:1100:05:4100:07:5400:49:5000:16:0000:16:03
Writing to sink00:02:1500:18:5600:21:3001:11:4102:07:4407:23:18
Rows per hour80.717.489213.776.722143.827.407148.071.979312.644.743402.133.584
Rows per minute1.345.2913.562.9452.397.1232.467.8665.210.7466.702.226
Rows per second22.42259.38239.95241.13186.846111.704
Measurements from Parquet into SQL


Thank you so much for reading and many thanks to Henk van der Valk (T | B | L) for pointing me to the TCP-H data for more intens performance testing!

One thought on “Fooling around with TPC-H data, ADF and Hyperscale Serverless

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s