Some time ago, I wrote a number of blogposts comparing the different Azure SQL options to give you some idea about performance, differences between tiers and differences between the Stock Keeping Units (SKU’s). This was done by creating data in the database itself and review the metrics. This works fine and gave a good overview of the different tiers and SKU’s. For reference, you can find those blogs here.
For the new series, I’ve thought of a new process that aligns more with my regular line of work, data warehousing. This means ingesting a lot of data and modelling it.
In this first blog I’ll take you through the process of the scripts I’ve used and the different database tiers and SKU’s. As usual, the number you’ll see in the blogs shouldn’t be taken as absolutes but compared with each other to find the best fit.
Disclaimer
This series is being written whilst also working on a series on the Microsoft DP-700 certification exam. Testing the databases, working the results takes time. Contrary to the previous series where I published on an almost weekly basis, this will not happen with this one. I’m aiming at once every 2-3 weeks.
Databases
I’ve decided to test most databases. The basic and free tier have been skipped as they’re unable to hold all the data. Then there are the options like memory optimised, compute optimised and confidential computing. I might add these later on if there are significant differences from the standard series. For now, I’ve decided to not test these.
I’ve tested the standard and premium DTU (Database Transaction Unit) tiers, each in two SKU’s. I’ve then moved on to the processor based databases; General Purpose, Hyperscale and Business Critical. Each of these are tested at the 4,8 and 16 cores SKU as these are the ones I’ve seen the most at customers. Finally I’ve included the Azure SQL Managed Instance. Again with 4, 8 and 16 cores.
Some databases, like the Managed Instance and Hyperscale, have a Premium option (which indicates faster hardware). I’ve included this as well.
To give you an indication of all the choices, here’s an overview. Don’t hold me to it when you’re reading this as Microsoft can add or remove SKU’s or even Tiers before I had the chance to update this blog. Feel free to comment if you think one of the numbers is wrong.
| Tier Name | Standard | Premium | Premium (Memory Optimized) | Compute Optimized | Confidential Computing |
|---|---|---|---|---|---|
| Basic | 1 | ||||
| Standard | 9 | ||||
| Premium | 6 | ||||
| General Purpose (Provisioned) | 15 | 11 | 12 | ||
| General Purpose (Serverless) | 15 | ||||
| Hyperscale (Provisioned) | 14 | 16 | 15 | 12 | |
| Hyperscale (Serverless) | 14 | ||||
| Business Critical | 15 | 12 | |||
| Managed Instance (General Purpose) | 8 | 8 | 6 | ||
| Managed Instance (Business Critical) | 7 | 15 | 15 |
In total, there are 226 options to choose from, not including the choices in data size or amount of replica’s. With all these options, the first question I always get is which one to choose. That question is something this blog series tries to help with. It won’t answer it because every workload is different and requirements will differ. But in the end, what everyone wants is a very fast database for as little money as possible. Little spoiler alert; performance doesn’t come cheap. So what this blog aims to do is provide you with some insights on performance differences and some inspiration to mimic your own workload and find the sweet spot.
Fabric SQL
There’s a special kid on the block nowadays, the Azure Sql DB in Microsoft Fabric. I’ve decided to not feature that specific database in this series as it has a different implementation and is, at the time of writing, still in public preview.
Ingesting
The data that will be ingested is a set of TPC-H CSV files. For those who follow my excursions in Fabric, the dataset will be familiar as these are the same files. About 500 GB of raw text. To ingest these files, I’m using DBA Tools. If you want to learn more about this set of PowerShell commandlets, click here! This tooling can read the file and write the data into my table. When using the import data wizard, this process fails for the simple reason it tries to load it all in memory. And my machines lack that amount of memory. This is why I switched to an Azure VM where I can add memory freely. I also added a very fast disk to make sure my local disk won’t be a bottleneck.
The VM: Standard D16lds v5 (16 vcpus, 32 GiB memory)
Extra data disk: Premium SSD LRS, 700 GB, 5000 IOps, 200 MBps max throughput.
The tables are created from a script without indexes. I firmly believe that data ingestion is fastest when the table is a heap. Just dump it all in, we’ll sort it later (pun intended). This only goes for full loads, if you’re running incremental loads things get different because you need to find the highest ID, latest date or whatever indicator you’re using. Doing that on a heap is ‘less efficient’ than on a properly indexed table.
Due to the size of my data, I created a database with 1 TB of reserved space. Not every SKU can handle that which limited choices at times.
Techniques
To measure performance, I’m using two different techniques. The first one is serial ingestion of the different files I created. The code for this part is quite simple. Create an array of SKU’s, work through the array and write the data.
# Code setup
# - Connect to Azure Sql Instance
# - Set the sku of the Azure Sql Instance
# - Azure SQL VM
# - Standard S2
# - Standard S7
# - Premium P2
# - Premium P6
# - General Purpose Gen5 4 vCores
# - General Purpose Gen5 8 vCores
# - General Purpose Gen5 16 vCores
# - Hyperscale Gen5 4 vCores
# - Hyperscale Gen5 8 vCores
# - Hyperscale Gen5 16 vCores
# - Hyperscale Premium Gen5 4 vCores
# - Hyperscale Premium Gen5 8 vCores
# - Hyperscale Premium Gen5 16 vCores
# - Business Critical Gen5 4 vCores
# - Business Critical Gen5 8 vCores
# - Business Critical Gen5 16 vCores
# - Managed Instance General Purpose Gen5 4 vCores
# - Managed Instance General Purpose Gen5 8 vCores
# - Managed Instance General Purpose Gen5 16 vCores
# - Create a new table
# - Insert data into the table
# - Get the insert performance from the DBA Tools module
# - Drop the table
# - Change the sku of the Azure Sql Instance
# - Repeat, except for MI, that's manual labour
Clear-Host
Connect-AzAccount -Tenant 'YourTenant' -SubscriptionId 'YourSub'
Get-AzContext
#Connect to the Azure Sql Instance
$sqlCred = Get-Credential DbaToolsLogin
$server = Connect-DbaInstance -SqlInstance YourInstance.database.windows.net -SqlCredential $sqlCred
$skuArray = @("S2", "S7", "P2", "P6", "GP_Gen5_4", "GP_Gen5_8", "GP_Gen5_16",
"HS_Gen5_4", "HS_Gen5_8", "HS_Gen5_16", "HS_PRMS_4", "HS_PRMS_8", "HS_PRMS_16",
"BC_Gen5_4", "BC_Gen5_8", "BC_Gen5_16 vCores")
foreach ($sku in $skuArray){
IF($sku -eq "S2" -or $sku -eq "S7"){
$edition = "Standard"
}
IF($sku -eq "P2" -or $sku -eq "P6"){
$edition = "Premium"
}
IF($sku -eq "GP_Gen5_4" -or $sku -eq "GP_Gen5_8" -or $sku -eq "GP_Gen5_16"){
$edition = "GeneralPurpose"
}
IF ($sku -eq "HS_Gen5_4" -or $sku -eq "HS_Gen5_8" -or $sku -eq "HS_Gen5_16" -or $sku -eq "HS_PRMS_4" -or $sku -eq "HS_PRMS_8" -or $sku -eq "HS_PRMS_16"){
$edition = "Hyperscale"
}
IF ($sku -eq "BC_Gen5_4" -or $sku -eq "BC_Gen5_8" -or $sku -eq "BC_Gen5_16"){
$edition = "BusinessCritical"
}
Write-Host "Setting sku to $sku with Edition $edition" -ForegroundColor Green
Set-AzSqlDatabase -ResourceGroupName "YourRG" `
-DatabaseName "YourDB" `
-ServerName "YourServer" `
-Edition $edition `
-RequestedServiceObjectiveName $sku
Write-Host "Dropping old tables" -ForegroundColor Yellow
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS nation;"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS region;"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS supplier;"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS part;"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS customer;"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS partsupp;"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS orders;"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "DROP TABLE IF EXISTS lineitem;"
Write-Host "Creating a new tables" -ForegroundColor Green
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE nation ( n_nationkey INT, n_name NVARCHAR(25), n_regionkey INT, n_comment NVARCHAR(152), n_empty varchar(1) NULL );"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE region ( r_regionkey INT, r_name NVARCHAR(25), r_comment NVARCHAR(152), r_empty varchar(1) NULL );"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE supplier ( s_suppkey INT, s_name NVARCHAR(25), s_address NVARCHAR(40), s_nationkey INT, s_phone NVARCHAR(15), s_acctbal DECIMAL(15,2), s_comment NVARCHAR(101), s_empty varchar(1) NULL );"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE part ( p_partkey INT, p_name NVARCHAR(55), p_mfgr NVARCHAR(25), p_brand NVARCHAR(10), p_type NVARCHAR(25), p_size INT, p_container NVARCHAR(10), p_retailprice DECIMAL(15,2), p_comment NVARCHAR(23), p_empty varchar(1) NULL );"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE customer ( c_custkey INT, c_name NVARCHAR(25), c_address NVARCHAR(40), c_nationkey INT, c_phone NVARCHAR(15), c_acctbal DECIMAL(15,2), c_mktsegment NVARCHAR(10), c_comment NVARCHAR(117), c_empty varchar(1) NULL );"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE partsupp ( ps_partkey INT, ps_suppkey INT, ps_availqty INT, ps_supplycost DECIMAL(15,2), ps_comment NVARCHAR(199), ps_empty varchar(1) NULL );"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE orders ( o_orderkey INT, o_custkey INT, o_orderstatus NVARCHAR(1), o_totalprice DECIMAL(15,2), o_orderdate DATE, o_orderpriority NVARCHAR(15), o_clerk NVARCHAR(15), o_shippriority INT, o_comment NVARCHAR(79), o_empty varchar(1) NULL );"
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -Query "CREATE TABLE lineitem ( l_orderkey INT, l_partkey INT, l_suppkey INT, l_linenumber INT, l_quantity DECIMAL(15,2), l_extendedprice DECIMAL(15,2), l_discount DECIMAL(15,2), l_tax DECIMAL(15,2), l_returnflag NVARCHAR(1), l_linestatus NVARCHAR(1), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct NVARCHAR(25), l_shipmode NVARCHAR(10), l_comment NVARCHAR(44), l_empty varchar(1) NULL );"
Write-Host "Inserting data into the table" -ForegroundColor Green
#Create the logfile
$source = "YourFabulousComputer"
Out-File .\$edition$sku$source.txt -NoClobber
$fileNames = @("nation.tbl", "region.tbl", "supplier.tbl.1", "part.tbl.1", "customer.tbl.1", "partsupp.tbl.1", "orders.tbl.1", "lineitem.tbl.1")
foreach ($fileName in $fileNames){
Write-Host "Load $fileName" -ForegroundColor Green
$PeriodPos = $fileName.IndexOf(".")
$tableName = $fileName.Substring(0, $PeriodPos)
Import-DbaCsv -SqlInstance $Server -Database sqldbmvpperf `
-Path D:\TPCH\$fileName `
-Delimiter '|' `
-Table $tableName `
-Schema dbo `
-NoHeaderRow | Out-File .\$edition$sku$source.txt -Append
}
}
To really hammer the database, I’m also loading the files in parallel. You can do this in PowerShell 5 with code like this:
workflow Test-Workflow
{
Parallel
{
--Commands here
}
}
This will run the commands next to each other and hit your database hard. Which is what I’m after to see what it can do.
If you’re using PowerShell 7 (like me), you can create your code like this.
$Array = @{
File1,
File2,
File3
}
$Array | ForEach-Object -Parallel {
DoWorkHere $_
}
First I’m creating an array. I’m piping this array into the ForEach operator with the parallel flag. This means it will start running the commands in parallel. The $_ will pass the values in the array to the command. This sounds very simple but I had some struggles to get this running. Luckily, Claudio Silva came to the rescue and helped me fix the code. I learned a lot in the interactions with him and it saved me at least two days of frustration and debugging to find out where things go wrong. The main struggle was reusing the login to the database. This part was fixed by using the $using: operator.
As the code is repetitive for different files, I’ll show you the main part of the code.
$DatabaseTier = 'BusCrit'
$DatabaseSKU = '16Cores'
#Create the logfile
Out-File .\$DatabaseTier$DatabaseSKU.txt -NoClobber
Write-Host "Connect to Azure Sql DB" -ForegroundColor Green | Out-File .\$DatabaseTier$DatabaseSKU.txt -Append
#Connect to the Azure Sql Instance
$sqlcred = Get-Credential DbaToolsLogin
$Server = Connect-DbaInstance -SqlInstance sqlmvplearn.database.windows.net -SqlCredential $sqlcred
Write-Host "Clear DB" -ForegroundColor Green
Invoke-DbaQuery -SqlInstance $Server -Database sqldbmvpperf -File '.\tpc-h tables.sql'
Write-Host "Parallel Customers" -ForegroundColor Green
$cust = @(
"E:\tpch\files\customer.tbl.1",
"E:\tpch\files\customer.tbl.2",
"E:\tpch\files\customer.tbl.3",
"E:\tpch\files\customer.tbl.4",
"E:\tpch\files\customer.tbl.5")
$cust | foreach-object -parallel {
Import-DbaCsv -SqlInstance $using:Server -Database sqldbmvpperf -Path $_ -Delimiter '|' -Table customer -Schema dbo -NoHeaderRow #| Out-File .\$DatabaseTier$DatabaseSKU.txt -Append
}
After the login process, I’m creating the array. This array is piped into the foreach object where the SqlInstance is populated by the $using:Server. This is the $Server object created before. The Path is the location of the files and is populated from the $_; every element in the $cust array.
Cleansing and modelling
My personal preference is to drag data around as little as possible. So in this case, I’m not using a staging, silver, cleansing layer. I’m going from the raw data to the star schema. And to get the data in, I’m using the merge command. I know a lot of people have opinions about that, but for this use case, it works.
The code looks like this:
create or alter procedure sp_loadNations
as
BEGIN
SET NOCOUNT ON;
MERGE dim.nations AS Target
USING (
SELECT DISTINCT
n_nationkey as NationID,
n_name as NationName,
n_comment as NationComment
FROM dbo.nation) AS Source
ON Source.NationID = Target.NationID
WHEN MATCHED
AND source.NationName != target.NationName
OR source.NationComment != target.NationComment
THEN UPDATE
SET
Target.NationName = Source.NationName,
Target.NationComment = Source.NationComment,
Target.ChangeDate = SYSDATETIME()
WHEN NOT MATCHED BY SOURCE THEN UPDATE
SET Target.DeleteDate = SYSDATETIME()
WHEN NOT MATCHED BY TARGET
THEN
INSERT(NationID, NationName, NationComment, IsActual, CreateDate, ChangeDate, DeleteDate)
VALUES (NationID, NationName, NationComment,1, SYSDATETIME(),SYSDATETIME() ,'2099-01-01' );
END
Concluding
In short, this is the setup I’ve been using to see how much the databases can put up with. Do not take my numbers from the following blogs as absolutes but compare them with each other. Whenever you’re using an external tool to ingest data, performance will be different. Take into account that every tool has overhead in a way. The results of the tests will follow in separate blog posts.
That is a very promising setup, looking forward to the results!
LikeLike