Mind your collation!

Today, I got a really nice question from one of my colleagues who ran into an issue. It took some time getting to the bottom of it, and it’s just to fun and nerdy not to share. Thanks again Wouter for the question!

Issue

The issue is quite simple. A text needs to be converted into a SHA2_256 hash for some authentication reasons. The example shown here is simplified. The thing is, the outcome of the hash isn’t accepted by the authorising party and when the input is checked via an online MD5 hashing site, there’s a difference between that output and that from the SQL Script.

First analysis.

Here’s the basic script we started out with.

declare @Unhashed_Secret nvarchar(200)

set @Unhashed_Secret = NEWID()

select
@Unhashed_Secret    as Unhashed_secret,
lower(convert(varchar(32),hashbytes('SHA2_256',@Unhashed_Secret),2))as Hashed_Secret

This one doesn’t provide the correct output. My first guess was to see what happens when we change the nvarchar to varchar.

declare @Unhashed_Secret varchar(200)
, @Unhashed_NSecret nvarchar(200);

set @Unhashed_Secret = NEWID()
set @Unhashed_NSecret = @Unhashed_Secret

select
@Unhashed_Secret    as Unhashed_secret,
lower(convert(varchar(32),hashbytes('SHA2_256',@Unhashed_Secret),2))as Hashed_Secret,
@Unhashed_NSecret    as Unhashed_Nsecret,
lower(convert(varchar(32),hashbytes('SHA2_256',@Unhashed_NSecret),2))as Hashed_NSecret

The output:

varchar
nvarchar

Not really that difficult to see that there are differences there. The top one (varchar) provided the correct output, the bottom one (nvarchar) was faulty. But why?

Digging in

Let’s see what’s happening with the length of the data and the datalength. The former is the LEN function I’m using a lot, the DATALENGTH() function was a bit new to me, but this shows the length of the physical data, which is something else than the data you see on your screen.

select
LEN(@Unhashed_Secret) as Unhashed_Secret_Length,
DATALENGTH(@Unhashed_Secret) as Hashed_Secret_Data_Length,
LEN(@Unhashed_NSecret) as Unhashed_NSecret_Length,
DATALENGTH(@Unhashed_NSecret) as Hashed_NSecret_Data_Length,
@Unhashed_Secret    as Unhashed_secret,
lower(convert(varchar(32),hashbytes('SHA2_256',@Unhashed_Secret),2))as Hashed_Secret,
@Unhashed_NSecret    as Unhashed_Nsecret,
lower(convert(varchar(32),hashbytes('SHA2_256',@Unhashed_NSecret),2))as Hashed_NSecret

When I ran this query, the first four columns showed this:

varchar

36 characters long, that’s what I’m expecting. And the data length is 36 as well. And the Nvarchar?

nvarchar

Ha! There we go. It’s got double the length! Check the SQL Server definition here. So this makes a difference, but what’s the difference? Let’s convert the GUID into a binary column to see more of the ‘background’ info.

select
LEN(@Unhashed_Secret) as Unhashed_Secret_Length,
DATALENGTH(@Unhashed_Secret) as Hashed_Secret_Length,
convert(binary(16), cast(@Unhashed_Secret as varchar(30))) as Convertedsecret,
LEN(@Unhashed_NSecret) as Unhashed_NSecret_Length,
DATALENGTH(@Unhashed_NSecret) as Hashed_NSecret_Length,
convert(binary(16), cast(@Unhashed_NSecret as nvarchar(30))) as ConvertedNsecret,
@Unhashed_Secret    as Unhashed_secret,
lower(convert(varchar(32),hashbytes('SHA2_256',@Unhashed_Secret),2))as Hashed_Secret,
@Unhashed_NSecret    as Unhashed_Nsecret,
lower(convert(varchar(32),hashbytes('SHA2_256',@Unhashed_NSecret),2))as Hashed_NSecret
varchar

The converted Secret (from varchar to binary) shows the binary value of the text.

nvarchar

The converted Nsecret is the same, but after every 2 characters in this output, there are two zeroes. These are the collation placeholders. When you look at the data itself in SSMS or ADS, there’s no difference. But in the background, there’s a huge difference! Not only in the number of characters but also in the value that’s being used for the conversion.

Finally

As mentioned at the start of this blog, this is a simplified example but it shows the impact a wrong data type can have. Whenever you’re faced with hashing columns, make sure you and the decrypting side are using the same data types. And whenever you’re designing tables, please think twice before choosing the Nvarchar datatype, as it will take up a lot of extra space!

One thought on “Mind your collation!

Leave a comment