Allegedly, Acuity had a data breach. That's the context that accompanied a massive trove of data that was sent to me 2 years ago now. I looked into it, tried to attribute and verify it then put it in the "too hard basket" and moved onto more pressing issues. It was only this week as I desperately tried to make some space to process yet more data that I realised why I was short on space in the first place:
Ah, yeah - Acuity - that big blue 437GB blob. What follows is the process I went through trying to work out what an earth this thing is, the confusion surrounding the data, the shady characters dealing with it and ultimately, how it's now searchable in Have I Been Pwned (HIBP), which may be what brought you to this blog post in the first place.
One of the first things I do after receiving a data breach is to literally just Google it: acuity data breach. Which immediately yielded this top result from June:
Ah, so Acuity is a healthcare company. But wait - here's the next result:
That's not about healthcare, that's Acuity Brands. How many companies called "Acuity" that have been breached are there?! Let's see what references I have in my email:
Another one 🤦♂️ That "breach" could be circumstantial, so we'll call it a "maybe", but it's yet another Acuity with a question mark next to it. So how many "Acuity" companies are out there in total?! Just in the course of investigating this data, I came across a total of 6 of them that as far as I can tell, are completely unrelated:
Ugh, great. We'll work through them and try to figure out where they fit into the picture in a moment, but first let's look at the actual data. We already know it's 437GB, but it's the breadth of column headings that's most stunning; here's all 414 of them:
Just by eyeballing these, it really doesn't feel like the sort of data that comes from a healthcare provider, a brands company or a scheduler. The other 3, however... Maybe.
Some more data points before going further:
On that final point, here's an example of what I'm talking about:
The last names are the same, as are the salutations. The physical addresses are spot on accurate in their structure as are the phone numbers; there are no spaces, no dashes and no other artifacts typical of millions of different humans entering data. This is clean - too clean.
The "datasource" field is another interesting data point with the top 10 values being:
Each of these entries appeared at least hundreds of thousands of times, if not millions. Does that mean that Netflix, for example, provided customer data to this list? Almost certainly no, but it does feel reminiscent of the Acxiom / Live Ramp misattribution post I wrote a year ago where I listed full counts of a similar column. One of the top values there was also "TAGGED.COM" (also all in uppercase), alongside several other values that also appeared in both sources.
Back to attribution and a post on a popular hacking forum jumps out:
Many things here line up, for example the column names that are very unique to this data source, including "estimatedincomecode", "del_point_check_digit" and "secondaryaddresspresent". The attribution is to the insurance company named "Acuity", but is that accurate? Insurance companies collect a lot of data as it's relevant to how they run their business, but that data is highly unlikely to include fields such as:
That's much more in the "data enrichment" space where a company sells a massive data set so that it can expand the profile data of the purchaser's existing customer base. It's a legitimate, honest, legal business model. It's also indistinguishable from this:
Hey, it's 437GB! And the column names line up! And it's called Acuity! Slightly different column count to mine (and similar but different to the hacker forum post), and slightly different email count, but the similarities remain striking. How I got to this resource is also interesting, having come by someone I was discussing the data with a couple of years ago:
The YouTube video is a walkthrough of a campaign management tool to send emails to customers. Could that indicate the data as coming from Acuity Ads (now Illumin)? No, not in and of itself, the walkthrough there isn't that dissimilar to other campaign tools I've used in the past. No matter how much I looked, I just couldn't find a solid lead back to Acuity Ads and anything even remotely related was merely circumstantial. It could be from them, but it could also be from many other places and the mere fact that a near identical corpus of data was sitting there on an outright spam site only makes the whole mystery that much deeper. There was just one more interesting data point in that email:
i myself am in that dataset and i've been getting 100x more phishing/scam calls, emails, and physical mail
Let me end this with a best guess: this feels like the same situation as the massive Master Deeds incident in South Africa in 2017. In that case, a legally operating data aggregator (I think you know how I feel about those by now...) sold personal information to a real estate business who then left it publicly exposed. I say it feels the same because it's just such a clean set of data and it's clearly very comprehensive in terms of the columns. It's exactly what I'd expect a data aggregator to prepare and sell to other businesses so they could identify which of their existing customers likes needlework.
In the past, publishing blog posts like this has helped identify an origin service and if that happens again here then I'll be sure to provide an update. For now, I've loaded it into HIBP and flagged it as a spam list which means it won't impact the size of anyone's domains and bump them into a different subscription level. If you do have any interesting insights on this data, please leave a comment below and with any luck, one of the Acuity entities out there will emerge as the source.
Note: just after loading the data, I ran the calcs on how many of the addresses were pre-existing in HIBP. This seems like a statistically significant number 😲
So, 100% (just under actually, but it rounded up). Working through a bunch of sample addresses, they appeared across all sorts of other existing spam lists and dodgy data aggregator breaches. Who knows which ones came first, just more data in the big swimming pool of breaches. https://t.co/Ux2rw6uaAk
— Troy Hunt (@troyhunt) November 15, 2023
Click to Open Code Editor