Like what you read? Want to be updated whenever we post something new? Subscribe to our newsletter!

The climate cost of Snowflake

Author: Iris Meredith

Date published: 2024-08-25

I've recently been through a recruitment process for an NGO working in the climate space. Now, I've talked through recruiting bias and such half-to-death on this blog, and while I'll no doubt continue to do so, I thought it might be interesting to touch on something different that came up in the interview: Snowflake and Databricks. Being myself, I was naturally a bit of a PostgreSQL fangirl in the interview, and while my interviewer didn't seem to disapprove, they did ask some directed questions about data clouds like Snowflake and Databricks. I gave the honest answer: these tools are almost certainly overkill for the task at hand and are likely to waste significant amounts of money. This did not seem to go down all that well, and while there are of course many different reasons why a person might be rejected from a role despite feedback lauding one's technical experience, excellent communication skills and alignment on values (hell, maybe it wasn't even transphobia this time!), some people in the know have indicated that going against the orthodoxy on data clouds may have been part of the reason for the rejection.

Naturally, given that I'm currently driven mostly by spite and a bad attitude a mile wide, I wasn't very good at taking this lying down. It took me a moment to think of a good angle, but eventually some ideas came up...

Fermi calculations

Given the strong ideological attachment to Snowflake and Databricks coupled with the relative lack of technical need for them, I thought it'd be interesting to estimate the cost to the climate of using Snowflake as opposed to a different solution. Given available data, I've restricted my analysis to Snowflake running on an AWS infrastructure in the Sydney region, and as storage isn't a huge contributor to total emissions, I've elected not to factor bulk storage into our estimates. Moreover, these are Fermi estimates: back-of-the-envelope estimates aiming more to get some idea around orders of magnitude than a correct value. That said, let's begin!

A moderately-sized NGO in New Zealand might stabilise at employing around fifty people. For a research/policy focused organisation, between five and ten of these people might be full-time technical staff, and another ten to fifteen might be policy staff or researchers that run SQL queries but that aren't full-time technical staff. I've estimated eight full-time technical staff and twelve policy staff in this instance.

The next step here is to estimate the total number of queries that this team might emit over a year. This is obviously difficult to estimate on so many levels, but going on data from my personal and professional work in the past, I might emit between ten and thirty queries in an average day of coding, and might have three effective full-time days of coding a week. Policy analysts in the public service had significantly lower total consumption, only being responsible for maybe three or four total queries a week. Running with the midpoint of these numbers and taking the midpoint for full-time staff, we find that our organisation is emitting just short of 200 queries a week, which assuming 48 work weeks a year gives a total of 9,600 queries a year.

If the NGO has externally facing applications, this number will increase significantly. Let us assume that the organisation maintains two or three online calculators, each of which emits one query each time it's used. On the New Zealand scale, each of these calculators might need to get a hundred hits a week in order to be useful as a policy/awareness tool. Assuming 2.5 online calculators, this gives us a total of 13,000 queries, for a sum total of 22,600 queries being emitted by the organisation in a year. Given 8766 hours in a year, this translates to approximately 2.6 queries an hour.

Now, Snowflake charges credits based on the size of the warehouse being used and the time running, with a minimum charge time of sixty seconds. Our hypothetical NGO is thus using a minimum of 377 hours of compute per year. That said, this minimum is... unlikely to be reached. Taking default settings as our given, we're dealing with about ten times that amount of compute, or 3770 hours.

Compute for Snowflake is a little difficult to track, given the virtualisation layer on top of the base cloud that's running on. Conveniently for them, Snowflake doesn't actually quote the size of the instances on their website. We do, however, have pretty good estimates for CO2 emissions AWS ec2 instances, which you can find here. Looking at this, we can get an estimate if we can figure out a rough equivalent between AWS instances and Snowflake warehouses. This website states that it's relatively well-known that warehouses are constructed from underlying compute nodes, and that on AWS each node is a c5d.2xlarge ec2 instance, and that the number of nodes used doubles with each size increase in the virtual warehouse. Now, if our NGO were sensible, they'd just go with an extra-small warehouse, but who are we kidding? They aren't going to be sensible, and a medium warehouse is probably going to be the smallest they're willing to accept, which means four nodes. This allows us to calculate a rough estimate for our NGO's CO2 emissions.

Plugging the numbers into the calculator given here tells us that our hypothetical warehouse is generating, at a minimum, 83.4 kg equivalent of carbon dioxide emissions a year. A more realistic estimate, given how people actually use these data clouds, might be about a tonne of CO2 a year. This isn't an enormous amount of emissions: average per capita emissions in New Zealand are around six tonnes a year, but it's not exactly an amount of CO2 that we can just ignore either, especially given that we're talking about a climate-focused NGO. Moreover, this is, quite bluntly, a pretty charitable estimate given the level of dysfunction that we know exists in your average organisation.

Comparisons against other solutions

In the interests of making a comparison against an actual alternative, we might want to benchmark this against the equivalent estimates for a PostgreSQL instance sitting on an AWS virtual machine. Now, it's important to note that most NGOs of this size will be working with quantities of data on the gigabyte scale, or a few terabytes at the upper limit. This is well within the limits of what PostgreSQL can handle comfortably. Moreover, the size of the queries isn't huge (we're usually pulling a few hundred or thousand rows at most), and in the above estimates, we're only really running two queries an hour. We can thus probably get away with quite a small instance: a general-purpose m7g.medium instance might even be adequate for this workload. Running that for an entire year (8766 hours) with very little optimisation would produce only 37 kg equivalent of CO2. This is significantly smaller than the minimum equivalent emissions from our estimates above, and way less than the more realistic estimate. And this is a substitute that I am more-or-less convinced would provide exactly the same value for an average NGO workload.

At this point, Snowflake is coming off distinctly the worse for wear in this comparison. However, it gets worse...

Second-order effects

The PostgreSQL estimate is fixed: it emits that amount of CO2 no matter what you do, at least until you need to upgrade. For Snowflake, that scales much more linearly. This means that each time you increase your query throughput, the problem gets worse. Moreover, it's much easier, with Snowflake, for usage to blow out of control, and with the level of compute power behind Snowflake you can create inefficient queries, write code that queries the warehouse ten times when once would have been sufficient and engage in any number of behaviours that, with a PostgreSQL solution, would have real consequences. This is, of course, a complete disaster from an engineering and cost perspective, but it also translates to further emissions.

Moreover, Snowflake and such come with all manner of fancy features, bells and whistles and other distractions, all of which push even further towards the organisation consuming more compute resources. There's even generative AI these days, which is a complete fucking environmental disaster. Sticking to PostgreSQL avoids these temptations, and from a behavioural perspective, that is probably the best way to avoid being led into foolish decisions.

Finally, if you were foolish enough to adopt Snowflake, it's a massive red flag that your other technical decisions are highly questionable. What else are you wasting time and money on that's creating further emissions?

The ugly conclusion

If you're an organisation with a relatively small data workload, you probably shouldn't be using Snowflake. If you're a fucking environmental NGO, there's no way you should be using it. If you've decided to use it anyway, you're most likely a bunch of hypocrites, or foolish enough that it's unlikely that the work you're doing is going to be helpful in the fight against climate change.

In fairness, there may be some reasons to use Snowflake: specifically, it is valid to use it if you can use it to prevent meaningfully more in CO2 emissions than the instance would consume. However, there isn't really a clear path to that: PostgreSQL is a very capable piece of software, and it's unclear what Snowflake would bring to the table that couldn't be done more easily and cheaply in that language. Moreover, organisations are very good at lying to themselves, and it's far easier to imagine a use case for Snowflake and justify it using motivated reasoning than it is to find an actual reason, meaning that at least 90% of the reasons that you're going to encounter to use it are wrong. And if they are wrong, well, you're generating a small but non-trivial amount of emissions for no good reason, with the potential existing for you to become quite a large emitter for no good reason if you're foolish enough.

Just... please, for the love of God, think about your operations for five minutes before jumping on the fucking bandwagon. You can do all these detailed calculations about solar panels: why can't you manage it about your fucking cloud computing?

Share:

RSS