top of page
  • Writer's pictureHarry Tan

How We Halved Our Snowflake Expenses: 15 Effective Cost-Optimization Techniques

In the transitional period spanning 2022 and 2023, a significant number of technology companies faced an uncomfortable truth: they had over-expanded and overstaffed, without achieving the hyper-growth needed to sustain such investments. This led to an industry-wide focus on cost optimization. For organizations leveraging Snowflake as their cloud data platform, this line item often constitutes a substantial portion of operational costs.

My journey began at a SaaS company that found itself in a similar predicament. We implemented a hiring freeze and initiated a comprehensive review of our vendor contracts, seeking opportunities to discontinue or replace services with more cost-effective alternatives. We managed to cut our monthly Snowflake bill by approximately 50% within just five months, all without compromising the integrity of our ongoing projects. This remarkable achievement was the result of a series of carefully planned and executed changes, both minor and significant.

In this article, I will share the key strategies that contributed to our success, offering insights that you can potentially apply to your own Snowflake environment. While not all of these strategies may be directly applicable to your specific situation, they can serve as a valuable starting point for your own cost-optimization initiatives.

  1. Merge Infrequently-Used Warehouses

At first, we created multiple warehouses so that each team could have its own, allowing us to attribute Snowflake costs to individual teams. However, this approach increased costs due to the overhead of maintaining metadata resources for each warehouse. Additionally, each warehouse couldn’t leverage idle computational resources effectively. When a query finished running, the warehouse would remain idle for a period before shutting down. By consolidating similar workloads into fewer warehouses, we optimized resource utilization and reduced overhead. Ultimately, we merged all read-only BI queries into a single large pool.

2. Change Scaling Policiesd

Snowflake’s scaling policies dictate how virtual warehouses adapt to fluctuating workloads. These policies can be configured to automatically scale up or down based on the volume of queries or computational tasks. This ensures optimal resource utilization and helps balance performance with cost.

Snowflake offers two scaling policies: Standard and Economy. We opted for the Economy policy for our backend ETL jobs, as these tasks are generally not time-sensitive and can afford to wait in a queue before execution. Our ETL warehouse is configured as a Small warehouse with a maximum of six clusters. Given the large number of ETL jobs we run daily, the cost savings from this change have been substantial.

3. Remove Older Data History

While Snowflake’s data storage costs are relatively low — essentially passing on the S3 costs from AWS or equivalent storage costs from other cloud providers — the size of your tables can impact query performance. As tables grow, queries can take increasingly longer to execute, especially if the tables lack appropriate clustering keys. Implementing a data retention policy to periodically remove or archive older, non-essential data can mitigate this issue and optimize costs.

4. Ensure Tables Only Store Relevant Data

This strategy complements the previous one but focuses on data relevance. Conduct regular audits of your tables to ensure they contain only data that is pertinent to your current operations. In our experience, archiving data from churned customers proved beneficial. Additionally, we optimized tables related to features accessible only to higher-tier plans by removing data associated with customers on lower-tier plans.

5. Store Aggregated Data, Rather Than All Raw Data

Storing raw data can be both costly and unnecessary. Whenever possible, store only aggregated data that meets your analytical requirements, while purging or archiving the raw data. Although this step may not yield immediate, substantial cost savings, it sets the stage for future efficiencies. By ensuring that stakeholders require only aggregated metrics, we were able to offload entire raw data tables from Snowflake to more cost-effective storage solutions like S3. This had two key advantages: 1) it reduced costs associated with data ingestion and updates, and 2) it opened up opportunities to explore alternative data processing technologies, such as Spark.

6. Identify Unused Expensive Columns

Columns with complex data types can disproportionately consume storage and compute resources. It’s crucial to identify such columns and evaluate their utility. For example, we had a log table with a column dedicated to storing error messages. Despite the table’s relatively small row count, the size of these error messages was substantial. After confirming that no one was utilizing this error log, we dropped the column and rebuilt the table, reducing its size from 5.5TB to a mere 130GB.

7. Periodical Reclustering Instead of Auto-Clustering

Auto-clustering is a convenient feature for optimizing query performance but can inflate costs. We adopted a manual, weekly reclustering strategy, a tip we owe to our Snowflake sales engineer. We set up two scheduled jobs: one to activate auto-clustering and another to deactivate it after a few hours. This approach proved effective for our dataset. Since our data is already well-clustered and updates tend to affect only certain micro-partitions, periodic reclustering suffices. Subsequent rounds of data updates usually impact the same micro-partitions, making this a time-efficient and cost-effective strategy.

8. Clustering with the Right Key

Clustering is one of the most potent performance-tuning techniques Snowflake offers, provided it’s executed correctly. Effective clustering requires data engineers to understand Snowflake’s micro-partitions, the specific data and read/write patterns, and the tools available for monitoring clustered data. I plan to delve deeper into this topic in a future article.

In our case, we grappled with tables that were several terabytes in size and contained billions of rows. These tables were not append-only; older records could be updated or deleted years after creation. Balancing cost and efficiency became a significant challenge. After extensive analysis and trial-and-error, we settled on using the month of record creation as our clustering key. Our ETL process now updates records created within the last three months, focusing on the most recent data, which is most relevant to our users. All changes are batch-updated over the weekend.

9. Choose the Right Merge-In Cadence for Huge Tables

For large tables, the timing and frequency of merge-in operations can significantly affect both performance and cost. It’s crucial to select a merge-in cadence that aligns with your workload patterns to optimize resource utilization.

10. Reduce Data Replication Between Regions

Minimizing data replication across different geographical regions can lead to significant savings on storage and data transfer costs. Initially, we found Snowflake’s cost page in the legacy UI somewhat misleading. It listed an item as “REPLICATION” warehouse and included a “Data Transfer” data size. Although the replication amount might appear small, the data transfer cost is $23 per TB, which added up quickly in our case.

11. Merge Snowflake Accounts

If your organization operates multiple Snowflake accounts, consolidating them can offer several advantages, including volume discounts and shared data processing. Initially, we had separate instances for internal and platform reporting, each located in different regions. These instances operated in isolation, leading to redundant data ingestion and processing. We decided to merge all data processing into one primary account and created a read-only account in the same region for shared data sets. This approach allowed us to maintain separate environments for internal and external reporting, mitigating the risk of unauthorized access to sensitive financial data. While the transition did require time and effort, the cost savings have been substantial.

12. Downgrade the Snowflake Editions to Standard

It’s essential to align the Snowflake edition you’re using with your actual needs. If you find that you’re not leveraging enterprise-level features, consider downgrading to the standard edition to achieve cost savings. For those who are heavy Snowflake users, the advanced features, such as multi-cluster warehouses, may be indispensable. However, if you’re setting up a new instance — for a new region or a specific user group, for example — and the workload is relatively light, the standard edition may suffice. Opting for the standard edition over the enterprise edition can result in a per-credit cost that’s typically 50% lower.

13. Shorten the Auto-Suspend Waiting Time

By default, the auto-suspend setting for each Snowflake warehouse is set to five minutes, which is also the minimum time span selectable via the Snowflake UI. However, this setting can be adjusted to as low as 60 seconds using the command line: ALTER WAREHOUSE my_wh SET AUTO_SUSPEND=60;. Reducing the auto-suspend time to 60 seconds ensures that idle virtual warehouses are suspended more quickly, thereby minimizing compute costs. It’s worth noting that this approach has a trade-off concerning warehouse caching. If users frequently query the same dataset at short intervals, the reduced auto-suspend time may prevent them from benefiting from cached data.

14. Separate Weekend Scheduling

Weekend workloads often differ significantly from those during weekdays. Given that both our internal team and our clients — employees of other businesses — tend to be less active over the weekend, we optimized our data ingestion and processing schedules accordingly. Utilizing Apache Airflow to orchestrate our ETL jobs, we implemented customized scheduling for both weekdays and weekends. You can refer to this page for how that works in Airflow.

Before making any changes, it’s crucial to consult with data users to ensure that the new weekend schedule aligns with their needs. While it may be easier to ask for forgiveness than permission for one-time changes, obtaining agreement upfront is a more sustainable approach for long-term operations.

15. Review All Data Assets and Processes with Stakeholders

Periodic reviews of all data assets with key stakeholders can uncover opportunities for cost savings. Questions to consider during these reviews include: Is this dataset still in use? How frequently does this table need to be updated? Are there any backup or test tables that can be deleted? Are all tasks created actually necessary?

Bonus tricks: Besides all the above 15 tactics that contribute to cost saving directly, I am giving three more extra tips on the operation.

16. Setting Up a Usage Quota in Cost Monitoring

We’ve found daily email updates on month-to-date and daily costs to be invaluable. As a growing company, we prefer a post-mortem review process over imposing strict SOPs and policies that could hinder our team’s productivity. If we notice excessive credit usage, we reach out to the individual to discuss the cost implications and suggest more efficient approaches. For example, one user changed a warehouse size from small to 3X-Large to expedite query execution, unaware of the cost impact. Daily email checks help us maintain cost awareness and prompt timely interventions. See this page for the guide to create resource monitor.

17. Have a Regular Review Session

I hold a weekly review session with my manager, who sponsors our cost optimization project. These sessions ensure we’re aligned in our objectives and strategies, and they also prevent any surprises when the monthly cost reports are finalized. If urgent actions are needed to control costs, having a manager who is well-informed and engaged facilitates quick review and approval of new initiatives.

Summary: Transforming Data Costs into Data Smiles

In an era where technology companies are increasingly focused on cost optimization, Snowflake, as a leading cloud data platform, often represents a significant line item in operational budgets. This article serves as a guide to strategically reducing Snowflake costs based on our own success story at Data Smiles. We managed to cut our monthly Snowflake bill by approximately 50% within just five months, without compromising any ongoing projects.

We’ve shared a range of strategies, from reducing data replication between regions and merging warehouses, to fine-tuning scaling policies and data retention practices. Whether you’re a Snowflake novice or a seasoned user, these tips offer a starting point for your own cost-optimization journey. And if you’d prefer to leave it to the experts, Data Smiles is here to assist you in turning your data expenses into data smiles.

Last but not the least, I’d like to take a moment to introduce my consulting firm, Data Smiles (datasmiles.com). We specialize in data engineering solutions and have a proven track record of helping clients optimize their Snowflake costs, among other services. Our team of experts is well-versed in the intricacies of Snowflake’s pricing model and can tailor a cost-reduction plan that aligns with your business objectives. If you find the upcoming tips useful but would prefer to have seasoned professionals handle the optimization process for you, don’t hesitate to reach out to us. We’re here to turn your data expenses into data smiles.

0 views0 comments

Comments


bottom of page