
With diverse range of data sources, including CRM applications like SFDC and social media apps like twitter pouring immense volumes of structured/semi-structured data, the traditional data warehouse architecture is no longer a good fit for Enterprises.
The tightly coupled storage and compute resources of the old data warehouses give misery to the data analysts and ache the systems. The analysts have to wait for 24 hours or more for data to flow into the warehouse before it is ready for analysis and even longer to run complex queries.
With the increase in data volume, their ETL(Extract, Transform, Load) times are increasing, the duration for which the warehouse is put in a batch mode is increasing which leaves users less time to query the warehouse.
While working at a global telecom provider, this was a continuous battle! Every other morning analysts come to office to see the batch jobs are still running. There was a minor failure, the previous night which took half to 1 hour to fix. The users are back in office to start their trend analysis but o my god, they cant use the system as the data is not ready yet and Teradata’s batch mode has to continue. In good days, with no failures, the jobs will complete by 6 am. A single failure taking couple of hours to fix would stretch this time to 6 pm. Yes a 12 hours delay!
Problems with Conventional Data Warehouses:
- Performance issues while simultaneously trying to load and query data
- Inefficiency in handling varied data sources
- Expensive, slow and painful process of data recovery
- Lack of single source of truth causing inconsistent, untrustworthy data and poor data sharing
How Snowflake solves these problems:
Built purposely for cloud, Snowflake has a unique scalable architecture which separates its storage and compute resources. It has a hybrid feature which blends the shared nothing architecture and shared disk architecture of traditional warehouses along with the massively parallel processing (MPP) capability. Snowflake offers independently scalable, virtually unlimited storage and compute resources with pay-as-you-go service.
Snowflake has 3 layers to its “multi-cluster, shared data” architecture –
- Compute Layer (also known as Virtual Warehouse, re-sizable, scalable and elastic))
- Storage Layer (leverages hybrid columnar, compressed storage mechanism)
- Services Layer (handles all data management functionalities — metadata, security, optimization)
Source: Online Snowflake Resources
Solution 1: High Performance with High Elasticity and High Availability
- Snowflake’s Virtual Warehouses (VW) can be scaled in and out independent of the storage requirements, solving the problems faced by a shared nothing architecture
- You can create as many VWs as you like. For e.g. A separate XS virtual warehouse for ETL processing, a Large VW for your reporting needs, XL for your Data Science queries and so on. This ensures your ETL processes run smoothly even when users are querying the reporting data without impacting performance
- VWs come in various (T-shirt) sizes from XS to L to 4XL. The size can be changed at runtime based on your workload requirements and can be auto-suspended when not in use. For e.g. during nights when more batch jobs are running the VW can be updated to be XL from L and switched back when the batch jobs are complete. This ensures queries are not stuck and data is available on time
- VWs also have a multi-cluster feature which helps with query concurrency. For e.g. during month-ends when more users are accessing reporting VW, multi-clusters of the same VW size can be spun automatically to ensure all users can query instantaneously. When concurrency reduces, it scales back automatically
- Snowflake leverages columnar storage which improves it’s query performance as it automatically returns the columns needed as against the entire row in a conventional relational system
- Snowflake exploits data caching for better performance without requiring users to perform partitioning, indexing or stats gathering. For e.g. If the underlying data is not changed, same query run by multiple users fetches data using the result query. This greatly reduces the i/o operations. Warehouse caching is another way of improving subsequent query performance where the data is saved in VW’s SSD disks and the result need not be fetched from the table. Bear in mind, when virtual warehouse is suspended/reduced in size, the cached data is lost. You will need a cache trade-off strategy for better performance over extra credits or vice versa
Solution 2: Snowflake can Smartly Handle Volume, Variety and Velocity of Incoming Data
Over the years, data sources have expanded beyond transactional operations to include exponential volumes from websites, mobiles, games, social media apps and even machine generated data using IoT. Moreover, a significant volume of the data arrives in a semi-structured format which the traditional on-prem DMS(Data Management Systems) are not equipped to handle.
- Snowflake comes as Software-as-a-Service platform which gives you more time to focus on your data rather than grooming disparate infrastructure. Snowflake’s team takes care of tuning the knobs, compressing and encrypting the data during transit and at rest
- Snowflake’s dynamic elasticity and separate compute and storage layers enables queries without affecting workloads processing large volumes of data
- Snowflake comprehensively supports ANSI SQL and ACID transactions (unlike HIVE) and can easily handle structured data coming from RDBMS sources and flat/CSV files
- Snowflake’s variant datatype allows semi-structured data coming from JSON, XML, Parquet or ORC to be stored in Snowflake tables. Flattening of variant datatypes is another cool feature of Snowflake which helps in converting semi structured data to a relational representation for analytical use cases
- Snowflake’s SnowPipe helps in handling velocity of data by continuously processing it in micro batches and making it available to users while the data is still fresh
Did you know?
1 Petabyte = 1 Mn Gigabytes = 500 Bn textbook pages = 58,333 HD Movies (each ~2 hours length)
Solution 3: Recovering an Object was Never Simple before Snowflake!
- Snowflake’s undrop feature is unique in itself. It allows you to restore a table, schema or an entire database deleted by mistake or on purpose. You do not have to depend on administrators or wait for days to get it done, a simple undrop <tables/schedule/database> command comes to your rescue
- Snowflake’s time travel feature allows to recover the original version of an object to reverse the updates. You can simply use query id or timestamp or an offset time feature (to go xx min/xx days backwards) to get your original data back on your own. Time travel can allow you to go back up to 90 days to retrieve data
- Snowflake’s Fail Safe feature helps in restoring data up to 7 days after the time travel period has lapsed. Unlike time travel, fail safe data can only be accessed by Snowflake personnel and can be retrieved only in case of unprecedented incidents like natural disaster or security breach
Solution 4: Coherent Data Centralization, Democratization and Sharing with Snowflake
- Snowflake consolidates data warehouses, data marts, and data lakes into a single source of truth (central data storage accessible by all, based on privileges) and democratizes data to empower users for better analytics
- Snowflake is cloud-agnostic hence distributing data across cloud providers across regions for high data availability and disaster recovery management is a breeze
- Cloning, including zero copy cloning is another great feature by Snowflake. You can clone a table, scheme, the entire database without any additional cost. Cloning in Snowflake only creates an additional metadata pointing to the same data, hence no storage cost is incurred. A perfect use case for cloning data is in different environments like dev and test from prod. In specific cases, an additional anonymized database to mask the sensitive prod data can be cloned. Masking the anonymized database would change the data hence storage would be chargeable on top of which multiple databases can be zero-copy-cloned for different test cases
- Snowflake’s Data sharing capability allows the snowflake provider to give read only access of certain objects to multiple consumers. Moreover, these consumers may or may not have snowflake account. Single copy of these objects (tables, schemas, secure views, databases) are referred, hence no additional storage costs are incurred. The snowflake consumer accounts only get charged for querying. If the consumer do no have a snowflake account, a reader account is created and the querying cost is borne by the provider. This feature significantly reduces the effort and time spent by organizations in sharing data via downloaded files, emails etc. If the source data is changed in Snowflake, the consumer can easily see the updated data within minutes
At the end of the day, technology platforms have to act as enablers for business to help them visualize insights to make right decisions for present and future. Snowflake helps in building an integral data foundation for it!
Source: Online Snowflake Resources
Neeta Valecha, Technical Project Manager at Enquero
13 years of IT experience in Data-warehousing, Big Data, Data Analysis and ETL tools along with Project Management, Software Development and IT Operations Management. She is snowflake certified, loves tech innovations, follows minimalism and adores cats!
Subscribe to blog