Data Warehouse (OLTP vs OLAP, Traditional vs Modern Data warehousing solutions)

Data Warehouse : A central storage of information that can be queried and used for analysis.

Data generated from different software programs being used by a company like Finance department, HR system, Emails , Excel sheet having information is collected. The data is cleaned, transformed ,validated, aggregated and puts it through ETL, which means extract, transform and load. The integrated data is stored in a special database called data warehouse where the data is arranged into dimensions and facts. This data becomes the source for data mining activities to draw some conclusions about the business process and ultimately make some decisions. 

Concerns about using DW and need:
1. Extra piece of software that needs maintenance. This also comes with an added cost.
2. The data is already available in opperational systems.
3. Extracting information from operational data is complicated.


Data processing solutions:

OLTP (Online Transaction Processing) is called transactional system. That means system which can record transaction. Day-to-day handling of transactions that result from enterprise operation

OLAP(Online Analytical Processing) is called analytical system. Here, we analyze the information in a database, OLTP database where we collected all this information. Analysis  of information in a database for the purpose of making management decisions.

Read more here

Why data warehousing in cloud ?

  • Cost (pay as you go models available)
  • Scalability
  • No hardware maintenance
  • No large capital expense
  • Integration with other cloud services
  • Time to market
  • Massive parallel processing (MPP)


In modern data warehousing solutions, data comes from structured and unstructured sources which is ingested into a storage. Azure Data Factory is one such service that can bring this data from different sources and can ingest it into a storage. 
For storage, now more and more customers are using Data Lake architecture. This Data Lake architecture gives you cost effective way to store that data which can be accessed by data scientists, data analysts/engineers. This data has to be prepared and cleaned because data quality is an important aspect of modern data warehousing system.  A dimensional model of data,  structural model of data is created, so that people can understand its shape, structure, its schema, and creating a single source of truth where business can rely, and where business can actually ask questions. Finally BI and reporting workload, and advanced analytics are build on top of the solution. 

Compute Power of the Azure SQL data warehouse can actually be completely paused for maximum savings whenever there is no use. Please note that when you pause the Compute, you will still get charged for the storage.  But if you're not querying the data, you can just stop the Compute and pay only for storage. When we are using Azure Data Lake storage architecture, it is very cheap. So now you have the flexibility to pull all of your data without filtering, because this raw data will give you lot of flexibility in the future for insight, for those questions which you may not know at this point of time. If you have this data, you can ask or you can answer business questions which will be raised in the future. Now this was not possible earlier in old database architecture because the Compute and storage were connected. So either data used to filter out, or some kind of archiving strategy we used to save the cost. Now with this architecture, this pattern is a very common way to achieve a modern data warehousing solution and that's why a lot of customers are moving their data warehousing solution to the cloud. (Datawarehousing unit or the DWU = CPU+Memory+IO)

Azure Data Factory is a serverless integration product. It has a lot of components and can bring the data into your storage solution and land it into Azure Data Lake Storage. So Azure Data Lake Storage is really a big data store. That's where we do the data preparation, cleaning. This is a place where data comes together, combined and cleaned, and all kind of transformation takes place. The Data Lake storage is the cheapest for storage.

Databricks is another service which we can use to explore the data. Databricks allows us to use many languages to query and explore the data, in Data Lake. There are many languages like Python, R, Scala, or maybe SQL which customers can choose according to their familiarity or skill set they have. So basically customers has lots of choices in terms of language to do data preparation. Databricks can either write this data into Azure SQL data warehouse or it can also work with Azure analysis services. Also Azure Databricks can also allow you to create a Spark environment. Spark clusters give you a lot of power not just exploring the Data Lake but prepare and clean the data so you can have best of both the world. So finally after bringing the data into a big data store and do some better preparation and exploration, now finally we are ready to effectively publish the data to organization to consume. 

Source : Microsoft



Comments