Modern Data Warehouse
A Data Warehouse is a place where we restore a large portion of our data in a relatively easily accessible environment and format for analysts and others to use. Often in storing data from a variety of sources or databases, this data is for Analysis, Online analytical processing, and Reporting.
The transactional databases are not well suited for analytics. To perform analytics, a Data Warehouse is required.
Data warehouses have to handle big data. Now, big data is a general description of data which meets the five V's, Velocity, Volume, Value, Variety, and Veracity.
A Modern Data Warehouse might contain a mixture of relational and non-relational data. So this can be made up of things such as image files, video streams, and IoT sensor data such as traffic counting devices.
Modern Data Warehousing, usually combines Batch processing and Stream processing.
Large scale business will need some combination of Historical data, which is associated with Batch processing, and Real-time up-to-date data, which requires stream processing. This means that the Modern Data Warehouses will incorporate both Batch and Stream processing.
In Batch processing, we wait for certain amount of raw data to pile up before running an ETL job. That is to say the data is between an hour to a few days old before it's made available for analysis. Batch ETL jobs will typically be run on a set schedule, say every 24 hours or once the amount of data reaches a specific threshold.
Stream processing is used to query continuous data streams and detect conditions quickly within a small time period from the time of receiving that specific data.The detection time period varies from a few milliseconds to minutes. It's also called by many names such as Real-Time Aanalytics, Streaming Analytics, Complex Event Processing, Real-Time Streaming Analytics, and Event Processing. There are a lot of streaming data available, such as customer transactions, activities, website visits, and they'll grow more faster with all the use of IoT devices that's happening all over the world now.
During the design of the Data Warehouse, it's necessary to consider the optimizing performance to first identify the type of workload that is planned to execute in Data Warehouse.
In 2007, the Gardner Group identified six distinct workload types that are commonly used in Data Warehouses. The first is Continuous data loading which creates an environment that is very similar to OLTP or online transactional processing because the indexes and other optimization structures have to be continuously updated. This can have a very big effect on summary and aggregate data, which is used to support dashboards and pre-built reports. It can also be used for Batch data loading.
New types of storage partitioning may need to be created and other types of optimization structures in the Data Warehouse for Data Warehouses that are used exclusively as Reporting systems. This is particularly useful for business analytics used by business professionals with little or no programming experience who use the data in random unpredictable ways, making it impossible to optimize and tune their queries, giving rise to workload called Analytic and BI-oriented functions.
Traditional Data Warehouse in a typical IT environment ingest, model, and store data through an ETL. These ETL jobs are used to move large amounts of data in a Batch oriented manner. They're most commonly scheduled to run daily. Running these jobs daily means that at best, the warehouse data is a few hours old, but more typically it is a day or older. Because ETL jobs consumes significant CPU, memory, disk space, and network bandwidth, it's difficult to justify running these jobs more than once a day. In a time when application programming interfaces or APIs were not as prevalent as they are now, ETL tools were the only solution for operational use cases.
Data mart is focused on a single functional area of an organization and contains a subset of data stored in a Data Warehouse. Data marts usually draw data from only a few sources compared to a data warehouse. Data marts are small in size and are more flexible.
The Real-time Data Warehouse describes a system that reflects the state of the warehouse in Real-time. A Real-time Data Warehouse has low latency data that provides current Real-time data.
The government has mandated how long we have to keep our data and our historical information. This type of Data Warehouse contains data volumes that are many times larger than that of Traditional Data Warehouses which requires to pay close attention to storage structures and partition. This results in Analytic OLTP warehouse is one that has been created to be all things to all people.
Operational data is periodically replicated to a Data Warehouse for analytics, and due to the competitiveness of enterprises today, we need Real-time reporting on Operational data.
So the database needs to provide performance guarantees for transactional workloads, and at the same time efficiently evaluate complex analytical queries. Optimizing for these various and contradictory requirements might not be possible. It is important to be aware what kinds of workloads the data warehouse is supporting. As a direct result of the need for Real-time data loading, and Automated transactions, best data source might be one or more Transactional databases and not your Data Warehouse.
Source : Percipio and Microsoft
Comments
Post a Comment