Posts

SQL Data Cleaning in Real world Applications

Data Cleaning refers to the massaging of raw data to make it usable and ready for analysis.  T he process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. Examples: Data could be in un-standardized formats like in a different units of currency and it requires to be normalized to ensure comparing equally across records.  Data could be in string data types and it requires to cast each column in order to run computations. Data could be having different date formats and it requires to be standardized according to the country specifics. Removing irrelevant data Deduplicating the data or removing duplicates in other words Dealing with missing data Filtering data outliers Validating the data Most of the data scientists and analysts spend most of their time in prepping the data. Effective Data Cleaning Strategies Review t

SQL Window Functions - Part 1 (Basic functions overview)

Image
Windows Function : A window function is a calculation across a set of rows in a table that are somehow related to the current row.   A window function is similar to aggregate functions combined with group by clauses but have one key difference:  Window functions retain the total number of rows between the input table and the output table (or result).  Behind the scenes, the window function is able to access more than just the current row of the query result. Use cases : 1. T o measure trends or changes over rows or records in your data. 2. To rank a column for outreach or prioritization. Partition by:  A subclause of the OVER clause. Similar to GROUP BY. Over:  Typically precedes the partition by that signals what to “GROUP BY”. Aggregates:  Aggregate functions that are used in window functions, too (e.g., sum, count, avg). Row_number():  Ranking function where each row gets a different number. Rank():  Ranking function where a row could get the same rank if they have the same value. D

Slowly Changing Dimensions

Image
A slowly changing dimension (SCD) is one that appropriately manages the change of dimension members over time. It applies when the values of a business entity change over time, and not on a set schedule. For example, a customer SCD will have a set of columns that can change and  a set of fields to track when the record was added and modified. Types of SCD - Type 0 – Fixed Dimension No changes allowed, dimension never changes Type 1 – No History Update record directly, there is no record of historical values, only current state Type 2 – Row Versioning Track changes as version records with current flag & active dates and other metadata Type 3 – Previous Value column Track change to a specific attribute, add a column to show the previous value, which is updated as further changes occur Type 4 – History Table Show current value in dimension table but track all changes in separate table Type 6 – Hybrid SCD Utilise techniques from SCD Types 1, 2 and 3 to track change Star schema design

Azure Data Factory - Chapter 2

Image
  Azure custom roles If the Azure built-in roles don't meet the specific needs of your organization, you can create your own custom roles. Click this link to see an overview of custom roles .