SQL Data Cleaning in Real world Applications

Data Cleaning refers to the massaging of raw data to make it usable and ready for analysis. 

The process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled.

Examples:

  • Data could be in un-standardized formats like in a different units of currency and it requires to be normalized to ensure comparing equally across records. 
  • Data could be in string data types and it requires to cast each column in order to run computations.
  • Data could be having different date formats and it requires to be standardized according to the country specifics.
  • Removing irrelevant data
  • Deduplicating the data or removing duplicates in other words
  • Dealing with missing data
  • Filtering data outliers
  • Validating the data

Most of the data scientists and analysts spend most of their time in prepping the data.

Effective Data Cleaning Strategies

  • Review the data that you have at hand.
  • Understand the different data fields being used and what are the data types , do these align with your data needs.
  • Build a plan to convert the data you currently have to the data you need. Define actions and data cleaning techniques that needs to be applied.
  • Building a plan for analysis is important after the planning. Also, visualize the plan to solve the problem  

Effective Data Cleaning Methods (most commonly used functions)

  • LEFT : Extracts number of characters from a string starting from left
  • RIGHT : Extracts number of characters from a string starting from right
  • SUBSTR : Extracts substring from a string (starting at any position)
  • POSITION : Returns the position of the first occurrence of a substring in a string
  • STRPOS: Returns the position of a substring withing a string
  • CONCAT : Adds two or more expressions together
  • CAST : Converts a value of the data type into a required (specific) data type
  • COALESCE: Returns the first non-null value in a list
Check out more functions here
LEFT(student_data, 8) AS student_id
RIGHT(student_course, 3) AS course_id
SUBSTR(string,start,length) i.e SUBSTR(student_data,10,2) as gender
CONCAT(month,'-',day,'-',year) as date             
CAST(salary AS int)
POSITION(substring IN string) i.e. POSITION("$" IN student_sal) as sal
SELECT POSITION('S' IN 'SUSTAINABLE')AS POSITION_S;  
OUTPUT: 1
SELECT POSITION( 'DELHI' IN 'NEW DELHI'AS POSITION_DELHI;  
OUTPUT: 5
SELECT STRPOS('w3resource', 'so')AS "Position of substring";     OUTPUT : 5
SELECT COALESCE(NULL, 4, 2, 'W3Schools.com');  OUTPUT: 4

Comments