Data Pipelines
Data movement and transformation in steps or stages is referred to as a Data Pipeline.
A data pipeline defines the flow of data from one part of a system to another, and what happens to the data as it moves through this flow. The data moving through these pipelines can be from anywhere and at any scale.
Components of Data Pipeline :
- Source
- Destination
- Data flow
- Processing
- Workflow
- Monitoring
Challenges to building Data Pipeline :
Netflix, has built its own data pipeline. However, building your own data pipeline is very difficult and time is taken.
Here are some common challenges to creating a data pipeline in-house:
- Connection
- Flexibility
- Centralization
- Latency
- Batch processing
- Streaming processing
Processing means when we convert raw data to some kind of meaningful information which can provide some kind of insight.
Batch processing
- Buffering and processing the data in groups. Example : Credit card bill
- Large volumes of data can be processed at a convenient time
- Time delay between ingesting data and getting results
- Used to perform complex analytics
- Handles data in real time, data is processed as soon as it arrives
- Can process only small volume of data at real time
- Ideal for time-critical operations that require instant real time response
- Stock market, youtube, netflix etc
- Used for simple resource functions, aggregates etc
There are two main tools in the Microsoft Azure cloud platform to create data pipelines
- The first one is called Azure Data Factory.
- And the other one is Synapse Analytics, referred to as Synapse Pipelines in the Synapse workspace.
Pipeline in Azure Data Factory or Synapse are logical grouping of various activities such as data movement, data transformation and control flow. The Activities inside the Pipelines are actions that we perform on the data. For example:
- Copy data activity is used to load data from on-prem SQL server to Azure Data Lake
- Dataflow activity to extract data from Data Lake, transform and load into Synapse
- Control Flow activity to iteratively perform the copy data activities or data flow activities
Source : IBM, Geekforgeeks, Udacity
Comments
Post a Comment