Source Credits : IBM documentation, IBM Skills Network
Understanding the Difference between Small vs Big Data
Big Data Lifecyle
Did you Know ?
The V's of Big Data:
1. Velocity : Speed at which data arrives
2. Volume : Increase in the amount of data stored over time.
3. Variety : Diversity of data
4. Veracity : Certainty of data
5. Value : Creates value when collected, processed and stored correctly.
Impact of Big Data : Day to day life use cases
Linear vs Parallel Processing :
- In any normal analytics cycle, the functionality of the computer is to store data and move that data from its storage capacity into a compute capacity (which includes memory), and back to storage once important results are computed.
- With Big Data, you have more data than will fit on a single computer.
- Parallelism or Parallel processing can best be understood by comparing it to Linear processing.
- Linear Processing Linear processing is the traditional method of computing a problem where the problem statement is broken into a set of instructions that are executed sequentially till all instructions are completed successfully. If an error occurs in any one of the instructions, the entire sequence of instructions is executed from the beginning after the error has been resolved. It is evident from the processing method that Linear processing is best suited for minor computing tasks and is inefficient and time consuming when it comes to processing complex problems such as Big Data.
- Parallel Processing, here the problem statement is broken down into a set of executable instructions. The instructions are then distributed to multiple execution nodes of equal processing power and are executed in parallel. Since the instructions are run on separate execution nodes, errors can be fixed and executed locally independent of other instructions.
Advantages of Parallel Processing :
Data Scaling :Data Scaling is a technique to manage, store, and process the overflow of data.
You can get a larger single node computer.
But when your data is growing exponentially, eventually it will outgrow the capacity that
is available.
Increasing the capacity of a single node as a means of increasing capacity is called scaling up.
Fault tolerance refers to the ability of a system to continue operating without interruption
when one or more of its components fail.
This works for Hadoop primary data storage system (HDFS) and other similar storage systems (like S3, blob)
Consider the first 3 partitions of a data set labelled P1, P2, and P3, which reside
on the first node.
In this system, copies of each of these data partitions are also stored on other locations
or nodes within the cluster.
If the first node ever goes down, you can add a new node to the cluster and recover
the lost partitions by copying data from one of the other nodes where copies of P1, P2,
and P3 partitions are stored.
Clearly, this is an extraordinarily complex maintenance process, but the Hadoop filesystem
is a robust and time-tested framework.
It can be reliable to 5 9s (99.999%).
Key Data technologies include :
Hadoop, HDFS, MapReduce, Spark, Cloudera and Databricks
There are six main components of Big Data tools, namely:
Data technologies
Analytics and visualization
Business Intelligence
Cloud providers
NoSQL databases, and
Programming tools
Each tooling category plays a very specific and critical role in the Big Data life cycle.
Several major commercial and open source vendors provide tools and support for Big Data processing.
Companies are relying heavily on Big Data to differentiate themselves from the competition;
and
There are several ways in which the retail, insurance, telecom, manufacturing, automotive,
and finance industries are leveraging Big Data to reduce cost, increase customer satisfaction,
and make competitive business decisions.
Comments
Post a Comment