Apache Spark, Spark SQl, Pyspark and Python

Apache Spark is a distributed cluster-computing engine. Spark uses clusters of virtual machines to handle large volumes of data/large computations. Several APIs/languages can be used to express a computation on Spark.

Spark SQL is an API that is used to execute SQL queries on Spark (acts as a distributed SQL query engine). You can also think of Spark SQL as a Spark module for structured data processing. It provides Spark with more information about the structure of both the data and the computation being performed. It is used by Spark to perform extra optimizations, and it provides a programming abstraction called DataFrame.

PySpark is an interface/API for Apache Spark in Python that supports the collaboration of Apache Spark and Python. It allows you to write Spark applications using Python APIs, and provides the PySpark shell for analyzing data in a distributed environment. Under the hood, PySpark executes the same optimized code as Spark SQL. PySpark features quite a few libraries for writing efficient programs.

Python is an open source, high-level, general-purpose programming language that is interpreted, interactive, and object-oriented. It can be applied to many different classes of problems and is powerful with very clear syntax.

Python-wrapped spark SQL

Python string formatting and variable substitution to wrap around Spark SQL statements. In the following example, `budget` is a variable that has been previously defined as the name of a database, and using Python’s string literal formatting with “f-strings” allows you to embed the variable name within a SQL statement.

Search This Blog

Data Talks and Musings

Apache Spark, Spark SQl, Pyspark and Python

Comments

Post a Comment