Having fun with Programming: Spark glossary

Thursday, January 8, 2015

Spark glossary

Data structures

RDD is a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. As a data structure, an RDD is a lazy, read-only, partitioned collection of records created through deterministic transformations (e.g., map, join and group-by) on other RDDs.

The lineage of an RDD is the series of transformations used to build it. More vaguely, it also refers to the information needed to derive an RDD from other RDDs. Logging lineages allows an RDD to reconstruct lost partitions without requiring costly checkpointing and rollback.

RDDs vs. DSM. In DSM models, applications read and write to arbitrary locations in a global address space. DSM is a very general abstraction, but this generality makes it harder to implement in an efficient and fault-tolerant manner on commodity clusters. The main difference between RDDs and DSM is that RDDs can only be created (“written”) through bulk transformations, while DSM allows reads and writes to each memory location. This restriction of RDDs allows for more efficient fault tolerance. In particular, RDDs do not need to incur the overhead of checkpointing, as they can be recovered using lineages.

The execution model

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program, called the driver program. A worker node is a node that can run application code in the cluster. Each worker node in the cluster can have several executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. The number of executors on a node is limited by the number of cores of that node. A task is a unit of work that will be sent to one executor. A job is a parallel computation consisting of multiple tasks that gets spawned in response to a Spark action such as reduce. Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce).

Having fun with Programming

Thursday, January 8, 2015

Spark glossary

Data structures

The execution model

No comments:

Post a Comment