Data structures
RDD is a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce. As a data structure, an RDD is a lazy, read-only, partitioned collection of records created through deterministic transformations (e.g., map, join and group-by) on other RDDs.The lineage of an RDD is the series of transformations used to build it. More vaguely, it also refers to the information needed to derive an RDD from other RDDs. Logging lineages allows an RDD to reconstruct lost partitions without requiring costly checkpointing and rollback.
RDDs vs. DSM. In DSM models, applications read and write to arbitrary locations in a global address space. DSM is a very general abstraction, but this generality makes it harder to implement in an efficient and fault-tolerant manner on commodity clusters. The main difference between RDDs and DSM is that RDDs can only be created (“written”) through bulk transformations, while DSM allows reads and writes to each memory location. This restriction of RDDs allows for more efficient fault tolerance. In particular, RDDs do not need to incur the overhead of checkpointing, as they can be recovered using lineages.
No comments:
Post a Comment