1. Create RDDs
2. RDD Persistence and Caching
What is RDD persistence, Why do we need to call cache or persist on an RDD, What is the Difference between Cache() and Persist() method in Spark
What is RDD Persistence and Caching
The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY
while using persist() we can use various storage levels (described below).
- Time efficient
- Cost efficient
- Lessen the execution time.
Storage levels of Persisted RDDs
- MEMORY_ONLY
- MEMORY_AND_DISK
- MEMORY_ONLY_SER
- MEMORY_AND_DISK_SER
- DISK_ONLY
How to Unpersist RDD in Spark?
using RDD.unpersist() method.
3. RDD Features
3.1 In-memory computation
3.2 Lazy Evaluation
3.3 Fault Tolerance
Fault tolerance in Apache Spark – Reliable Spark Streaming
3.4 Immutability
3.5 Persistence
3.6 Partitioning
3.7 Parallel
Rdd, process the data parallelly over the cluster.
3.8 Location-Stickiness
3.9 Coarse-grained Operation
3.10 Typed
We can have RDD of various types like: RDD [int], RDD [long], RDD [string].
3.11 No limitation
4. Paired RDD
Here transformation operations are:
- groupByKey
- reduceByKey
- join
- left outer join
- right outer Join
Whereas actions like countByKey
4.1 Objective
in spark is designed as each dataset in RDD is divided into logical partitions. Further, we can say here each partition may be computed on different nodes of the cluster.
4.2 Spark Paired RDD
1 | from pyspark import SparkConf , SparkContext |
4.3 Create Spark Paired RDD
a. In Python language
1 | pairs = lines.map(lambda x: (x.split(” “)[0], x)) |
4.4 Paired RDD Operations
No. | Operations | desc |
---|---|---|
. | Transformation Operations |
|
. | map / flatMap / mapPartitions | … |
1. | groupByKey | The groupbykey operation generally groups all the values with the same key. rdd.groupByKey() |
2. | reduceByKey(fun) | Here, the reduceByKey operation generally combines values with the same key. add.reduceByKey( (x, y) => x + y) |
3. | combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner) | CombineByKey uses a different result type, then combine those values with the same key. |
4. | mapValues(func) | Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning. |
5. | keys() | Keys() Return an RDD with the keys of each tuple. |
6. | values() | Return an RDD with the values of each tuple. |
7. | sortByKey (ascending=True, numPartitions=None, keyfunc=<function RDD. |
Similarly, the sortByKey operation generally returns an RDD sorted by the key. |
. | Action Operations |
|
8. | countByKey() | countByKey operation, we can count the number of elements for each key. |
9. | collectAsMap() | Here, collectAsMap() operation helps to collect the result as a map to provide easy lookup. |
10. | lookup(key) | Moreover, it returns all values associated with the provided key. |
(1). reduceByKey(fun) & groupByKey
1 | lines = sc.textFile("/Users/blair/ghome/github/spark3.0/pyspark/spark-src/word_count.text", 2) |
(2). mapValues(fun)
1 | rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) |
(3). keys(), values()
1 | m = sc.parallelize([(1, 2), (3, 4)]).keys() |
(4). sortBykey
Checking if Disqus is accessible...