Spark Tutorials 3 - RDD

blairchen

spark

Publish：Sep 25, 2020

views

1. Create RDDs

2. RDD Persistence and Caching

What is RDD persistence, Why do we need to call cache or persist on an RDD, What is the Difference between Cache() and Persist() method in Spark

What is RDD Persistence and Caching

The difference between cache() and persist() is that using cache() the default storage level is MEMORY_ONLY while using persist() we can use various storage levels (described below).

Time efficient

Cost efficient

Lessen the execution time.

Storage levels of Persisted RDDs

MEMORY_ONLY

MEMORY_AND_DISK

MEMORY_ONLY_SER

MEMORY_AND_DISK_SER

DISK_ONLY

How to Unpersist RDD in Spark?

using RDD.unpersist() method.

3. RDD Features

3.1 In-memory computation

3.2 Lazy Evaluation

3.3 Fault Tolerance

Fault tolerance in Apache Spark – Reliable Spark Streaming

3.4 Immutability

3.5 Persistence

3.6 Partitioning

3.7 Parallel

Rdd, process the data parallelly over the cluster.

3.8 Location-Stickiness

3.9 Coarse-grained Operation

3.10 Typed

We can have RDD of various types like: RDD [int], RDD [long], RDD [string].

3.11 No limitation

4. Paired RDD

Here transformation operations are:

groupByKey
reduceByKey
join
left outer join
right outer Join

Whereas actions like countByKey

4.1 Objective

in spark is designed as each dataset in RDD is divided into logical partitions. Further, we can say here each partition may be computed on different nodes of the cluster.

4.2 Spark Paired RDD

from pyspark import SparkConf , SparkContext
from operator import add
sc.version
lines22 = sc.textFile("/Users/blair/ghome/github/spark3.0/pyspark/spark-src/word_count.text", 2)
pairs22= lines22.map(lambda x: (x, 1))

#pairs22.take(2)

counts22 = pairs22.reduceByKey(add)

counts22

4.3 Create Spark Paired RDD

a. In Python language

1	pairs = lines.map(lambda x: (x.split(” “)[0], x))

4.4 Paired RDD Operations

No.	Operations	desc
.	`Transformation Operations`
.	map / flatMap / mapPartitions	…
1.	groupByKey	The groupbykey operation generally groups all the values with the same key. rdd.groupByKey()
2.	reduceByKey(fun)	Here, the reduceByKey operation generally combines values with the same key. add.reduceByKey( (x, y) => x + y)
3.	combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)	CombineByKey uses a different result type, then combine those values with the same key.
4.	mapValues(func)	Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.
5.	keys()	Keys() Return an RDD with the keys of each tuple.
6.	values()	Return an RDD with the values of each tuple.
7.	`sortByKey`(ascending=True, numPartitions=None, keyfunc=<function RDD.>)	Similarly, the sortByKey operation generally returns an RDD sorted by the key.
.	`Action Operations`
8.	countByKey()	countByKey operation, we can count the number of elements for each key.
9.	collectAsMap()	Here, collectAsMap() operation helps to collect the result as a map to provide easy lookup.
10.	lookup(key)	Moreover, it returns all values associated with the provided key.

PySpark 3.0.1 documentation »

(1). reduceByKey(fun) & groupByKey

lines = sc.textFile("/Users/blair/ghome/github/spark3.0/pyspark/spark-src/word_count.text", 2)

lines.take(3)
words = lines.flatMap(lambda x: x.split(' '))
print(words.take(5))

wco = words.map(lambda x: (x, 1))
print(wco.take(5))
# word_count = wco.reduceByKey(add)
# print("\nword_count:")
# print(word_count.take(5))
print("\ngroupByKey:")
test = wco.groupByKey()
print(test.take(2))
# gp = test.map(lambda x: (x[0], [i for i in x[1]]))
# gp.take(2)

(2). mapValues(fun)

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])

sorted(rdd.groupByKey().mapValues(len).collect())
# [('a', 2), ('b', 1)]
sorted(rdd.groupByKey().mapValues(list).collect())
# [('a', [1, 1]), ('b', [1])]

(3). keys(), values()

m = sc.parallelize([(1, 2), (3, 4)]).keys()
m.collect()
[1, 3]

m = sc.parallelize([(1, 2), (3, 4)]).values()
m.collect()
[2, 4]

(4). sortBykey

5. RDD limitations

Reference

updated on：Jun 28, 2021

spark

Spark SQL几种Join实现

Spark RDD Feature 有两种方式使用SparkSQL Spark SQL性能调优以及原理图是直接写sql语句，这个需要有元数据库支持，例如Hive等通过Dataset/Da...

SQL vs NoSQL

什么是事务？事务是指是程序中一系列严密的逻辑操作，而且所有操作必须全部成功完成，否则在每个操作中所作的所有更改都会被撤消。可以通俗理解为：就是把多件事情当做一件事情来处理，好比大家同在一条...

Comments

Load Disqus