RDD的建立:parallelize|textFile|HDFS
一个示例:
from pyspark import SparkConf, SparkContext
config = SparkConf().setAppName(value="WordCount").setMaster("local[*]")
sc = SparkContext(conf=config)
rdd = sc.parallelize(["Hello World", "Hello Spark"])
rdd2 = rdd.flatMap(lambda x: x.split(" "))
rdd3 = rdd2.map(lambda x: (x, 1))
rdd4 = rdd3.reduceByKey(lambda x, y: x + y)
print(rdd4.collect())
sc.stop()