spark 数据集中有多少条记录,有多少个用户,有多少件商品,有多少个商品类别,记录了多少个日期的数据使用代码spl、RDD、df实现
时间: 2024-02-24 16:58:41 浏览: 14
好的,以下是使用Scala语言的Spark代码,实现了使用Spark SQL、RDD和DataFrame三种方式来统计数据集中的记录数、用户数、商品数、商品类别数和日期数:
```scala
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Data Analysis").getOrCreate()
// 读取数据集
val dataRDD = spark.sparkContext.textFile("path/to/dataset.csv")
val dataDF = spark.read.csv("path/to/dataset.csv").toDF("user_id", "item_id", "category_id", "date")
// 使用RDD统计数据
val numRecordsRDD = dataRDD.count()
val numUsersRDD = dataRDD.map(line => line.split(",")(0)).distinct().count()
val numItemsRDD = dataRDD.map(line => line.split(",")(1)).distinct().count()
val numCategoriesRDD = dataRDD.map(line => line.split(",")(2)).distinct().count()
val numDatesRDD = dataRDD.map(line => line.split(",")(3)).distinct().count()
// 使用DataFrame统计数据
val numRecordsDF = dataDF.count()
val numUsersDF = dataDF.select("user_id").distinct().count()
val numItemsDF = dataDF.select("item_id").distinct().count()
val numCategoriesDF = dataDF.select("category_id").distinct().count()
val numDatesDF = dataDF.select("date").distinct().count()
// 使用Spark SQL统计数据
dataDF.createOrReplaceTempView("data")
val numRecordsSQL = spark.sql("SELECT COUNT(*) FROM data").first().getLong(0)
val numUsersSQL = spark.sql("SELECT COUNT(DISTINCT user_id) FROM data").first().getLong(0)
val numItemsSQL = spark.sql("SELECT COUNT(DISTINCT item_id) FROM data").first().getLong(0)
val numCategoriesSQL = spark.sql("SELECT COUNT(DISTINCT category_id) FROM data").first().getLong(0)
val numDatesSQL = spark.sql("SELECT COUNT(DISTINCT date) FROM data").first().getLong(0)
// 输出结果
println(s"Number of records (RDD): $numRecordsRDD")
println(s"Number of users (RDD): $numUsersRDD")
println(s"Number of items (RDD): $numItemsRDD")
println(s"Number of categories (RDD): $numCategoriesRDD")
println(s"Number of dates (RDD): $numDatesRDD")
println(s"Number of records (DataFrame): $numRecordsDF")
println(s"Number of users (DataFrame): $numUsersDF")
println(s"Number of items (DataFrame): $numItemsDF")
println(s"Number of categories (DataFrame): $numCategoriesDF")
println(s"Number of dates (DataFrame): $numDatesDF")
println(s"Number of records (Spark SQL): $numRecordsSQL")
println(s"Number of users (Spark SQL): $numUsersSQL")
println(s"Number of items (Spark SQL): $numItemsSQL")
println(s"Number of categories (Spark SQL): $numCategoriesSQL")
println(s"Number of dates (Spark SQL): $numDatesSQL")
```
你需要将代码中的"path/to/dataset.csv"替换为你的数据集的实际路径。这段代码首先使用SparkSession读取数据集,并将数据集分别转换为RDD、DataFrame和Spark SQL的临时表。然后,使用RDD API、DataFrame API和Spark SQL分别统计数据。最后,输出结果。