pySpark RDD编程：期中考试题解析

版权申诉

5星 · 超过95%的资源 35 浏览量更新于2024-06-25 9 收藏 455KB DOCX 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

本资源是一份关于使用PySpark RDD进行数据处理的实验报告或课程设计，涉及的内容包括计算学生总数、课程总数、特定学生平均分、每个学生选修课程数、特定课程选修人数、所有课程平均分以及利用累加器计算选修特定课程的学生数。此外，还包含了一个独立的Spark程序，用于合并并去重两个文本文件的内容。最后，还有一个程序用于计算多个学科的平均成绩。 1）计算该系总共有多少学生：实现代码： ```python # 获取学生姓名 students = data.map(lambda line: line.split(",")[0]) # 去重并统计学生数量 total_students = students.distinct().count() print("Total students in the department:", total_students) ``` 实现过程及结果：通过将数据文件中的每一行分割，提取出学生姓名，并使用distinct()去除重复项，最后通过count()方法计算唯一学生数量。 2）计算该系共开设了多少门课程：实现代码： ```python # 获取课程名 courses = data.map(lambda line: line.split(",")[1]) # 去重并统计课程数量 total_courses = courses.distinct().count() print("Total courses in the department:", total_courses) ``` 实现过程及结果：与计算学生数类似，这里提取出课程名，去重后计数。 3）计算Tom同学的总成绩平均分：实现代码： ```python # 过滤Tom的成绩 tom_grades = data.filter(lambda line: 'Tom' in line).map(lambda line: float(line.split(",")[2])) # 计算平均分 average_tom_grade = tom_grades.reduce(lambda x, y: x + y) / max(tom_grades.count(), 1) print("Tom's average score:", average_tom_grade) ``` 实现过程及结果：筛选出包含"Tom"的行，转换成绩为浮点数，然后求平均值。 4）求每名同学的选修的课程门数：实现代码： ```python # 统计学生选修课程数 student_courses = data.map(lambda line: (line.split(",")[0], line.split(",")[1])) course_counts = student_courses.countByKey() for student, count in course_counts.items(): print(f"{student} has taken {count} courses.") ``` 实现过程及结果：将数据转换为（学生，课程）键值对，然后通过countByKey()计算每个学生对应的课程数。 5）计算该系DataBase课程共有多少人选修：实现代码： ```python # 过滤出DataBase课程 database_students = data.filter(lambda line: 'DataBase' in line) # 计算人数 database_students_count = database_students.count() print("Number of students who took Database:", database_students_count) ``` 实现过程及结果：筛选出包含"DataBase"的行，然后计数。 6）计算各门课程的平均分：实现代码： ```python # 分组并计算每门课程的平均分 course_averages = data.map(lambda line: (line.split(",")[1], float(line.split(",")[2]))) averages = course_averages.groupByKey().mapValues(sum).mapValues(len).divideByKey() for course, average in averages.collect(): print(f"Average score for {course}: {average}") ``` 实现过程及结果：按课程分组，计算每门课程所有成绩之和与样本数量，然后除以样本数量得到平均分。 7）使用累加器计算共有多少人选了DataBase这门课：实现代码： ```python from pyspark import AccumulatorParam class Counter(AccumulatorParam): def zero(self, value): return 0 def addInPlace(self, val1, val2): return val1 + val2 database_counter = sc.accumulator(0, Counter()) data.foreach(lambda line: database_counter.add(1 if 'DataBase' in line else 0)) print("Number of students who took Database:", database_counter.value) ``` 实现过程及结果：创建自定义累加器Counter，遍历数据，若行中包含"DataBase"则累加1，最后输出累加器的值。 8）编写独立应用程序实现数据去重：实现代码： ```python # 合并并去重两个文件 merged_data = sc.union(sc.textFile("A.txt"), sc.textFile("B.txt")) unique_data = merged_data.distinct() # 写入新文件 unique_data.saveAsTextFile("C.txt") ``` 实现过程及结果：使用union()合并两个文件，然后使用distinct()去重，最后将结果保存到新的文本文件C.txt中。 9）求平均值问题：实现代码： ```python # 读取不同科目成绩文件 files = ["Algorithm.txt", "Database.txt", "Python.txt"] total_scores = 0 num_students = 0 for file in files: scores = sc.textFile(file).map(lambda line: float(line.strip())) total_scores += scores.sum() num_students += scores.count() average_scores = total_scores / num_students print("Average score of all students:", average_scores) ``` 实现过程及结果：读取每个文件，计算成绩总和与学生总数，最后求平均成绩。

资源详情

资源推荐