"Apache Spark实现大规模数据处理：一个60TB生产案例分析"

需积分: 5 88 浏览量更新于2024-03-12 收藏 1.79MB PDF 举报

The presented document, "Apache Spark at Scale: A 60 TB production use case" from the "藏经阁" series, provides a comprehensive overview of the implementation of Apache Spark for entity ranking in a large-scale production environment. The document is presented by Sital Kedia from Facebook and covers various aspects of the use case, including the previous implementation using Hive, the transition to Spark, performance comparisons, reliability and performance improvements, as well as configuration tuning. The use case itself revolves around entity ranking, which is utilized to serve real-time queries for ranking entities such as users, places, and pages. Raw features are generated offline using Hive and then loaded onto the system for real-time query processing. The document specifically outlines the challenges faced with the previous Hive implementation, such as the INSERT OVERWRITE T command, and how Apache Spark was adopted to address these challenges. The transition to Spark allowed for improved performance and reliability, with a detailed performance comparison provided to showcase the benefits of this transition. The document also delves into the various reliability and performance improvements that were made as a result of the Spark implementation, as well as the configuration tuning that was undertaken to optimize the system for this use case. Overall, the document serves as a valuable resource for understanding the practical application of Apache Spark at a large scale in a real-world production environment. It provides insights into the challenges, considerations, and benefits of transitioning from a Hive-based implementation to Apache Spark, and serves as a valuable reference for organizations looking to leverage Apache Spark for similar use cases. With its detailed analysis and real-world insights, this document is an important contribution to the Apache Spark community and to the field of big data processing at scale.

Spark implementation

SELECT TRANSFORM (shard_id, . . .)

USING 'indexer' !

AS shard_id, status!

FROM (

SELECT entity_id % SHARDS as shard_id, entity_id, target_id, AGG ( ...)!

FROM input_table!

WHERE ...!

GROUP BY shard_id, entity_id, feature_id, target_id

CLUSTER BY shard_id

) AS T !

Input table

indexed!

hdfs ﬁles

•

Single job with 2 stages

•

Shuﬄes 90 TB+ compressed

intermediate data

剩余25页未读，继续阅读

weixin_40191861_zj

粉丝: 86
资源: 1万+

"Apache Spark实现大规模数据处理：一个60TB生产案例分析"

藏经阁-Apache Spark MLlib_'s Past Trajectory and New Directions.pdf

藏经阁-Deep Dive into Catalyst_ Apache Spark _'s Optimizer.pdf

藏经阁-PostgresChina2018_李海龙_Qunar的PostgreSQL运维实践.pdf

如何在阿里云上部署微服务架构，并确保系统的高可用性和扩展性？

如何利用阿里云提供的服务部署微服务架构，同时确保应用的高可用性和可扩展性？

如何结合机器学习技术实现人机交互中的自然语言理解，并举一个简单的意图识别和槽抽取的实例？

在平头哥无剑平台基础上，如何进行RISC-V架构软件应用的开发？请提供具体的开发步骤和相关工具链的配置方法。

PolarDB-X如何结合云原生特性实现高性能的分布式事务处理？请结合金融标准和ARM适配进行分析。

PolarDB-X是如何实现存储计算分离的，以及这种架构设计带来了哪些优势？

最新资源