如何通过pyspark登录hive的hue用户
时间: 2023-11-16 17:03:48 浏览: 290
在python中使用pyspark读写Hive数据操作
可以使用`pyspark.sql.SparkSession`类中的`config`方法来设置Hive Metastore URI和Hadoop用户。具体步骤如下:
1. 导入`pyspark.sql.SparkSession`类:
```python
from pyspark.sql import SparkSession
```
2. 创建一个`SparkSession`对象:
```python
spark = SparkSession.builder \
.appName("HiveTest") \
.config("hive.metastore.uris", "thrift://<hive_metastore_host>:9083") \
.config("spark.hadoop.fs.defaultFS", "hdfs://<hdfs_namenode_host>:8020") \
.config("spark.hadoop.fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem") \
.config("spark.hadoop.hadoop.security.authentication", "kerberos") \
.config("spark.hadoop.hadoop.security.authorization", "true") \
.config("spark.hadoop.hadoop.security.auth_to_local", "RULE:[2:$1@$0](.*@<REALM>)s/@.*//" \
"DEFAULT") \
.config("spark.yarn.access.hadoopFileSystems", "hdfs://<hdfs_namenode_host>:8020") \
.config("spark.hadoop.yarn.resourcemanager.hostname", "<yarn_resourcemanager_host>") \
.config("spark.hadoop.yarn.resourcemanager.address", "<yarn_resourcemanager_host>:8032") \
.config("spark.hadoop.yarn.resourcemanager.scheduler.address", "<yarn_resourcemanager_host>:8030") \
.config("spark.hadoop.yarn.resourcemanager.resource-tracker.address", "<yarn_resourcemanager_host>:8031") \
.config("spark.hadoop.yarn.resourcemanager.admin.address", "<yarn_resourcemanager_host>:8033") \
.config("spark.hadoop.yarn.resourcemanager.webapp.address", "<yarn_resourcemanager_host>:8088") \
.config("spark.hadoop.yarn.nodemanager.address", "<yarn_nodemanager_host>:8042") \
.config("spark.hadoop.yarn.nodemanager.local-dirs", "/mnt/yarn/nm") \
.config("spark.hadoop.yarn.nodemanager.log-dirs", "/mnt/yarn/logs") \
.config("spark.hadoop.yarn.nodemanager.remote-app-log-dir", "/app-logs") \
.config("spark.hadoop.yarn.nodemanager.remote-app-log-dir-suffix", "/logs") \
.config("spark.hadoop.yarn.timeline-service.enabled", "false") \
.config("spark.hadoop.yarn.timeline-service.hostname", "<timeline_service_host>") \
.config("spark.hadoop.yarn.timeline-service.address", "<timeline_service_host>:10200") \
.config("spark.hadoop.yarn.timeline-service.webapp.address", "<timeline_service_host>:8188") \
.config("spark.hadoop.yarn.timeline-service.store-class", "org.apache.hadoop.yarn.server.timeline.MemoryTimelineStore") \
.config("spark.hadoop.yarn.timeline-service.ttl-enable", "true") \
.config("spark.hadoop.yarn.timeline-service.ttl-ms", "120000") \
.config("spark.hadoop.yarn.timeline-service.ttl-interval-ms", "60000") \
.config("spark.hadoop.yarn.timeline-service.ttl-check-interval-ms", "60000") \
.config("spark.hadoop.yarn.timeline-service.entity-group-fs-store.active-dir", "/yarn/timeline") \
.config("spark.hadoop.yarn.timeline-service.entity-group-fs-store.done-dir", "/yarn/timeline/done") \
.config("spark.hadoop.yarn.timeline-service.generic-application-history.store-class", "org.apache.hadoop.yarn.server.applicationhistoryservice.NullApplicationHistoryStore") \
.config("spark.hadoop.yarn.timeline-service.version", "1.0") \
.enableHiveSupport() \
.getOrCreate()
```
其中,`hive.metastore.uris`指定了Hive Metastore的地址,`spark.hadoop.fs.defaultFS`指定了Hadoop的默认文件系统(HDFS),`spark.hadoop.hadoop.security.authentication`指定了Kerberos认证方式。
3. 设置Hadoop用户:
```python
import os
os.environ['HADOOP_USER_NAME'] = '<hue_user>'
```
这里将环境变量`HADOOP_USER_NAME`设置为Hue用户。
4. 使用`SparkSession`对象进行Hive操作:
```python
df = spark.sql("SELECT * FROM <hive_db>.<hive_table>")
df.show()
```
这里使用`spark.sql`方法执行Hive查询,将结果放入DataFrame中,最后使用`show`方法展示结果。
阅读全文