pandas将数据写入Hive的方法
时间: 2024-01-23 13:03:14 浏览: 313
可以使用PyHive库将pandas数据写入Hive。以下是一个示例代码:
```python
from pyhive import hive
import pandas as pd
# 创建连接
conn = hive.Connection(host='localhost', port=10000, username='hiveuser')
# 创建表
query = """
CREATE TABLE mytable (
col1 STRING,
col2 INT
)
"""
with conn.cursor() as cursor:
cursor.execute(query)
# 将pandas数据写入表
df = pd.DataFrame({'col1': ['foo', 'bar'], 'col2': [1, 2]})
with conn.cursor() as cursor:
cursor.execute("USE mydatabase")
cursor.execute("SET hive.exec.dynamic.partition.mode=nonstrict")
cursor.execute("SET hive.exec.max.dynamic.partitions=10000")
cursor.execute("SET hive.exec.max.dynamic.partitions.pernode=10000")
cursor.execute("SET hive.enforce.bucketing=true")
cursor.execute("SET hive.mapred.mode=nonstrict")
cursor.execute("SET hive.optimize.index.filter=true")
cursor.execute("SET hive.optimize.ppd=true")
cursor.execute("SET hive.vectorized.execution.enabled=true")
cursor.execute("SET hive.vectorized.execution.reduce.enabled=true")
cursor.execute("SET hive.vectorized.execution.reduce.groupby.enabled=true")
cursor.execute("SET hive.vectorized.execution.reduce.groupby.fixed.ordered=false")
cursor.execute("SET hive.vectorized.execution.reduce.groupby.variable.estimated=false")
cursor.execute("SET hive.vectorized.execution.reduce.groupby.variable.exact=false")
cursor.execute("SET hive.vectorized.execution.reduce.groupby.variable.force=false")
cursor.execute("SET hive.vectorized.execution.reduce.groupby.variable.width=32768")
cursor.execute("SET hive.vectorized.execution.row.filter.enabled=true")
cursor.execute("SET hive.vectorized.execution.row.filter.pushdown=true")
cursor.execute("SET hive.vectorized.groupby.checkinterval=4096")
cursor.execute("SET hive.cbo.enable=true")
cursor.execute("SET hive.stats.fetch.column.stats=true")
cursor.execute("SET hive.stats.fetch.partition.stats=true")
cursor.execute("SET hive.compute.query.using.stats=true")
cursor.execute("SET hive.stats.join.factor=1.0")
cursor.execute("SET hive.stats.key.prefix=stats_")
cursor.execute("SET hive.stats.ndv.error=0.05")
cursor.execute("SET hive.stats.reliable=true")
cursor.execute("SET hive.stats.autogather=true")
cursor.execute("SET hive.stats.autogather.interval=10000")
cursor.execute("SET hive.stats.autogather.maxsize=10000")
df.to_sql(name='mytable', con=conn, if_exists='append', index=False)
# 关闭连接
conn.close()
```
请注意,代码中的连接信息需要根据您的环境进行修改。另外,如果需要写入分区表,可以在`to_sql`方法中使用`partition_by`参数指定分区列。
阅读全文