dataframe和dataset分别能通过哪些方式创建,编程举例
时间: 2023-04-28 14:06:13 浏览: 117
DataFrame和Dataset都可以通过以下方式创建:
1. 从本地文件读取数据
2. 从数据库中读取数据
3. 从RDD转换而来
举例:
1. 从本地文件读取数据创建DataFrame:
```
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("create_dataframe").getOrCreate()
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
```
2. 从本地文件读取数据创建Dataset:
```
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("create_dataset").getOrCreate()
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("gender", StringType(), True)
])
df = spark.read.csv("path/to/file.csv", header=True, schema=schema)
ds = df.as[Person]
```
3. 从RDD转换而来创建DataFrame:
```
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("create_dataframe_from_rdd").getOrCreate()
rdd = spark.sparkContext.parallelize([(1, "John"), (2, "Jane"), (3, "Bob")])
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])
df = spark.createDataFrame(rdd, schema)
```
4. 从RDD转换而来创建Dataset:
```
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("create_dataset_from_rdd").getOrCreate()
rdd = spark.sparkContext.parallelize([(1, "John"), (2, "Jane"), (3, "Bob")])
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])
df = spark.createDataFrame(rdd, schema)
ds = df.as[Person]
```
阅读全文