编写spark生成器生成字段title: String, author: String, price: Double, sales: Int, rating: Double, publishTime: String
时间: 2024-03-25 17:38:24 浏览: 61
这里是一个简单的 Spark 数据生成器,可以生成包含 `title`、`author`、`price`、`sales`、`rating` 和 `publishTime` 字段的数据。
```scala
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import java.sql.Timestamp
import scala.util.Random
object DataGenerator {
def generateData(spark: SparkSession, numRecords: Int): DataFrame = {
val titles = Seq("The Great Gatsby", "To Kill a Mockingbird", "1984", "Pride and Prejudice", "The Catcher in the Rye", "The Hobbit", "The Lord of the Rings", "Animal Farm", "The Diary of a Young Girl", "The Hitchhiker's Guide to the Galaxy")
val authors = Seq("F. Scott Fitzgerald", "Harper Lee", "George Orwell", "Jane Austen", "J.D. Salinger", "J.R.R. Tolkien", "George Orwell", "J.R.R. Tolkien", "Anne Frank", "Douglas Adams")
val prices = Seq(9.99, 12.99, 14.99, 19.99, 24.99, 29.99, 34.99, 39.99, 44.99, 49.99)
val sales = Seq(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000)
val ratings = Seq(3.5, 4.0, 4.5, 5.0)
val dates = Seq("2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01", "2021-05-01", "2021-06-01", "2021-07-01", "2021-08-01", "2021-09-01", "2021-10-01")
val random = new Random()
val data = (1 to numRecords).map { i =>
val title = titles(random.nextInt(titles.length))
val author = authors(random.nextInt(authors.length))
val price = prices(random.nextInt(prices.length))
val sale = sales(random.nextInt(sales.length))
val rating = ratings(random.nextInt(ratings.length))
val date = Timestamp.valueOf(dates(random.nextInt(dates.length)) + s" ${random.nextInt(24)}:${random.nextInt(60)}:${random.nextInt(60)}")
Row(title, author, price, sale, rating, date.toString)
}
val schema = spark.sqlContext.createDataFrame(Seq.empty[Row]).schema
.add("title", "string")
.add("author", "string")
.add("price", "double")
.add("sales", "int")
.add("rating", "double")
.add("publishTime", "string")
spark.createDataFrame(data, schema)
}
}
```
调用 `generateData` 函数即可生成指定数量的数据,例如:
```scala
val numRecords = 1000
val data = DataGenerator.generateData(spark, numRecords)
data.show()
```
阅读全文