java spark如何读取excel数据?
时间: 2023-09-02 11:06:23 浏览: 126
Java Spark可以通过以下几种方式读取Excel数据:
1. 使用Apache POI库读取Excel数据,然后将数据转换为Spark DataFrame。
2. 使用Excel DataSource API,这是一个基于Spark SQL的插件,可以直接从Excel中加载数据并生成Spark DataFrame。
3. 使用第三方库,例如DataBricks的excelent或者CData的Excel JDBC驱动程序。
下面是使用Apache POI库读取Excel数据的示例代码:
```java
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.functions.*;
import java.util.ArrayList;
import java.util.List;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class ExcelReader {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("Excel Reader")
.master("local[*]")
.getOrCreate();
String filePath = "path/to/excel/file.xlsx";
String sheetName = "Sheet1";
// Read Excel file into a Workbook object
Workbook workbook = null;
try {
workbook = new XSSFWorkbook(filePath);
} catch (Exception e) {
e.printStackTrace();
}
// Get the sheet from the workbook
Sheet sheet = workbook.getSheet(sheetName);
// Get the header row
Row headerRow = sheet.getRow(0);
// Create a list of StructField objects to define the schema
List<StructField> fields = new ArrayList<>();
for (Cell cell : headerRow) {
String columnName = cell.getStringCellValue();
StructField field = DataTypes.createStructField(columnName, DataTypes.StringType, true);
fields.add(field);
}
// Create the schema
StructType schema = DataTypes.createStructType(fields);
// Read the data rows and convert them to Spark Rows
List<Row> rows = new ArrayList<>();
for (int i = 1; i <= sheet.getLastRowNum(); i++) {
Row row = sheet.getRow(i);
List<String> rowValues = new ArrayList<>();
for (Cell cell : row) {
rowValues.add(cell.getStringCellValue());
}
Row sparkRow = RowFactory.create(rowValues.toArray());
rows.add(sparkRow);
}
// Create the DataFrame
Dataset<Row> df = spark.createDataFrame(rows, schema);
// Show the DataFrame
df.show();
// Close the workbook
try {
workbook.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
```
注意:这段代码仅适用于读取XLSX格式的Excel文件,如果要读取XLS格式的文件,需要使用HSSF而不是XSSF。
阅读全文