以上这段sql有什么问题吗?讲讲其中什么逻辑
时间: 2024-11-15 22:29:18 浏览: 6
数据库系统教学课件:第8讲高级SQL.pptx
### SQL 代码分析与问题总结
#### 主要功能
该SQL脚本主要用于处理和汇总账单数据,具体包括以下几个步骤:
1. **读取原始数据**:从多个表中读取订单详情、账单记录等数据。
2. **清洗和转换数据**:使用UDF(用户定义函数)解析JSON字段,并进行必要的数据清洗和转换。
3. **生成中间结果表**:创建临时视图以存储中间计算结果。
4. **聚合和汇总数据**:按不同的维度(如省份、城市、产品编号等)进行数据聚合和汇总。
5. **输出结果**:将最终结果保存到指定的HDFS路径。
#### 具体逻辑与潜在问题
1. **日期参数提取**
```scala
val date = args(0)
val year = date.substring(0, 4) // yyyy
val yearmonth_start = date.substring(0, 4) + "01" // yyyymm01
val yearmonth_end = date.substring(0, 6) // yyyymm
```
- **问题**:`yearmonth_start` 和 `yearmonth_end` 的命名有些混淆。建议更改为 `start_month` 和 `end_month` 以提高可读性。
2. **读取订单详情并解析JSON字段**
```scala
val frame = spark.sql(
s"""
|select
| atomic_power_orders,
| app_id,
| application_id,
| whether
|from ods_spyw.t_ods_tyyy_application_detail
|where month='${yearmonth_end}'
| and del_flag='0'
| and atomic_power_orders is not null
""".stripMargin)
.withColumn("orderNo", getAppidOrderNoJson($"atomic_power_orders"))
.createOrReplaceTempView("clear_json_object")
```
- **问题**:`getAppidOrderNoJson` UDF 需要确保在所有情况下都能正确处理空值或无效的JSON字符串。
3. **生成订单详情表**
```scala
spark.sql(
s"""
|select
| app_id,
| application_id,
| order_code,
| whether
|from clear_json_object
|lateral view explode(split(orderNo, '/')) v_ods_ehome_reqinterfacelog as order_code
""".stripMargin)
.createOrReplaceTempView("application_detail")
```
- **问题**:`split(orderNo, '/')` 假设 `orderNo` 字段中的订单号是用斜杠分隔的。如果实际数据格式不同,可能会导致错误。
4. **标记订单是否为“多记”**
```scala
spark.sql(
s"""
|with flag_data as (
| select
| order_code,
| if(count(distinct application_id) >= 2, concat(order_code, '-', count(distinct application_id), '-多记'), '单记') as flag
| from application_detail
| where order_code is not null and order_code != ''
| group by order_code
|)
|select
| order_code,
| app_id,
| app_type,
| app_name,
| flag,
| business_model_flag
|from (
| select
| t0.app_id,
| t1.app_name,
| t0.application_id,
| t0.order_code,
| t1.id,
| t1.business_model_flag,
| t1.app_type,
| t0.flag
| from (
| select
| data_flag0.app_id,
| data_flag0.application_id,
| data_flag0.order_code,
| data_flag1.flag
| from (
| select
| app_id,
| application_id,
| order_code
| from application_detail
| where order_code is not null and order_code != ''
| group by application_id, order_code, app_id
| ) data_flag0
| left join (
| select
| order_code,
| flag
| from flag_data
| ) data_flag1
| on data_flag0.order_code = data_flag1.order_code
| ) t0
| left join (
| select
| id,
| business_model_flag,
| case
| when app_type = '1' then '云眼平台'
| when app_type = '2' then '云眼应用'
| when app_type = '2' then '云眼国标'
| else '其他'
| end as app_type,
| app_name
| from ods_spyw.t_ods_tyyy_application
| where month = '${yearmonth_end}'
| and del_flag = '0'
| group by id, business_model_flag, app_name, app_type
| ) t1
| on t0.application_id = t1.id
|) st0
|group by order_code, app_id, app_type, app_name, flag, business_model_flag
""".stripMargin)
.createOrReplaceTempView("application_detail")
```
- **问题**:`case when app_type = '2' then '云眼应用' when app_type = '2' then '云眼国标'` 这里有一个重复条件,可能导致逻辑错误。应该检查并修正。
5. **读取账单记录并过滤**
```scala
spark.sql(
s"""
|select
| t0.order_code,
| sum(t0.payable_price) as payable_price,
| t0.product_no,
| t0.settlement_status_str,
| t0.apply_status_str,
| t0.consumer_company_province,
| t0.consumer_company_city_name
|from (
| select
| order_code,
| payable_price,
| product_no,
| settlement_status_str,
| apply_status_str,
| consumer_company_province,
| consumer_company_city_name
| from ods_spw.t_ods_tyyy_ebp_bill
| where replace(payment_day, '-', '') = '${yearmonth_end}'
| and product_no in ('PRO20240223092833350', 'PRO20240223090418735', ...)
| and apply_status_str = '对账完成'
| and settlement_status_str = '开始结算'
|) t0
|inner join (
| select
| order_code
| from ods_spyw.t_ods_tyyy_ebp_order
| where month = '202410'
|) t1
|on t0.order_code = t1.order_code
|group by t0.order_code, t0.payable_price, t0.product_no, t0.settlement_status_str, t0.apply_status_str, t0.consumer_company_province, t0.consumer_company_city_name
""".stripMargin)
.createOrReplaceTempView("ebp_order_detail")
```
- **问题**:`group by` 子句中包含了 `t0.payable_price`,这通常会导致每个 `payable_price` 被单独分组。如果目的是按 `order_code` 分组并求和 `payable_price`,应去掉 `t0.payable_price`。
6. **生成宽表**
```scala
spark.sql(
"""
|select /*+ mapjoin(ebp_order_detail)*/
| ebp_order_detail.order_code,
| ebp_order_detail.payable_price,
| ebp_order_detail.product_no,
| ebp_order_detail.settlement_status_str,
| ebp_order_detail.apply_status_str,
| ebp_order_detail.consumer_company_province,
| ebp_order_detail.consumer_company_city_name,
| application_detail.order_code,
| application_detail.app_id,
| application_detail.app_type,
| application_detail.app_name,
| application_detail.flag,
| application_detail.business_model_flag,
| 1 as num_flag,
| '${yearmonth_end}' as month
|from ebp_order_detail
|left join application_detail
|on ebp_order_detail.order_code = application_detail.order_code
""".stripMargin)
.createOrReplaceTempView("wide_table")
```
7. **聚合和汇总数据**
```scala
spark.sql(
"""
|select
| consumer_company_province,
| consumer_company_city_name,
| app_type,
| sum(payable_price) as payable_price
|from (
| select
| consumer_company_province,
| consumer_company_city_name,
| app_type,
| flag,
| (sum(payable_price) / if(flag = '单记', 1, sum(num_flag))) as payable_price
| from wide_table
| group by consumer_company_province, consumer_company_city_name, app_type, flag
|) t0
|group by consumer_company_province, consumer_company_city_name, app_type
""".stripMargin)
.show(1000, false)
.write.mode(SaveMode.Overwrite).option("header", "true").option("nullValue", "NULL").csv(s"$HDFS_PATH/year_province_apptype/$date")
```
- **问题**:`sum(payable_price) / if(flag = '单记', 1, sum(num_flag))` 中的 `sum(num_flag)` 可能会导致除零错误。需要确保 `num_flag` 不为零。
8. **按月份、省份、城市、产品编号聚合**
```scala
spark.sql(
"""
|select
| month,
| consumer_company_province,
| consumer_company_city_name,
| product_no,
| sum(payable_price) as payable_price
|from (
| select
| month,
| consumer_company_province,
| consumer_company_city_name,
| product_no,
| flag,
| (sum(payable_price) / if(flag = '单记', 1, sum(num_flag))) as payable_price
| from wide_table
| group by month, consumer_company_province, consumer_company_city_name, product_no, flag
|) t0
|group by month, consumer_company_province, consumer_company_city_name, product_no
""".stripMargin)
.show(1000)
```
9. **按月份、产品编号聚合**
```scala
spark.sql(
"""
|select
| month,
| product_no,
| sum(payable_price) as payable_price
|from (
| select
| month,
| product_no,
| flag,
| (sum(payable_price) / if(flag = '单记', 1, sum(num_flag))) as payable_price
| from wide_table
| group by month, product_no, flag
|) t0
|group by month, product_no
""".stripMargin)
.show(1000, false)
```
10. **按月份、标志聚合**
```scala
spark.sql(
"""
|select
| month,
| flag,
| sum(payable_price) as sum_payable_price,
| if(flag = '单记', 1, sum(num_flag)) as num_flag,
| (sum(payable_price) / if(flag = '单记', 1, sum(num_flag))) as payable_price
|from wide_table
|group by month, flag
""".stripMargin)
.show(1000)
```
### 总结
- **命名规范**:改善变量命名以提高代码可读性。
- **逻辑错误**:修正重复条件和不必要的 `group by` 子句。
- **数据处理**:确保所有可能的边界情况都得到处理,特别是空值和除零错误。
- **性能优化**:考虑使用广播连接(broadcast join)来优化JOIN操作,特别是在小表和大表之间进行JOIN时。
阅读全文