spark sql regexp_replace
时间: 2023-10-07 11:10:04 浏览: 127
Spark SQL中的`regexp_replace`函数用于在字符串中使用正则表达式来替换匹配的部分。
使用方法如下:
```scala
import org.apache.spark.sql.functions._
val df = spark.sql("SELECT 'Hello, World!' as str")
df.select(regexp_replace(col("str"), "[aeiou]", "*")).show()
```
这将输出:
```
+-------------------+
|regexp_replace(str)|
+-------------------+
|H*ll*, W*rld! |
+-------------------+
```
在上述示例中,`regexp_replace`函数用于将字符串中的元音字母替换为星号(`*`)。正则表达式`[aeiou]`表示匹配任何一个元音字母。函数接受两个参数:要替换的字符串列和用于替换的字符串。在这里,我们将匹配到的元音字母替换为星号(`*`)。
相关问题
spark regexp_replace
正则表达式在 Spark 中的使用通常与 `regexp_replace` 函数结合使用。该函数允许您使用正则表达式模式替换字符串中的匹配项。
以下是 `regexp_replace` 函数的使用示例:
```scala
import org.apache.spark.sql.functions._
val df = Seq(("Hello, World!"), ("How are you?")).toDF("text")
df.show()
val replacedDF = df.withColumn("replaced_text", regexp_replace(col("text"), "[aeiou]", ""))
replacedDF.show()
```
这将产生以下输出:
```
+--------------+
| text|
+--------------+
|Hello, World! |
| How are you? |
+--------------+
+--------------+-------------+
| text|replaced_text|
+--------------+-------------+
|Hello, World! | Hll, Wrld!|
| How are you? | Hw r y? |
+--------------+-------------+
```
这个示例将在 DataFrame 的 "text" 列中查找元音字母,并用空字符串替换它们。在 `regexp_replace` 函数中,第一个参数是要替换的列,第二个参数是要匹配的正则表达式模式,第三个参数是用于替换匹配项的字符串。
希望这可以帮助到您!如果还有其他问题,请随时提问。
mismatched input 'WITH' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 1191) == SQL == WITH TAB_A AS ( select DISTINCT A.IMSI, coalesce(B.COUNTRY,C.COUNTRY) GJ, coalesce(B.OPERATOR,C.OPERATOR) YYS from ( SELECT A.START_TIME, A.END_TIME, A.MSISDN, A.IMSI, A.CALLING_GT, A.OPERATE_CODE, A.RESULT, A.ERROR_CODE, substr(a.CALLING_GT,1,6) CALLING_GT_6, substr(a.CALLING_GT,1,5) CALLING_GT_5, substr(a.CALLING_GT,1,4) CALLING_GT_4 FROM spark_odc_dwd.D_ENS_GMAP_MM a WHERE p_hour='2023060600' ) a left join ( SELECT DISTINCT regexp_replace(E164_IR21,'[^0-9]+','') E164_IR21, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,6) E164_IR21_6, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,5) E164_IR21_5, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,4) E164_IR21_4, COUNTRY, VENDOR_NAME AS OPERATOR FROM spark_odc_data.A_DM_CO_BA_VR_OPT_GT A) b on A.CALLING_GT_5=b.E164_IR21_5 left join ( SELECT DISTINCT regexp_replace(E164_IR21,'[^0-9]+','') E164_IR21, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,6) E164_IR21_6, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,5) E164_IR21_5, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,4) E164_IR21_4, COUNTRY, VENDOR_NAME AS OPERATOR FROM spark_odc_data.A_DM_CO_BA_VR_OPT_GT A) c on A.CALLING_GT_4=c.E164_IR21_4)
这个报错与前一个类似,是因为WITH语句不被支持或者不被SQL版本所支持。你可以将WITH语句改写成子查询的形式,例如:
```
SELECT DISTINCT A.IMSI, coalesce(B.COUNTRY,C.COUNTRY) AS GJ, coalesce(B.OPERATOR,C.OPERATOR) AS YYS
FROM (
SELECT START_TIME, END_TIME, MSISDN, IMSI, CALLING_GT, OPERATE_CODE, RESULT, ERROR_CODE, substr(CALLING_GT,1,6) AS CALLING_GT_6, substr(CALLING_GT,1,5) AS CALLING_GT_5, substr(CALLING_GT,1,4) AS CALLING_GT_4
FROM spark_odc_dwd.D_ENS_GMAP_MM
WHERE p_hour='2023060600'
) A
LEFT JOIN (
SELECT DISTINCT regexp_replace(E164_IR21,'[^0-9]+','') AS E164_IR21, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,6) AS E164_IR21_6, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,5) AS E164_IR21_5, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,4) AS E164_IR21_4, COUNTRY, VENDOR_NAME AS OPERATOR
FROM spark_odc_data.A_DM_CO_BA_VR_OPT_GT
) B ON A.CALLING_GT_5=B.E164_IR21_5
LEFT JOIN (
SELECT DISTINCT regexp_replace(E164_IR21,'[^0-9]+','') AS E164_IR21, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,6) AS E164_IR21_6, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,5) AS E164_IR21_5, SUBSTR(regexp_replace(E164_IR21,'[^0-9]+',''),1,4) AS E164_IR21_4, COUNTRY, VENDOR_NAME AS OPERATOR
FROM spark_odc_data.A_DM_CO_BA_VR_OPT_GT
) C ON A.CALLING_GT_4=C.E164_IR21_4;
```
这样就可以避免使用WITH语句而得到正确的结果。
阅读全文