如何在Pyspark中阅读CSV文件时忽略双引号?

uubf1zoe  于 2024-01-06  发布在  Spark
关注(0)|答案(1)|浏览(256)

我有一个CSV文件,如下所示:
enter image description here

  1. ID, Tags
  2. 1,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
  3. 2,"""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"",""Remarks"": "" EUC Personal Desktop"""
  4. 3,"""Remarks"": "" EUC Personal Desktop"",""ApplicationID "": ""9AAG033396"",""Environment "": ""PROD"""

字符串
示例标签值如下所示。

  1. "cluster": "csp-csp-cluster","kubernetes_node": ""
  2. "Project": "S4","Type": "Dev"
  3. "ApplicationID": "BSN0002278","ClusterId": "0803-071102-1n1kbom","ClusterName": "Riskman Quaitification Cluster","Creator": "[email protected]","DatabricksEnvironment": "workerenv-6764199050889752","Environment": "DEV","Remark": "EDP","RequestNo": "RITM0215195","Solution": "Riskman","Vendor": "Databricks","databricks-instance-name": "e82e5f2df9ae479bbb76766c48bdf3cc","x_Environment": "DEV"
  4. "ApplicationID": "APP0005199","Environment": "Dev","Remark": "Buy@energy","RequestNo": "RITM0270765"
  5. "ApplicationID": "","Environment": "","Remark": "","RequestNo": ""
  6. "APM_ID": "empty","ApplicationID": "Empty","Environment": "PROD","Remark": "Logs","RequestNo": "empty"


我想在spark中将其作为 Dataframe 读取,其中每个字段的值与CSV中写入的值完全相同,但没有引号,并且需要创建名为ApplicationID的新列。
预期产出:
enter image description here

  1. ID, Tags,ApplicationID
  2. 1,ApplicationID : 9AAG033396,Environment : PROD,Remarks: EUC Personal Desktop,9AAG033396
  3. 2,ApplicationID : 9AAG033396,Environment : PROD,Remarks: EUC Personal Desktop,9AAG033396
  4. 3,Remarks: EUC Personal Desktop,ApplicationID : 9AAG033396,Environment : PROD,9AAG033396


我得到的输出
enter image description here

  1. ID, Tags,ApplicationID
  2. 1,"""ApplicationID "": ""9AAG033396"",9AAG033396
  3. 2,"""ApplicationID "": ""9AAG033396"",9AAG033396
  4. 3,"""Remarks"": "" EUC Personal Desktop"",null


在pyspark中,我是这样阅读的:

  1. df = spark.read\
  2. .option("header", "true")\
  3. .option("inferSchema", "true")\
  4. .option("delimiter", ",")\
  5. .option("escapeQuotes", "true")\
  6. .option("multiLine","true")\
  7. .option('quote','"')\
  8. .csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
  1. `df2 = df.withColumn("ApplicationID", when(df.Tags.contains("ApplicationID"),substring('Tags', 23,10))
  2. .otherwise(''))`

的字符串
我想知道是否有一种方法可以使用escape选项来实现这一点。

g6baxovj

g6baxovj1#

escape正是您要查找的选项。您希望将其设置为",以便将双引号引为""

  1. df = spark.read
  2. .option("header", "true")
  3. .option("inferSchema", "true")
  4. .option("escape",'"')
  5. .csv(f"wasbs://{container}@{storage_account_name}.blob.core.windows.net/onetimeazamortizecostnew/onetimeazamortizecostnew/20230901-20231031/onetimeazamortizecostnew_17fcdeca-81ca-43e7-b181-36bc379e9644.csv")
  6. +---+----------------------------------------------------------------------------------------+
  7. |ID | Tags |
  8. +---+----------------------------------------------------------------------------------------+
  9. |1 |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
  10. |2 |"ApplicationID ": "9AAG033396","Environment ": "PROD","Remarks": " EUC Personal Desktop"|
  11. |3 |"Remarks": " EUC Personal Desktop","ApplicationID ": "9AAG033396","Environment ": "PROD"|
  12. +---+----------------------------------------------------------------------------------------+

字符串
如果你把字符串用花括号括起来,它就可以被解析为JSON,你就可以很容易地提取ApplicationID。

  1. schema = StructType([StructField(c, StringType()) for c in ["ApplicationID ", "Environment ", "Remarks"]])
  2. df.withColumn("Tags", from_json(concat(lit("{"), col(" Tags"), lit("}")), schema))\
  3. .withColumn("ApplicationID", col("Tags.ApplicationID "))
  4. +---+-----------------------------------------+-------------+
  5. |ID |Tags |ApplicationID|
  6. +---+-----------------------------------------+-------------+
  7. |1 |{9AAG033396, PROD, EUC Personal Desktop}|9AAG033396 |
  8. |2 |{9AAG033396, PROD, EUC Personal Desktop}|9AAG033396 |
  9. |3 |{9AAG033396, PROD, EUC Personal Desktop}|9AAG033396 |
  10. +---+-----------------------------------------+-------------+


注意:你的CSV文件中似乎有一些奇怪的空格(比如"ApplicationID ")。不知道是不是这样,但我写了代码示例来匹配它。

展开查看全部

相关问题