为什么我的pyspark正则表达式只给出第一行?

g52tjvyc  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(308)

从这个答案中获得灵感:https://stackoverflow.com/a/61444594/4367851 我已经能够在sparkDataframe中将我的.txt文件拆分为列。然而,它只给了我第一个游戏-即使示例.txt文件包含更多。
我的代码:

basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
    selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
    withColumn("Event", col("new")[0]).\
    withColumn("White", col("new")[2]).\
    withColumn("Black", col("new")[3]).\
    withColumn("Result", col("new")[4]).\
    withColumn("UTCDate", col("new")[5]).\
    withColumn("UTCTime", col("new")[6]).\
    withColumn("WhiteElo", col("new")[7]).\
    withColumn("BlackElo", col("new")[8]).\
    withColumn("WhiteRatingDiff", col("new")[9]).\
    withColumn("BlackRatingDiff", col("new")[10]).\
    withColumn("ECO", col("new")[11]).\
    withColumn("Opening", col("new")[12]).\
    withColumn("TimeControl", col("new")[13]).\
    withColumn("Termination", col("new")[14]).\
    drop("new")

basefile.show()

输出:

+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|               Event|          White|            Black|        Result|             UTCDate|             UTCTime|         WhiteElo|         BlackElo|     WhiteRatingDiff|     BlackRatingDiff|        ECO|             Opening|         TimeControl|         Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+

输入文件:

[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]

1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0

[Event "Rated Classical game"]
.
.
.

每场比赛都以 [Event 所以我觉得它应该是可行的,因为文件有重复的结构,唉,我不能让它工作。
额外积分:
我其实不需要移动列表,所以如果更容易的话可以删除它们。
我只希望在将“”转换为sparkDataframe后,为每个新行显示“”中的内容。
非常感谢。

3yhwsihp

3yhwsihp1#

wholetextfiles将每个文件读入单个记录。如果只读取一个文件,结果将是一个只有一行的rdd,包含整个文本文件。问题中的regexp逻辑每行只返回一个结果,这将是文件中的第一个条目。
最好的解决方案可能是在操作系统级别将文件拆分为每个游戏一个文件(例如这里),这样spark就可以并行读取多个游戏。但如果单个文件不是太大,也可以在pyspark中拆分游戏:
读取文件:

basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()

创建列列表并使用regexp\u extract将此列表转换为列表达式列表:

from pyspark.sql import functions as F

cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]

提取数据:
将整个文件拆分为一系列游戏
将此数组分解为单个记录
删除每个记录中的换行符,以便正则表达式工作
使用上面定义的列表达式来提取数据

basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
  .selectExpr("explode(game) as game") \
  .withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
  .select(cols) \
  .show(truncate=False)

输出(对于包含三个游戏副本的输入文件):

+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event                |White|Black  |Result|UTCDate   |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening                         |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0   |2012.12.31|23:01:03|1639    |1403    |+5             |-8             |C00|French Defense: Normal Variation|600+8      |Normal     |
|Rated Classical game2|BFG9k|mamalak|1-0   |2012.12.31|23:01:03|1639    |1403    |+5             |-8             |C00|French Defense: Normal Variation|600+8      |Normal     |
|Rated Classical game3|BFG9k|mamalak|1-0   |2012.12.31|23:01:03|1639    |1403    |+5             |-8             |C00|French Defense: Normal Variation|600+8      |Normal     |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

相关问题