我需要将pgn文件转换为json,这样我就可以使用spark将它们转换为sparkDataframe,并最终创建一个图形。我已经编写了一个python脚本,用pandas将它们解析为一个Dataframe,但是它太慢了(170k游戏大约需要56分钟(最初估计是30分钟,但在配置文件之后我估计是56分钟))。我也试过使用这个回购:https://github.com/jonathancauchi/pgn-to-json-parser 这给了我json文件,但是170k游戏用了69分钟。
我可以将pgn扩展名改为.txt,它的工作方式似乎完全相同,因此我假设有更多的支持.txt到json,但我不确定。
我认为spark将比“普通”python更快,但我不知道如何进行转换。下面是一个示例。虽然有20亿个游戏,所以我目前的方法都不起作用,因为如果我使用pgn-to-json解析器,需要将近2年的时间。理想情况下,使用.txt来触发Dataframe并完全忽略json将是理想的选择。
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
编辑:增加了20k游戏的配置文件。
ncalls tottime percall cumtime percall filename:lineno(function)
1 26.828 26.828 395.848 395.848 /Users/danieljones/Documents – Daniel’s iMac/GitHub/ST446Project/ParsePGN.py:11(parse_pgn)
20000 0.798 0.000 289.203 0.014 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/frame.py:7614(append)
20000 0.098 0.000 199.489 0.010 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:70(concat)
20000 0.480 0.000 126.548 0.006 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:295(__init__)
100002 0.212 0.000 122.178 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5199(_protect_consolidate)
80002 0.076 0.000 122.177 0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5210(_consolidate_inplace)
40000 0.079 0.000 122.063 0.003 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5218(_consolidate)
80002 0.170 0.000 121.830 0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/generic.py:5213(f)
100001 0.223 0.000 99.829 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:986(_consolidate_inplace)
59999 0.451 0.000 96.718 0.002 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:1898(_consolidate)
80002 0.138 0.000 96.599 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:970(consolidate)
79999 52.602 0.001 91.913 0.001 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/internals/managers.py:1915(_merge_blocks)
20000 7.432 0.000 79.741 0.004 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/chess/pgn.py:1323(read_game)
20000 0.361 0.000 72.843 0.004 /Users/danieljones/opt/anaconda3/envs/LSE/lib/python3.6/site-packages/pandas/core/reshape/concat.py:456(get_result)
我不确定“cumtime”是否是排序的最佳列,但append步骤似乎需要很多时间。
这是我的剧本:
def parse_pgn(pgn):
games = []
i = 0
edges_df = pd.DataFrame(columns=["Event", "Round", "WhitePlayer", "BlackPlayer", "Result", "BlackElo",
"Opening", "TimeControl", "Date", "Time", "WhiteElo"])
while i < 20000:
first_game = chess.pgn.read_game(pgn)
if first_game is not None:
Event = first_game.headers["Event"]
Round = first_game.headers["Round"]
White_player = first_game.headers["White"]
Black_player = first_game.headers["Black"]
Result = first_game.headers["Result"] # Add condition to split this
if Result == "1-0":
Result = White_player
elif Result == "0-0":
Result = "Draw"
else:
Result = Black_player
BlackELO = first_game.headers["BlackElo"]
Opening = first_game.headers["Opening"]
TimeControl = first_game.headers["TimeControl"]
UTCDate = first_game.headers["UTCDate"]
UTCTime = first_game.headers["UTCTime"]
WhiteELO = first_game.headers["WhiteElo"]
edges_df = edges_df.append({"Event": Event,
"Round": Round,
"WhitePlayer": White_player,
"BlackPlayer": Black_player,
"Result": Result,
"BlackElo": BlackELO,
"Opening": Opening,
"TimeControl": TimeControl,
"Date": UTCDate,
"Time": UTCTime,
"White": WhiteELO,
}, ignore_index=True)
games.append(first_game)
i += 1
else:
pass
return edges_df
编辑2:将append方法更改为dictionary。20k现在需要78秒。很多花时间的方法似乎都来自 chess
包,如检查合法的移动,阅读板布局。所有这些对我的最终目标都不重要,所以我想知道我是否可以不再使用这个包,而是自己把文件分为不同的游戏,也许至少可以 [Event
因为这是每个不同游戏的开始。
1条答案
按热度按时间woobm2wo1#
不要
.append
至pandas.DataFrame
在循环中,如果您想有较短的运行时间,您可以在这里阅读更多关于这方面的内容。你可以先把你的dict
在iterable中创建pandas.DataFrame
从它那里。我会用collections.deque
(来自collections
内置模块),因为它是专为运动高速.append
,让我们比较一下这些不同的方法这些函数产生相等的
pandas.DataFrame
s、 我用内置模块比较了它们timeit
跟随方式结果如下
速度快了500多倍。当然,您的里程数可能会有所不同,但我建议您尝试进行此优化。