Regex使用Python拆分Git日志

2ul0zpep  于 2023-05-12  发布在  Git
关注(0)|答案(5)|浏览(79)

我想使用python import re来拆分Git日志的字符串如下:

commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 28 18:58:00 2023 +0800

    new cat

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    meow

commit 4f113912741f753c75a44f18790ff5903e910fad
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 14 17:55:55 2023 +0800

    Add test files

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    Second commit test

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    First commit

然后
我想得到如下的commits数组:

[
'
commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 28 18:58:00 2023 +0800

    new cat

',
'
commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    meow

',
...
]

对我来说,很难找到与提交匹配的Clean和General模式。
任何想法都欢迎
谢谢

6yoyoihd

6yoyoihd1#

将问题定义为使用已知的开始/结束模式定位块
然后,定义块的开始和结束位置-这里通过锚定到提交哈希

import re

rgx = r'(commit\s[0-9,a-f]{40}.*?)(?=commit\s[0-9,a-f]{40}|\Z)'

text = '''commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 28 18:58:00 2023 +0800

    new cat

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    meow

commit 4f113912741f753c75a44f18790ff5903e910fad
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 14 17:55:55 2023 +0800

    Add test files

commit 87053deb6ad07fa1ea6dd7a5acfee075ce5b6322
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 14 15:16:57 2023 +0800

    Add cat.jpg
'''

re.findall(rgx, text, re.DOTALL)

它给出了预期的输出

['commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6\nAuthor: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>\nDate:   Fri Apr 28 18:58:00 2023 +0800\n\n    new cat\n\n',
 'commit 9274b33435238122c8d6d389e73266f6a3e68745\nAuthor: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>\nDate:   Wed Apr 19 11:04:04 2023 +0800\n\n    meow\n\n',
 'commit 4f113912741f753c75a44f18790ff5903e910fad\nAuthor: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>\nDate:   Fri Apr 14 17:55:55 2023 +0800\n\n    Add test files\n\n',
 'commit 87053deb6ad07fa1ea6dd7a5acfee075ce5b6322\nAuthor: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>\nDate:   Fri Apr 14 15:16:57 2023 +0800\n\n    Add cat.jpg\n']

编辑:注意EOF使用sentinel \Z处理

tyu7yeag

tyu7yeag2#

如果您显示的摘录是您自己的git log命令的输出,您也可以定义自己的格式字符串。
例如:

git log --pretty="--commit--%ncommit %H%nAuthor: %an <%ae>%nDate:   %ad%n%n%w(80,4,4)%B"

应该会给予你一个类似这样的输出:

--commit--
commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 28 18:58:00 2023 +0800

    new cat

--commit--
commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    meow

--commit--
commit 4f113912741f753c75a44f18790ff5903e910fad
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 14 17:55:55 2023 +0800

    Add test files

--commit--
commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    Second commit test

--commit--
commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    First commit

因此,您可以简单地将输出拆分到^--commit--$上,或者甚至选择一个no regexp选项,然后拆分到"--commit--"上(<-如果您分隔符具有足够的判别性,不会出现在提交消息中,这应该是可以的),或者"\n--commit--\n",然后处理第一行,或者...

  • 注意:* 请参阅this answer,了解%w(...)如何在--pretty字符串中工作。
cwxwcias

cwxwcias3#

你可以使用Python中的re模块根据commit关键字拆分字符串。下面是一个例子:

import re

git_log = '''
commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Fri Apr 28 18:58:00 2023 +0800

    new cat

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <ISAAC.NEWTON@GOOGLE.COM>
Date:   Wed Apr 19 11:04:04 2023 +0800

    meow

'''

commits = re.split(r'(?=commit )', git_log)
commits = [commit.strip() for commit in commits if commit.strip()]

for commit in commits:
    print(commit)
    print()
dy1byipe

dy1byipe4#

也许你可以尝试使用re.split,像这样:

import re    
text = git.logs()
ret = re.split('\n', text)
ret_u_want = ['\n'.join(ret[i:i+6]) for i in range(0, len(ret), 6)]

你可以直接使用split函数,而不用re:

text = git.logs()
 ret = text.split('\n')
 ret_u_want = ['\n'.join(ret[i:i+6]) for i in range(0, len(ret), 6)]
brc7rcf0

brc7rcf05#

如果您是生成输出的人,则不需要使用正则表达式。您可以在空字节上进行拆分。

#!/usr/bin/env python3

import subprocess

log = subprocess.check_output(["git", "log", "-z"]).decode().split("\x00")
print(log)

或者至少我从来没有在提交消息中看到过这样的控制字符。

相关问题