regex 从文件读取中提取数据不工作

iqih9akk 于 2022-11-18 发布在其他

关注(0)|答案(2)|浏览(266)

我有一个与XML文件格式相同的midx文件。
我正在阅读此文件：

f = open('visus8.midx', 'r')
regexCommand = re.compile("/\/(\S\w*)*.JPG/im")
for line in f:
    matches = regexCommand.findall(str(line))
    print(matches)

文件已

<dataset url="/home/siddharth/Desktop/testing/VisusSlamFiles/idx/0000.idx" color="#f4be4bff" quad="0 365.189 5614.56 5.9402e-14 5617.89 3728.26 331.293 3920.91" filenames="/home/siddharth/Desktop/testing/DJI_0726.JPG" q="0.036175 -0.998922 0.024509 -0.015672" t="-2.536858 -5.009510 91.514963" lat="35.944029617344619" lon="-90.450638476132283" alt="91.672617112396139" />

作为其中一个标签，我想提取

/home/siddharth/Desktop/testing/DJI_0726.JPG

从文件名=“”
我不能这样做，你能请我的正则表达式是错误的或其他东西是错误的！！
这是我在这里分享的midx文件的一半：

<dataset typename="IdxMultipleDataset" logic_box="0 7252 0 8683" physic_box="0.24874641550219023 0.24875126191231167 0.6071205757248886 0.6071264043899676">
    <slam width="5472" height="3648" dtype="uint8[3]" calibration="4256.023438 2735.799316 1824.087646" />
    <field name='voronoi'><code>
        output=voronoi()</code>
    </field>
    <translate x="0.24874641550219023" y="0.60712057572488864">
        <scale x="6.6824682454607912e-10" y="6.6824682454607912e-10">
            <translate x="-0" y="-5.9402018165207208e-14">
                <svg width="1048" height="1254" viewBox="0 0 7252 8683">
                    <g stroke="#000000" stroke-width="1" fill="#ffff00" fill-opacity="0.3">
                        <poi point="2710.006104,2372.072998" />
                        <poi point="2795.450439,3354.056396" />
                        <poi point="2846.955566,4015.307861" />
                        <poi point="2914.414307,4897.018555" />
                        <poi point="3015.048584,6234.411133" />
                        <poi point="4570.675293,6449.748047" />
                        <poi point="4437.736328,4984.978027" />
                        <poi point="4387.470703,4050.677002" />
                    </g>
                </svg>
                <dataset url="/home/siddharth/Desktop/testing/VisusSlamFiles/idx/0000.idx" color="#f4be4bff" quad="0 365.189 5614.56 5.9402e-14 5617.89 3728.26 331.293 3920.91" filenames="/home/siddharth/Desktop/testing/DJI_0726.JPG" q="0.036175 -0.998922 0.024509 -0.015672" t="-2.536858 -5.009510 91.514963" lat="35.944029617344619" lon="-90.450638476132283" alt="91.672617112396139" />

谢谢你

regex

来源：https://stackoverflow.com/questions/74273550/extracting-data-from-the-file-read-not-working

2条答案

按热度按时间

xkrw2x1b1#

您可以使用一个捕获组，使模式更具体一些，而根本不使用重复的组：

<dataset\b[^<>]* filenames="(\S+\.JPG)"

Regex demo
范例

import re

pattern = r'<dataset\b[^<>]* filenames="(\S+\.JPG)"'
s = "...."
print(re.findall(pattern, s))

输出量

['/home/siddharth/Desktop/testing/DJI_0726.JPG']

赞(0）回复(0）举报 2022-11-18

c3frrgcw2#

正则表达式模式存在多个问题。
在Python re.compile(pattern, flags=...)中，你把regex标志指定为参数，而不是把它们放到regex模式中。"/\/(\S\w*)*.JPG/im"中的'/'和'/im'在Python中被解释为regex模式的一部分，所以regex试图按字面意思找到"JPG/im"，但失败了。
在正则表达式模式中，'.'具有特殊的含义（任何单个字符），因此需要对其进行转义以匹配点。
不需要在正斜杠前放置反斜杠。
您希望捕获斜杠后跟非空白字符的重复出现，因此必须将斜杠放在捕获组中。
如果你根据上面的描述调整你的正则表达式模式，你会得到：

regexCommand = re.compile("(/\S*)*\.JPG", re.I|re.M)

然后将给予你一个结果（不包括'.JPG'，因为它不包括在捕获组中）。
请注意，在上面的正则表达式中，您可以跳过(/\S*)后面的' * '，因为组也将捕获任何/在其路径'.JPG'和使用'\w*'将不会涵盖文件路径中允许的非单词字符。
因此，如果您想提取JPG图像（包括.“JPG”）的任何绝对路径（以“/”开头），您可以用途：

regexCommand = re.compile("/\S*\.JPG", flags=re.I|re.M)

或者像在另一个答案中建议的那样，使用一个更具体的正则表达式。

赞(0）回复(0）举报 2022-11-18

我来回答

regex 从文件读取中提取数据不工作

2条答案

相关问题

热门标签

最新问答