regex 用于提取python panda中的子字符串的正则表达式

wko9yo5t 于 2022-11-18 发布在 Python

关注(0)|答案(4)|浏览(154)

下面有一个名为“New”的数据框列

df = pd.DataFrame({'New' : ['emerald shines bright(happy)(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)', 'this is just a text'],
'UI': ['AOT', 'BOT', 'LOV', 'HAP', 'NON']})

现在，我想将各种ID（例如ABCED“、AxYBD和”http“中的ID）提取到另一列中。
但当我用

df['New_col'] = df['New'].str.extract(r'.*\((.*)\).*',expand=True)

我不能让它工作得很好，因为(ABCED ID - 1234556)的整个括号都被返回了。更重要的是，http id 234555没有被返回。
另外，有人可以清除第一列以删除括号中的ID，并使用类似以下内容：

New            UI    New_col
0  emerald shines bright(happy)               AOT    1234556
1   honey in the bread                        BOT  123467890
2        http/ABCED/id/234555                 LOV     234555
3        healing strenght                     HAP    1234556
4  this is just a text                        NON

regex

来源：https://stackoverflow.com/questions/74220370/regular-expression-to-extract-substrings-in-python-pandas

4条答案

按热度按时间

e5njpo681#

可能不是最优雅的答案，但是，我认为这是你想让它做的，
基于NEW标准。

import re

df = pd.DataFrame({'New' : ['emerald shines bright(happy)(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)', 'this is just a text'],
'UI': ['AOT', 'BOT', 'LOV', 'HAP', 'NON']})

def grab_id(row):
    text = re.findall(r'\(([A-Za-z]+)\sID\s-\s?(\d+)\)|/([0-9]+)', row)
    if text:
        if text[0][0]:
            return text[0][1]
        else:
            return text[0][2]
    else:
        return ""
    
    
def remove_ID_in_brackets(row):
    text = re.sub(r'\(([A-Za-z]+)\sID\s-\s?(\d+)\)', '', row)
    
    return text

df['New_Col'] = df['New'].apply(grab_id)
df['New'] = df['New'].apply(remove_ID_in_brackets)

下面是df现在的样子：

赞(0）回复(0）举报 2022-11-18

ui7jx7zq2#

您可以使用下列程式码来完成这项工作：

reg_expression = r'.*\(.*ID\s*-\s*(.*)\)|http\/.*\/id\/(\d*)'
extract_text = lambda row: row[0][0] if row[0][0] else row[0][1]

df['New_col'] = df['New'].str.findall(reg_expression).apply(extract_text)

输出：

说明：

根据您的虚拟示例，您必须捕获两种模式：

HTTP案例模式http\/.*\/id\/(\d*)

例如http/ABCED/id/234555

无HTTP案例模式：.*\(.*ID\s*-\s*(.*)\)

例如emerald shines bright(ABCED ID - 1234556)
并使用or（|）运算符将它们组合到一个正则表达式中。
因为有多个匹配项，所以我们可以使用lambda函数从匹配项中获取值。

赞(0）回复(0）举报 2022-11-18

sq1bmfud3#

您可以使用

import pandas as pd
df = pd.DataFrame({'New' : ['emerald shines bright(ABCED ID - 1234556)', 'honey in the bread(ABCED ID - 123467890)','http/ABCED/id/234555', 'healing strenght(AxYBD ID -1234556)'], 'UI': ['AOT', 'BOT', 'LOV', 'HAP']})
df['New_col'] = df['New'].str.extract(r'.*(?:\(\D*|http\S*/id/)(\d+)',expand=False)

输出量：

>>> print(df.to_string())
                                         New   UI    New_col
0  emerald shines bright(ABCED ID - 1234556)  AOT    1234556
1   honey in the bread(ABCED ID - 123467890)  BOT  123467890
2                       http/ABCED/id/234555  LOV     234555
3        healing strenght(AxYBD ID -1234556)  HAP    1234556

请参阅regex demo。* 详细数据 *：

.*-任何零个或多个字符，尽可能多的换行符字符除外
(?:\(\D*|http\S*/id/)-(+零个或多个非数字字符，或者http后跟零个或多个非空格，然后是/id/
(\d+)-组1：一个或多个数字。

赞(0）回复(0）举报 2022-11-18

vcudknz34#

r'[i,d,I,D]{2}.*?(\d.*?)\D'也许这个能帮上忙

已编辑：/?\(?(\w{5}) ?/?[i,d,I,D]{2}看起来您需要的是字母，而不是数字

赞(0）回复(0）举报 2022-11-18

我来回答

regex 用于提取python panda中的子字符串的正则表达式

4条答案

相关问题

热门标签

最新问答