regex 使用正则表达式从公司列表中删除常用的公司后缀

ztyzrc3y  于 2023-01-21  发布在  其他
关注(0)|答案(1)|浏览(128)

我有下面的代码,我用它来生成下面的常见公司后缀列表:

import re
from cleanco import typesources, 
import string

def generate_common_suffixes():
    unique_items = []
    company_suffixes_raw = typesources()
    for item in company_suffixes_raw:
        for i in item:
            if i.lower() not in unique_items:
                unique_items.append(i.lower())

    unique_items.extend(['holding'])
    return unique_items

然后,我尝试使用以下代码从公司名称列表中删除这些后缀

company_name = ['SAMSUNG ÊLECTRONICS Holding, LTD', 'Apple inc',
                'FIIG Securities Limited Asset Management Arm',
                'First Eagle Alternative Credit, LLC', 'Global Credit 
                 Investments','Seatown', 'Sona Asset Management']

suffixes = generate_common_suffixes()

cleaned_names = []

for company in company_name:
    for suffix in suffixes:
        new = re.sub(r'\b{}\b'.format(re.escape(suffix)), '', company)
    cleaned_names.append(new)

我不断得到一个未更改的公司名称列表,尽管知道后缀在那里。

    • 备用尝试**

我还尝试过另一种方法,即查找单词并将其替换为不包含regex,的单词,但我不明白为什么它会删除公司名称本身的部分内容-例如,它会删除Samsung中的前3个字母

for word in common_words:
        name = name.replace(word, "")

任何帮助是非常感谢!

svmlkihl

svmlkihl1#

import unicodedata
from cleanco import basename
import re

company_names = ['SAMSUNG ÊLECTRONICS Holding, LTD',
                 'Apple inc',
                 'FIIG Securities Limited Asset Management Arm',
                 'First Eagle Alternative Credit, LLC',
                 'Global Credit Investments',
                 'Seatown',
                 'Sona Asset Management']

suffix = ["holding"] # "Common words"? You can add more

cleaned_names  = []
for company_name in company_names:
    # To Lower
    company_name = company_name.lower()
    # Fix unicode
    company_name = unicodedata.normalize('NFKD', company_name).encode('ASCII', 'ignore').decode()
    # Remove punctuation
    company_name = re.sub(r'[^\w\s]', '', company_name)
    # Remove suffixes
    company_name = basename(company_name)
    # Remove common words
    for word in suffix:
        company_name = re.sub(fr"\b{word}\b", '', company_name)
    # Save
    cleaned_names.append(company_name)

print(cleaned_names)

输出:

['samsung aalectronics ', 'apple', 'fiig securities limited asset management arm', 'first eagle alternative credit', 'global credit investments', 'seatown', 'sona asset management']

相关问题