在Python中从URL提取域名

ovfsdjhp 于 2022-12-25 发布在 Python

关注(0)|答案(6)|浏览(155)

我正在尝试从URL列表中提取域名。就像在https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url中一样
我的问题是，网址可以是关于一切，几个例子：
m.google.com =〉google
m.docs.google.com =〉google
www.someisotericdomain.innersite.mall.co.uk =〉mall
www.ouruniversity.department.mit.ac.us =〉mit
www.somestrangeurl.shops.relevantdomain.net =〉relevantdomain
www.example.info =〉example
等等。
域的多样性不允许我使用how to get domain name from URL中所示的正则表达式（因为我的脚本将在来自真实网络流量的大量url上运行，正则表达式将必须是巨大的，以便捕获前面提到的所有类型的域）。
不幸的是，我的网络研究没有提供任何有效的解决方案.
有人知道怎么做吗？
任何帮助将不胜感激!
谢谢

python

来源：https://stackoverflow.com/questions/44021846/extract-domain-name-from-url-in-python

6条答案

按热度按时间

qvk1mo1f1#

使用tldextract，它是urlparse的更有效版本，tldextract可以准确地将gTLD或ccTLD（通用或国家a地区代码顶级域）与URL的注册domain和subdomains分开。

>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'

赞(0）回复(0）举报 2022-12-25

t5fffqht2#

看起来你可以使用urlparse https://docs.python.org/3/library/urllib.parse.html来解析那个url，然后提取netloc。
而且，通过使用split，您可以轻松地从netloc中提取域名

赞(0）回复(0）举报 2022-12-25

fjnneemd3#

通过正则表达式的简单解决方案

import re

def domain_name(url):
    return url.split("www.")[-1].split("//")[-1].split(".")[0]

赞(0）回复(0）举报 2022-12-25

nbnkbykc4#

对于regex，您可以使用如下代码：
第一个月
https://regex101.com/r/WQXFy6/5
注意，您必须注意co.uk等特殊情况。

赞(0）回复(0）举报 2022-12-25

tkclm6bt5#

检查replace和split方法。
PS：只适用于简单的链接，如https://youtube.com（输出=youtube）和(www.user.ru.com)（输出=用户）
定义域名（url）：

return url.replace("www.","http://").split("//")[1].split(".")[0]

赞(0）回复(0）举报 2022-12-25

njthzxwz6#

import re
def getDomain(url:str) -> str:
    '''
        Return the domain from any url
    '''
    # copy the original url text
    clean_url = url

    # take out protocol
    reg = re.findall(':[0-9]+',url)
    if len(reg) > 0:
        url = url.replace(reg[0],'')
    
    # take out paths routes
    if '/' in url:
        url = url.split('/')

    # select only the domain
    if 'http' in clean_url:
        url = url[2]

    # preparing for next operation
    url = ''.join(url)

    # select only domain
    url = '.'.join(url.split('.')[-2:])

    return url

赞(0）回复(0）举报 2022-12-25

我来回答

在Python中从URL提取域名

6条答案

相关问题

热门标签

最新问答