python-3.x 如何删除https，www和一切只得到域？

vddsk6oq 于 2023-07-01 发布在 Python

关注(0)|答案(2)|浏览(111)

我能够得到的网址在搜索页面使用下面的脚本

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

   def scrape_google(query):

    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.co.uk/search?q=" + query)

    links = list(response.html.absolute_links)
    google_domains = ('https://www.google.', 
                      'https://google.', 
                      'https://webcache.googleusercontent.', 
                      'http://webcache.googleusercontent.', 
                      'https://policies.google.',
                      'https://support.google.',
                      'https://maps.google.',
                      'https://play.google.')
    https = ('https://')

    for url in links[:]:
        if url.startswith(google_domains):
            links.remove(url)
        
    return links

现在我想得到没有https的普通域名，www或类似下面的东西

wiki.org
itroasters.com

而且还需要删除任何重复。
有没有人可以帮助我得到预期的结果？
谢谢

python-3.x

来源：https://stackoverflow.com/questions/76586331/how-to-remove-https-www-and-everything-to-get-only-domain

2条答案

按热度按时间

bwitn5fc1#

问题中没有解释删除netloc/path的'www.'前导的用例，可能是不明智的。
这里有一个模式，它将提供netloc/path的隔离，有或没有'www.'前导码。此代码还将处理没有方案的URL。需要删除的任何其他前缀都可以添加到PREFIXES列表中：

from urllib.parse import urlparse
from typing import Iterator

# a list of prefixes to remove
# there's only one initially but this construct means that others
# could be added without needing to adjust the runtime code
PREFIXES = [
    'www.'
]

google_domains = [
    'https://www.google.',
    'https://google.',
    'https://webcache.googleusercontent.',
    'http://webcache.googleusercontent.',
    'https://policies.google.',
    'https://support.google.',
    'https://maps.google.',
    'https://play.google.',
]

def strip_scheme(urls: list[str], remove: bool=False) -> Iterator[str]:
    for url in urls:
        _, netloc, path, *_ = urlparse(url)
        rv = netloc or path
        if remove:
            for prefix in PREFIXES:
                if rv.startswith(prefix):
                    rv = rv[len(prefix):]
                    break
        yield rv

print(set(strip_scheme(google_domains)))
print()
print(set(strip_scheme(google_domains, True)))

输出：

['maps.google.', 'support.google.', 'webcache.googleusercontent.', 'google.', 'www.google.', 'policies.google.', 'play.google.']

['maps.google.', 'support.google.', 'webcache.googleusercontent.', 'google.', 'policies.google.', 'play.google.']

赞(0）回复(0）举报 2023-07-01

o2rvlv0m2#

要从URL中删除“https：//"、“www."和任何其他子域，以便在Python 3中只获取域，可以使用urllib.parse模块。下面是一个例子：

from urllib.parse import urlparse

# Sample URLs
urls = [
  "https://www.example.com",
  "http://subdomain.example.com",
  "https://www.another-example.com",
  "https://sub.sub.domain.example.com"
]

# Set to store unique domains
unique_domains = set()

# Extract domains from URLs
for url in urls:
    parsed_url = urlparse(url)
    domain = parsed_url.netloc.replace("www.", "")
    unique_domains.add(domain)

# Convert set to a list
domain_list = list(unique_domains)

print(domain_list)

这个脚本使用urllib.parse模块中的urlparse函数来解析URL。然后检索netloc属性，该属性包含域和子域信息。replace方法用于删除“www.”子域（如果存在）。唯一域存储在一个集合（unique_domains）中以消除重复。最后，将集合转换为列表（domain_list）以供打印。

请注意，此脚本假设URL格式良好，并遵循标准URL结构。它可能无法处理所有可能的变化或边缘情况。可能需要根据具体要求进行调整。

希望这有帮助！*

赞(0）回复(0）举报 2023-07-01

我来回答

python-3.x 如何删除https，www和一切只得到域？

2条答案

相关问题

热门标签

最新问答