langchain 在OpenAI的document_loaders/audio中,出现了AttributeError: 'str'对象没有'text'属性,

nkoocmlb  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(42)

检查其他资源

  • 我为这个问题添加了一个非常描述性的标题。
  • 我使用集成搜索在LangChain文档中进行了搜索。
  • 我使用GitHub搜索查找了一个类似的问题,但没有找到。
  • 我确信这是LangChain中的一个bug,而不是我的代码。
  • 通过更新到LangChain的最新稳定版本(或特定集成包)无法解决此bug。

示例代码

顶部

import logging
import os
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
from enum import Enum
from typing import BinaryIO
from typing import cast, Literal, Union

from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers.audio import OpenAIWhisperParser
from pydub import AudioSegment

from cache import conditional_lru_cache
from youtube.loader import YoutubeAudioLoader
....

    loader = GenericLoader(YoutubeAudioLoader([url], save_dir, proxy_servers),
                           OpenAIWhisperParser(api_key=get_settings().openai_api_key,
                                               language=lang.value,
                                               response_format="srt",
                                               temperature=0
                                               ))

YoutubeAudioLoader是我对Langchain YoutubeAudioLoader的定制,它允许使用代理访问YouTube。

import random
from typing import Iterable, List

from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.blob_loaders.schema import Blob, BlobLoader

class YoutubeAudioLoader(BlobLoader):
    """Load YouTube urls as audio file(s)."""

    def __init__(self, urls: List[str], save_dir: str, proxy_servers: List[str] = None):

        if not isinstance(urls, list):
            raise TypeError("urls must be a list")

        self.urls = urls
        self.save_dir = save_dir
        self.proxy_servers = proxy_servers

    def yield_blobs(self) -> Iterable[Blob]:
        """Yield audio blobs for each url."""

        try:
            import yt_dlp
        except ImportError:
            raise ImportError(
                "yt_dlp package not found, please install it with "
                "`pip install yt_dlp`"
            )

        # Use yt_dlp to download audio given a YouTube url
        ydl_opts = {
            "format": "m4a/bestaudio/best",
            "noplaylist": True,
            "outtmpl": self.save_dir + "/%(title)s.%(ext)s",
            "postprocessors": [
                {
                    "key": "FFmpegExtractAudio",
                    "preferredcodec": "m4a",
                }
            ],
            'netrc': True,
            'verbose': True,
            "extractor_args": {"youtube": "youtube:player_skip=webpage"}
        }

        if (self.proxy_servers):
            ydl_opts["proxy"] = random.choice(self.proxy_servers)

        for url in self.urls:
            # Download file
            with yt_dlp.YoutubeDL(ydl_opts) as ydl:
                ydl.download(url)

        # Yield the written blobs
        loader = FileSystemBlobLoader(self.save_dir, glob="*.m4a")
        for blob in loader.yield_blobs():
            yield blob

OpenAIWhisperParser 类中的我的解决方法

if hasattr(transcript, 'text'):
      yield Document(
          page_content=transcript.text,
          metadata={"source": blob.source, "chunk": split_number},
      )
  else:
      yield Document(
          page_content=transcript,
          metadata={"source": blob.source, "chunk": split_number},
      )

错误信息和堆栈跟踪(如有)

2024-08-09 04:54:51,225 [DEBUG] [AnyIO worker thread] HTTP Response: POST https://api.openai.com/v1/audio/transcriptions "200 OK" Headers([('date', 'Fri, 09 Aug 2024 04:54:51 GMT'), ('content-type', 'text/plain; charset=utf-8'), ('transfer-encoding', 'chunked'), ('connection', 'keep-alive'), ('openai-organization', 'user-imywxd1x3dz2koid5nl3pykg'), ('openai-processing-ms', '65120'), ('openai-version', '2020-10-01'), ('strict-transport-security', 'max-age=15552000; includeSubDomains; preload'), ('x-ratelimit-limit-requests', '50'), ('x-ratelimit-remaining-requests', '49'), ('x-ratelimit-reset-requests', '1.2s'), ('x-request-id', 'req_ee43e9b5d13b87213865e038c5cb2b27'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '__cf_bm=fIoOXAGFHjq12ZFNqV2aJW9VpSZ7F.EEwLZCLjQE7xE-1723179291-1.0.1.1-QwPirU_LuFjrc4wkDkk9Trr5C9.th_1ZY3_DpiXDelVA7LMsWOyKBwyQ18l.4.H42VyroK.spHCXh.pW.1LZVA; path=/; expires=Fri, 09-Aug-24 05:24:51 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('x-content-type-options', 'nosniff'), ('set-cookie', '_cfuvid=LI.AshH8TiEGFHWzAy95eYdOziTNvrGLH9.bRjsl_d8-1723179291106-0.0.1.1-604800000; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('server', 'cloudflare'), ('cf-ray', '8b0524df2babc7a7-DUS'), ('content-encoding', 'br'), ('alt-svc', 'h3=":443"; ma=86400')])
2024-08-09 04:54:51,226 [DEBUG] [AnyIO worker thread] request_id: req_ee43e9b5d13b87213865e038c5cb2b27
2024-08-09 04:54:51,227 [DEBUG] [AnyIO worker thread] Could not read JSON from response data due to <class 'json.decoder.JSONDecodeError'> - Extra data: line 2 column 1 (char 2)
Transcribing part 1!
INFO:     172.18.0.1:59254 - "POST /youtube/summarize HTTP/1.0" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/app/routers/youtube.py", line 30, in yt_summarize
    transcription = yt_transcribe(request.url,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/app/transcribe/utils.py", line 69, in yt_transcribe
    docs = loader.load()
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/langchain_core/document_loaders/base.py", line 30, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/langchain_community/document_loaders/generic.py", line 116, in lazy_load
    yield from self.blob_parser.lazy_parse(blob)
  File "/usr/local/lib/python3.12/site-packages/langchain_community/document_loaders/parsers/audio.py", line 132, in lazy_parse
    page_content=transcript.text,
                 ^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'text'

描述

  • 我希望在使用OpenAI whisper集成时以srt、txt、vtt等支持的任何格式获取YT视频转录。
nqwrtyyt

nqwrtyyt1#

我为这个类创建了一个小的PR修复,但这很奇怪,因为OpenAI的官方文档指出它应该返回Transcript对象,但实际上返回的是字符串。我也能在我的机器上复现这个错误。

相关问题