python 如何查看保存在Lang Chain中的带有Chroma(或任何其他DB)的文档的嵌入?

hfsqlsce  于 2023-11-15  发布在  Python
关注(0)|答案(2)|浏览(152)

当我使用ChromaLangchainOpenAI嵌入时,我可以看到除了文档的嵌入之外的所有内容。它总是显示None
代码如下:

for db_collection_name in tqdm(["class1-sub2-chap3", "class2-sub3-chap4"]):
    documents = []
    doc_ids = []

    for doc_index in range(3):
        cl, sub, chap = db_collection_name.split("-")
        content = f"This is {db_collection_name}-doc{doc_index}"
        doc = Document(page_content=content, metadata={"chunk_num": doc_index, "chapter":chap, "class":cl, "subject":sub})
        documents.append(doc)
        doc_ids.append(str(doc_index))

    # # Initialize a Chroma instance with the original document
    db = Chroma.from_documents(
         collection_name=db_collection_name,
         documents=documents, ids=doc_ids,
         embedding=embeddings, 
         persist_directory="./data")
    
     db.persist()

字符串
当我执行db.get()时,除了embeddingNone之外,我看到的一切都是预期的。

{'ids': ['0', '1', '2'],
 'embeddings': None,
 'documents': ['This is class1-sub2-chap3-doc0',
  'This is class1-sub2-chap3-doc1',
  'This is class1-sub2-chap3-doc2'],
 'metadatas': [{'chunk_num': 0,
   'chapter': 'chap3',
   'class': 'class1',
   'subject': 'sub2'},
  {'chunk_num': 1, 'chapter': 'chap3', 'class': 'class1', 'subject': 'sub2'},
  {'chunk_num': 2, 'chapter': 'chap3', 'class': 'class1', 'subject': 'sub2'}]}


我的embeddings也工作正常,因为它返回:

len(embeddings.embed_documents(["EMBED THIS"])[0])
>> 1536


另外,在我的./data目录中,我有一个嵌入文件,名为chroma-embeddings.parquet
我尝试了文档中给出的示例,但它也显示了None

# Import Document class
from langchain.docstore.document import Document

# Initial document content and id
initial_content = "This is an initial document content"
document_id = "doc1"

# Create an instance of Document with initial content and metadata
original_doc = Document(page_content=initial_content, metadata={"page": "0"})

# Initialize a Chroma instance with the original document
new_db = Chroma.from_documents(
    collection_name="test_collection",
    documents=[original_doc],
    embedding=OpenAIEmbeddings(),  # using the same embeddings as before
    ids=[document_id],
)


这里new_db.get()也给出了None

ncecgwcz

ncecgwcz1#

您只需要指定在使用.get时也需要嵌入

# Get all embeddings
db._collection.get(include=['embeddings'])

# Get embeddings by document_id
db._collection.get(ids=['doc0', ..., 'docN'], include=['embeddings'])

字符串

y53ybaqx

y53ybaqx2#

下面是解决方案

loader = DirectoryLoader("document", glob="**/*.*")
files = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
docs = text_splitter.split_documents(files)

documents = [Document(page_content=doc.page_content, metadata={"topic":f"John's story{i}"}) for i, doc in enumerate(docs)]
db = Chroma.from_documents(documents=documents, embedding=embedding, persist_directory="db")

字符串
请尝试运行此代码。我已经检查了元数据已添加。这是结果图像。

相关问题