ollama 批处理嵌入随着批次的增大而逐渐变差,

tnkciper  于 2个月前  发布在  其他
关注(0)|答案(5)|浏览(31)

问题是什么?

我正在使用ollama Python库来获取所有的结果。
当我使用ollama.embed()创建嵌入时,随着批次变大,我得到的嵌入效果越来越差。这与一次创建一个嵌入相比。在批量大小为16或更大时似乎发生了跳跃。我的所有测试都假设我能以相同的顺序获得嵌入,因为我不久前提交了一个问题( #6187 ),并得到了解决( #6187 )。
由于我正在使用这些嵌入进行RAG应用程序,检索性能必须很好,以便适用于所有插入的嵌入。
我使用来自彼得潘( https://www.gutenberg.org/files/16/16-h/16-h.htm )的文本运行了函数 "chunk_text", chunk_size = 256, max_characters of 65536(256个块,每个块256个字符)。
我使用上面函数调用的块运行了函数 "test",并使用batch_size_list [2, 4, 8, 16, 32, 64, 128, 256]。
以下是所有代码的结果以及一些结果的图表。

import ollama
import numpy as np
import os
from typing import List
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

load_dotenv()

# Embedding model used was "bge-large:latest"
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
EPS=1e-4

def chunk_text(text: str, chunk_size: int, max_characters: int) -> List[str]:
    chunks = []
    for i in range(0, len(text) if len(text) < max_characters else max_characters, chunk_size):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)
    return chunks

# Used first few chapters of Peter Pan
text = ""

chunk_size = 256
# 256 is the max batch size that is defined later
chunks = chunk_text(text, chunk_size, chunk_size * 256)

def embed_string(s: str) -> np.ndarray:
    return np.array(ollama.embed(
        input=s,
        model=EMBEDDING_MODEL,
        options={
            
        },
        truncate=False
    )["embeddings"])[0]

def embed_list(s: List[str]) -> np.ndarray:
    return np.array(ollama.embed(
        input=s,
        model=EMBEDDING_MODEL,
        options={
            
        },
        truncate=False
    )["embeddings"])

def test(list_of_string: List[str], batch_sizes: List[int]) -> bool:
    avg_distances = []
    avg_similarites = []

    max_distances = []
    min_similarities = []

    for batch_size in batch_sizes:
        print(f"Results for batch size: {batch_size}")
        singles = np.array([embed_string(s) for s in list_of_string[:batch_size]])
        as_list = embed_list(list_of_string[:batch_size])
    
        distances = []
        for single_embedding, as_list_embedding in zip(singles, as_list):
            distance = np.sqrt(((single_embedding - as_list_embedding) ** 2).sum())
            distances.append(distance)

        distances = np.array(distances)

        mean = np.mean(distances)
        max = np.max(distances)

        avg_distances.append(mean)
        max_distances.append(max)

        print("Euclidean Distance:")
        print(f"\tMean of euclidean distances: {mean}")
        print(f"\tMax euclidean distance: {max}")
        
        # Cosine similarity
        similarities = []
        for single_embedding, as_list_embedding in zip(singles, as_list):
            vector1 = single_embedding.reshape(1, -1)
            vector2 = as_list_embedding.reshape(1, -1)
            similarity = cosine_similarity(vector1, vector2)
            similarities.append(similarity)

        similarities = np.array(similarities)

        mean = np.mean(similarities)  
        min = np.min(similarities)

        avg_similarites.append(mean)
        min_similarities.append(min)

        print("Cosine Similarity:")
        print(f"\tMean of cosine similarites: {mean}")
        print(f"\tMin cosine similarity: {min}")

        print("==========================================================")

    return (batch_sizes, avg_distances, avg_similarites, max_distances, min_similarities)

结果:

Results for batch size: 2
Euclidean Distance:
	Mean of euclidean distances: 0.0027100650691554615
	Max euclidean distance: 0.003069791207141852
Cosine Similarity:
	Mean of cosine similarites: 0.999996263071194
	Min cosine similarity: 0.99999528818776
==========================================================
Results for batch size: 4
Euclidean Distance:
	Mean of euclidean distances: 0.002698965850379388
	Max euclidean distance: 0.0032083663351101474
Cosine Similarity:
	Mean of cosine similarites: 0.9999962901587796
	Min cosine similarity: 0.9999948531925777
==========================================================
Results for batch size: 8
Euclidean Distance:
	Mean of euclidean distances: 0.003292175370207458
	Max euclidean distance: 0.0038060000679778546
Cosine Similarity:
	Mean of cosine similarites: 0.9999945197318343
	Min cosine similarity: 0.999992757181494
==========================================================
Results for batch size: 16
Euclidean Distance:
	Mean of euclidean distances: 0.11461230989305338
	Max euclidean distance: 1.136748198080119
Cosine Similarity:
	Mean of cosine similarites: 0.946342128810411
	Min cosine similarity: 0.35390177365096614
==========================================================
Results for batch size: 32
Euclidean Distance:
	Mean of euclidean distances: 0.08102131219835153
	Max euclidean distance: 0.8772319282635773
Cosine Similarity:
	Mean of cosine similarites: 0.9674320902167836
	Min cosine similarity: 0.6152323203539565
==========================================================
Results for batch size: 64
Euclidean Distance:
	Mean of euclidean distances: 0.09294858544026222
	Max euclidean distance: 1.1095371609913112
Cosine Similarity:
	Mean of cosine similarites: 0.960566610535093
	Min cosine similarity: 0.38446375093677954
==========================================================
Results for batch size: 128
Euclidean Distance:
	Mean of euclidean distances: 0.08298749768139443
	Max euclidean distance: 0.9481241092937398
Cosine Similarity:
	Mean of cosine similarites: 0.9696922749059266
	Min cosine similarity: 0.5505303906957912
==========================================================
Results for batch size: 256
Euclidean Distance:
	Mean of euclidean distances: 0.08726932907397295
	Max euclidean distance: 1.0992560821737951
Cosine Similarity:
	Mean of cosine similarites: 0.966701897336651
	Min cosine similarity: 0.39581805237428414
==========================================================

dfddblmv

dfddblmv1#

你好,@jorgetrejo36。我想运行你的代码,看看我是否能在MacOS上复现这个问题,但是有些部分缺失了。你能提供它们吗?

bxpogfeg

bxpogfeg2#

关于这个问题,很抱歉-正在调查中。

bvpmtnay

bvpmtnay3#

你使用的是哪个模型?我无法用$x^{

"nomic-embed-text:latest",
  "paraphrase-multilingual:latest",
  "snowflake-arctic-embed:latest",
  "mxbai-embed-large:latest",
  "bge-large:latest",
  "all-minilm:l12",

}$复制。

ia2d9nvy

ia2d9nvy4#

我正在使用bge-large:latest。

nbewdwxp

nbewdwxp5#

@jorgetrejo36 问题仍然存在吗?

相关问题