问题是什么?
我正在使用ollama Python库来获取所有的结果。
当我使用ollama.embed()创建嵌入时,随着批次变大,我得到的嵌入效果越来越差。这与一次创建一个嵌入相比。在批量大小为16或更大时似乎发生了跳跃。我的所有测试都假设我能以相同的顺序获得嵌入,因为我不久前提交了一个问题( #6187 ),并得到了解决( #6187 )。
由于我正在使用这些嵌入进行RAG应用程序,检索性能必须很好,以便适用于所有插入的嵌入。
我使用来自彼得潘( https://www.gutenberg.org/files/16/16-h/16-h.htm )的文本运行了函数 "chunk_text", chunk_size = 256, max_characters of 65536(256个块,每个块256个字符)。
我使用上面函数调用的块运行了函数 "test",并使用batch_size_list [2, 4, 8, 16, 32, 64, 128, 256]。
以下是所有代码的结果以及一些结果的图表。
import ollama
import numpy as np
import os
from typing import List
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
load_dotenv()
# Embedding model used was "bge-large:latest"
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
EPS=1e-4
def chunk_text(text: str, chunk_size: int, max_characters: int) -> List[str]:
chunks = []
for i in range(0, len(text) if len(text) < max_characters else max_characters, chunk_size):
chunk = text[i:i + chunk_size]
chunks.append(chunk)
return chunks
# Used first few chapters of Peter Pan
text = ""
chunk_size = 256
# 256 is the max batch size that is defined later
chunks = chunk_text(text, chunk_size, chunk_size * 256)
def embed_string(s: str) -> np.ndarray:
return np.array(ollama.embed(
input=s,
model=EMBEDDING_MODEL,
options={
},
truncate=False
)["embeddings"])[0]
def embed_list(s: List[str]) -> np.ndarray:
return np.array(ollama.embed(
input=s,
model=EMBEDDING_MODEL,
options={
},
truncate=False
)["embeddings"])
def test(list_of_string: List[str], batch_sizes: List[int]) -> bool:
avg_distances = []
avg_similarites = []
max_distances = []
min_similarities = []
for batch_size in batch_sizes:
print(f"Results for batch size: {batch_size}")
singles = np.array([embed_string(s) for s in list_of_string[:batch_size]])
as_list = embed_list(list_of_string[:batch_size])
distances = []
for single_embedding, as_list_embedding in zip(singles, as_list):
distance = np.sqrt(((single_embedding - as_list_embedding) ** 2).sum())
distances.append(distance)
distances = np.array(distances)
mean = np.mean(distances)
max = np.max(distances)
avg_distances.append(mean)
max_distances.append(max)
print("Euclidean Distance:")
print(f"\tMean of euclidean distances: {mean}")
print(f"\tMax euclidean distance: {max}")
# Cosine similarity
similarities = []
for single_embedding, as_list_embedding in zip(singles, as_list):
vector1 = single_embedding.reshape(1, -1)
vector2 = as_list_embedding.reshape(1, -1)
similarity = cosine_similarity(vector1, vector2)
similarities.append(similarity)
similarities = np.array(similarities)
mean = np.mean(similarities)
min = np.min(similarities)
avg_similarites.append(mean)
min_similarities.append(min)
print("Cosine Similarity:")
print(f"\tMean of cosine similarites: {mean}")
print(f"\tMin cosine similarity: {min}")
print("==========================================================")
return (batch_sizes, avg_distances, avg_similarites, max_distances, min_similarities)
结果:
Results for batch size: 2
Euclidean Distance:
Mean of euclidean distances: 0.0027100650691554615
Max euclidean distance: 0.003069791207141852
Cosine Similarity:
Mean of cosine similarites: 0.999996263071194
Min cosine similarity: 0.99999528818776
==========================================================
Results for batch size: 4
Euclidean Distance:
Mean of euclidean distances: 0.002698965850379388
Max euclidean distance: 0.0032083663351101474
Cosine Similarity:
Mean of cosine similarites: 0.9999962901587796
Min cosine similarity: 0.9999948531925777
==========================================================
Results for batch size: 8
Euclidean Distance:
Mean of euclidean distances: 0.003292175370207458
Max euclidean distance: 0.0038060000679778546
Cosine Similarity:
Mean of cosine similarites: 0.9999945197318343
Min cosine similarity: 0.999992757181494
==========================================================
Results for batch size: 16
Euclidean Distance:
Mean of euclidean distances: 0.11461230989305338
Max euclidean distance: 1.136748198080119
Cosine Similarity:
Mean of cosine similarites: 0.946342128810411
Min cosine similarity: 0.35390177365096614
==========================================================
Results for batch size: 32
Euclidean Distance:
Mean of euclidean distances: 0.08102131219835153
Max euclidean distance: 0.8772319282635773
Cosine Similarity:
Mean of cosine similarites: 0.9674320902167836
Min cosine similarity: 0.6152323203539565
==========================================================
Results for batch size: 64
Euclidean Distance:
Mean of euclidean distances: 0.09294858544026222
Max euclidean distance: 1.1095371609913112
Cosine Similarity:
Mean of cosine similarites: 0.960566610535093
Min cosine similarity: 0.38446375093677954
==========================================================
Results for batch size: 128
Euclidean Distance:
Mean of euclidean distances: 0.08298749768139443
Max euclidean distance: 0.9481241092937398
Cosine Similarity:
Mean of cosine similarites: 0.9696922749059266
Min cosine similarity: 0.5505303906957912
==========================================================
Results for batch size: 256
Euclidean Distance:
Mean of euclidean distances: 0.08726932907397295
Max euclidean distance: 1.0992560821737951
Cosine Similarity:
Mean of cosine similarites: 0.966701897336651
Min cosine similarity: 0.39581805237428414
==========================================================
5条答案
按热度按时间dfddblmv1#
你好,@jorgetrejo36。我想运行你的代码,看看我是否能在MacOS上复现这个问题,但是有些部分缺失了。你能提供它们吗?
bxpogfeg2#
关于这个问题,很抱歉-正在调查中。
bvpmtnay3#
你使用的是哪个模型?我无法用$x^{
}$复制。
ia2d9nvy4#
我正在使用bge-large:latest。
nbewdwxp5#
@jorgetrejo36 问题仍然存在吗?