def split_text_into_chunks(text, chunk_size):
"""Split text into larger chunks of specified size."""
return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
# Example usage
text = "Your long text here..."
chunk_size = 1000 # Increase the chunk size as needed
text_chunks = split_text_into_chunks(text, chunk_size)
# Ingest each chunk into Qdrant
for chunk in text_chunks:
document = Document(
id_=generate_unique_id(), # Ensure each document has a unique ID
text=chunk,
metadata={"your": "metadata"} # Include only essential metadata
)
# Ingest document into Qdrant
qdrant_client.upload_documents(collection_name="your_collection", documents=[document])
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient()
vector_store = QdrantVectorStore(
collection_name="your_collection",
client=client,
batch_size=64, # Adjust batch size as needed
parallel=1, # Adjust parallel processes as needed
max_retries=3 # Adjust max retries as needed
)
# Example usage
text = "Your long text here..."
chunk_size = 1000 # Increase the chunk size as needed
text_chunks = split_text_into_chunks(text, chunk_size)
nodes = [TextNode(id_=generate_unique_id(), text=chunk, metadata={"your": "metadata"}) for chunk in text_chunks]
# Add nodes to the vector store
vector_store.add(nodes)
from llama_index.core.node_parser.text.token import TokenTextSplitter
# Initialize the TokenTextSplitter with appropriate chunk size and overlap
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)
# Example usage
text = "Your long text here..."
metadata_str = "Your metadata here..."
text_chunks = splitter.split_text_metadata_aware(text, metadata_str)
# Ingest each chunk into Qdrant
for chunk in text_chunks:
document = Document(
id_=generate_unique_id(), # Ensure each document has a unique ID
text=chunk,
metadata={"your": "metadata"} # Include only essential metadata
)
# Ingest document into Qdrant
qdrant_client.upload_documents(collection_name="your_collection", documents=[document])
either decrease the amount of text you are putting into metadata, document.metadata = {} (or some smaller subset) or set the metadata excludes on your input nodes/documents. document.excluded_llm_metadata_keys = ["key1", ...] document.excluded_embed_metadata_keys = ["key1", ...] Its taking forever beasuse its probably creating a ton of nodes 😓
2条答案
按热度按时间yyhrrdl81#
要解决在将文本块导入Qdrant数据库时收到关于元数据长度接近块大小的警告的问题,您应该考虑减小元数据的尺寸或增加块尺寸。这可以帮助防止文本块数量呈指数级增长和过度摄取时间。
以下是解决此问题的一种可能方法:
以下是在摄取之前调整块尺寸的示例:
此外,您还可以利用LlamaIndex集成中的
QdrantVectorStore
类来更有效地管理摄取过程。该类提供了处理批量上传、并行处理和重试的方法,这些方法可以帮助提高摄取性能。以下是如何使用
QdrantVectorStore
类的示例:此外,您还可以使用LlamaIndex中的
TokenTextSplitter
类更有效地处理分块过程,特别是在处理元数据时。该类确保在将文本分割成块时考虑到元数据长度。以下是如何使用
TokenTextSplitter
类的示例:通过减小每个文本块的大小、最小化元数据以及使用
QdrantVectorStore
和TokenTextSplitter
类进行高效的摄取,您可以缓解警告并提高摄取性能。ih99xse12#
either decrease the amount of text you are putting into metadata,
document.metadata = {}
(or some smaller subset)or set the metadata excludes on your input nodes/documents.
document.excluded_llm_metadata_keys = ["key1", ...]
document.excluded_embed_metadata_keys = ["key1", ...]
Its taking forever beasuse its probably creating a ton of nodes 😓