Bug Description
VectorStoreIndex has metadata type issue when using EntityExtractor as a pipeline transformer. EntityExtractor returns a list, however, VectorStoreIndex expects metadata values to be string/int.
Other transformers (e.g. QuestionsAnsweredExtractor, KeywordExtractor) return strings.
Version
0.10.10
Steps to Reproduce
pipeline = IngestionPipeline(transformations=[
SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=OpenAIEmbedding()),
TitleExtractor(nodes=5, llm=llm),
QuestionsAnsweredExtractor(questions=3, llm=llm),
EntityExtractor(prediction_threshold=0.5),
KeywordExtractor(keywords=10, llm=llm)
])
nrma_home_nodes = pipeline.run(documents)
index = VectorStoreIndex(nrma_home_nodes, storage_context=storage_context)
Relevant Logs/Tracbacks
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-81-d5bf8332a4b1> in <cell line: 7>()
5 vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
6 storage_context = StorageContext.from_defaults(vector_store=vector_store)
----> 7 index = VectorStoreIndex(nrma_home_nodes, storage_context=storage_context)
7 frames
/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in __init__(self, nodes, use_async, store_nodes_override, embed_model, insert_batch_size, objects, index_struct, storage_context, callback_manager, transformations, show_progress, service_context, **kwargs)
72
73 self._insert_batch_size = insert_batch_size
---> 74 super().__init__(
75 nodes=nodes,
76 index_struct=index_struct,
/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/base.py in __init__(self, nodes, objects, index_struct, storage_context, callback_manager, transformations, show_progress, service_context, **kwargs)
89 if index_struct is None:
90 nodes = nodes or []
---> 91 index_struct = self.build_index_from_nodes(
92 nodes + objects # type: ignore
93 )
/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in build_index_from_nodes(self, nodes, **insert_kwargs)
305 )
306
--> 307 return self._build_index_from_nodes(nodes, **insert_kwargs)
308
309 def _insert(self, nodes: Sequence[BaseNode], **insert_kwargs: Any) -> None:
/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in _build_index_from_nodes(self, nodes, **insert_kwargs)
277 run_async_tasks(tasks)
278 else:
--> 279 self._add_nodes_to_index(
280 index_struct,
281 nodes,
/usr/local/lib/python3.10/dist-packages/llama_index/core/indices/vector_store/base.py in _add_nodes_to_index(self, index_struct, nodes, show_progress, **insert_kwargs)
231 for nodes_batch in iter_batch(nodes, self._insert_batch_size):
232 nodes_batch = self._get_node_with_embedding(nodes_batch, show_progress)
--> 233 new_ids = self._vector_store.add(nodes_batch, **insert_kwargs)
234
235 if not self._vector_store.stores_text or self._store_nodes_override:
/usr/local/lib/python3.10/dist-packages/llama_index/vector_stores/chroma/base.py in add(self, nodes, **add_kwargs)
236 for node in node_chunk:
237 embeddings.append(node.get_embedding())
--> 238 metadata_dict = node_to_metadata_dict(
239 node, remove_text=True, flat_metadata=self.flat_metadata
240 )
/usr/local/lib/python3.10/dist-packages/llama_index/core/vector_stores/utils.py in node_to_metadata_dict(node, remove_text, text_field, flat_metadata)
41
42 if flat_metadata:
---> 43 _validate_is_flat_dict(metadata)
44
45 # store entire node as json string - some minor text duplication
/usr/local/lib/python3.10/dist-packages/llama_index/core/vector_stores/utils.py in _validate_is_flat_dict(metadata_dict)
25 raise ValueError("Metadata key must be str!")
26 if not isinstance(val, (str, int, float, type(None))):
---> 27 raise ValueError(
28 f"Value for metadata {key} must be one of (str, int, float, None)"
29 )
ValueError: Value for metadata entities must be one of (str, int, float, None)
2条答案
按热度按时间tvokkenx1#
Hey @mike4lyf,又来解决科技谜题了,我看到你了!希望你过得很好。
感谢你报告这个问题。看起来
EntityExtractor
返回了一个实体列表,这导致ValueError
在VectorStoreIndex
尝试处理元数据时出现问题。这是因为VectorStoreIndex
希望元数据值的类型是字符串或整数。LlamaIndex 代码库中的
EntityExtractor
类用于从文本中提取实体并将它们插入到节点元数据中。它使用默认模型tomaarsen/span-marker-mbert-base-multinerd
和 SpanMarker 库进行实体提取。然后将实体作为字典列表返回。以下是相关代码:来源:llama-index-integrations/extractors/llama-index-extractors-entity/llama_index/extractors/entity/base.py
要解决此问题,您可以修改
EntityExtractor
以返回实体的字符串表示形式而不是列表。可以通过使用分隔符将实体连接起来来实现这一点。以下是如何修改EntityExtractor
的示例:这将确保
EntityExtractor
返回一个字符串,该字符串可以由VectorStoreIndex
处理。请告诉我这是否解决了您的问题,或者您需要进一步的帮助。
来源
关于 Dosu
此响应旨在节省您的时间并提供帮助。它不是精确的解决方案,而是您自己研究的起点。
帮助我变得更有用!如果这有帮助,请留下一个 👍,如果无关紧要,请留下一个 👎。
如果您想继续对话,请以 @dosu-bot 开头回复。
rwqw0loc2#
EntityExtractor
的当前导入是什么?以及支持的模块是什么?谢谢。