我试图创建一个加载和预处理Parquet文件的tensorflow数据集,但当我尝试Map我的预处理函数时,我得到以下错误:
StagingError: in user code:
File "<ipython-input-22-245243856ef3>", line 2, in preprocess_data *
data = load_relevant_data_subset(path)
File "<ipython-input-20-0f01af668bc5>", line 3, in load_relevant_data_subset *
data = pd.read_parquet(pq_path, columns=data_columns)
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parquet.py", line 493, in read_parquet **
return impl.read(
File "/usr/local/lib/python3.9/dist-packages/pandas/io/parquet.py", line 240, in read
result = self.api.parquet.read_table(
File "/usr/local/lib/python3.9/dist-packages/pyarrow/parquet/__init__.py", line 2780, in read_table
dataset = _ParquetDatasetV2(
File "/usr/local/lib/python3.9/dist-packages/pyarrow/parquet/__init__.py", line 2368, in __init__
[fragment], schema=schema or fragment.physical_schema,
File "pyarrow/_dataset.pyx", line 898, in pyarrow._dataset.Fragment.physical_schema.__get__
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
ArrowInvalid: Called Open() on an uninitialized FileSource
这是预处理函数:
def preprocess_data(path, label):
data = load_relevant_data_subset(path)
data = tf.where(tf.math.is_nan(data), tf.reduce_mean(tf.where(tf.math.is_nan(data), tf.zeros_like(data), data)), data)
target_size = (80, 543)
data = tf.image.resize(data, target_size, method='bilinear')
return data, label
然后创建一个路径列表和train_dataset:
file_paths = [os.path.join(root_path, p) for p in train['path'].tolist()]
labels = train['label'].tolist()
train_dataset = tf.data.Dataset.from_tensor_slices((file_paths, labels))
然后尝试Map它:
train_dataset=train_dataset.map(preprocess_data,num_parallel_calls=tf.data.experimental.AUTOTUNE)
它返回错误。有什么办法解决这个问题吗?
1条答案
按热度按时间3bygqnnd1#
有必要将预处理函数 Package 到
tf.numpy_function
中,类似于以下内容:另外,我们必须将
path
从字节转换为字符串,类似于这样: