我有一个版本化的azureml数据集。数据只是附加的,每周都有。因此,数据集将获得一个“新的每周版本”。它在azure blob容器上组织如下:
.data
├── week1
│ └── data1.csv
└── weeks2
└── data2.csv
# week1 data1.csv
country,code
United States,US
India,IN
United Kingdom,UK
# week2 data2.csv
country,code
China,CN
我在azureml工作区中有这个数据集。我在azuredatabricks-worspace中也有一个笔记本,我在那里访问这个数据集
from azureml.core import Workspace, Datastore, Dataset
subscription_id = "###"
resource_group = "####"
workspace_name = "####"
workspace = Workspace(subscription_id, resource_group, workspace_name)
datastore= workspace.get_default_datastore()
dataset_ver1= Dataset.get_by_name(workspace, name="demo_data", version=1)
print (dataset_ver1.to_pandas_dataframe())
# country code
# 0 United States US
# 1 India IN
# 2 United Kingdom UK
dataset_ver1.to_spark_dataframe().show(20)
# +--------------+----+
# | country|code|
# +--------------+----+
# | United States| US|
# | India| IN|
# |United Kingdom| UK|
# +--------------+----+
dataset_ver2= Dataset.get_by_name(workspace, name="demo_data", version="latest")
print (dataset_ver2.to_pandas_dataframe())
# country code
# 0 United States US
# 1 India IN
# 2 United Kingdom UK
# 3 China CN
dataset_ver2.to_spark_dataframe().show(20)
# +--------------+----+
# | country|code|
# +--------------+----+
# | United States| US|
# | India| IN|
# |United Kingdom| UK|
# | China| CN|
# | United States| US|
# | India| IN|
# |United Kingdom| UK|
# | China| CN|
# +--------------+----+
如果查看version2的sparkDataframe输出-每一行都是重复的。而pandasDataframe看起来和预期的一样。这是azuremlapi中的一个bug,还是我做错了什么?
请帮忙
暂无答案!
目前还没有任何答案,快来回答吧!