我正在使用NetCDF文件(ds),它有一个变量"zg"和4个维度(时间,级别,纬度,经度)。我试图选择一个级别上的坐标点。但是,当我进行选择并试图查看DataArray
时(使用ds.values
)或绘制时间序列(通过使用ds.values.plot()
),我的内核冻结并在某个时候死亡。我得到一个弹出错误"内核正在重新启动"。
我无法找出确切的问题,因为在绘制空间图(例如单个时间片ds[0,:,:].plot()
)时不会出现此问题。以前也不会出现此问题,因为我以前使用过多个CMIP6模型文件。
下面是我的代码:
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import numpy as np
import dask
xr.set_options(display_style='html')
import intake
from xmip.preprocessing import rename_cmip6,promote_empty_dims
import cftime
cat_url = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json" #not a local but object storage
col = intake.open_esm_datastore(cat_url)
#Using Geopotential hts for index calculation:
cat = col.search(experiment_id=['piControl'], table_id=['Amon'], variable_id=['zg'],source_id=['CESM2'],member_id = ['r1i1p1f1'], grid_label=['gn'])
z_kwargs = {'consolidated': True, 'use_cftime':True}
dset_dict = cat.to_dataset_dict(zarr_kwargs=z_kwargs)
#selecting the 500HPa level from the dataset:
dset_dict['CMIP.NCAR.CESM2.piControl.Amon.gn'] = dset_dict['CMIP.NCAR.CESM2.piControl.Amon.gn'].squeeze() #remove empty dims
GPT_500Hpa = dset_dict['CMIP.NCAR.CESM2.piControl.Amon.gn'].zg[:,5,:,:]
GPT_500Hpa[0].plot() #this works!
#Following don't work and kill the kernel:
GPT_500Hpa.mean(dim='time').plot()
GPT_500Hpa.mean(dim=('lon','lat')).plot()
#Not even the .value works:
GPT_500Hpa.mean(dim=('lon','lat')).value
#It doesn't even let me save it to a new netcdf file:
GPT_500Hpa.to_netcdf('/CMIP.NCAR.CESM2.piControl.Amon.gn_500Hpa.nc')
选择水平并对数据进行分块后,结果如下所示:
<xarray.DataArray 'zg' (time: 14400, lat: 192, lon: 288)>
dask.array<rechunk-merge, shape=(14400, 192, 288), dtype=float32, chunksize=(500, 192, 288), chunktype=numpy.ndarray>
Coordinates:
* lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
* lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
plev float64 5e+04
* time (time) object 0001-01-15 12:00:00 ... 1200-12-15 12:00:00
member_id <U8 'r1i1p1f1'
Attributes:
cell_measures: area: areacella
cell_methods: time: mean
comment: Geopotential is the sum of the specific gravitational pot...
description: Geopotential is the sum of the specific gravitational pot...
frequency: mon
id: zg
long_name: Geopotential Height
mipTable: Amon
out_name: zg
prov: Amon ((isd.003))
realm: atmos
standard_name: geopotential_height
time: time
time_label: time-mean
time_title: Temporal mean
title: Geopotential Height
type: real
units: m
variable_id: zg
当我试图查看这些值或绘制下面的一维时间序列时,内核会死机:
GPT_500Hpa[:,:,:].mean(dim=("lat","lon"),keep_attrs=True, skipna = True)
数据如下所示:
<xarray.DataArray 'zg' (time: 14400)>
dask.array<mean_agg-aggregate, shape=(14400,), dtype=float32, chunksize=(500,), chunktype=numpy.ndarray>
Coordinates:
plev float64 5e+04
* time (time) object 0001-01-15 12:00:00 ... 1200-12-15 12:00:00
member_id <U8 'r1i1p1f1'
Attributes:
cell_measures: area: areacella
cell_methods: time: mean
comment: Geopotential is the sum of the specific gravitational pot...
description: Geopotential is the sum of the specific gravitational pot...
frequency: mon
id: zg
long_name: Geopotential Height
mipTable: Amon
out_name: zg
prov: Amon ((isd.003))
realm: atmos
standard_name: geopotential_height
time: time
time_label: time-mean
time_title: Temporal mean
title: Geopotential Height
type: real
units: m
variable_id: zg
我已经升级了我的matplotlib以及pip和python,但都不起作用。
我降低了freetype的级别,假设我的matplotlib是问题的幕后黑手,但它没有帮助。
1条答案
按热度按时间n53p2ov01#
xr.DataArray.values
是一个属性,它将所有数据作为numpy数组返回。Numpy没有延迟加载的概念,所以当你引用这个属性时,你会强制xarray加载整个数组,这看起来会让你的内核崩溃。无论何时尝试访问数据的子集,尤其是当数据位于磁盘或远程设备上而尚未保存在内存中时,都应该使用xarray的Indexing and Selecting Data方法,例如
.sel
、.isel
、.loc
、.iloc
,或者直接对数据集进行切片。例如ds["varname"][0, :, 50:100]
。我推荐使用方法而不是切片,因为您可以通过名称而不是位置引用维度。另外,请注意,
ds.values.plot()
无论如何都不会工作,因为numpy数组没有绘图方法。"values"
,你将不能像现在这样使用属性访问来引用它,而是使用字典样式的键访问方法:ds["values"].plot()
最后,如果你在减少数据时遇到了麻烦,比如
ds.mean()
,数据集可能比内存大。你可以通过dask对数组进行分块来处理比内存大的数组。当你第一次访问数组时,chunk
使用一个分块方案来处理数据,其中每个块都适合内存。通常是几百个MB。请参阅working with dask上的文档了解更多信息。