使用带有3D var的xarray从csv创建netcdf文件

gg0vcinb  于 2022-12-06  发布在  Etcd
关注(0)|答案(2)|浏览(241)

我试图将一个csv文件与年,纬度,长和压力转换成三维netcdf压力(时间,纬度,长)。
但是,我的列表包含重复值 如下所示:

year,lon,lat,pressure
1/1/00,79.4939,34.4713,11981569640
1/1/01,79.4939,34.4713,11870476671
1/1/02,79.4939,34.4713,11858633008
1/1/00,77.9513,35.5452,11254617090
1/1/01,77.9513,35.5452,11267424230
1/1/02,77.9513,35.5452,11297377976
1/1/00,77.9295,35.5188,1031160490

我有同样的年份,一个月,一个月的压力
我的第一个尝试是使用straight:

import pandas as pd
import xarray as xr
csv_file = '.csv'
df = pd.read_csv(csv_file)
df = df.set_index(["year", "lon", "lat"])
xr = df.to_xarray()
nc=xr.to_netcdf('netcdf.nc')`

所以我试着跟随How to convert a csv file to grid with Xarray?,但我崩溃了。
我想我需要重新排列此csv以使其具有唯一值 作为时间的函数,仅改变值 压力。
大概是这样的:

longitude,latitude,1/1/2000,1/1/2001,1/1/2002....
79.4939,34.4713 11981569640 ...
77.9513,35.5452 11254617090 ... 
77.9295,35.5188 1031160490 ...

我可以使用“pd.melt”创建我的netcdf:

df = pd.melt(df, id_vars=["year","lon", "lat"], var_name="year", value_name="PRESSURE")

我的档案里有一个两年的例子:
https://1drv.ms/u/s!AhZf0QH5jEVSjWQ7WNCwJsrKBwor?e=UndUkV
使用下面的代码,我想得到:

filename = '13.csv'
colnames = ['year','lon','lat','pressure']
df = pd.read_csv(filename, names = colnames)

df["year"]= pd.to_datetime(df["year"], errors='coerce') 
xr = df.set_index(['year','lon','lat']).to_xarray()

#xr['time'].attrs={'units':'hours since 2018-01-01'}
xr['lat'].attrs={'units':'degrees', 'long_name':'Latitude'}
xr['lon'].attrs={'units':'degrees', 'long_name':'Longitude'}
xr['pressure'].attrs={'units':'pa', 'long_name':'Pressure'}

xr.to_netcdf('my_netcdf.nc')
yi0zb3m4

yi0zb3m41#

因此,如果您希望将这些数据保存为netCDF(或zarr/HDF 5或任何其他常规网格上的数据存储格式),您有几种选择。
第一种方法是继续执行当前的计划,在这种情况下,您绝对需要以某种方式解决生成的超立方体的总大小问题。您可以使用sparse库,并将数据保存为支持稀疏数据的格式。我不推荐使用这种方法。但是如果你真的想要一个3D不规则的网格,在这个网格中以不规则的间隔放置你的站点,您可以这样做。或者,您可以重新网格化您的数据,以强制数据位于规则网格上。这仍然会导致非常大的稀疏数据,但它会比不规则间隔的坐标稍微更有用。例如,如果您希望将数据覆盖在另一个网格化数据集上,这是一个很好的选择。如果您采用这种方法,您可能应该考虑使用pd.cut将lat/lon值离散化到规则的bin中。
第三种选择是将观测/测站/任何点集合视为一个点集合,并为每个点分配一个公共的“点ID”。然后,lat/lon将成为点的一个 * 属性 *,而不是索引坐标。这种方法需要在考虑xarray/netCDF如何工作时进行一些转换,但这种类型的索引通常用于观测数据。其中可能有许多垂直维度,如点ID、位置时间索引、波段等,但每个观测的位置和时间戳实际上是由这些其他维度索引的变量。
为了演示这一点,我设置了一个结构与您的数据集类似的小数据集:

import xarray as xr, numpy as np, pandas as pd

years = pd.date_range("2000-01-01", freq="YS", periods=3)
# generate 20 random stations on earth
n_stations = 20
lats = np.random.random(size=n_stations) * 180 - 90
lons = np.random.random(size=n_stations) * 360 - 180

# generate data for all combos of (lat, lon) pairs and time
pressure = (np.random.random(size=(n_stations * len(years))) * 1.1e10 + 1e9).astype(int)

df = pd.DataFrame({'year': (list(years) * n_stations), 'lat': [l for l in lats for _ in years], 'lon': [l for l in lons for _ in years], 'pressure': pressure})

这看起来像这样:

In [4]: df
Out[4]:
         year        lat         lon     pressure
0  2000-01-01  47.518457 -122.971638   6592720223
1  2001-01-01  47.518457 -122.971638   3181381723
2  2002-01-01  47.518457 -122.971638   4295719754
3  2000-01-01 -61.557495  -80.201070   3843828897
4  2001-01-01 -61.557495  -80.201070  11028409576
5  2002-01-01 -61.557495  -80.201070   2369538294
6  2000-01-01 -69.549806 -108.064884   4736968141
7  2001-01-01 -69.549806 -108.064884   5362327422
8  2002-01-01 -69.549806 -108.064884   5786865879
...
55 2001-01-01   7.065455  -56.622611   1159025195
56 2002-01-01   7.065455  -56.622611   2861490045
57 2000-01-01  10.176521  -93.359717  10668195383
58 2001-01-01  10.176521  -93.359717   6179278941
59 2002-01-01  10.176521  -93.359717   8096958866

这里最重要的一点是我们需要重新构造数据,这样纬度和经度就可以和一个新的点索引一起移动。你可以用很多种方法来分配这个索引,但是如果你有二维数据(这里是时间的点ID),一个简单的方法是将数据分解成Pandas Dataframe :

In [11]: reshaped = df.set_index(['year', 'lat', 'lon']).pressure.unstack('year')
    ...: reshaped
Out[11]:
year                     2000-01-01   2001-01-01   2002-01-01
lat        lon
-69.549806 -108.064884   4736968141   5362327422   5786865879
-61.557495 -80.201070    3843828897  11028409576   2369538294
-26.232121 -42.518353   11071436453   3324450900  10017446009
-17.632865 -43.825574    9624163047   4327094339   5194657461
-10.397045  13.041766    3644097094   4970975759  10215709500
-5.046885  -160.372459  10848978249   5362828700   3165559292
 2.535630   105.366159   7565267947   9150340532   1244019860
 3.070028   54.610328    5774184805   2190428768   3410656879
 7.065455  -56.622611   10487542202   1159025195   2861490045
 10.176521 -93.359717   10668195383   6179278941   8096958866
 11.533859 -8.406768     2311635381   7860849630   9199114517
 15.157955 -113.279669  11984888049  10749492217   8554513278
 20.534460 -9.486914     4636773154  11988039892   7941587610
 32.064057 -55.641618    6209291077   7651976538   9282714003
 42.013715 -55.603621   10377165416  11385104693   7612481121
 43.445033  48.639165    7650284975   2174961057   5519531845
 47.518457 -122.971638   6592720223   3181381723   4295719754
 61.276641 -34.552255   11778765056   2864520584   8978044061
 71.118582  98.074277    8543534134   1709130344   4596373347
 86.568656 -32.057453    2511358407   5623460467  11854301741

现在,我们可以删除latlon索引(稍后将重新拾取它们),并将其替换为工作站ID索引:

In [12]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
    ...: press_df
Out[12]:
year         2000-01-01   2001-01-01   2002-01-01
station_id
0            4736968141   5362327422   5786865879
1            3843828897  11028409576   2369538294
2           11071436453   3324450900  10017446009
3            9624163047   4327094339   5194657461
4            3644097094   4970975759  10215709500
5           10848978249   5362828700   3165559292
6            7565267947   9150340532   1244019860
7            5774184805   2190428768   3410656879
8           10487542202   1159025195   2861490045
9           10668195383   6179278941   8096958866
10           2311635381   7860849630   9199114517
11          11984888049  10749492217   8554513278
12           4636773154  11988039892   7941587610
13           6209291077   7651976538   9282714003
14          10377165416  11385104693   7612481121
15           7650284975   2174961057   5519531845
16           6592720223   3181381723   4295719754
17          11778765056   2864520584   8978044061
18           8543534134   1709130344   4596373347
19           2511358407   5623460467  11854301741

现在,让我们跟踪纬度/隆恩,保持它们的顺序(以及station_id值)一致:

In [13]: latlons = reshaped.index.to_frame().reset_index(drop=True).rename_axis('station_id')
    ...: latlons
Out[13]:
                  lat         lon
station_id
0          -69.549806 -108.064884
1          -61.557495  -80.201070
2          -26.232121  -42.518353
3          -17.632865  -43.825574
4          -10.397045   13.041766
5           -5.046885 -160.372459
6            2.535630  105.366159
7            3.070028   54.610328
8            7.065455  -56.622611
9           10.176521  -93.359717
10          11.533859   -8.406768
11          15.157955 -113.279669
12          20.534460   -9.486914
13          32.064057  -55.641618
14          42.013715  -55.603621
15          43.445033   48.639165
16          47.518457 -122.971638
17          61.276641  -34.552255
18          71.118582   98.074277
19          86.568656  -32.057453

现在我们可以重新堆叠表并转换为xarray DataArray:

In [16]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
    ...: press_df
Out[16]:
year         2000-01-01   2001-01-01   2002-01-01
station_id
0            4736968141   5362327422   5786865879
1            3843828897  11028409576   2369538294
2           11071436453   3324450900  10017446009
3            9624163047   4327094339   5194657461
4            3644097094   4970975759  10215709500
5           10848978249   5362828700   3165559292
6            7565267947   9150340532   1244019860
7            5774184805   2190428768   3410656879
8           10487542202   1159025195   2861490045
9           10668195383   6179278941   8096958866
10           2311635381   7860849630   9199114517
11          11984888049  10749492217   8554513278
12           4636773154  11988039892   7941587610
13           6209291077   7651976538   9282714003
14          10377165416  11385104693   7612481121
15           7650284975   2174961057   5519531845
16           6592720223   3181381723   4295719754
17          11778765056   2864520584   8978044061
18           8543534134   1709130344   4596373347
19           2511358407   5623460467  11854301741

In [17]: press_da = press_df.stack().to_xarray()
    ...: press_da
Out[17]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141,  5362327422,  5786865879],
       [ 3843828897, 11028409576,  2369538294],
       [11071436453,  3324450900, 10017446009],
       [ 9624163047,  4327094339,  5194657461],
       [ 3644097094,  4970975759, 10215709500],
       [10848978249,  5362828700,  3165559292],
       [ 7565267947,  9150340532,  1244019860],
       [ 5774184805,  2190428768,  3410656879],
       [10487542202,  1159025195,  2861490045],
       [10668195383,  6179278941,  8096958866],
       [ 2311635381,  7860849630,  9199114517],
       [11984888049, 10749492217,  8554513278],
       [ 4636773154, 11988039892,  7941587610],
       [ 6209291077,  7651976538,  9282714003],
       [10377165416, 11385104693,  7612481121],
       [ 7650284975,  2174961057,  5519531845],
       [ 6592720223,  3181381723,  4295719754],
       [11778765056,  2864520584,  8978044061],
       [ 8543534134,  1709130344,  4596373347],
       [ 2511358407,  5623460467, 11854301741]])
Coordinates:
  * station_id  (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
  * year        (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01

注意,这里的维度是(station_id, year),而不是(lat, lon)。我们可以添加(lat,lon),索引为station_id,作为坐标:

In [19]: press_da = press_da.assign_coords(**latlons.to_xarray())

In [20]: press_da
Out[20]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141,  5362327422,  5786865879],
       [ 3843828897, 11028409576,  2369538294],
       [11071436453,  3324450900, 10017446009],
       [ 9624163047,  4327094339,  5194657461],
       [ 3644097094,  4970975759, 10215709500],
       [10848978249,  5362828700,  3165559292],
       [ 7565267947,  9150340532,  1244019860],
       [ 5774184805,  2190428768,  3410656879],
       [10487542202,  1159025195,  2861490045],
       [10668195383,  6179278941,  8096958866],
       [ 2311635381,  7860849630,  9199114517],
       [11984888049, 10749492217,  8554513278],
       [ 4636773154, 11988039892,  7941587610],
       [ 6209291077,  7651976538,  9282714003],
       [10377165416, 11385104693,  7612481121],
       [ 7650284975,  2174961057,  5519531845],
       [ 6592720223,  3181381723,  4295719754],
       [11778765056,  2864520584,  8978044061],
       [ 8543534134,  1709130344,  4596373347],
       [ 2511358407,  5623460467, 11854301741]])
Coordinates:
  * station_id  (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
  * year        (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
    lat         (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
    lon         (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06

现在我们有了所有的数据,年份垂直于站点ID,使得沿着年份维度的数据分析变得容易,但不需要处理稀疏数组。
如果愿意,现在可以记录DataArray & Dataset,然后写入netcdf:

In [24]: import datetime
    ...: ds = press_da.to_dataset(name="pressure")
    ...: ds.pressure.attrs.update({
    ...:     "units": "big numbers",
    ...:     "long_name": "Pressure!",
    ...:     "cell_method": "random numbers",
    ...: })
    ...: ds.attrs.update({
    ...:     "created": datetime.datetime.now(),
    ...:     "author": "me",
    ...:     "method": "moving random data around",
    ...:     "etc": "etc",
    ...: })

In [25]: ds
Out[25]:
<xarray.Dataset>
Dimensions:     (station_id: 20, year: 3)
Coordinates:
  * station_id  (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
  * year        (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
    lat         (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
    lon         (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06
Data variables:
    pressure    (station_id, year) int64 4736968141 5362327422 ... 11854301741
Attributes:
    created:  2022-11-08 10:36:39.875581
    author:   me
    method:   moving random data around
    etc:      etc
k7fdbhmy

k7fdbhmy2#

使用这些数据无法直接完成所请求的任务--它不是在规则的水平网格上,而是从不同点收集的数据。

因此,要使其成为规则网格,应进行插值,但由于某些区域的数据密度非常高,而另一些区域的数据密度相当小,因此选择步长非常小的规则网格间距并不明智,因为存在超过~40000个唯一经度值和~30000个唯一纬度值。基本上,将其置于规则网格将意味着阵列40 k x 30 k。
我建议只制作包含所有点(不规则间隔)的netCDF,并使用此数据集进行进一步分析。
下面是将输入xlsx文件转换为netCDF的一些代码:

#!/usr/bin/env ipython
import xarray as xr
import pandas as pd
import numpy as np
# -----------------
import pandas as pd
df = pd.read_excel('13.xlsx');
df.columns = ['date','lon','lat','pres'];
for cval in df.columns:
    df[cval] = pd.to_numeric(df[cval],errors = 'coerce')
# --------------------------------------
ddf = xr.Dataset.from_dataframe(df);
ddf.to_netcdf('simple_netcdf.nc')

相关问题