Pyspark插补时间序列中的缺失值

dxxyhpgq 于 2023-05-06 发布在 Spark

关注(0)|答案(1)|浏览(186)

我正在使用Pyspark来分析一些时间序列数据。
我的数据看起来像这样：

Key | time   | value
--------------------
 A  |   t0   |  null
 A  |   t1   |  1.5
 A  |   t2   |  1.7
 B  |   t3   |  0.5
 B  |   t4   |  null
 B  |   t5   |  1.1
 C  |   t6   |  4.3
 C  |   t7   |  3.4
 C  |   t8   |  null
 C  |   t9   |  2.7

可以安全地假设“时间”和“价值”之间的关系近似是线性的。
我想通过从每个键的剩余（时间，值）数据点训练线性回归来插值空值。
例如，在（t6，4.3），（t7，3.4），（t9，2.7）上拟合回归，以填充t8的空值。
Pandas有一个df.interpolate（）函数，但我找不到任何类似的pyspark函数。

注意，t0-t9是不规则间隔。*

pyspark

来源：https://stackoverflow.com/questions/57994022/pyspark-impute-missing-values-in-time-series

1条答案

按热度按时间

e5nszbig1#

Pyspark Pandas现在有插值了！！

>>> df = ps.DataFrame([(0.0, np.nan, -1.0, 1.0),
               (np.nan, 2.0, np.nan, np.nan),
               (2.0, 3.0, np.nan, 9.0),
               (np.nan, 4.0, -4.0, 16.0)],
              columns=list('abcd'))
>>> df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0

>>> df.interpolate(method='linear')
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3  2.0  4.0 -4.0  16.0

interpolate docs
您也可以考虑使用imputer docs

赞(0）回复(0）举报 2023-05-06

我来回答

Pyspark插补时间序列中的缺失值

1条答案

相关问题

热门标签

最新问答