Python与 Impala

n53p2ov0  于 2021-06-26  发布在  Impala
关注(0)|答案(1)|浏览(357)

我有一个impala表,我想用ibis查询它。该表如下所示:

id  | timestamp
-------------------
 A     | 5
 A     | 7
 A     | 3
 B     | 9
 B     | 5

我想去 group_by 此表根据 id 以及 timestamp range . 分组操作最终应该产生一个 grouped 然后可以对其应用聚合的对象。例如:
第1组条件: id == A; 4 < timestamp < 11 第2组条件: id == A; 1 < timestamp < 6 第3组条件: id == B; 4 < timestamp < 7 产生 grouped 对象包含以下组:
第1组:

id  | timestamp
-------------------
 A     | 5
 A     | 7

第2组:

id  | timestamp
-------------------
 A     | 5
 A     | 3

第3组:

id  | timestamp
-------------------
 B     | 5

一旦我有了组,我将执行各种聚合以获得最终结果。如果有人能帮我搞清楚这一组的话,我将不胜感激,即使是一个正规的Pandas表达也会很有帮助!

3zwtqj6y

3zwtqj6y1#

这里有一个例子 groupby (无下划线):

df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,3,4,5,6]})

为您的应用程序创建一个grouper列 timestamp .

df["my interval"] = (df["timestamp"] > 3 )&  (df["timestamp"] <5)
"you need some _data_ columns, i.e. those which you do not use for grouping"
df["dummy"] = 1 
df.groupby(["id", "my interval"]).agg("count")["dummy"]

或者您可以同时使用:

df["something that I need"] = df["my interval"] & (df["id"] == "b")
df.groupby(["something that I need"]).agg("count")["dummy"]

您可能还希望应用整数除法来生成时间间隔:

df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,13,14,25,26], "sales": [0,4,2,3,6,7]})
epoch = 10
df["my interval"] = epoch* (df["timestamp"] // epoch)
df.groupby(["my interval"]).agg(sum)["sales"]

编辑:

你的例子:

import pandas as pd
A = "A"
B = "B"
df = pd.DataFrame({"id":[A,A,A,B,B], "timestamp":[5,7,3,9,5]})
df["dummy"] = 1

解决方案:

grouper = (df["id"] == A) & (4 < df["timestamp"] ) & ( df["timestamp"] < 11)
df.groupby( grouper ).agg(sum)["dummy"]

或更好:

df[grouper]["dummy"].sum()

相关问题