Python Pandas模块使用NaN填充DataFrame中的列,即使输入是来自另一个DataFrame的列

dgiusagp  于 2024-01-04  发布在  Python
关注(0)|答案(3)|浏览(107)

代码如下:

import pandas as pd
text =  pd.DataFrame(["it", "never", "forget", "it", "hello", "listener's", "books", "at", "cya", "in", "the", "village", "deliberate", "mistake", "hello", "again", "i'd", "seen", "the", "thing", "and", "i'd", "love", "to", "check"])

c_mask = text[0] == "i'd"
v_mask = c_mask.shift(fill_value=False)

check_c = pd.DataFrame()
check_c["contractions"] = text[c_mask]
check_c["followup"] = text[v_mask]
print(check_c)

Out[46]
   contractions followup
16          i'd      NaN
21          i'd      NaN

字符串
我怎么也想不通!

check_c["contractions"] = text[c_mask]

check_c["followup"] = text[v_mask]


据我所知,这两行是相同的。此外,先做“followup”列,然后是“contractions”,使“followup”正常填充,“contractions”用NaN。我以为这可能是索引问题,但使用.reset_index()方法没有帮助,在尝试将第二行添加为列之前将其转换为Series也没有帮助。有人能解释一下发生了什么吗?为什么会发生这种情况?

6xfqseft

6xfqseft1#

我设法解决了这个问题,通过编辑第二行:

check_c["followup"] = text.loc[v_mask,0].values

字符串
我想这是一个索引的问题,但我仍然不确定。如果有人能解释一下那里实际发生了什么,我会非常感激。

kx7yvsdv

kx7yvsdv2#

有不同的索引问题,因为移位掩码,所以新列由NaN s填充,因为第二个掩码不存在index=17,22

print(text[v_mask])
       0
17  seen
22  love

print(text[c_mask])
      0
16  i'd
21  i'd

字符串
另一个问题是一个列的DataFrame,所以不能创建一维数组,如果想像你的解决方案中那样赋值,需要Series

print(text[v_mask].to_numpy())
[['seen']
 ['love']]

print(text.loc[v_mask, 0].to_numpy())
['seen' 'love']


如果"i'd"是列的最后一个值,那么你的解决方案不起作用,因为数组只返回一个元素,并且ValueError被提升:

text =  pd.DataFrame(["it", "never", "forget", "it", "hello", "listener's",
                      "books", "at", "cya", "in", "the", "village", "deliberate", 
                      "mistake", "hello", "again", "i'd", "seen", "the", "thing",
                      "and", "i'd"])

c_mask = text[0] == "i'd"
v_mask = c_mask.shift(fill_value=False)

print (text.loc[v_mask,0].values)
['seen']

check_c = pd.DataFrame()
check_c["contractions"] = text[c_mask]
check_c["followup"] = text.loc[v_mask,0].values


ValueError:值的长度(% 1)与索引的长度(% 2)不匹配
我建议先移位,然后过滤:

c_mask = text[0] == "i'd"

check_c = pd.DataFrame()
check_c["contractions"] = text.loc[c_mask, 0]
check_c["followup"] = text.shift(-1).loc[c_mask, 0]
print(check_c)
   contractions followup
16          i'd     seen
21          i'd     None

kmbjn2e3

kmbjn2e33#

问题是shift方法返回一个Series,结果Series的数据类型将与原始Series相同。在您的情况下,c_mask和v_mask的数据类型都将是具有布尔值的pandas Series。
因此,你可以直接应用v_mask逻辑来过滤新列的移位,检查下面正确的代码。

import pandas as pd

text = pd.DataFrame(["it", "never", "forget", "it", "hello", "listener's", "books", "at", "cya", "in", "the", "village", "deliberate", "mistake", "hello", "again", "i'd", "seen", "the", "thing", "and", "i'd", "love", "to", "check"])

c_mask = text[0] == "i'd"
print(c_mask)

check_c = pd.DataFrame()
print(check_c)
check_c["contractions"] = text[0]
print(check_c)

# Shift the v_mask one more time to get the correct "followup" row
check_c["followup"] = text[0].shift(periods=-1, fill_value=False)

# Filter rows where "i'd" is present in "contractions"
result_df = check_c.loc[c_mask]

print(result_df)

字符串

相关问题