Python Pandas模块使用NaN填充DataFrame中的列,即使输入是来自另一个DataFrame的列

dgiusagp  于 2024-01-04  发布在  Python
关注(0)|答案(3)|浏览(110)

代码如下:

  1. import pandas as pd
  2. text = pd.DataFrame(["it", "never", "forget", "it", "hello", "listener's", "books", "at", "cya", "in", "the", "village", "deliberate", "mistake", "hello", "again", "i'd", "seen", "the", "thing", "and", "i'd", "love", "to", "check"])
  3. c_mask = text[0] == "i'd"
  4. v_mask = c_mask.shift(fill_value=False)
  5. check_c = pd.DataFrame()
  6. check_c["contractions"] = text[c_mask]
  7. check_c["followup"] = text[v_mask]
  8. print(check_c)
  9. Out[46]
  10. contractions followup
  11. 16 i'd NaN
  12. 21 i'd NaN

字符串
我怎么也想不通!

  1. check_c["contractions"] = text[c_mask]
  2. check_c["followup"] = text[v_mask]


据我所知,这两行是相同的。此外,先做“followup”列,然后是“contractions”,使“followup”正常填充,“contractions”用NaN。我以为这可能是索引问题,但使用.reset_index()方法没有帮助,在尝试将第二行添加为列之前将其转换为Series也没有帮助。有人能解释一下发生了什么吗?为什么会发生这种情况?

6xfqseft

6xfqseft1#

我设法解决了这个问题,通过编辑第二行:

  1. check_c["followup"] = text.loc[v_mask,0].values

字符串
我想这是一个索引的问题,但我仍然不确定。如果有人能解释一下那里实际发生了什么,我会非常感激。

kx7yvsdv

kx7yvsdv2#

有不同的索引问题,因为移位掩码,所以新列由NaN s填充,因为第二个掩码不存在index=17,22

  1. print(text[v_mask])
  2. 0
  3. 17 seen
  4. 22 love
  5. print(text[c_mask])
  6. 0
  7. 16 i'd
  8. 21 i'd

字符串
另一个问题是一个列的DataFrame,所以不能创建一维数组,如果想像你的解决方案中那样赋值,需要Series

  1. print(text[v_mask].to_numpy())
  2. [['seen']
  3. ['love']]
  4. print(text.loc[v_mask, 0].to_numpy())
  5. ['seen' 'love']


如果"i'd"是列的最后一个值,那么你的解决方案不起作用,因为数组只返回一个元素,并且ValueError被提升:

  1. text = pd.DataFrame(["it", "never", "forget", "it", "hello", "listener's",
  2. "books", "at", "cya", "in", "the", "village", "deliberate",
  3. "mistake", "hello", "again", "i'd", "seen", "the", "thing",
  4. "and", "i'd"])
  5. c_mask = text[0] == "i'd"
  6. v_mask = c_mask.shift(fill_value=False)
  7. print (text.loc[v_mask,0].values)
  8. ['seen']
  9. check_c = pd.DataFrame()
  10. check_c["contractions"] = text[c_mask]
  11. check_c["followup"] = text.loc[v_mask,0].values


ValueError:值的长度(% 1)与索引的长度(% 2)不匹配
我建议先移位,然后过滤:

  1. c_mask = text[0] == "i'd"
  2. check_c = pd.DataFrame()
  3. check_c["contractions"] = text.loc[c_mask, 0]
  4. check_c["followup"] = text.shift(-1).loc[c_mask, 0]
  5. print(check_c)
  6. contractions followup
  7. 16 i'd seen
  8. 21 i'd None

展开查看全部
kmbjn2e3

kmbjn2e33#

问题是shift方法返回一个Series,结果Series的数据类型将与原始Series相同。在您的情况下,c_mask和v_mask的数据类型都将是具有布尔值的pandas Series。
因此,你可以直接应用v_mask逻辑来过滤新列的移位,检查下面正确的代码。

  1. import pandas as pd
  2. text = pd.DataFrame(["it", "never", "forget", "it", "hello", "listener's", "books", "at", "cya", "in", "the", "village", "deliberate", "mistake", "hello", "again", "i'd", "seen", "the", "thing", "and", "i'd", "love", "to", "check"])
  3. c_mask = text[0] == "i'd"
  4. print(c_mask)
  5. check_c = pd.DataFrame()
  6. print(check_c)
  7. check_c["contractions"] = text[0]
  8. print(check_c)
  9. # Shift the v_mask one more time to get the correct "followup" row
  10. check_c["followup"] = text[0].shift(periods=-1, fill_value=False)
  11. # Filter rows where "i'd" is present in "contractions"
  12. result_df = check_c.loc[c_mask]
  13. print(result_df)

字符串

展开查看全部

相关问题