在pandas Dataframe 中,基于在单行中拆分一列,用多行替换单行[duplicate]

wwwo4jvm  于 2023-09-29  发布在  其他
关注(0)|答案(1)|浏览(201)

此问题已在此处有答案

Split (explode) pandas dataframe string entry to separate rows(27个回答)
6天前关闭
我有一个dataframe,看起来像这样:

  1. key term notes Source
  2. 156349471 Aasdasd Bleen 20623750
  3. 213740505 dfgdfgdfg Blox 33052911
  4. 171645239 rtertertert sdffd 15805072|24361871|28885000
  5. 156134219 cvdv dsfsdf 20305092|21259293|21905055|23136149
  6. 205936689 ddfg dfsewr 34480604
  7. 205947819 xvcbfghf svdst 34480604
  8. 213902333 jfghd xcvsd 35020164
  9. 156133836 cvbcvb xcvsfg 21907755|30098279
  10. 156349486 cvbcvb xcv 24880025
  11. 156134727 dfgdfgdfg sdfgdfs 24001450

我尝试做的是从这里创建一个dataframe,其中在Source列中有多个条目的每一行,该列由“|“”被转换为多行,该行的其余部分不受影响。由此可见:
171645239 rtertertert sdffd 15805072|24361871|28885000
会变成:

  1. 171645239 rtertertert sdffd 15805072
  2. 171645239 rtertertert sdffd 24361871
  3. 171645239 rtertertert sdffd 28885000

因此,对于上面的整个示例,10行将变为16行。
这是我尝试的代码:

  1. new_data = []
  2. for _, row in master_df.iterrows():
  3. for src in row['Source'].split('|'):
  4. new_data.append([row['key', 'term', 'notes', 'Source'], src])
  5. new_df = pd.DataFrame(new_data, columns=['key', 'term', 'notes', 'Source', 'src'])
  6. print(new_df)

这是我得到的错误:

  1. File "notations.py", line 70, in <module>
  2. new_data.append([row['key', 'term', 'notes', 'Source'], src])
  3. ~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  4. File "XX\venv\Lib\site-packages\pandas\core\series.py", line 1072, in __getitem__
  5. return self._get_with(key)
  6. ^^^^^^^^^^^^^^^^^^^
  7. File "XX\venv\Lib\site-packages\pandas\core\series.py", line 1082, in _get_with
  8. return self._get_values_tuple(key)
  9. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  10. File "XX\venv\Lib\site-packages\pandas\core\series.py", line 1126, in _get_values_tuple
  11. raise KeyError("key of type tuple not found and not a MultiIndex")
  12. KeyError: 'key of type tuple not found and not a MultiIndex'

此代码工作:

  1. import pandas as pd
  2. diddly = {
  3. 'A': ['gone1', 'gone2'],
  4. 'B': ['PMID1|PMID2', 'PMID3|PMID4']
  5. }
  6. df = pd.DataFrame(diddly)
  7. print(diddly)
  8. new_data = []
  9. for _, row in df.iterrows():
  10. for pmid in row['B'].split('|'):
  11. new_data.append([row['A'], pmid])
  12. new_df = pd.DataFrame(new_data, columns=['Gone', 'PMID'])
  13. print(new_df)

输出量:

  1. Gone PMID
  2. 0 gone1 PMID1
  3. 1 gone1 PMID2
  4. 2 gone2 PMID3
  5. 3 gone2 PMID4

所以我想知道这是否是我的数据框在错误情况下有两个以上的列的事实,但我不是Maven。
帮助将不胜感激!

h79rfbju

h79rfbju1#

验证码

使用以下代码

  1. df.assign(Source=df['Source'].str.split('|')).explode('Source')

产出:

  1. key term notes Source
  2. 0 156349471 Aasdasd Bleen 20623750
  3. 1 213740505 dfgdfgdfg Blox 33052911
  4. 2 171645239 rtertertert sdffd 15805072
  5. 2 171645239 rtertertert sdffd 24361871
  6. 2 171645239 rtertertert sdffd 28885000
  7. 3 156134219 cvdv dsfsdf 20305092
  8. 3 156134219 cvdv dsfsdf 21259293
  9. 3 156134219 cvdv dsfsdf 21905055
  10. 3 156134219 cvdv dsfsdf 23136149
  11. 4 205936689 ddfg dfsewr 34480604
  12. 5 205947819 xvcbfghf svdst 34480604
  13. 6 213902333 jfghd xcvsd 35020164
  14. 7 156133836 cvbcvb xcvsfg 21907755
  15. 7 156133836 cvbcvb xcvsfg 30098279
  16. 8 156349486 cvbcvb xcv 24880025
  17. 9 156134727 dfgdfgdfg sdfgdfs 24001450

如果你想重新分配索引,你可以在上面的结果上使用reset_index(drop=True)

示例

  1. import pandas as pd
  2. data = {'key': [156349471, 213740505, 171645239, 156134219, 205936689, 205947819, 213902333, 156133836, 156349486, 156134727],
  3. 'term': ['Aasdasd', 'dfgdfgdfg', 'rtertertert', 'cvdv', 'ddfg', 'xvcbfghf', 'jfghd', 'cvbcvb', 'cvbcvb', 'dfgdfgdfg'],
  4. 'notes': ['Bleen', 'Blox', 'sdffd', 'dsfsdf', 'dfsewr', 'svdst', 'xcvsd', 'xcvsfg', 'xcv', 'sdfgdfs'],
  5. 'Source': ['20623750','33052911', '15805072|24361871|28885000', '20305092|21259293|21905055|23136149', '34480604',
  6. '34480604', '35020164', '21907755|30098279', '24880025', '24001450']}
  7. df = pd.DataFrame(data)
展开查看全部

相关问题