我如何在Pandas中将柱状层次结构转换为父子列表?

iecba09b  于 2023-01-15  发布在  其他
关注(0)|答案(3)|浏览(169)

我正尝试使用Pandas库将一个使用固定列数(许多列为空)的列格式的层次结构转换为一个具有子级和父级的邻接列表。

层次结构示例

下面是一个具有5个层次级别的虚构示例:

Books
                     /     |     \
             Science     (null)      (null)
               /           |           \
      Astronomy          (null)          Pictures       
         /  \              |                      \
Astrophysics Cosmology   (null)                    Astronomy
      /         \          |                       /    |    \
  (null)        (null)   Amateurs_Astronomy   Galaxies Stars Astronauts

数据.csv

id,level_1,level_2,level_3,level_4,level_5
1,Books,Science,Astronomy,Astrophysics,
2,Books,Science,Astronomy,Cosmology,
3,Books,,,,Amateurs_Astronomy
4,Books,,Pictures,Astronomy,Galaxies
5,Books,,Pictures,Astronomy,Stars
6,Books,,Pictures,Astronomy,Astronauts

我所做的一切

我首先添加了一个列,用于存储每个现有节点的uuid。
[EDIT,根据mozway的评论]
这个函数的问题是,它将为相同的节点填充不同的uuid:

  • 第一行和第二行具有相同的级别1、2、3,因此应该具有与pk_level_3相同的uuid
  • 以相同的方式,行4、5和6应该具有与pk_level_3和pk_level_4相同的uuid。
import pandas as pd

df = pd.read_csv('data.csv')

# iterate over each column in the dataframe to add a new column,
# containing a uuid each time the csv row has a value for this level:
for col in df.columns:
    if df[col].isnull().sum() > 0:
        new_col = 'pk_' + col
        df[new_col] = None
        # fill the new column with uuid only for non-null values of the original column
        df.loc[df[col].notnull(), new_col] = df.loc[df[col].notnull(), col].apply(lambda x: uuid.uuid4())

此外,我不知道如何找到每个节点的父节点,跳过所有空节点。
你知道我怎样才能得到下面的结果吗?

this_node,parent_node,this_node_uuid,parent_node_uuid
Science,Books,books/science-node-uuid,books-node-uuid
Astronomy,Science,books/science/astronomy-node-uuid,books/science-node-uuid
Astrophysics,Astronomy,books/science/astronomy/astrophysics-node-uuid,books/science/astronomy-node-uuid
Amateurs_Astronomy,Books,books/amateurs_astronomy-node-uuid,books-node-uuid

(...)

qfe3c7zg

qfe3c7zg1#

下面是一种按值和级别生成uuid,然后生成邻接列表的方法:

import uuid
from collections import defaultdict

mapper = defaultdict(uuid.uuid4)

df2 = (df.stack().reset_index(name='node')
         .assign(uuid=lambda d: d.groupby(['level_1', 'node']).ngroup().map(mapper))
      )
       
(df2[['node', 'uuid']]
 .join(df2.groupby('id')[['node', 'uuid']].shift(-1).add_prefix('parent_'))
 .dropna()
 [['node', 'parent_node', 'uuid', 'parent_uuid']]
)

输出:

node         parent_node                                  uuid                           parent_uuid
0       Books             Science  73299f14-db0b-49ac-8050-13ba909fbbf9  d5eabe29-9822-4cd5-832f-e7a69630ed1a
1     Science           Astronomy  d5eabe29-9822-4cd5-832f-e7a69630ed1a  f72718d8-99d0-4160-ab2b-c4d990c103bc
2   Astronomy        Astrophysics  f72718d8-99d0-4160-ab2b-c4d990c103bc  03f6af50-df0f-4762-8791-3c06103dae62
4       Books             Science  73299f14-db0b-49ac-8050-13ba909fbbf9  d5eabe29-9822-4cd5-832f-e7a69630ed1a
5     Science           Astronomy  d5eabe29-9822-4cd5-832f-e7a69630ed1a  f72718d8-99d0-4160-ab2b-c4d990c103bc
6   Astronomy           Cosmology  f72718d8-99d0-4160-ab2b-c4d990c103bc  27de8aa5-5805-41f0-b127-e1c962328398
8       Books  Amateurs_Astronomy  73299f14-db0b-49ac-8050-13ba909fbbf9  af5763c3-9f55-4815-88c8-3996bd2407db
10      Books            Pictures  73299f14-db0b-49ac-8050-13ba909fbbf9  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
11   Pictures           Astronomy  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
12  Astronomy            Galaxies  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43  68a8cb4f-def5-492d-b497-318a074a1f15
14      Books            Pictures  73299f14-db0b-49ac-8050-13ba909fbbf9  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
15   Pictures           Astronomy  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
16  Astronomy               Stars  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43  9d823bdd-fd3e-43a3-8756-51160490c8ed
18      Books            Pictures  73299f14-db0b-49ac-8050-13ba909fbbf9  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5
19   Pictures           Astronomy  7cbc093c-b34c-4d45-8e38-24cc68b6ccc5  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43
20  Astronomy          Astronauts  41bf967b-d6ca-4da7-b5ad-3ec05ceefd43  609e708f-60cd-4928-863c-d41255330981
图表
import networkx as nx
G = nx.from_pandas_edgelist(out, source='uuid', target='parent_uuid', create_using=nx.DiGraph)
nx.set_node_attributes(G, {k: v for (_, v), k in mapper.items()}, name='label')

kuhbmx9i

kuhbmx9i2#

在这里,您如何生成uuid?

def build_hierarchy(df):
    return pd.concat([df.shift(-1), df], keys=['node', 'parent'], axis=1)

out = (df.set_index('id').stack()
         .groupby(level='id', group_keys=False).apply(build_hierarchy)
         .droplevel(1).reset_index())

输出:

>>> out
    id                node              parent
0    1             Science               Books
1    1           Astronomy             Science
2    1        Astrophysics           Astronomy
3    1                None        Astrophysics
4    2             Science               Books
5    2           Astronomy             Science
6    2           Cosmology           Astronomy
7    2                None           Cosmology
8    3  Amateurs_Astronomy               Books
9    3                None  Amateurs_Astronomy
10   4            Pictures               Books
11   4           Astronomy            Pictures
12   4            Galaxies           Astronomy
13   4                None            Galaxies
14   5            Pictures               Books
15   5           Astronomy            Pictures
16   5               Stars           Astronomy
17   5                None               Stars
18   6            Pictures               Books
19   6           Astronomy            Pictures
20   6          Astronauts           Astronomy
21   6                None          Astronauts
64jmpszr

64jmpszr3#

def function1(ss:pd.Series):
    return ss.tolist() if ss.size>1 else None

df11=df1.set_index('id').apply(lambda ss:pd.Series(ss.dropna().rolling(2,2))
                          .apply(function1).dropna().tolist(),axis=1)\
    .explode().drop_duplicates()
df12=pd.DataFrame(df11.tolist(),columns=['node','parent_node'])
df12.assign(uuid=df12.node.map(id)).assign(parent_uuid=df12.parent_node.map(id))

out:

       node         parent_node           uuid    parent_uuid
0      Books             Science  2437636899760  2437636912432
1    Science           Astronomy  2437636912432  2437636913072
2  Astronomy        Astrophysics  2437636913072  2437636914288
3  Astronomy           Cosmology  2437636913072  2437636909360
4      Books  Amateurs_Astronomy  2437636899760  2437649183760
5      Books            Pictures  2437636899760  2437649161840
6   Pictures           Astronomy  2437649161840  2437649163120
7  Astronomy            Galaxies  2437649163120  2437649165552
8  Astronomy               Stars  2437649163120  2437649167344
9  Astronomy          Astronauts  2437649163120  2437649162864

相关问题