如何使用pandas数据框定义sankey图的结构

bzzcjhmw  于 2023-04-10  发布在  其他
关注(0)|答案(1)|浏览(147)

这听起来像是一个非常宽泛的问题,但如果你让我描述一些细节,我可以向你保证,这是非常具体的,也是令人沮丧、沮丧和愤怒的。
下面的图描述了苏格兰选举,它基于www.example.com中的代码plot.ly:

图1:

数据集1:

data = [['Source','Target','Value','Color','Node, Label','Link Color'],
        [0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
        [0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
        [0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],
        [1,5,14,'#7FC241','Leave+Yes – 14','rgba(219, 233, 246, 0.5)'],
        [1,6,1,'#D3D3D3','Didn’t vote in at least one referendum – 21','rgba(73, 148, 206, 1)'],
        [1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)'],
        [2,5,3,'#449E9E','39 – Yes','rgba(250, 188, 19, 1)'],
        [2,6,17,'#D3D3D3','14 – Don’t know / would not vote','rgba(250, 188, 19, 0.5)'],
        [2,7,2,'','','rgba(250, 188, 19, 0.5)'],
        [3,5,3,'','','rgba(127, 194, 65, 1)'],
        [3,6,9,'','','rgba(127, 194, 65, 0.5)'],
        [3,7,2,'','','rgba(127, 194, 65, 0.5)'],
        [4,5,5,'','','rgba(211, 211, 211, 0.5)'],
        [4,6,9,'','','rgba(211, 211, 211, 0.5)'],
        [4,7,8,'','','rgba(211, 211, 211, 0.5)']
        ]

剧情构建方式:

我从各种来源收集了一些关于sankey图表行为的重要细节,例如:

正如你将在下面的细节中看到的,节点,标签和颜色并没有按照源 Dataframe 的结构化顺序应用到图表中。* 其中一些 * 是完美的,因为你有各种各样的元素来描述同一个节点,如颜色,目标,值和链接颜色。一个节点'Remain+No – 28'看起来像这样:

数据集的附带部分看起来像这样:

[0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
[0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
[0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],

因此,源的这一部分描述了一个节点[0],其中有三个对应的目标[5, 6, 7]和三个值为[20, 3, 5]的链接。'#F27420'是橙子(ish)节点的颜色,颜色'rgba(253, 227, 212, 0.5)''rgba(242, 116, 32, 1)''rgba(253, 227, 212, 0.5)'描述从节点到一些目标的链接的颜色。上述样本中未使用的信息是:

数据样本2(部分)

[-,-,--'-------','---------------','-------------------'],
[-,-,-,'#4994CE','Leave+No – 16','-------------------'],
[-,-,-,'#FABC13','Remain+Yes – 21','-------------------'],

并且该信息被用作图的其余元素被引入。
那么,问题是什么呢?在下面的进一步细节中,你会看到,只要数据集中的一行新数据插入一个新链接,并对其他元素(颜色,标签)进行其他更改(如果该信息尚未使用),一切都是有意义的。我将使用两个截图来更具体地说明,这两个截图来自我制作的设置,左边是plot,右边是代码:
下面的数据示例按照上面描述的逻辑生成下面的图表:

数据样本3

data = [['Source','Target','Value','Color','Node, Label','Link Color'],
        [0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
        [0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
        [0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],
        [1,5,14,'#7FC241','Leave+Yes – 14','rgba(219, 233, 246, 0.5)'],
        [1,6,1,'#D3D3D3','Didn’t vote in at least one referendum – 21','rgba(73, 148, 206, 1)']]

屏幕截图1 -具有数据样本3的部分图

问题:

在数据集中添加行[1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)']会在源[5]和目标[7]之间生成一个新链接,但同时将颜色和标签应用到目标5。我认为下一个应用到图表的标签是'Remain+Yes – 21',因为它还没有被使用。但这里发生的是标签'46 – No'应用到目标5。为什么

截图2 -数据样本3的部分图+ [1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)']

你如何辨别什么是源,什么是基于该 Dataframe 的目标?
我知道这个问题很奇怪,也很难回答,但我希望有人能给我一个建议,我也知道dataframe可能不是sankey图表的最佳来源,也许json可以代替?

完整的代码和数据示例,可轻松复制和粘贴到Jupyter Notebook:

import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Original data
data = [['Source','Target','Value','Color','Node, Label','Link Color'],
    [0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
    [0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
    [0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],
    [1,5,14,'#7FC241','Leave+Yes – 14','rgba(219, 233, 246, 0.5)'],
    [1,6,1,'#D3D3D3','Didn’t vote in at least one referendum – 21','rgba(73, 148, 206, 1)'],
    [1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)'],
    [2,5,3,'#449E9E','39 – Yes','rgba(250, 188, 19, 1)'],
    [2,6,17,'#D3D3D3','14 – Don’t know / would not vote','rgba(250, 188, 19, 0.5)'],
    [2,7,2,'','','rgba(250, 188, 19, 0.5)'],
    [3,5,3,'','','rgba(127, 194, 65, 1)'],
    [3,6,9,'','','rgba(127, 194, 65, 0.5)'],
    [3,7,2,'','','rgba(127, 194, 65, 0.5)'],
    [4,5,5,'','','rgba(211, 211, 211, 0.5)'],
    [4,6,9,'','','rgba(211, 211, 211, 0.5)'],
    [4,7,8,'','','rgba(211, 211, 211, 0.5)']
    ]


headers = data.pop(0)
df = pd.DataFrame(data, columns = headers)
scottish_df = df

data_trace = dict(
    type='sankey',
    domain = dict(
      x =  [0,1],
      y =  [0,1]
    ),
    orientation = "h",
    valueformat = ".0f",
    node = dict(
      pad = 10,
      thickness = 30,
      line = dict(
        color = "black",
        width = 0
      ),
      label =  scottish_df['Node, Label'].dropna(axis=0, how='any'),
      color = scottish_df['Color']
    ),
    link = dict(
      source = scottish_df['Source'].dropna(axis=0, how='any'),
      target = scottish_df['Target'].dropna(axis=0, how='any'),
      value = scottish_df['Value'].dropna(axis=0, how='any'),
      color = scottish_df['Link Color'].dropna(axis=0, how='any'),
  )
)

layout =  dict(
    title = "Scottish Referendum Voters who now want Independence",
    height = 772,
    font = dict(
      size = 10
    ),    
)

fig = dict(data=[data_trace], layout=layout)
iplot(fig, validate=False)
syqv5f0l

syqv5f0l1#

这个问题看起来确实很奇怪,但只有在你分析plotly中的sankey图是如何创建的之后才会出现:
创建sankey图时,向其发送:
1.节点列表
1.链接列表
这些列表彼此有界。当您创建5长度的节点列表时,任何边都将知道0,1,2,3,4的开始和结束。在您的程序中,您错误地创建了节点-您创建了链接列表,然后遍历它并创建节点。看看您的图表。它有两个黑色节点,其中包含undefined。您的数据集的长度是多少...是的,5。你的节点索引在4上结束,没有真正定义目标节点。你在数据集中添加第六个列表,- bingo!-存在nodes[5]!只需尝试在数据集中添加另一个新行:
[1,7,1,'#FF0000','WAKA','rgba(219, 233, 246,0.5)']
你会看到另一个黑条被染成了红色。你有五个节点(因为你有5个链接,你通过迭代链接列表来创建节点),但是链接的目标索引是5,6,7。你可以用两种方法来解决这个问题:
1.将数据集中的Target更改为2,3,4
1.分别创建节点和链接 (正确方法)
我希望我帮助你在你的问题和情节创作的理解(什么是更重要的海事组织)。

**编辑:**以下是创建单独节点/链路的示例(注意data_trace中的node部分只使用nodes_df数据,data_trace中的link部分只使用links_df数据,nodes_dflinks_df长度不相等):

import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

nodes = [
    ['ID', 'Label', 'Color'],
    [0,'Remain+No – 28','#F27420'],
    [1,'Leave+No – 16','#4994CE'],
    [2,'Remain+Yes – 21','#FABC13'],
    [3,'Leave+Yes – 14','#7FC241'],
    [4,'Didn’t vote in at least one referendum – 21','#D3D3D3'],
    [5,'46 – No','#8A5988']
]
links = [
    ['Source','Target','Value','Link Color'],
    [0,3,20,'rgba(253, 227, 212, 0.5)'],
    [0,4,3,'rgba(242, 116, 32, 1)'],
    [0,2,5,'rgba(253, 227, 212, 0.5)'],
    [1,5,14,'rgba(219, 233, 246, 0.5)'],
    [1,3,1,'rgba(73, 148, 206, 1)'],
    [1,4,1,'rgba(219, 233, 246,0.5)'],
    [1,2,10,'rgba(8, 233, 246,0.5)'],
    [1,3,5,'rgba(219, 77, 246,0.5)'],
    [1,5,12,'rgba(219, 4, 246,0.5)']
]

nodes_headers = nodes.pop(0)
nodes_df = pd.DataFrame(nodes, columns = nodes_headers)
links_headers = links.pop(0)
links_df = pd.DataFrame(links, columns = links_headers)

data_trace = dict(
    type='sankey',
    domain = dict(
      x =  [0,1],
      y =  [0,1]
    ),
    orientation = "h",
    valueformat = ".0f",
    node = dict(
      pad = 10,
      thickness = 30,
      line = dict(
        color = "black",
        width = 0
      ),
      label =  nodes_df['Label'].dropna(axis=0, how='any'),
      color = nodes_df['Color']
    ),
    link = dict(
      source = links_df['Source'].dropna(axis=0, how='any'),
      target = links_df['Target'].dropna(axis=0, how='any'),
      value = links_df['Value'].dropna(axis=0, how='any'),
      color = links_df['Link Color'].dropna(axis=0, how='any'),
  )
)

layout =  dict(
    title = "Scottish Referendum Voters who now want Independence",
    height = 772,
    font = dict(
      size = 10
    ),    
)

fig = dict(data=[data_trace], layout=layout)
iplot(fig, validate=False)

**编辑2:**让我们更深入地了解:)sankey图中的节点和链接几乎是完全独立的。唯一限制它们的信息是源中的索引-链接中的目标。因此我们可以为它们创建许多节点而没有链接(只需将Edit 1代码中的节点/链接替换为它):

nodes = [
    ['ID', 'Label', 'Color'],
    [0,'Remain+No – 28','#F27420'],
    [1,'Leave+No – 16','#4994CE'],
    [2,'Remain+Yes – 21','#FABC13'],
    [3,'Leave+Yes – 14','#7FC241'],
    [4,'Didn’t vote in at least one referendum – 21','#D3D3D3'],
    [5,'46 – No','#8A5988'],
    [6,'WAKA1','#8A5988'],
    [7,'WAKA2','#8A5988'],
    [8,'WAKA3','#8A5988'],
    [9,'WAKA4','#8A5988'],
    [10,'WAKA5','#8A5988'],
    [11,'WAKA6','#8A5988'],

]
links = [
    ['Source','Target','Value','Link Color'],
    [0,3,20,'rgba(253, 227, 212, 0.5)'],
    [0,4,3,'rgba(242, 116, 32, 1)'],
    [0,2,5,'rgba(253, 227, 212, 0.5)'],
    [1,5,14,'rgba(219, 233, 246, 0.5)'],
    [1,3,1,'rgba(73, 148, 206, 1)'],
    [1,4,1,'rgba(219, 233, 246,0.5)'],
    [1,2,10,'rgba(8, 233, 246,0.5)'],
    [1,3,5,'rgba(219, 77, 246,0.5)'],
    [1,5,12,'rgba(219, 4, 246,0.5)']
]

并且这些节点不会出现在图中。
我们只能创建没有节点的链接:

nodes = [
    ['ID', 'Label', 'Color'],
]
links = [
    ['Source','Target','Value','Link Color'],
    [0,3,20,'rgba(253, 227, 212, 0.5)'],
    [0,4,3,'rgba(242, 116, 32, 1)'],
    [0,2,5,'rgba(253, 227, 212, 0.5)'],
    [1,5,14,'rgba(219, 233, 246, 0.5)'],
    [1,3,1,'rgba(73, 148, 206, 1)'],
    [1,4,1,'rgba(219, 233, 246,0.5)'],
    [1,2,10,'rgba(8, 233, 246,0.5)'],
    [1,3,5,'rgba(219, 77, 246,0.5)'],
    [1,5,12,'rgba(219, 4, 246,0.5)']
]

我们将只有从无处到无处的链接。
如果你想添加**(1)一个带链接的新源,你应该在nodes中添加一个新列表,计算它的索引(这就是为什么我有ID列),并在links中添加一个新列表,其中Source等于节点索引。
如果你想为现有的节点添加
(2)**一个新的目标-只需在links中添加一个新的列表,并正确地写它的SourceTarget

[1,100500,10,'rgba(219, 233, 246,0.5)'],
    [1,100501,10,'rgba(8, 233, 246,0.5)'],
    [1,100502,10,'rgba(219, 77, 246,0.5)'],
    [1,100503,10,'rgba(219, 4, 246,0.5)']

(Here我为4个新目标创建了4个新链接。源是所有目标的索引为1的节点)。

**(3+4):**Sankey图没有源和目标的区别,所有的图都是Sankey的节点,每个节点都可以是源,也可以是目标,请看:

nodes = [
    ['ID', 'Label', 'Color'],
    [0,'WAKA WANNA BE SOURCE','#F27420'],
    [1,'WAKA WANNA BE TARGET','#4994CE'],
    [2,'WAKA DON\'T KNOW WHO WANNA BE','#FABC13'],

]
links = [
    ['Source','Target','Value','Link Color'],
    [0,1,10,'rgba(253, 227, 212, 1)'],
    [0,2,10,'rgba(242, 116, 32, 1)'],
    [2,1,10,'rgba(253, 227, 212, 1)'],
]

在这里,您将看到3列Sankey图。0节点是源,1是目标,2节点是1的源和2的目标。

相关问题