erlang 为什么要有一个监督树,而不是一个集中的监督者?

qnyhuwrf  于 2022-12-08  发布在  Erlang
关注(0)|答案(1)|浏览(173)

我正在学习Elixir和Erlang/OTP,想了解在构建高可用性系统中拥有监督树的重要性。
我明白了管理员在管理工作进程生命周期中的重要性。但是,我仍然想知道为什么一些应用程序需要以层次结构的形式组织管理员,而不是只有一个管理员来管理所有的工作进程?拥有这样的结构有什么实际的好处吗?
借用Programming Elixir书中的一个例子,在哪种情况下,我们更喜欢第一种结构而不是第二种结构?

1.  MainSupervisor
    ├── StashWorker
    └── SubSupervisor
        └──SequenceWorker

2.  MainSupervisor
    ├── StashWorker
    └── SequenceWorker
um6iljoc

um6iljoc1#

What you probably overlook is the famous “let it crash” philosophy, which makes process crashes and restarts the first-class citizen in OTP. We don’t treat process crashes as failures, but rather as an opportunity to redo it properly without the necessity to manually handle errors.
The main reason is to allow more grained control on what should have been restarted on failure. For that, we have strategies. Or, as @Andree restated it in comments:
by organizing supervisions in hierarchies, we allow finer-grained control over how the system should respond should a subset of the system fails
Imagine the application that has a process responsible for a remote connection, and a bunch of processes, all using this resource. When the connection process crashes, it’s, in any case, being restarted by its supervisor but its pid changes. Meaning all the process that relied on this pid should have been restarted as well. With :rest_for_one strategy it’s easy out of the box.
Another approach to this particular example would be to manage a connection in a process, supervised in another part of the tree, and upon connection issues manually crash the supervisor of pools using this connection to reinitialize all of them.
Even more, we might want to manually crash the process handling this connection to reinitialize it, instead of writing defensive code like if no_conn, do: reload_config_and_restart_connection we just let it crash and get reinitialized by the supervision tree with new proper config.
Last but not least, if the supervisor does not trap exits, it would crash as well, propagating it up. That way we might reinitialize the whole branch of supervision tree without writing a line of code.

相关问题