tensorflow 通过apply_gradients使计算高阶梯度成为可能

sqyvllje  于 6个月前  发布在  其他
关注(0)|答案(1)|浏览(43)
  • Please make sure that this is a feature request. As per our GitHub Policy , we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template*
    System information
  • TensorFlow version (you are using): 2.7 / 2.8
  • Are you willing to contribute it (Yes/No): No
    Describe the feature and the current behavior/state.

Currently, TF can not compute gradients through optimizer.apply_gradients calls - I suspect this would be caused by the underlying assign operation not being differentiable. However, this operation is in itself definitely differentiable and computing a gradient over this operation is a staple in multiple important works, especially in Meta-Learning and other fields actually making use of higher order gradients.
Current behavior: While higher-order gradients can easily be implemented using e.g. nested gradient tapes (https://www.tensorflow.org/guide/advanced_autodiff#higher-order_gradients), they can not be achieved through applying gradient updates - generally done via optimizer.apply_gradients().
While in theory one can create work-arounds by stitching together gradients by hand ( let 'theta' be parameters before update, theta_dash parameters after update, phi some term dependent on the updated parameters one wants the gradient for: [d_L/d_theta_dash] x [d_theta_dash/d_theta] x [d_theta/d_phi] can be computed because TF can compute [d_L/d_theta_dash] and [d_theta/d_phi] naturally, for SGD [d_theta_dash/d_theta] is just vec(-1) really. However this workaround really only works for SGD without Momentum, for anything used in practice (let's say Adam as a default) this workaround does not work anymore because [d_theta_dash/d_theta] is dependent on specific parameter update from the optimizer rather than the gradient itself, which is not naturally exposed. )
In summary: De facto TF currently makes it extremely difficult and messy to implement hypergradients through gradient updates. This could partially be tackled by exposing differentiable computed parameter updates from optimizers.
Generally what should be possible can lazily be summarized like this:

model = some_model()
inner_optimizer=optimizers.Adam()
with tf.GradientTape() as outer_tape:
    with tf.GradientTape() as inner_tape:
        y = model(x)
        inner_loss = loss(y)
    inner_gradients = inner_tape.gradient(inner_loss, model.trainable_weights)
    inner_optimizer.apply_gradients(zip(inner_gradients, model.trainable_weights))
    
    outer_loss = model(x')
outer_gradients = outer_tape.gradient(outer_loss, model.trainable_weights) #!notworking

Will this change the current api? How?

This should not change the API at all, in my opinion gradient computation through gradient updates in apply_gradient seems natural given the Graph nature of TF. Worst case this would create a boolean flag argument in apply_gradient to enable/disable gradients through the computation.

Who will benefit with this feature?

Primarily the Meta-Learning Community and Researchers and Users relying from advances in this field. Two very impactful papers that showcase why this would be important:

Also: Google as Developer of TF, as even research at Google has to default to PyTorch for research on topics like the above, see e.g. here: ( https://github.com/googleinterns/commentaries )

Any Other info.

Nothing technical, just love for TF as an awesome library that I wish stays ahead!

o7jaxewo

o7jaxewo1#

我认为这个功能对于所有想要在元学习领域进行研究的人来说,是与PyTorch竞争的关键。
经过近一年的时间,我认为没有取得任何进展,而且正如@LJKS所强调的那样,目前只有仅使用SGD的工作方法可行。此外,要将训练逻辑封装在一个@tf.function中,您需要创建N个模型副本,其中N是inner_update_steps的数量(参见https://github.com/siavash-khodadadeh/UMTRA-Release/blob/master/models/maml/maml.py#L63),从而导致高内存消耗。
目前,没有办法以高效的方式在元学习中使用TensorFlow 😔

相关问题