Attention softmax - typically over the last dimension (key sequence) for attention weight computation
Attention softmax - typically over the last dimension (key sequence) for attention weight computation