Example
Comparing to the existing implementation in Transformers.jl
See the code in the NeuralAttentionlib's test, where we compare output/gradient from NeuralAttenionlib
v.s. the MultiheadAttention
layer from Transformers.jl
. This should provide enough knowledge for implementing a multi-head QKV attention layer with DL framework like Flux.jl.