Example
Comparing to the existing implementation in Transformers.jl
See the code in the NeuralAttentionlib's test,  where we compare output/gradient from NeuralAttenionlib v.s. the MultiheadAttention layer from Transformers.jl.  This should provide enough knowledge for implementing a multi-head QKV attention layer with DL framework like Flux.jl.