Transformers.Layers

Layer building blocks of Transformers.jl. Most of the layers are designed to work with NamedTuples. It would take a NamedTuple as input, finding correct names as its arguments for computation, ignoring extra fields in the NamedTuple, store the computation result in the input NamedTuple with correct names (conceptually, since NamedTuple is immutable) and return it.

These layer types are mostly compatible with Flux.

API Reference

Transformers.Layers.CrossAttentionMethod
CrossAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
               dropout::Union{Nothing, Float64} = nothing, return_score = false)

Create a multi-head cross attention layer with head heads and head_hidden_size per head.

source
Transformers.Layers.EmbedType
Embed(hidden_size::Int, vocab_size::Int; scale = nothing)

An Embedding layer that take an array of integer / one-hot encoding and return a multi-dimensional array as embedded vectors and scale with scale.

See also: EmbedDecoder

Example

julia> embed = Embed(7, 10; scale = 100)
Embed(7, 10, scale = 100)

julia> embed([1,3,5])
7×3 Matrix{Float32}:
  0.86955    1.14728    0.43275
 -0.378461  -0.112709   3.33885
 -1.61534   -2.55506    1.08488
 -0.833164   0.565268  -1.32531
  0.820126  -5.11536   -0.75666
 -2.13458    1.25796   -1.47247
  3.20416    0.872459   0.980557
source
Transformers.Layers.FixedLenPositionEmbedType
FixedLenPositionEmbed(hidden_size::Int, max_length::Int = 1024)

An trainable position embedding layer.

See also: SinCosPositionEmbed

Example

julia> pe = FixedLenPositionEmbed(7)
FixedLenPositionEmbed(7, 1024)

julia> pe(5)
7×5 Matrix{Float32}:
 -0.0330963    -0.0412815    -0.0110067    0.0299395   -0.0303213
  0.0203617    -0.000259752  -0.0300242    0.00573144   0.0147597
  0.00662918   -0.0222377    -9.40627f-5  -0.038285    -0.0467688
 -0.00358604    0.0344152     0.0101526   -0.00750311   0.0173139
  0.000689436   0.0116299    -0.00478128  -0.0331492    0.0148091
  0.000711651  -0.0198647    -0.0037188    0.00427536  -0.0172123
 -0.00987371   -0.0385056    -0.00103168   0.0578125    0.00286929

julia> pe([1,3])
7×2 Matrix{Float32}:
 -0.0330963    -0.0110067
  0.0203617    -0.0300242
  0.00662918   -9.40627f-5
 -0.00358604    0.0101526
  0.000689436  -0.00478128
  0.000711651  -0.0037188
 -0.00987371   -0.00103168

julia> pe(randn(3,3))
7×3 Matrix{Float32}:
 -0.0330963    -0.0412815    -0.0110067
  0.0203617    -0.000259752  -0.0300242
  0.00662918   -0.0222377    -9.40627f-5
 -0.00358604    0.0344152     0.0101526
  0.000689436   0.0116299    -0.00478128
  0.000711651  -0.0198647    -0.0037188
 -0.00987371   -0.0385056    -0.00103168
source
Transformers.Layers.ForkType
Fork(layers...)

A layer for applying each layers to the same input and return a Tuple. For example (Fork(dense1, dense2))(x) is equivalent to (dense1(x), dense2(x)).

source
Transformers.Layers.NSplitType
NSplit(n::Integer, layer)

A layer for splitting the result of layer into n parts in the first dimension and return a Tuple. For example (NSplit(2, dense))(x) is equivalent to y = dense(x); s1 = size(y, 1); (y[begin:div(s1, 2)-1, :], y[div(s1, 2):end, :].

source
Transformers.Layers.PostNormTransformerBlockMethod
PostTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                     attention_dropout = nothing, dropout = nothing, return_score = false)

Create a post-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

source
Transformers.Layers.PostNormTransformerDecoderBlockMethod
PostTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                            attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
                            return_score = false, return_self_attention_score = false)

Create a post-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

source
Transformers.Layers.PreNormTransformerBlockMethod
PreNormTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                        attention_dropout = nothing, dropout = nothing, return_score = false)

Create a pre-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

source
Transformers.Layers.PreNormTransformerDecoderBlockMethod
PreTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                           attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
                           return_score = false, return_self_attention_score = false)

Create a pre-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

source
Transformers.Layers.SelfAttentionMethod
SelfAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
              dropout::Union{Nothing, Float64} = nothing, return_score = false, causal = false)

Create a multi-head self attention layer with head heads and head_hidden_size per head.

source
Transformers.Layers.SinCosPositionEmbedType
SinCosPositionEmbed(hidden_size::Int)

The absolute sin cos postion embedding.

See also: FixedLenPositionEmbed

Example

julia> pe = SinCosPositionEmbed(7)
SinCosPositionEmbed(default_position_func(static(7)), 7, normalized = false)

julia> pe(5)
7×5 Matrix{Float32}:
 0.0  0.841471      0.909297      0.14112     -0.756802
 1.0  0.540302     -0.416147     -0.989992    -0.653644
 0.0  0.0719065     0.143441      0.214232     0.283915
 1.0  0.997411      0.989659      0.976783     0.95885
 0.0  0.00517945    0.0103588     0.0155378    0.0207164
 1.0  0.999987      0.999946      0.999879     0.999785
 0.0  0.000372759   0.000745519   0.00111828   0.00149104

julia> pe([1,3])
7×2 Matrix{Float32}:
 0.0   0.909297
 1.0  -0.416147
 0.0   0.143441
 1.0   0.989659
 0.0   0.0103588
 1.0   0.999946
 0.0   0.000745519

julia> pe(randn(3,3))
7×3 Matrix{Float64}:
 0.0  0.841471      0.909297
 1.0  0.540302     -0.416147
 0.0  0.0719065     0.143441
 1.0  0.997411      0.989659
 0.0  0.00517945    0.0103588
 1.0  0.999987      0.999946
 0.0  0.000372759   0.000745519
source
Transformers.Layers.TransformerMethod
Transformer(T::Type{<:AbstractTransformerBlock}, n::Int, args...; kwargs...)

Create n layers of transformer blocks with T(args...; kwargs...).

source
Transformers.Layers.TransformerBlockMethod
TransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                 attention_dropout = nothing, dropout = nothing, return_score = false)

Create a post-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

source
Transformers.Layers.TransformerDecoderBlockMethod
TransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                        attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
                        return_score = false, return_self_attention_score = false)

Create a post-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

source