Transformers.Layers
Layer building blocks of Transformers.jl. Most of the layers are designed to work with NamedTuples. It would take a NamedTuple as input, finding correct names as its arguments for computation, ignoring extra fields in the NamedTuple, store the computation result in the input NamedTuple with correct names (conceptually, since NamedTuple is immutable) and return it.
These layer types are mostly compatible with Flux.
API Reference
Transformers.Layers.ApplyEmbed — TypeApplyEmbed([apply = .+,] embed)A layer that help to get embedding and apply on the input. Used with position embeddings.
Transformers.Layers.CrossAttention — MethodCrossAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
dropout::Union{Nothing, Float64} = nothing, return_score = false)Create a multi-head cross attention layer with head heads and head_hidden_size per head.
Transformers.Layers.CrossAttention — MethodCrossAttention(atten_op::AbstractAttenOp, head::Int, hidden_size::Int, head_hidden_size::Int)Create a cross attention layer with given atten_op.
Transformers.Layers.Embed — TypeEmbed(hidden_size::Int, vocab_size::Int; scale = nothing)An Embedding layer that take an array of integer / one-hot encoding and return a multi-dimensional array as embedded vectors and scale with scale.
See also: EmbedDecoder
Example
julia> embed = Embed(7, 10; scale = 100)
Embed(7, 10, scale = 100)
julia> embed([1,3,5])
7×3 Matrix{Float32}:
0.86955 1.14728 0.43275
-0.378461 -0.112709 3.33885
-1.61534 -2.55506 1.08488
-0.833164 0.565268 -1.32531
0.820126 -5.11536 -0.75666
-2.13458 1.25796 -1.47247
3.20416 0.872459 0.980557
Transformers.Layers.EmbedDecoder — TypeEmbedDecoder(embed::Embed; bias = false)A layer that share weight with an embedding layer embed and return the logit.
See also: Embed
Transformers.Layers.FixedLenPositionEmbed — TypeFixedLenPositionEmbed(hidden_size::Int, max_length::Int = 1024)An trainable position embedding layer.
See also: SinCosPositionEmbed
Example
julia> pe = FixedLenPositionEmbed(7)
FixedLenPositionEmbed(7, 1024)
julia> pe(5)
7×5 Matrix{Float32}:
-0.0330963 -0.0412815 -0.0110067 0.0299395 -0.0303213
0.0203617 -0.000259752 -0.0300242 0.00573144 0.0147597
0.00662918 -0.0222377 -9.40627f-5 -0.038285 -0.0467688
-0.00358604 0.0344152 0.0101526 -0.00750311 0.0173139
0.000689436 0.0116299 -0.00478128 -0.0331492 0.0148091
0.000711651 -0.0198647 -0.0037188 0.00427536 -0.0172123
-0.00987371 -0.0385056 -0.00103168 0.0578125 0.00286929
julia> pe([1,3])
7×2 Matrix{Float32}:
-0.0330963 -0.0110067
0.0203617 -0.0300242
0.00662918 -9.40627f-5
-0.00358604 0.0101526
0.000689436 -0.00478128
0.000711651 -0.0037188
-0.00987371 -0.00103168
julia> pe(randn(3,3))
7×3 Matrix{Float32}:
-0.0330963 -0.0412815 -0.0110067
0.0203617 -0.000259752 -0.0300242
0.00662918 -0.0222377 -9.40627f-5
-0.00358604 0.0344152 0.0101526
0.000689436 0.0116299 -0.00478128
0.000711651 -0.0198647 -0.0037188
-0.00987371 -0.0385056 -0.00103168
Transformers.Layers.Fork — TypeFork(layers...)A layer for applying each layers to the same input and return a Tuple. For example (Fork(dense1, dense2))(x) is equivalent to (dense1(x), dense2(x)).
Transformers.Layers.NSplit — TypeNSplit(n::Integer, layer)A layer for splitting the result of layer into n parts in the first dimension and return a Tuple. For example (NSplit(2, dense))(x) is equivalent to y = dense(x); s1 = size(y, 1); (y[begin:div(s1, 2)-1, :], y[div(s1, 2):end, :].
Transformers.Layers.PostNormTransformerBlock — MethodPostTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, return_score = false)Create a post-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.PostNormTransformerDecoderBlock — MethodPostTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
return_score = false, return_self_attention_score = false)Create a post-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.PreNormTransformerBlock — MethodPreNormTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, return_score = false)Create a pre-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.PreNormTransformerDecoderBlock — MethodPreTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
return_score = false, return_self_attention_score = false)Create a pre-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.SelfAttention — MethodSelfAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
dropout::Union{Nothing, Float64} = nothing, return_score = false, causal = false)Create a multi-head self attention layer with head heads and head_hidden_size per head.
Transformers.Layers.SelfAttention — MethodSelfAttention(atten_op::AbstractAttenOp, head::Int, hidden_size::Int, head_hidden_size::Int)Create a self attention layer with given atten_op.
Transformers.Layers.SinCosPositionEmbed — TypeSinCosPositionEmbed(hidden_size::Int)The absolute sin cos postion embedding.
See also: FixedLenPositionEmbed
Example
julia> pe = SinCosPositionEmbed(7)
SinCosPositionEmbed(default_position_func(static(7)), 7, normalized = false)
julia> pe(5)
7×5 Matrix{Float32}:
0.0 0.841471 0.909297 0.14112 -0.756802
1.0 0.540302 -0.416147 -0.989992 -0.653644
0.0 0.0719065 0.143441 0.214232 0.283915
1.0 0.997411 0.989659 0.976783 0.95885
0.0 0.00517945 0.0103588 0.0155378 0.0207164
1.0 0.999987 0.999946 0.999879 0.999785
0.0 0.000372759 0.000745519 0.00111828 0.00149104
julia> pe([1,3])
7×2 Matrix{Float32}:
0.0 0.909297
1.0 -0.416147
0.0 0.143441
1.0 0.989659
0.0 0.0103588
1.0 0.999946
0.0 0.000745519
julia> pe(randn(3,3))
7×3 Matrix{Float64}:
0.0 0.841471 0.909297
1.0 0.540302 -0.416147
0.0 0.0719065 0.143441
1.0 0.997411 0.989659
0.0 0.00517945 0.0103588
1.0 0.999987 0.999946
0.0 0.000372759 0.000745519
Transformers.Layers.Transformer — MethodTransformer(T::Type{<:AbstractTransformerBlock}, n::Int, args...; kwargs...)Create n layers of transformer blocks with T(args...; kwargs...).
Transformers.Layers.TransformerBlock — MethodTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, return_score = false)Create a post-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.TransformerDecoderBlock — MethodTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
return_score = false, return_self_attention_score = false)Create a post-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.no_dropout — Functionno_dropout(model)Creating a new model sharing all parameters with model but disable all dropout.
Transformers.Layers.set_dropout — Functionset_dropout(model, p)Creating a new model sharing all parameters with model but set all dropout probability to p.
Transformers.Layers.testmode — Methodtestmode(model)Creating a new model sharing all parameters with model but used for testing. Currently this is just no_dropout.