Transformers.Layers
Layer building blocks of Transformers.jl. Most of the layers are designed to work with NamedTuple
s. It would take a NamedTuple
as input, finding correct names as its arguments for computation, ignoring extra fields in the NamedTuple
, store the computation result in the input NamedTuple
with correct names (conceptually, since NamedTuple
is immutable) and return it.
These layer types are mostly compatible with Flux.
API Reference
Transformers.Layers.ApplyEmbed
— TypeApplyEmbed([apply = .+,] embed)
A layer that help to get embedding and apply on the input. Used with position embeddings.
Transformers.Layers.CrossAttention
— MethodCrossAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
dropout::Union{Nothing, Float64} = nothing, return_score = false)
Create a multi-head cross attention layer with head
heads and head_hidden_size
per head.
Transformers.Layers.CrossAttention
— MethodCrossAttention(atten_op::AbstractAttenOp, head::Int, hidden_size::Int, head_hidden_size::Int)
Create a cross attention layer with given atten_op
.
Transformers.Layers.Embed
— TypeEmbed(hidden_size::Int, vocab_size::Int; scale = nothing)
An Embedding layer that take an array of integer / one-hot encoding and return a multi-dimensional array as embedded vectors and scale with scale
.
See also: EmbedDecoder
Example
julia> embed = Embed(7, 10; scale = 100)
Embed(7, 10, scale = 100)
julia> embed([1,3,5])
7×3 Matrix{Float32}:
0.86955 1.14728 0.43275
-0.378461 -0.112709 3.33885
-1.61534 -2.55506 1.08488
-0.833164 0.565268 -1.32531
0.820126 -5.11536 -0.75666
-2.13458 1.25796 -1.47247
3.20416 0.872459 0.980557
Transformers.Layers.EmbedDecoder
— TypeEmbedDecoder(embed::Embed; bias = false)
A layer that share weight with an embedding layer embed
and return the logit.
See also: Embed
Transformers.Layers.FixedLenPositionEmbed
— TypeFixedLenPositionEmbed(hidden_size::Int, max_length::Int = 1024)
An trainable position embedding layer.
See also: SinCosPositionEmbed
Example
julia> pe = FixedLenPositionEmbed(7)
FixedLenPositionEmbed(7, 1024)
julia> pe(5)
7×5 Matrix{Float32}:
-0.0330963 -0.0412815 -0.0110067 0.0299395 -0.0303213
0.0203617 -0.000259752 -0.0300242 0.00573144 0.0147597
0.00662918 -0.0222377 -9.40627f-5 -0.038285 -0.0467688
-0.00358604 0.0344152 0.0101526 -0.00750311 0.0173139
0.000689436 0.0116299 -0.00478128 -0.0331492 0.0148091
0.000711651 -0.0198647 -0.0037188 0.00427536 -0.0172123
-0.00987371 -0.0385056 -0.00103168 0.0578125 0.00286929
julia> pe([1,3])
7×2 Matrix{Float32}:
-0.0330963 -0.0110067
0.0203617 -0.0300242
0.00662918 -9.40627f-5
-0.00358604 0.0101526
0.000689436 -0.00478128
0.000711651 -0.0037188
-0.00987371 -0.00103168
julia> pe(randn(3,3))
7×3 Matrix{Float32}:
-0.0330963 -0.0412815 -0.0110067
0.0203617 -0.000259752 -0.0300242
0.00662918 -0.0222377 -9.40627f-5
-0.00358604 0.0344152 0.0101526
0.000689436 0.0116299 -0.00478128
0.000711651 -0.0198647 -0.0037188
-0.00987371 -0.0385056 -0.00103168
Transformers.Layers.Fork
— TypeFork(layers...)
A layer for applying each layer
s to the same input and return a Tuple
. For example (Fork(dense1, dense2))(x)
is equivalent to (dense1(x), dense2(x))
.
Transformers.Layers.NSplit
— TypeNSplit(n::Integer, layer)
A layer for splitting the result of layer
into n
parts in the first dimension and return a Tuple
. For example (NSplit(2, dense))(x)
is equivalent to y = dense(x); s1 = size(y, 1); (y[begin:div(s1, 2)-1, :], y[div(s1, 2):end, :]
.
Transformers.Layers.PostNormTransformerBlock
— MethodPostTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, return_score = false)
Create a post-LN transformer encoder block. head
, hidden_size
(and head_hidden_size
) are parameters of SelfAttention
. intermediate_size
, hidden_size
(and act
) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.PostNormTransformerDecoderBlock
— MethodPostTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
return_score = false, return_self_attention_score = false)
Create a post-LN transformer decoder block. head
, hidden_size
(and head_hidden_size
) are parameters of SelfAttention
and CrossAttention
. intermediate_size
, hidden_size
(and act
) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.PreNormTransformerBlock
— MethodPreNormTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, return_score = false)
Create a pre-LN transformer encoder block. head
, hidden_size
(and head_hidden_size
) are parameters of SelfAttention
. intermediate_size
, hidden_size
(and act
) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.PreNormTransformerDecoderBlock
— MethodPreTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
return_score = false, return_self_attention_score = false)
Create a pre-LN transformer decoder block. head
, hidden_size
(and head_hidden_size
) are parameters of SelfAttention
and CrossAttention
. intermediate_size
, hidden_size
(and act
) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.SelfAttention
— MethodSelfAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
dropout::Union{Nothing, Float64} = nothing, return_score = false, causal = false)
Create a multi-head self attention layer with head
heads and head_hidden_size
per head.
Transformers.Layers.SelfAttention
— MethodSelfAttention(atten_op::AbstractAttenOp, head::Int, hidden_size::Int, head_hidden_size::Int)
Create a self attention layer with given atten_op
.
Transformers.Layers.SinCosPositionEmbed
— TypeSinCosPositionEmbed(hidden_size::Int)
The absolute sin cos postion embedding.
See also: FixedLenPositionEmbed
Example
julia> pe = SinCosPositionEmbed(7)
SinCosPositionEmbed(default_position_func(static(7)), 7, normalized = false)
julia> pe(5)
7×5 Matrix{Float32}:
0.0 0.841471 0.909297 0.14112 -0.756802
1.0 0.540302 -0.416147 -0.989992 -0.653644
0.0 0.0719065 0.143441 0.214232 0.283915
1.0 0.997411 0.989659 0.976783 0.95885
0.0 0.00517945 0.0103588 0.0155378 0.0207164
1.0 0.999987 0.999946 0.999879 0.999785
0.0 0.000372759 0.000745519 0.00111828 0.00149104
julia> pe([1,3])
7×2 Matrix{Float32}:
0.0 0.909297
1.0 -0.416147
0.0 0.143441
1.0 0.989659
0.0 0.0103588
1.0 0.999946
0.0 0.000745519
julia> pe(randn(3,3))
7×3 Matrix{Float64}:
0.0 0.841471 0.909297
1.0 0.540302 -0.416147
0.0 0.0719065 0.143441
1.0 0.997411 0.989659
0.0 0.00517945 0.0103588
1.0 0.999987 0.999946
0.0 0.000372759 0.000745519
Transformers.Layers.Transformer
— MethodTransformer(T::Type{<:AbstractTransformerBlock}, n::Int, args...; kwargs...)
Create n
layers of transformer blocks with T(args...; kwargs...)
.
Transformers.Layers.TransformerBlock
— MethodTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, return_score = false)
Create a post-LN transformer encoder block. head
, hidden_size
(and head_hidden_size
) are parameters of SelfAttention
. intermediate_size
, hidden_size
(and act
) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.TransformerDecoderBlock
— MethodTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
return_score = false, return_self_attention_score = false)
Create a post-LN transformer decoder block. head
, hidden_size
(and head_hidden_size
) are parameters of SelfAttention
and CrossAttention
. intermediate_size
, hidden_size
(and act
) would be use to create the 2 layered feed-forward layer.
Transformers.Layers.no_dropout
— Functionno_dropout(model)
Creating a new model sharing all parameters with model
but disable all dropout.
Transformers.Layers.set_dropout
— Functionset_dropout(model, p)
Creating a new model sharing all parameters with model
but set all dropout probability to p
.
Transformers.Layers.testmode
— Methodtestmode(model)
Creating a new model sharing all parameters with model
but used for testing. Currently this is just no_dropout
.