Transformers.Layers

Layer building blocks of Transformers.jl. Most of the layers are designed to work with NamedTuples. It would take a NamedTuple as input, finding correct names as its arguments for computation, ignoring extra fields in the NamedTuple, store the computation result in the input NamedTuple with correct names (conceptually, since NamedTuple is immutable) and return it.

These layer types are mostly compatible with Flux.

API Reference

Transformers.Layers.ApplyEmbed — Type

ApplyEmbed([apply = .+,] embed)

A layer that help to get embedding and apply on the input. Used with position embeddings.

Transformers.Layers.CrossAttention — Method

CrossAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
               dropout::Union{Nothing, Float64} = nothing, return_score = false)

Create a multi-head cross attention layer with head heads and head_hidden_size per head.

Transformers.Layers.CrossAttention — Method

CrossAttention(atten_op::AbstractAttenOp, head::Int, hidden_size::Int, head_hidden_size::Int)

Create a cross attention layer with given atten_op.

Transformers.Layers.Embed — Type

Embed(hidden_size::Int, vocab_size::Int; scale = nothing)

An Embedding layer that take an array of integer / one-hot encoding and return a multi-dimensional array as embedded vectors and scale with scale.

See also: EmbedDecoder

Example

julia> embed = Embed(7, 10; scale = 100)
Embed(7, 10, scale = 100)

julia> embed([1,3,5])
7×3 Matrix{Float32}:
  0.86955    1.14728    0.43275
 -0.378461  -0.112709   3.33885
 -1.61534   -2.55506    1.08488
 -0.833164   0.565268  -1.32531
  0.820126  -5.11536   -0.75666
 -2.13458    1.25796   -1.47247
  3.20416    0.872459   0.980557

Transformers.Layers.EmbedDecoder — Type

EmbedDecoder(embed::Embed; bias = false)

A layer that share weight with an embedding layer embed and return the logit.

See also: Embed

Transformers.Layers.FixedLenPositionEmbed — Type

FixedLenPositionEmbed(hidden_size::Int, max_length::Int = 1024)

An trainable position embedding layer.

See also: SinCosPositionEmbed

Example

julia> pe = FixedLenPositionEmbed(7)
FixedLenPositionEmbed(7, 1024)

julia> pe(5)
7×5 Matrix{Float32}:
 -0.0330963    -0.0412815    -0.0110067    0.0299395   -0.0303213
  0.0203617    -0.000259752  -0.0300242    0.00573144   0.0147597
  0.00662918   -0.0222377    -9.40627f-5  -0.038285    -0.0467688
 -0.00358604    0.0344152     0.0101526   -0.00750311   0.0173139
  0.000689436   0.0116299    -0.00478128  -0.0331492    0.0148091
  0.000711651  -0.0198647    -0.0037188    0.00427536  -0.0172123
 -0.00987371   -0.0385056    -0.00103168   0.0578125    0.00286929

julia> pe([1,3])
7×2 Matrix{Float32}:
 -0.0330963    -0.0110067
  0.0203617    -0.0300242
  0.00662918   -9.40627f-5
 -0.00358604    0.0101526
  0.000689436  -0.00478128
  0.000711651  -0.0037188
 -0.00987371   -0.00103168

julia> pe(randn(3,3))
7×3 Matrix{Float32}:
 -0.0330963    -0.0412815    -0.0110067
  0.0203617    -0.000259752  -0.0300242
  0.00662918   -0.0222377    -9.40627f-5
 -0.00358604    0.0344152     0.0101526
  0.000689436   0.0116299    -0.00478128
  0.000711651  -0.0198647    -0.0037188
 -0.00987371   -0.0385056    -0.00103168

Transformers.Layers.Fork — Type

Fork(layers...)

A layer for applying each layers to the same input and return a Tuple. For example (Fork(dense1, dense2))(x) is equivalent to (dense1(x), dense2(x)).

Transformers.Layers.NSplit — Type

NSplit(n::Integer, layer)

A layer for splitting the result of layer into n parts in the first dimension and return a Tuple. For example (NSplit(2, dense))(x) is equivalent to y = dense(x); s1 = size(y, 1); (y[begin:div(s1, 2)-1, :], y[div(s1, 2):end, :].

Transformers.Layers.PostNormTransformerBlock — Method

PostTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                     attention_dropout = nothing, dropout = nothing, return_score = false)

Create a post-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

Transformers.Layers.PostNormTransformerDecoderBlock — Method

PostTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                            attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
                            return_score = false, return_self_attention_score = false)

Create a post-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

Transformers.Layers.PreNormTransformerBlock — Method

PreNormTransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                        attention_dropout = nothing, dropout = nothing, return_score = false)

Create a pre-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

Transformers.Layers.PreNormTransformerDecoderBlock — Method

PreTransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                           attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
                           return_score = false, return_self_attention_score = false)

Create a pre-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

Transformers.Layers.SelfAttention — Method

SelfAttention(head::Int, hidden_size::Int [, head_hidden_size::Int = hidden_size ÷ head ];
              dropout::Union{Nothing, Float64} = nothing, return_score = false, causal = false)

Create a multi-head self attention layer with head heads and head_hidden_size per head.

Transformers.Layers.SelfAttention — Method

SelfAttention(atten_op::AbstractAttenOp, head::Int, hidden_size::Int, head_hidden_size::Int)

Create a self attention layer with given atten_op.

Transformers.Layers.SinCosPositionEmbed — Type

SinCosPositionEmbed(hidden_size::Int)

The absolute sin cos postion embedding.

See also: FixedLenPositionEmbed

Example

julia> pe = SinCosPositionEmbed(7)
SinCosPositionEmbed(default_position_func(static(7)), 7, normalized = false)

julia> pe(5)
7×5 Matrix{Float32}:
 0.0  0.841471      0.909297      0.14112     -0.756802
 1.0  0.540302     -0.416147     -0.989992    -0.653644
 0.0  0.0719065     0.143441      0.214232     0.283915
 1.0  0.997411      0.989659      0.976783     0.95885
 0.0  0.00517945    0.0103588     0.0155378    0.0207164
 1.0  0.999987      0.999946      0.999879     0.999785
 0.0  0.000372759   0.000745519   0.00111828   0.00149104

julia> pe([1,3])
7×2 Matrix{Float32}:
 0.0   0.909297
 1.0  -0.416147
 0.0   0.143441
 1.0   0.989659
 0.0   0.0103588
 1.0   0.999946
 0.0   0.000745519

julia> pe(randn(3,3))
7×3 Matrix{Float64}:
 0.0  0.841471      0.909297
 1.0  0.540302     -0.416147
 0.0  0.0719065     0.143441
 1.0  0.997411      0.989659
 0.0  0.00517945    0.0103588
 1.0  0.999987      0.999946
 0.0  0.000372759   0.000745519

Transformers.Layers.Transformer — Method

Transformer(T::Type{<:AbstractTransformerBlock}, n::Int, args...; kwargs...)

Create n layers of transformer blocks with T(args...; kwargs...).

Transformers.Layers.TransformerBlock — Method

TransformerBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                 attention_dropout = nothing, dropout = nothing, return_score = false)

Create a post-LN transformer encoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

Transformers.Layers.TransformerDecoderBlock — Method

TransformerDecoderBlock([act,] head::Int, hidden_size::Int [, head_hidden_size::Int], intermediate_size::Int;
                        attention_dropout = nothing, dropout = nothing, cross_attention_dropout = nothing,
                        return_score = false, return_self_attention_score = false)

Create a post-LN transformer decoder block. head, hidden_size (and head_hidden_size) are parameters of SelfAttention and CrossAttention. intermediate_size, hidden_size (and act) would be use to create the 2 layered feed-forward layer.

Transformers.Layers.no_dropout — Function

no_dropout(model)

Creating a new model sharing all parameters with model but disable all dropout.

Transformers.Layers.set_dropout — Function

set_dropout(model, p)

Creating a new model sharing all parameters with model but set all dropout probability to p.

Transformers.Layers.testmode — Method

testmode(model)

Creating a new model sharing all parameters with model but used for testing. Currently this is just no_dropout.