API Reference
Base.:!Base.:&Base.:|NeuralAttentionlib.:$NeuralAttentionlib.AttenMaskNeuralAttentionlib.GetIndexerNeuralAttentionlib.alibi_position_embeddingNeuralAttentionlib.apply_maskNeuralAttentionlib.apply_maskNeuralAttentionlib.attention_scoreNeuralAttentionlib.biased_scoreNeuralAttentionlib.collapsed_sizeNeuralAttentionlib.collapseddimsNeuralAttentionlib.collapseddimsNeuralAttentionlib.dot_product_scoreNeuralAttentionlib.generic_grouped_query_attentionNeuralAttentionlib.generic_multihead_qkv_attentionNeuralAttentionlib.generic_qkv_attentionNeuralAttentionlib.get_sincos_position_embeddingsNeuralAttentionlib.getmaskNeuralAttentionlib.grouped_query_attentionNeuralAttentionlib.l2normNeuralAttentionlib.layer_normNeuralAttentionlib.lengthsNeuralAttentionlib.masked_scoreNeuralAttentionlib.matmulNeuralAttentionlib.merge_headNeuralAttentionlib.mixingNeuralAttentionlib.move_head_dim_inNeuralAttentionlib.move_head_dim_in_permNeuralAttentionlib.move_head_dim_outNeuralAttentionlib.move_head_dim_out_permNeuralAttentionlib.multihead_qkv_attentionNeuralAttentionlib.naive_qkv_attentionNeuralAttentionlib.noncollapsed_sizeNeuralAttentionlib.normalized_scoreNeuralAttentionlib.rms_layer_normNeuralAttentionlib.scalar_relative_position_embeddingNeuralAttentionlib.scaled_dot_product_scoreNeuralAttentionlib.scaled_matmulNeuralAttentionlib.split_headNeuralAttentionlib.t5_bucketed_position_idNeuralAttentionlib.t5_causal_bucketed_position_idNeuralAttentionlib.unwrap_collapseNeuralAttentionlib.weighted_sum_mixingNeuralAttentionlib.with_rotary_position_embeddingNeuralAttentionlib.AbstractArrayMaskNeuralAttentionlib.AbstractAttenMaskNeuralAttentionlib.AbstractDatalessMaskNeuralAttentionlib.AbstractMaskNeuralAttentionlib.AbstractMaskOpNeuralAttentionlib.AbstractSeqMaskNeuralAttentionlib.BandPartMaskNeuralAttentionlib.BatchedMaskNeuralAttentionlib.BiLengthMaskNeuralAttentionlib.BiSeqMaskNeuralAttentionlib.CausalGroupedQueryAttenOpNeuralAttentionlib.CausalGroupedQueryAttenOpWithScoreNeuralAttentionlib.CausalMaskNeuralAttentionlib.CausalMultiheadQKVAttenOpNeuralAttentionlib.CausalMultiheadQKVAttenOpWithScoreNeuralAttentionlib.CollapsedDimsArrayNeuralAttentionlib.GenericAttenMaskNeuralAttentionlib.GenericSeqMaskNeuralAttentionlib.GroupedQueryAttenOpNeuralAttentionlib.GroupedQueryAttenOpWithScoreNeuralAttentionlib.IndexerNeuralAttentionlib.LengthMaskNeuralAttentionlib.LocalMaskNeuralAttentionlib.MultiheadQKVAttenOpNeuralAttentionlib.MultiheadQKVAttenOpWithScoreNeuralAttentionlib.NoMaskNeuralAttentionlib.PrefixedFunctionNeuralAttentionlib.RandomMaskNeuralAttentionlib.RepeatMaskNeuralAttentionlib.RevBiLengthMaskNeuralAttentionlib.RevLengthMaskNeuralAttentionlib.RevSymLengthMaskNeuralAttentionlib.SymLengthMask
Functional
NeuralAttentionlib.alibi_position_embedding — Functionalibi_position_embedding(mask::Union{AbstractAttenMask, Nothing}, score, args...)Add the non-trainable ALiBi position embedding to the attention score. The ALiBi embedding varied for each head, which  assuming the attention is multi-head variants. The first dimension of the batch dimension of the attention score is  treated as the head dimension (If used in single head attention, the alibi value would vary across batches).  mask can either be a attention mask or nothing. Usually, it is needed when there are gaps or prefix paddings  in the samples.
NeuralAttentionlib.attention_score — Functionattention_score(f, args...) = f(args...)Attention score api. Can be overload for doing custom implementation with generic_qkv_attention.  f is the score function.
See also: generic_qkv_attention, generic_multihead_qkv_attention, mixing
NeuralAttentionlib.biased_score — Functionbiased_score(bias, score, args...)Adding a precomputed bias to the attention score. bias should be in shape (key length, query length, ...) and  size(bias, 1) == size(s, 1) == size(bias, 2) == size(s, 2) && ndims(bias) <= ndims(s) where s = score(args...)  must hold.
NeuralAttentionlib.dot_product_score — Functiondot_product_score(q, k)Dot-product attention score function. Equivalent to scaled_dot_product_score(q, k, 1).
See also: scaled_dot_product_score
NeuralAttentionlib.generic_grouped_query_attention — Functiongeneric_grouped_query_attention(mixingf, scoref, head, group, q, k, v, args...)Generic version grouped_query_attention. Need to specify mixing and score functon.
NeuralAttentionlib.generic_multihead_qkv_attention — Functiongeneric_multihead_qkv_attention(mixingf, scoref, head, q, k, v, args...)Generic version of multihead_qkv_attention. Need to specify mixing and score function.
NeuralAttentionlib.generic_qkv_attention — Functiongeneric_qkv_attention(mixingf, scoref, q, k, v, args...)Generic version of naive_qkv_attention. Need to specify mixing and score function.
NeuralAttentionlib.get_sincos_position_embeddings — Functionget_sincos_position_embeddings(hidden_size::Integer, normalized::Bool, x)sincos position embeddings. x can be either a integer specifying the length or an array of position indices.
NeuralAttentionlib.grouped_query_attention — Functiongrouped_query_attention(head, group, q, k, v, mask=nothing)Similar to multihead_qkv_attention, but multiple queries are using the same group of keys/values.
NeuralAttentionlib.l2norm — Functionl2norm([epsilon = 1e-5,] x)Function which perform the L2 normalization on x.
NeuralAttentionlib.layer_norm — Functionlayer_norm([epsilon = 1e-5,] alpha, beta, x)Function which perform layer normalization on x. alpha and beta can a Vector, Number or Nothing.
$layer_norm(α, β, x) = α\frac{(x - μ)}{σ} + β$
If both alpha and beta is Nothing, this is just a standardize function applied on the first dimension.
NeuralAttentionlib.masked_score — Functionmasked_score(mask) = masked_score $ mask
masked_score(maskop, mask) = masked_score $ maskop $ mask
masked_score(maskop::AbstractMaskOp, mask::AbstractMask, score, args...)Masked attention score api. Applying the mask according to maskop on the attention score  compute from score(args...).
See also: naive_qkv_attention, SymLengthMask, BiLengthMask
NeuralAttentionlib.merge_head — Functionmerge_head(x)merge the head dimension split by split_head.
NeuralAttentionlib.mixing — Functionmixing(f, v, g, args...) = f(attention_score(g, args...), v)Mixing function api. Can be overload for doing custom implementation with generic_qkv_attention.  f is the mixing function and g is score function.
See also: generic_qkv_attention, generic_multihead_qkv_attention, attention_score
NeuralAttentionlib.move_head_dim_in — Functionmove_head_dim_in(x::AbstractArray, nobatch=false)Equivanlent to permutedims(x, move_head_dim_in_perm(x, nobatch)))
See also: merge_head, move_head_dim_in_perm
NeuralAttentionlib.move_head_dim_in_perm — Functionmove_head_dim_in_perm(x::AbstractArray{T, N}, nobatch=false)
move_head_dim_in_perm(N::Int, nobatch=false)Dimension order for permutedims to move the head dimension (created by split_head) from batch dimension  to feature dimension (for merge_head). Return a tuple of integer of length n.  nobatch specify where x is a batch of data.
Example
julia> Functional.move_head_dim_in_perm(5, false)
(1, 4, 2, 3, 5)
julia> Functional.move_head_dim_in_perm(5, true)
(1, 5, 2, 3, 4)
See also: merge_head, move_head_dim_in
NeuralAttentionlib.move_head_dim_out — Functionmove_head_dim_out(x::AbstractArray, nobatch=false)Equivanlent to permutedims(x, move_head_dim_out_perm(x, nobatch)))
See also: split_head, move_head_dim_out_perm
NeuralAttentionlib.move_head_dim_out_perm — Functionmove_head_dim_out_perm(x::AbstractArray{T, N}, nobatch=false)
move_head_dim_out_perm(N::Int, nobatch=false)Dimension order for permutedims to move the head dimension (created by split_head) to batch dimension.  Return a tuple of integer of length n. nobatch specify where x is a batch of data.
Example
julia> Functional.move_head_dim_out_perm(5, false)
(1, 3, 4, 2, 5)
julia> Functional.move_head_dim_out_perm(5, true)
(1, 3, 4, 5, 2)
See also: split_head, move_head_dim_out
NeuralAttentionlib.multihead_qkv_attention — Functionmultihead_qkv_attention(head, q, k, v, mask=nothing)Multihead version of naive_qkv_attention. The core operation for implement a regular transformer layer.
NeuralAttentionlib.naive_qkv_attention — Functionnaive_qkv_attention(q, k, v, mask=nothing)The scaled dot-product attention of a regular transformer layer.
$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$
It's equivalent to generic_qkv_attention(weighted_sum_mixing, normalized_score(NNlib.softmax) $ masked_score(GenericMaskOp(), mask) $ scaled_dot_product_score, q, k, v).
#Example
julia> fdim, ldim, bdim = 32, 10, 4;
julia> x = randn(fdim, ldim, bdim);
julia> y = naive_qkv_attention(x, x, x); # simple self attention
# no mask here
julia> z = generic_qkv_attention(weighted_sum_mixing, normalized_score(NNlib.softmax) $ scaled_dot_product_score, x, x, x);
julia> y ≈ z
true
See also: generic_qkv_attention
NeuralAttentionlib.normalized_score — Functionnormalized_score(norm) = normalized_score $ norm
normalized_score(norm, score, args...)Normalized attenion score api. norm is the normalize function (like softmax) and score is the function  that compute attention score from args....
See also: naive_qkv_attention
NeuralAttentionlib.rms_layer_norm — Functionrms_layer_norm([epsilon = 1e-5,] alpha, x)Function which perform root-mean-square layer normalization on x. alpha and beta can a Vector, Number  or Nothing.
$rms_layer_norm(α, x) = α\frac{x}{\sqrt{\sum_{i=1}^{N} x^2 / N}}$
If both alpha is Nothing, this is just a normalization with root-mean-square function applied on the first  dimension.
NeuralAttentionlib.scalar_relative_position_embedding — Functionscalar_relative_position_embedding(relative_position_id_func, embedding_table, score, args...)A relative position embedding that produce a trainable scalar bias for each value in the attention score.  relative_position_id_func is a function that take the attention score and return a relative_position_id  matrix with the same size of the attention score with batches (normally (key length, query length)). This  relative_position_id would be used to index (or gather) the embedding_table. embedding_table is an  array with multiple dimensions, where the first dimension is the number of possible "id"s and the remaining  dimensions are for giving different value to each heads. By default we treat the last dimension of attention  score as the batch dimension and the dimension between last dimension and the "length" dimension as the head  dimensions.
NeuralAttentionlib.scaled_dot_product_score — Function scaled_dot_product_score(q, k, s = sqrt(inv(size(k, 1))))The scaled dot-product attention score function of a regular transformer layer.
$Score(Q, K) = \frac{QK^T}{\sqrt{d_k}}$
scaled_dot_product_score(f, q, k)Apply a transform function f on q/k before dot-product.
See also: naive_qkv_attention
NeuralAttentionlib.split_head — Functionsplit_head(head::Int, x)Split the first dimension into head piece of small vector. Equivalent to  reshape(x, :, head, tail(size(x))...).
NeuralAttentionlib.t5_bucketed_position_id — Functiont5_bucketed_position_id(n_buckets::Int, max_distance::Int)A relative_position_id_func used in the T5 Transformer model. The relative distances is assigned to a  logarithmical buecket and the distance beyond max_distance would be assigned to the same bucket.
See also: scalar_relative_position_embedding, t5_causal_bucketed_position_id
NeuralAttentionlib.t5_causal_bucketed_position_id — Functiont5_causal_bucketed_position_id(n_buckets::Int, max_distance::Int)Same as t5_bucketed_position_id but only attent to past. Should be used with CausalMask
See also: scalar_relative_position_embedding, t5_bucketed_position_id
NeuralAttentionlib.weighted_sum_mixing — Functionweighted_sum_mixing(s, v)The mixing function of a regular transformer layer. s is the attention score and v is the value of QKV attention.
NeuralAttentionlib.with_rotary_position_embedding — Functionwith_rotary_position_embedding([size,] x)Apply rotary position embedding to x. Can take an size argument and the rotary position embedding will only apply  to x[1:size, :, ...]. Should be used with scaled_dot_product_score/dot_product_score.
NeuralAttentionlib.CausalGroupedQueryAttenOp — Typestruct CausalGroupedQueryAttenOp{F} <: AbstractAttenOp
    head::Int
    group::Int
    p::F
endStructure for holding parameter of grouped_query_attention.
(op::CausalGroupedQueryAttenOp)(q, k, v, mask = nothing)Perform grouped query attention where mask would be combined with a CausalMask.
NeuralAttentionlib.CausalGroupedQueryAttenOpWithScore — TypeSame as CausalGroupedQueryAttenOp but also return the attention score
NeuralAttentionlib.CausalMultiheadQKVAttenOp — Typestruct CausalMultiheadQKVAttenOp{F} <: AbstractAttenOp
    head::Int  # number of head
    p::F       # dropout probability
endStructure for holding parameter of multihead_qkv_attention.
(op::CausalMultiheadQKVAttenOp)(q, k, v, mask = nothing)Perform multihead attention where mask would be combined with a CausalMask
NeuralAttentionlib.CausalMultiheadQKVAttenOpWithScore — TypeSame as CausalMultiheadQKVAttenOp but also return the attention score
NeuralAttentionlib.GroupedQueryAttenOp — Typestruct GroupedQueryAttenOp{F} <: AbstractAttenOp
    head::Int
    group::Int
    p::F
endStructure for holding parameter of grouped_query_attention.
(op::GroupedQueryAttenOp)(q, k, v, mask = nothing)Perform grouped query attention.
NeuralAttentionlib.GroupedQueryAttenOpWithScore — TypeSame as GroupedQueryAttenOp but also return the attention score
NeuralAttentionlib.MultiheadQKVAttenOp — Typestruct MultiheadQKVAttenOp{F} <: AbstractAttenOp
    head::Int  # number of head
    p::F       # dropout probability
endStructure for holding parameter of multihead_qkv_attention.
(op::MultiheadQKVAttenOp)(q, k, v, mask = nothing)Perform multihead attention.
NeuralAttentionlib.MultiheadQKVAttenOpWithScore — TypeSame as MultiheadQKVAttenOp but also return the attention score
NeuralAttentionlib.PrefixedFunction — TypePrefixedFunction(f, args::NTuple{N}) <: FunctionA type representating a partially-applied version of the function f, with the first N arguments fixed to the  values args. In other words, PrefixedFunction(f, args) behaves similarly to (xs...)->f(args..., xs...).
See also NeuralAttentionlib.:$.
NeuralAttentionlib.:$ — Methodf $ x
f $ x $ y $ ...Partially-applied function. Return a PrefixedFunction.
Mask
NeuralAttentionlib.AbstractMaskOp — TypeAbstractMaskOpTrait-like abstract type for holding operation related argument, defined how the mask should be apply to input array
NeuralAttentionlib.apply_mask — Methodapply_mask(op::GenericMaskOp, mask::AbstractMask, score)Equivalent to op.apply(score, op.scale .* (op.flip ? .! mask : mask)).
Example
julia> x = randn(10, 10);
julia> m = CausalMask()
CausalMask()
julia> apply_mask(GenericMaskOp(.+, true, -1e9), m, x) ==  @. x + (!m * -1e9)
true
NeuralAttentionlib.apply_mask — Methodapply_mask(op::NaiveMaskOp, mask::AbstractMask, score)Directly broadcast multiply mask to attention score, i.e. score .* mask.
NeuralAttentionlib.AbstractArrayMask — TypeAbstractArrayMask <: AbstractAttenMaskAbstract type for mask with array data
NeuralAttentionlib.AbstractAttenMask — TypeAbstractAttenMask <: AbstractMaskAbstract type for mask data specifically for attention.
NeuralAttentionlib.AbstractDatalessMask — TypeAbstractDatalessMask <: AbstractAttenMaskAbstract type for mask without array data.
NeuralAttentionlib.AbstractMask — TypeAbstractMaskAbstract type for mask data.
NeuralAttentionlib.AbstractSeqMask — TypeAbstractSeqMask <: AbstractMaskAbstract type for mask data specifically for sequence.
NeuralAttentionlib.BandPartMask — TypeBandPartMask(l::Int, u::Int) <: AbstractAttenMask{DATALESS}Attention mask that only allow band_part values to pass.
Example
julia> trues(10, 10) .* BandPartMask(3, 5)
10×10 BitMatrix:
 1  1  1  1  1  1  0  0  0  0
 1  1  1  1  1  1  1  0  0  0
 1  1  1  1  1  1  1  1  0  0
 1  1  1  1  1  1  1  1  1  0
 0  1  1  1  1  1  1  1  1  1
 0  0  1  1  1  1  1  1  1  1
 0  0  0  1  1  1  1  1  1  1
 0  0  0  0  1  1  1  1  1  1
 0  0  0  0  0  1  1  1  1  1
 0  0  0  0  0  0  1  1  1  1NeuralAttentionlib.BatchedMask — TypeBatchedMask(mask::AbstractMask) <: AbstractWrapperMaskAttention mask wrapper over array mask for applying the same mask within the same batch.
Example
julia> m = SymLengthMask([2,3])
SymLengthMask{1, Vector{Int32}}(Int32[2, 3])
julia> trues(3,3, 2) .* m
3×3×2 BitArray{3}:
[:, :, 1] =
 1  1  0
 1  1  0
 0  0  0
[:, :, 2] =
 1  1  1
 1  1  1
 1  1  1
julia> trues(3,3, 2, 2) .* m
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; mask require ndims(A) == 3")
Stacktrace:
[...]
julia> trues(3,3, 2, 2) .* BatchedMask(m) # 4-th dim become batch dim
3×3×2×2 BitArray{4}:
[:, :, 1, 1] =
 1  1  0
 1  1  0
 0  0  0
[:, :, 2, 1] =
 1  1  0
 1  1  0
 0  0  0
[:, :, 1, 2] =
 1  1  1
 1  1  1
 1  1  1
[:, :, 2, 2] =
 1  1  1
 1  1  1
 1  1  1
NeuralAttentionlib.BiLengthMask — TypeBiLengthMask(q_len::A, k_len::A) where {A <: AbstractArray{Int, N}} <: AbstractAttenMask{ARRAYDATA}Attention mask specified by two arrays of integer that indicate the length dimension size.
Example
julia> bm = BiLengthMask([2,3], [3, 5])
BiLengthMask{1, Vector{Int32}}(Int32[2, 3], Int32[3, 5])
julia> trues(5,5, 2) .* bm
5×5×2 BitArray{3}:
[:, :, 1] =
 1  1  0  0  0
 1  1  0  0  0
 1  1  0  0  0
 0  0  0  0  0
 0  0  0  0  0
[:, :, 2] =
 1  1  1  0  0
 1  1  1  0  0
 1  1  1  0  0
 1  1  1  0  0
 1  1  1  0  0
See also: SymLengthMask, BiSeqMask, BatchedMask, RepeatMask
NeuralAttentionlib.BiSeqMask — TypeBiSeqMask(qmask::A1, kmask::A2) where {A1 <: AbstractSeqMask, A2 <: AbstractSeqMask} <: AbstractAttenMaskTake two sequence mask and construct an attention mask.
Example
julia> trues(7, 7, 2) .* Masks.BiSeqMask(Masks.LengthMask([3, 5]), Masks.RevLengthMask([3, 5]))
7×7×2 BitArray{3}:
[:, :, 1] =
 0  0  0  0  0  0  0
 0  0  0  0  0  0  0
 0  0  0  0  0  0  0
 0  0  0  0  0  0  0
 1  1  1  0  0  0  0
 1  1  1  0  0  0  0
 1  1  1  0  0  0  0
[:, :, 2] =
 0  0  0  0  0  0  0
 0  0  0  0  0  0  0
 1  1  1  1  1  0  0
 1  1  1  1  1  0  0
 1  1  1  1  1  0  0
 1  1  1  1  1  0  0
 1  1  1  1  1  0  0See also: BiLengthMask, RevBiLengthMask
NeuralAttentionlib.CausalMask — TypeCausalMask() <: AbstractAttenMask{DATALESS}Attention mask that block the future values.
Similar to applying LinearAlgebra.triu! on the score matrix
Example
julia> trues(10, 10) .* CausalMask()
10×10 BitMatrix:
 1  1  1  1  1  1  1  1  1  1
 0  1  1  1  1  1  1  1  1  1
 0  0  1  1  1  1  1  1  1  1
 0  0  0  1  1  1  1  1  1  1
 0  0  0  0  1  1  1  1  1  1
 0  0  0  0  0  1  1  1  1  1
 0  0  0  0  0  0  1  1  1  1
 0  0  0  0  0  0  0  1  1  1
 0  0  0  0  0  0  0  0  1  1
 0  0  0  0  0  0  0  0  0  1NeuralAttentionlib.GenericAttenMask — TypeGenericAttenMask <: AbstractAttenMask{ARRAYDATA}Generic attention mask. Just a wrapper over AbstractArray{Bool} for dispatch.
Example
julia> bitmask = rand(Bool, 10, 10)
10×10 Matrix{Bool}:
 1  0  1  1  0  0  1  0  1  1
 0  0  1  1  0  0  0  1  1  1
 0  1  0  1  0  1  0  0  1  0
 0  1  1  0  1  1  0  0  0  1
 1  0  1  1  1  0  0  0  0  1
 1  0  1  0  1  1  1  1  0  1
 0  0  0  1  1  1  0  1  1  1
 1  0  1  0  1  1  1  0  0  1
 0  1  0  1  0  0  1  1  0  1
 0  0  0  1  0  1  0  0  0  1
julia> trues(10, 10) .* GenericAttenMask(bitmask)
10×10 BitMatrix:
 1  0  1  1  0  0  1  0  1  1
 0  0  1  1  0  0  0  1  1  1
 0  1  0  1  0  1  0  0  1  0
 0  1  1  0  1  1  0  0  0  1
 1  0  1  1  1  0  0  0  0  1
 1  0  1  0  1  1  1  1  0  1
 0  0  0  1  1  1  0  1  1  1
 1  0  1  0  1  1  1  0  0  1
 0  1  0  1  0  0  1  1  0  1
 0  0  0  1  0  1  0  0  0  1NeuralAttentionlib.GenericSeqMask — TypeGenericSeqMask(mask::AbstractArray{Bool}) <: AbstractSeqMask{ARRAYDATA}Create a sequence mask from an array of Bool.
Example
julia> m = GenericSeqMask(rand(Bool, 10, 2))
GenericSeqMask{3, Array{Bool, 3}}([0 1 … 0 0;;; 1 0 … 1 0])
julia> trues(7, 10, 2) .* m
7×10×2 BitArray{3}:
[:, :, 1] =
 0  1  0  0  1  0  0  0  0  0
 0  1  0  0  1  0  0  0  0  0
 0  1  0  0  1  0  0  0  0  0
 0  1  0  0  1  0  0  0  0  0
 0  1  0  0  1  0  0  0  0  0
 0  1  0  0  1  0  0  0  0  0
 0  1  0  0  1  0  0  0  0  0
[:, :, 2] =
 1  0  1  1  0  1  1  1  1  0
 1  0  1  1  0  1  1  1  1  0
 1  0  1  1  0  1  1  1  1  0
 1  0  1  1  0  1  1  1  1  0
 1  0  1  1  0  1  1  1  1  0
 1  0  1  1  0  1  1  1  1  0
 1  0  1  1  0  1  1  1  1  0
julia> m.mask
1×10×2 Array{Bool, 3}:
[:, :, 1] =
 0  1  0  0  1  0  0  0  0  0
[:, :, 2] =
 1  0  1  1  0  1  1  1  1  0
NeuralAttentionlib.Indexer — TypeIndexer(m::AbstractMask, size::Dims{N}) <: AbstractArray{Bool, N}
Indexer(m::AbstractMask, size::Dims{N}, scale::T) <: AbstractArray{T, N}A lazy array-like object that "materialize" the mask m with size and a optional scale without size check.
See also: GetIndexer
NeuralAttentionlib.LengthMask — TypeLengthMask(len::AbstractArray{Int, N}) <: AbstractSeqMask{ARRAYDATA}A Sequence Mask specified by an array of integer that indicate the length dimension size.  Can be convert to attention mask (SymLengthMask, BiLengthMask) with AttenMask.
Example
julia> ones(7, 7, 2) .* LengthMask([3, 5])
7×7×2 Array{Float64, 3}:
[:, :, 1] =
 1.0  1.0  1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  0.0  0.0  0.0  0.0
[:, :, 2] =
 1.0  1.0  1.0  1.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  0.0  0.0
NeuralAttentionlib.LocalMask — TypeLocalMask(width::Int) <: AbstractAttenMask{DATALESS}Attention mask that only allow local (diagonal like) values to pass.
width should be ≥ 0 and A .* LocalMask(1) is similar to Diagonal(A)
Example
julia> trues(10, 10) .* LocalMask(3)
10×10 BitMatrix:
 1  1  1  0  0  0  0  0  0  0
 1  1  1  1  0  0  0  0  0  0
 1  1  1  1  1  0  0  0  0  0
 0  1  1  1  1  1  0  0  0  0
 0  0  1  1  1  1  1  0  0  0
 0  0  0  1  1  1  1  1  0  0
 0  0  0  0  1  1  1  1  1  0
 0  0  0  0  0  1  1  1  1  1
 0  0  0  0  0  0  1  1  1  1
 0  0  0  0  0  0  0  1  1  1NeuralAttentionlib.NoMask — TypeNoMask{T}() <: AbstractDatalessMask{T}A mask for no mask only for work with wrapper masks type constraints. Generally use nothing instead of NoMask  with apply_mask/mask_score for the fast path.
NeuralAttentionlib.RandomMask — TypeRandomMask(p::Float32) <: AbstractAttenMask{DATALESS}Attention mask that block value randomly.
p specify the percentage of value to block. e.g. A .* RandomMask(0) is equivalent to identity(A) and  A .* RandomMask(1) is equivalent to zero(A).
Example
julia> trues(10, 10) .* RandomMask(0.5)
10×10 BitMatrix:
 1  1  1  1  1  1  0  1  1  1
 0  0  1  0  1  0  0  0  1  0
 0  0  1  1  0  0  0  0  1  1
 1  0  1  1  1  0  0  1  0  1
 1  1  0  1  0  0  1  0  1  1
 0  1  1  1  1  0  1  0  1  1
 1  1  0  0  0  0  1  0  0  0
 0  0  1  0  1  1  0  1  1  0
 1  1  1  1  1  1  0  0  1  1
 0  0  1  0  1  1  0  0  1  0
julia> trues(10, 10) .* RandomMask(0.5)
10×10 BitMatrix:
 1  0  1  1  0  0  1  1  0  1
 0  1  0  1  1  1  0  0  1  1
 0  0  1  0  0  0  1  1  0  0
 0  0  0  0  1  0  0  1  1  1
 0  1  1  1  1  0  1  0  0  1
 1  0  0  1  1  0  0  0  1  1
 1  1  1  0  1  1  1  0  0  0
 0  0  1  1  0  0  1  1  1  0
 0  1  1  1  1  0  1  0  1  0
 0  0  1  0  0  0  0  1  1  1NeuralAttentionlib.RepeatMask — TypeRepeatMask(mask::AbstractMask, num::Int) <: AbstractWrapperMaskAttention mask wrapper over array mask for doing inner repeat on the last dimension.
Example
julia> m = SymLengthMask([2,3])
SymLengthMask{1, Vector{Int32}}(Int32[2, 3])
julia> trues(3,3, 2) .* m
3×3×2 BitArray{3}:
[:, :, 1] =
 1  1  0
 1  1  0
 0  0  0
[:, :, 2] =
 1  1  1
 1  1  1
 1  1  1
julia> trues(3,3, 4) .* m
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; mask require 3-th dimension to be 2, but get 4")
Stacktrace:
[...]
julia> trues(3,3, 4) .* RepeatMask(m, 2)
3×3×4 BitArray{3}:
[:, :, 1] =
 1  1  0
 1  1  0
 0  0  0
[:, :, 2] =
 1  1  0
 1  1  0
 0  0  0
[:, :, 3] =
 1  1  1
 1  1  1
 1  1  1
[:, :, 4] =
 1  1  1
 1  1  1
 1  1  1
NeuralAttentionlib.RevBiLengthMask — TypeRevBiLengthMask(q_len::A, k_len::A) where {A <: AbstractArray{Int, N}} <: AbstractAttenMask{ARRAYDATA}BiLengthMask but counts from the end of array, used for left padding.
Example
julia> bm = RevBiLengthMask([2,3], [3, 5])
RevBiLengthMask{1, Vector{Int32}}(Int32[2, 3], Int32[3, 5])
julia> trues(5,5, 2) .* bm
5×5×2 BitArray{3}:
[:, :, 1] =
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
[:, :, 2] =
 0  0  1  1  1
 0  0  1  1  1
 0  0  1  1  1
 0  0  1  1  1
 0  0  1  1  1
See also: RevLengthMask, RevSymLengthMask, BiSeqMask, BatchedMask, RepeatMask
NeuralAttentionlib.RevLengthMask — TypeRevLengthMask(len::AbstractArray{Int, N}) <: AbstractSeqMask{ARRAYDATA}LengthMask but counts from the end of array, used for left padding.  Can be convert to attention mask (RevSymLengthMask, RevBiLengthMask) with AttenMask.
Example
julia> ones(7, 7, 2) .* RevLengthMask([3, 5])
7×7×2 Array{Float64, 3}:
[:, :, 1] =
 0.0  0.0  0.0  0.0  1.0  1.0  1.0
 0.0  0.0  0.0  0.0  1.0  1.0  1.0
 0.0  0.0  0.0  0.0  1.0  1.0  1.0
 0.0  0.0  0.0  0.0  1.0  1.0  1.0
 0.0  0.0  0.0  0.0  1.0  1.0  1.0
 0.0  0.0  0.0  0.0  1.0  1.0  1.0
 0.0  0.0  0.0  0.0  1.0  1.0  1.0
[:, :, 2] =
 0.0  0.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0
 0.0  0.0  1.0  1.0  1.0  1.0  1.0
NeuralAttentionlib.RevSymLengthMask — TypeRevSymLengthMask(len::AbstractArray{Int, N}) <: AbstractAttenMask{ARRAYDATA}SymLengthMask but counts from the end of array, used for left padding.
Example
julia> m = RevSymLengthMask([2,3])
RevSymLengthMask{1, Vector{Int32}}(Int32[2, 3])
julia> trues(3,3, 2) .* m
3×3×2 BitArray{3}:
[:, :, 1] =
 0  0  0
 0  1  1
 0  1  1
[:, :, 2] =
 1  1  1
 1  1  1
 1  1  1
See also: BiLengthMask, BatchedMask, RepeatMask
NeuralAttentionlib.SymLengthMask — TypeSymLengthMask(len::AbstractArray{Int, N}) <: AbstractAttenMask{ARRAYDATA}Attention mask specified by an array of integer that indicate the length dimension size. assuming Query length and Key length are the same.
Example
julia> m = SymLengthMask([2,3])
SymLengthMask{1, Vector{Int32}}(Int32[2, 3])
julia> trues(3,3, 2) .* m
3×3×2 BitArray{3}:
[:, :, 1] =
 1  1  0
 1  1  0
 0  0  0
[:, :, 2] =
 1  1  1
 1  1  1
 1  1  1
See also: LengthMask, BiLengthMask, BatchedMask, RepeatMask
Base.:! — Method!m::AbstractMaskBoolean not of an attention mask
Base.:& — Methodm1::AbstractMask & m2::AbstractMasklogical and of two attention mask
Base.:| — Methodm1::AbstractMask | m2::AbstractMasklogical or of two attention mask
NeuralAttentionlib.AttenMask — FunctionAttenMask(m::AbstractMask)Convert mask into corresponding attention mask.
AttenMask(q_mask::AbstractSeqMask, k_mask::AbstractSeqMask)Create a attention mask from 2 sequence masks specific the sequence mask for "query" and "key".
NeuralAttentionlib.GetIndexer — FunctionGetIndexer(m::AbstractMask, destsize::Dims{N})Return the Indexer of m and check if the mask m can be applied to an array with size destsize.
NeuralAttentionlib.getmask — Functiongetmask(m::AbstractMask, score, scale = 1)Convert m into mask array of AbstractArray for score with scale.
Example
julia> getmask(CausalMask(), randn(7,7), 2)
7×7 Matrix{Float64}:
 2.0  2.0  2.0  2.0  2.0  2.0  2.0
 0.0  2.0  2.0  2.0  2.0  2.0  2.0
 0.0  0.0  2.0  2.0  2.0  2.0  2.0
 0.0  0.0  0.0  2.0  2.0  2.0  2.0
 0.0  0.0  0.0  0.0  2.0  2.0  2.0
 0.0  0.0  0.0  0.0  0.0  2.0  2.0
 0.0  0.0  0.0  0.0  0.0  0.0  2.0
NeuralAttentionlib.lengths — Functionlengths(::AbstractSeqMask)Get the number of trues of each batch in the sequence mask.
Matmul
NeuralAttentionlib.CollapsedDimsArray — TypeCollapsedDimsArray{T}(array, ni::Integer, nj::Integer) <: AbstractArray{T, 3}Similar to lazy reshape array with collapsed_size
NeuralAttentionlib.collapsed_size — Functioncollapsed_size(x, ni, nj [, n])::Dim{3}Collapse the dimensionality of x into 3 according to ni and nj where ni, nj specify the number of  second and third dimensions it take.
(X1, X2, ..., Xk, Xk+1, Xk+2, ..., Xk+ni, Xk+ni+1, ..., Xn)
 |______dim1___|  |_________ni_________|  |______nj______|Example
julia> x = randn(7,6,5,4,3,2);
julia> collapsed_size(x, 2, 2, 1)
42
julia> collapsed_size(x, 2, 2, 2)
20
julia> collapsed_size(x, 2, 2, 3)
6
julia> collapsed_size(x, 2, 2)
(42, 20, 6)
See also: noncollapsed_size
NeuralAttentionlib.collapseddims — Methodcollapseddims(x::AbstractArray, xi, xj)Reshape x into 3 dim array, equivalent to reshape(x, collapsed_size(x, xi, xj))
See also: collapsed_size
NeuralAttentionlib.collapseddims — Methodcollapseddims(ca::CollapsedDimsArray)remove the wrapper and really reshape it.
See also: CollapsedDimsArray, unwrap_collapse
NeuralAttentionlib.matmul — Functionmatmul(a::AbstractArray, b::AbstractArray, s::Number = 1)Equivalent to s .* (a * b) if a and b are Vector or Matrix. For array with higher dimension,  it will convert a and b to CollapsedDimsArray and perform batched matrix multiplication, and then  return the result as CollapsedDimsArray. This is useful for preserving the dimensionality. If the batch dimension  of a and b have different shape, it pick the shape of b for batch dimension. Work with NNlib.batch_transpose  and NNlib.batch_adjoint.
Example
# b-dim shape: (6,)
julia> a = CollapsedDimsArray(randn(3,4,2,3,6), 2, 1); size(a)
(12, 6, 6)
# b-dim shape: (3,1,2)
julia> b = CollapsedDimsArray(randn(6,2,3,1,2), 1, 3); size(b)
(6, 2, 6)
julia> c = matmul(a, b); size(c), typeof(c)
((12, 2, 6), CollapsedDimsArray{Float64, Array{Float64, 6}, Static.StaticInt{1}, Static.StaticInt{3}})
# b-dim shape: (3,1,2)
julia> d = unwrap_collapse(c); size(d), typeof(d)
((3, 4, 2, 3, 1, 2), Array{Float64, 6})
# equivanlent to `batched_mul` but preserve shape
julia> NNlib.batched_mul(collapseddims(a), collapseddims(b)) == collapseddims(matmul(a, b))
true
See also: CollapsedDimsArray, unwrap_collapse, collapseddims
NeuralAttentionlib.noncollapsed_size — Functionnoncollapsed_size(x, ni, nj [, n])Collapse the dimensionality of x into 3 according to ni and nj.
(X1, X2, ..., Xk, Xk+1, Xk+2, ..., Xk+ni, Xk+ni+1, ..., Xn)
 |______dim1___|  |_________ni_________|  |______nj______|But take the size before collapse. e.g. noncollapsed_size(x, ni, nj, 2) will be (Xi, Xi+1, ..., Xj-1).
Example
julia> x = randn(7,6,5,4,3,2);
julia> noncollapsed_size(x, 2, 2, 1)
(7, 6)
julia> noncollapsed_size(x, 2, 2, 2)
(5, 4)
julia> noncollapsed_size(x, 2, 2, 3)
(3, 2)
julia> noncollapsed_size(x, 2, 2)
((7, 6), (5, 4), (3, 2))
See also: collapsed_size
NeuralAttentionlib.scaled_matmul — Functionscaled_matmul(a::AbstractArray, b::AbstractArray, s::Number = 1)Basically equivalent to unwrap_collapse(matmul(a, b, s)), but not differentiable w.r.t. to s.
NeuralAttentionlib.unwrap_collapse — Functionunwrap_collapse(ca::CollapsedDimsArray)Return the underlying array of CollapsedDimsArray, otherwise just return the input.