Transformers.TextEncoders

Text processing module.

API Reference

Transformers.TextEncoders.TransformerTextEncoderType
struct TransformerTextEncoder{
    T<:AbstractTokenizer, V<:AbstractVocabulary{String}, P
} <: AbstractTransformerTextEncoder
    tokenizer::T
    vocab::V
    process::P
    startsym::String
    endsym::String
    padsym::String
    trunc::Union{Nothing, Int}
end

The text encoder for general transformers. Taking a tokenizer, vocabulary, and a processing function, configured with a start symbol, an end symbol, a padding symbol, and a maximum length.

TransformerTextEncoder(tokenze, vocab, process; trunc = nothing,
                       startsym = "<s>", endsym = "</s>", unksym = "<unk>", padsym = "<pad>")

tokenize can be any tokenize function from WordTokenizers. vocab is either a list of word or a Vocab. process can be omitted, then a predefined processing pipeline will be used. When vocab is a list, those special symbol (e.g. padsym) would be added to the word list.

TransformerTextEncoder(f, e::TransformerTextEncoder)

Take a text encoder and create a new text encoder with same configuration except the processing function. f is a function that take the encoder and return a new process function. This is useful for changing part of the procssing function.

Example

julia> textenc = TransformerTextEncoder(labels; startsym, endsym, unksym,
                                        padsym = unksym, trunc = 100)
TransformerTextEncoder(
├─ TextTokenizer(default),
├─ vocab = Vocab{String, SizedArray}(size = 37678, unk = </unk>, unki = 1),
├─ startsym = <s>,
├─ endsym = </s>,
├─ padsym = </unk>,
├─ trunc = 100,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := TextEncodeBase.with_head_tail(<s>, </s>)(target.token)
  ╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(10))(target.token)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(10, <pad>, tail, tail)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target := (target.token, target.attention_mask)
)

julia> TransformerTextEncoder(ans) do enc
           enc.process[1] |> TextEncoders.Pipelines(enc.process[4:5]) |> TextEncoders.PipeGet{(:token,)}()
       end
TransformerTextEncoder(
├─ TextTokenizer(default),
├─ vocab = Vocab{String, SizedArray}(size = 37678, unk = </unk>, unki = 1),
├─ startsym = <s>,
├─ endsym = </s>,
├─ padsym = </unk>,
├─ trunc = 100,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(10, <pad>, tail, tail)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target := (target.token)
)
source
TextEncodeBase.decodeMethod
decode(e::AbstractTransformerTextEncoder, x::Union{
    Integer,
    OneHotArray,
    AbstractArray{<:Integer}
})

Decode the one-hot encoding or indices into String (or Array{String}) from the bound vocabulary.

decode(e::AbstractTransformerTextEncoder, x::AbstractArray)

Perform argmax(x; dims = 1) and then decode. x should be collected beforehand if it's on GPU.

source
TextEncodeBase.decodeMethod
decode(bertenc::BertTextEncoder, x)

Convert indices back to string with bert vocabulary.

See also: encode

Example

julia> token = encode(bertenc, [["this is a sentence", "and another"]]).token;

julia> decode(bertenc, token)
9×1 Matrix{String}:
 "[CLS]"
 "this"
 "is"
 "a"
 "sentence"
 "[SEP]"
 "and"
 "another"
 "[SEP]"
source
TextEncodeBase.decodeMethod
decode(bertenc::GPT2TextEncoder, x)

Convert indices back to string with gpt2 vocabulary. This would also map the bytes back to the normal code ranges, so the string is not directly the one in the vocabulary.

See also: encode

Example

julia> token = encode(gpt2enc, [["this is a sentence", "and another"]]).token;

julia> decode(gpt2enc, token)
6×1 Matrix{String}:
 "this"
 " is"
 " a"
 " sentence"
 "and"
 " another"

julia> TextEncodeBase.decode_indices(gpt2enc, token)
6×1 Matrix{String}:
 "this"
 "Ġis"
 "Ġa"
 "Ġsentence"
 "and"
 "Ġanother"
source
TextEncodeBase.decodeMethod
decode(bertenc::T5TextEncoder, x)

Convert indices back to string with t5 vocabulary.

See also: encode

Example

julia> token = encode(t5enc, [["this is a sentence", "and another"]]).token;

julia> decode(t5enc, token)
9×1 Matrix{String}:
 "▁this"
 "▁is"
 "▁"
 "a"
 "▁sentence"
 "</s>"
 "▁and"
 "▁another"
 "</s>"
source
TextEncodeBase.encodeMethod
encode(e::AbstractTransformerTextEncoder, input::Union{
    String,                         # single sentence
    Vector{String},                 # batch of sentences
    Vector{Vector{String}},         # batch of multi-segment sentences
    Vector{Vector{Vector{String}}}  # batch of multi-sample multi-segment sentences
})

Tokenize the input and apply the processing function on the tokenized result. The input can be either a single String (1 sample) or a nested vector of String up to depth 3 (batch of samples). How batch input is transformed is defined by the bound processing function. The result of the processing function (first if return tuple) would be converted into one-hot encoding with the bound vocabulary.

encode(e::AbstractTransformerTextEncoder, src, trg)

Apply encode on src and trg and build the cross attention mask. This is just a convenient function for doing encoder-decoder tasks. Return a @NamedTuple{encoder_input, decoder_input} where encoder_input is just encode(e, src) and decoder_input is encode(e, trg) + the cross attention mask.

source
TextEncodeBase.encodeMethod
encode(::BertTextEncoder, ::String)

Encode a single sentence with bert text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 1}, segment::Vector{Int}, attention_mask::LengthMask{1, Vector{Int32}}}.

encode(::BertTextEncoder, ::Vector{String})

Encode a batch of sentences with bert text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, segment::Matrix{Int}, attention_mask::LengthMask{1, Vector{Int32}}}.

encode(::BertTextEncoder, ::Vector{Vector{String}})

Encode a batch of segments with bert text encoder. Segments would be concatenate together as batch of sentences with separation token and correct indicator in segment. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, segment::Matrix{Int}, attention_mask::LengthMask{1, Vector{Int32}}}.

encode(::BertTextEncoder, ::Vector{Vector{Vector{String}}})

Encode a batch of multi-sample segments with bert text encoder. The number of sample per data need to be the same. (e.g. length(batch[1]) == length(batch[2])). The default pipeline returning @NamedTuple{token::OneHotArray{K, 3}, segment::Array{Int, 3}, attention_mask::LengthMask{2, Matrix{Int32}}}. notice: If you want each sample to be independent to each other, this need to be reshaped before feeding to transformer layer or make sure the attention is not taking the end-1 dimension as another length dimension.

See also: decode, LengthMask

Example

julia> bertenc = HuggingFace.load_tokenizer("bert-base-cased")
BertTextEncoder(
├─ TextTokenizer(MatchTokenization(WordPieceTokenization(bert_cased_tokenizer, WordPiece(vocab_size = 28996, unk = [UNK], max_char = 100)), 5 patterns)),
├─ vocab = Vocab{String, SizedArray}(size = 28996, unk = [UNK], unki = 101),
├─ startsym = [CLS],
├─ endsym = [SEP],
├─ padsym = [PAD],
├─ trunc = 512,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
  ╰─ target[(token, segment)] := SequenceTemplate{String}([CLS]:<type=1> Input[1]:<type=1> [SEP]:<type=1> (Input[2]:<type=2> [SEP]:<type=2>)...)(target.token)
  ╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(512))(target.token)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(512, [PAD], tail, tail)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target[segment] := TextEncodeBase.trunc_and_pad(512, 1, tail, tail)(target.segment)
  ╰─ target[segment] := TextEncodeBase.nested2batch(target.segment)
  ╰─ target := (target.token, target.segment, target.attention_mask)
)

julia> e = encode(bertenc, [["this is a sentence", "and another"]])
(token = [0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0;;;], segment = [1; 1; … ; 2; 2;;], attention_mask = NeuralAttentionlib.LengthMask{1, Vector{Int32}}(Int32[9]))

julia> typeof(e)
NamedTuple{(:token, :segment, :attention_mask), Tuple{OneHotArray{0x00007144, 2, 3, Matrix{OneHot{0x00007144}}}, Matrix{Int64}, NeuralAttentionlib.LengthMask{1, Vector{Int32}}}}
source
TextEncodeBase.encodeMethod
encode(::GPT2TextEncoder, ::String)

Encode a single sentence with gpt2 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 1}, attention_mask::RevLengthMask{1, Vector{Int32}}}.

encode(::GPT2TextEncoder, ::Vector{String})

Encode a batch of sentences with gpt2 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::RevLengthMask{1, Vector{Int32}}}.

encode(::GPT2TextEncoder, ::Vector{Vector{String}})

Encode a batch of segments with gpt2 text encoder. Segments would be concatenate together as batch of sentences. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::RevLengthMask{1, Vector{Int32}}}.

encode(::GPT2TextEncoder, ::Vector{Vector{Vector{String}}})

Encode a batch of multi-sample segments with gpt2 text encoder. The number of sample per data need to be the same. (e.g. length(batch[1]) == length(batch[2])). The default pipeline returning @NamedTuple{token::OneHotArray{K, 3}, attention_mask::RevLengthMask{2, Matrix{Int32}}}. notice: If you want each sample to be independent to each other, this need to be reshaped before feeding to transformer layer or make sure the attention is not taking the end-1 dimension as another length dimension.

See also: decode, RevLengthMask

Example

julia> gpt2enc = HuggingFace.load_tokenizer("gpt2")
GPT2TextEncoder(
├─ TextTokenizer(MatchTokenization(CodeNormalizer(BPETokenization(GPT2Tokenization, bpe = CachedBPE(BPE(50000 merges))), codemap = CodeMap{UInt8 => UInt16}(3 code-ranges)), 1 patterns)),
├─ vocab = Vocab{String, SizedArray}(size = 50257, unk = <unk>, unki = 0),
├─ codemap = CodeMap{UInt8 => UInt16}(3 code-ranges),
├─ startsym = <|endoftext|>,
├─ endsym = <|endoftext|>,
├─ padsym = <|endoftext|>,
├─ trunc = 1024,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
  ╰─ target[token] := SequenceTemplate{String}((Input:<type=1>)...)(Val{1}(), target.token)
  ╰─ target[attention_mask] := (NeuralAttentionlib.RevLengthMask ∘ Transformers.TextEncoders.getlengths(1024))(target.token)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(1024, <|endoftext|>, head, head)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target := (target.token, target.attention_mask)
)

julia> e = encode(gpt2enc, [["this is a sentence", "and another"]])
(token = [0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0;;;], attention_mask = NeuralAttentionlib.RevLengthMask{1, Vector{Int32}}(Int32[6]))

julia> typeof(e)
NamedTuple{(:token, :attention_mask), Tuple{OneHotArray{0x0000c451, 2, 3, Matrix{OneHot{0x0000c451}}}, NeuralAttentionlib.RevLengthMask{1, Vector{Int32}}}}
source
TextEncodeBase.encodeMethod
encode(::T5TextEncoder, ::String)

Encode a single sentence with t5 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 1}, attention_mask::LengthMask{1, Vector{Int32}}}.

encode(::T5TextEncoder, ::Vector{String})

Encode a batch of sentences with t5 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::LengthMask{1, Vector{Int32}}}.

encode(::T5TextEncoder, ::Vector{Vector{String}})

Encode a batch of segments with t5 text encoder. Segments would be concatenate together as batch of sentences with a separation token. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::LengthMask{1, Vector{Int32}}}.

encode(::T5TextEncoder, ::Vector{Vector{Vector{String}}})

Encode a batch of multi-sample segments with t5 text encoder. The number of sample per data need to be the same. (e.g. length(batch[1]) == length(batch[2])). The default pipeline returning @NamedTuple{token::OneHotArray{K, 3}, attention_mask::LengthMask{2, Matrix{Int32}}}. notice: If you want each sample to be independent to each other, this need to be reshaped before feeding to transformer layer or make sure the attention is not taking the end-1 dimension as another length dimension.

See also: decode, LengthMask

Example

julia> t5enc = HuggingFace.load_tokenizer("t5")
T5TextEncoder(
├─ TextTokenizer(MatchTokenization(PrecompiledNormalizer(WordReplaceNormalizer(UnigramTokenization(EachSplitTokenization(splitter = isspace), unigram = Unigram(vocab_size = 32100, unk = <unk>)), pattern = r"^(?!▁)(.*)$" => s"▁"), precompiled
= PrecompiledNorm(...)), 103 patterns)),
├─ vocab = Vocab{String, SizedArray}(size = 32100, unk = <unk>, unki = 3),
├─ endsym = </s>,
├─ padsym = <pad>,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
  ╰─ target[(token, segment)] := SequenceTemplate{String}(Input[1]:<type=1> </s>:<type=1> (Input[2]:<type=1> </s>:<type=1>)...)(target.token)
  ╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(nothing))(target.token)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(nothing, <pad>, tail, tail)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target := (target.token, target.attention_mask)
)

julia> e = encode(t5enc, [["this is a sentence", "and another"]])
(token = [0 0 … 0 0; 0 0 … 0 1; … ; 0 0 … 0 0; 0 0 … 0 0;;;], attention_mask = NeuralAttentionlib.LengthMask{1, Vector{Int32}}(Int32[9]))

julia> typeof(e)
NamedTuple{(:token, :attention_mask), Tuple{OneHotArray{0x00007d64, 2, 3, Matrix{OneHot{0x00007d64}}}, NeuralAttentionlib.LengthMask{1, Vector{Int32}}}}
source