Transformers.TextEncoders
Text processing module.
API Reference
Transformers.TextEncoders.BertTextEncoder
— TypeBertTextEncoder
The text encoder for Bert model (WordPiece tokenization).
Transformers.TextEncoders.GPT2TextEncoder
— TypeGPT2TextEncoder
The text encoder for GPT2 model (ByteLevel BytePairEncoding tokenization).
Transformers.TextEncoders.T5TextEncoder
— TypeT5TextEncoder
The text encoder for T5 model (SentencePiece tokenization).
Transformers.TextEncoders.TransformerTextEncoder
— Typestruct TransformerTextEncoder{
T<:AbstractTokenizer, V<:AbstractVocabulary{String}, P
} <: AbstractTransformerTextEncoder
tokenizer::T
vocab::V
process::P
startsym::String
endsym::String
padsym::String
trunc::Union{Nothing, Int}
end
The text encoder for general transformers. Taking a tokenizer, vocabulary, and a processing function, configured with a start symbol, an end symbol, a padding symbol, and a maximum length.
TransformerTextEncoder(tokenze, vocab, process; trunc = nothing,
startsym = "<s>", endsym = "</s>", unksym = "<unk>", padsym = "<pad>")
tokenize
can be any tokenize function from WordTokenizers
. vocab
is either a list of word or a Vocab
. process
can be omitted, then a predefined processing pipeline will be used. When vocab
is a list, those special symbol (e.g. padsym
) would be added to the word list.
TransformerTextEncoder(f, e::TransformerTextEncoder)
Take a text encoder and create a new text encoder with same configuration except the processing function. f
is a function that take the encoder and return a new process function. This is useful for changing part of the procssing function.
Example
julia> textenc = TransformerTextEncoder(labels; startsym, endsym, unksym,
padsym = unksym, trunc = 100)
TransformerTextEncoder(
├─ TextTokenizer(default),
├─ vocab = Vocab{String, SizedArray}(size = 37678, unk = </unk>, unki = 1),
├─ startsym = <s>,
├─ endsym = </s>,
├─ padsym = </unk>,
├─ trunc = 100,
└─ process = Pipelines:
╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
╰─ target[token] := TextEncodeBase.with_head_tail(<s>, </s>)(target.token)
╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(10))(target.token)
╰─ target[token] := TextEncodeBase.trunc_and_pad(10, <pad>, tail, tail)(target.token)
╰─ target[token] := TextEncodeBase.nested2batch(target.token)
╰─ target := (target.token, target.attention_mask)
)
julia> TransformerTextEncoder(ans) do enc
enc.process[1] |> TextEncoders.Pipelines(enc.process[4:5]) |> TextEncoders.PipeGet{(:token,)}()
end
TransformerTextEncoder(
├─ TextTokenizer(default),
├─ vocab = Vocab{String, SizedArray}(size = 37678, unk = </unk>, unki = 1),
├─ startsym = <s>,
├─ endsym = </s>,
├─ padsym = </unk>,
├─ trunc = 100,
└─ process = Pipelines:
╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
╰─ target[token] := TextEncodeBase.trunc_and_pad(10, <pad>, tail, tail)(target.token)
╰─ target[token] := TextEncodeBase.nested2batch(target.token)
╰─ target := (target.token)
)
TextEncodeBase.decode
— Methoddecode(e::AbstractTransformerTextEncoder, x::Union{
Integer,
OneHotArray,
AbstractArray{<:Integer}
})
Decode the one-hot encoding or indices into String
(or Array{String}
) from the bound vocabulary.
decode(e::AbstractTransformerTextEncoder, x::AbstractArray)
Perform argmax(x; dims = 1)
and then decode
. x
should be collect
ed beforehand if it's on GPU.
TextEncodeBase.decode
— Methoddecode(bertenc::BertTextEncoder, x)
Convert indices back to string with bert vocabulary.
See also: encode
Example
julia> token = encode(bertenc, [["this is a sentence", "and another"]]).token;
julia> decode(bertenc, token)
9×1 Matrix{String}:
"[CLS]"
"this"
"is"
"a"
"sentence"
"[SEP]"
"and"
"another"
"[SEP]"
TextEncodeBase.decode
— Methoddecode(bertenc::GPT2TextEncoder, x)
Convert indices back to string with gpt2 vocabulary. This would also map the bytes back to the normal code ranges, so the string is not directly the one in the vocabulary.
See also: encode
Example
julia> token = encode(gpt2enc, [["this is a sentence", "and another"]]).token;
julia> decode(gpt2enc, token)
6×1 Matrix{String}:
"this"
" is"
" a"
" sentence"
"and"
" another"
julia> TextEncodeBase.decode_indices(gpt2enc, token)
6×1 Matrix{String}:
"this"
"Ġis"
"Ġa"
"Ġsentence"
"and"
"Ġanother"
TextEncodeBase.decode
— Methoddecode(bertenc::T5TextEncoder, x)
Convert indices back to string with t5 vocabulary.
See also: encode
Example
julia> token = encode(t5enc, [["this is a sentence", "and another"]]).token;
julia> decode(t5enc, token)
9×1 Matrix{String}:
"▁this"
"▁is"
"▁"
"a"
"▁sentence"
"</s>"
"▁and"
"▁another"
"</s>"
TextEncodeBase.encode
— Methodencode(e::AbstractTransformerTextEncoder, input::Union{
String, # single sentence
Vector{String}, # batch of sentences
Vector{Vector{String}}, # batch of multi-segment sentences
Vector{Vector{Vector{String}}} # batch of multi-sample multi-segment sentences
})
Tokenize the input
and apply the processing function on the tokenized result. The input
can be either a single String
(1 sample) or a nested vector of String
up to depth 3 (batch of samples). How batch input is transformed is defined by the bound processing function. The result of the processing function (first if return tuple) would be converted into one-hot encoding with the bound vocabulary.
encode(e::AbstractTransformerTextEncoder, src, trg)
Apply encode
on src
and trg
and build the cross attention mask. This is just a convenient function for doing encoder-decoder tasks. Return a @NamedTuple{encoder_input, decoder_input}
where encoder_input
is just encode(e, src)
and decoder_input
is encode(e, trg)
+ the cross attention mask.
TextEncodeBase.encode
— Methodencode(::BertTextEncoder, ::String)
Encode a single sentence with bert text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 1}, segment::Vector{Int}, attention_mask::LengthMask{1, Vector{Int32}}}
.
encode(::BertTextEncoder, ::Vector{String})
Encode a batch of sentences with bert text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, segment::Matrix{Int}, attention_mask::LengthMask{1, Vector{Int32}}}
.
encode(::BertTextEncoder, ::Vector{Vector{String}})
Encode a batch of segments with bert text encoder. Segments would be concatenate together as batch of sentences with separation token and correct indicator in segment
. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, segment::Matrix{Int}, attention_mask::LengthMask{1, Vector{Int32}}}
.
encode(::BertTextEncoder, ::Vector{Vector{Vector{String}}})
Encode a batch of multi-sample segments with bert text encoder. The number of sample per data need to be the same. (e.g. length(batch[1]) == length(batch[2])
). The default pipeline returning @NamedTuple{token::OneHotArray{K, 3}, segment::Array{Int, 3}, attention_mask::LengthMask{2, Matrix{Int32}}}
. notice: If you want each sample to be independent to each other, this need to be reshaped before feeding to transformer layer or make sure the attention is not taking the end-1
dimension as another length dimension.
See also: decode
, LengthMask
Example
julia> bertenc = HuggingFace.load_tokenizer("bert-base-cased")
BertTextEncoder(
├─ TextTokenizer(MatchTokenization(WordPieceTokenization(bert_cased_tokenizer, WordPiece(vocab_size = 28996, unk = [UNK], max_char = 100)), 5 patterns)),
├─ vocab = Vocab{String, SizedArray}(size = 28996, unk = [UNK], unki = 101),
├─ startsym = [CLS],
├─ endsym = [SEP],
├─ padsym = [PAD],
├─ trunc = 512,
└─ process = Pipelines:
╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
╰─ target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
╰─ target[(token, segment)] := SequenceTemplate{String}([CLS]:<type=1> Input[1]:<type=1> [SEP]:<type=1> (Input[2]:<type=2> [SEP]:<type=2>)...)(target.token)
╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(512))(target.token)
╰─ target[token] := TextEncodeBase.trunc_and_pad(512, [PAD], tail, tail)(target.token)
╰─ target[token] := TextEncodeBase.nested2batch(target.token)
╰─ target[segment] := TextEncodeBase.trunc_and_pad(512, 1, tail, tail)(target.segment)
╰─ target[segment] := TextEncodeBase.nested2batch(target.segment)
╰─ target := (target.token, target.segment, target.attention_mask)
)
julia> e = encode(bertenc, [["this is a sentence", "and another"]])
(token = [0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0;;;], segment = [1; 1; … ; 2; 2;;], attention_mask = NeuralAttentionlib.LengthMask{1, Vector{Int32}}(Int32[9]))
julia> typeof(e)
NamedTuple{(:token, :segment, :attention_mask), Tuple{OneHotArray{0x00007144, 2, 3, Matrix{OneHot{0x00007144}}}, Matrix{Int64}, NeuralAttentionlib.LengthMask{1, Vector{Int32}}}}
TextEncodeBase.encode
— Methodencode(::GPT2TextEncoder, ::String)
Encode a single sentence with gpt2 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 1}, attention_mask::RevLengthMask{1, Vector{Int32}}}
.
encode(::GPT2TextEncoder, ::Vector{String})
Encode a batch of sentences with gpt2 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::RevLengthMask{1, Vector{Int32}}}
.
encode(::GPT2TextEncoder, ::Vector{Vector{String}})
Encode a batch of segments with gpt2 text encoder. Segments would be concatenate together as batch of sentences. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::RevLengthMask{1, Vector{Int32}}}
.
encode(::GPT2TextEncoder, ::Vector{Vector{Vector{String}}})
Encode a batch of multi-sample segments with gpt2 text encoder. The number of sample per data need to be the same. (e.g. length(batch[1]) == length(batch[2])
). The default pipeline returning @NamedTuple{token::OneHotArray{K, 3}, attention_mask::RevLengthMask{2, Matrix{Int32}}}
. notice: If you want each sample to be independent to each other, this need to be reshaped before feeding to transformer layer or make sure the attention is not taking the end-1
dimension as another length dimension.
See also: decode
, RevLengthMask
Example
julia> gpt2enc = HuggingFace.load_tokenizer("gpt2")
GPT2TextEncoder(
├─ TextTokenizer(MatchTokenization(CodeNormalizer(BPETokenization(GPT2Tokenization, bpe = CachedBPE(BPE(50000 merges))), codemap = CodeMap{UInt8 => UInt16}(3 code-ranges)), 1 patterns)),
├─ vocab = Vocab{String, SizedArray}(size = 50257, unk = <unk>, unki = 0),
├─ codemap = CodeMap{UInt8 => UInt16}(3 code-ranges),
├─ startsym = <|endoftext|>,
├─ endsym = <|endoftext|>,
├─ padsym = <|endoftext|>,
├─ trunc = 1024,
└─ process = Pipelines:
╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
╰─ target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
╰─ target[token] := SequenceTemplate{String}((Input:<type=1>)...)(Val{1}(), target.token)
╰─ target[attention_mask] := (NeuralAttentionlib.RevLengthMask ∘ Transformers.TextEncoders.getlengths(1024))(target.token)
╰─ target[token] := TextEncodeBase.trunc_and_pad(1024, <|endoftext|>, head, head)(target.token)
╰─ target[token] := TextEncodeBase.nested2batch(target.token)
╰─ target := (target.token, target.attention_mask)
)
julia> e = encode(gpt2enc, [["this is a sentence", "and another"]])
(token = [0 0 … 0 0; 0 0 … 0 0; … ; 0 0 … 0 0; 0 0 … 0 0;;;], attention_mask = NeuralAttentionlib.RevLengthMask{1, Vector{Int32}}(Int32[6]))
julia> typeof(e)
NamedTuple{(:token, :attention_mask), Tuple{OneHotArray{0x0000c451, 2, 3, Matrix{OneHot{0x0000c451}}}, NeuralAttentionlib.RevLengthMask{1, Vector{Int32}}}}
TextEncodeBase.encode
— Methodencode(::T5TextEncoder, ::String)
Encode a single sentence with t5 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 1}, attention_mask::LengthMask{1, Vector{Int32}}}
.
encode(::T5TextEncoder, ::Vector{String})
Encode a batch of sentences with t5 text encoder. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::LengthMask{1, Vector{Int32}}}
.
encode(::T5TextEncoder, ::Vector{Vector{String}})
Encode a batch of segments with t5 text encoder. Segments would be concatenate together as batch of sentences with a separation token. The default pipeline returning @NamedTuple{token::OneHotArray{K, 2}, attention_mask::LengthMask{1, Vector{Int32}}}
.
encode(::T5TextEncoder, ::Vector{Vector{Vector{String}}})
Encode a batch of multi-sample segments with t5 text encoder. The number of sample per data need to be the same. (e.g. length(batch[1]) == length(batch[2])
). The default pipeline returning @NamedTuple{token::OneHotArray{K, 3}, attention_mask::LengthMask{2, Matrix{Int32}}}
. notice: If you want each sample to be independent to each other, this need to be reshaped before feeding to transformer layer or make sure the attention is not taking the end-1
dimension as another length dimension.
See also: decode
, LengthMask
Example
julia> t5enc = HuggingFace.load_tokenizer("t5")
T5TextEncoder(
├─ TextTokenizer(MatchTokenization(PrecompiledNormalizer(WordReplaceNormalizer(UnigramTokenization(EachSplitTokenization(splitter = isspace), unigram = Unigram(vocab_size = 32100, unk = <unk>)), pattern = r"^(?!▁)(.*)$" => s"▁"), precompiled
= PrecompiledNorm(...)), 103 patterns)),
├─ vocab = Vocab{String, SizedArray}(size = 32100, unk = <unk>, unki = 3),
├─ endsym = </s>,
├─ padsym = <pad>,
└─ process = Pipelines:
╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
╰─ target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
╰─ target[(token, segment)] := SequenceTemplate{String}(Input[1]:<type=1> </s>:<type=1> (Input[2]:<type=1> </s>:<type=1>)...)(target.token)
╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(nothing))(target.token)
╰─ target[token] := TextEncodeBase.trunc_and_pad(nothing, <pad>, tail, tail)(target.token)
╰─ target[token] := TextEncodeBase.nested2batch(target.token)
╰─ target := (target.token, target.attention_mask)
)
julia> e = encode(t5enc, [["this is a sentence", "and another"]])
(token = [0 0 … 0 0; 0 0 … 0 1; … ; 0 0 … 0 0; 0 0 … 0 0;;;], attention_mask = NeuralAttentionlib.LengthMask{1, Vector{Int32}}(Int32[9]))
julia> typeof(e)
NamedTuple{(:token, :attention_mask), Tuple{OneHotArray{0x00007d64, 2, 3, Matrix{OneHot{0x00007d64}}}, NeuralAttentionlib.LengthMask{1, Vector{Int32}}}}
Transformers.TextEncoders.bert_cased_tokenizer
— Methodbert_cased_tokenizer(input)
Google bert tokenizer which remain the case during tokenization. Recommended for multi-lingual data.
Transformers.TextEncoders.bert_uncased_tokenizer
— Methodbert_uncased_tokenizer(input)
Google bert tokenizer which do lower case on input before tokenization.
Transformers.TextEncoders.gpt_tokenizer
— Methodgpt_tokenizer(x)
An alternative for origin tokenizer (spacy tokenizer) used in gpt model.
Transformers.TextEncoders.text_standardize
— Methodtext_standardize(text)
The function in the origin gpt code. Fixes some issues the spacy tokenizer had on books corpus also does some whitespace standardization.