Transformers.TextEncoders

Text processing module.

API Reference

Transformers.TextEncoders.TrfTextEncoderType
struct TrfTextEncoder{
    T <: AbstractTokenizer,
    V <: AbstractVocabulary{String},
    C, A, EP, OP, DP, TP
} <: AbstractTransformerTextEncoder
    tokenizer::T
    vocab::V
    config::C
    annotate::A
    process::EP
    onehot::OP
    decode::DP
    textprocess::TP
end

The general text encoder. TrfTextEncoder has multiple fields that can modify the encode/decode process:

  1. .annotate (default to TextEncoders.annotate_strings): Annotate the input string for the tokenizer,

e.g. String would be treated as a single sentence, not a single word.

  1. .process (default to TextEncodeBase.nestedcall(TextEncoders.string_getvalue)): The pre-process

function applied to the tokenization results, e.g. adding special end-of-sentence token, computing attention mask...

  1. .onehot (default to TextEncoders.lookup_fist): Apply onehot encoding on the preprocess result,

the default behavior takes the first element from the proprocess result and applies onehot encoding.

  1. .decode (default to identity): The function that converts each token id back to string. This can

be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.

  1. .textprocess (default to TextEncodeBase.join_text): the function that joins the decode-d result

in complete sentence(s).

source
Transformers.TextEncoders.TrfTextEncoderMethod
TrfTextEncoder(
    tokenizer     :: AbstractTokenizer ,
    vocab         :: AbstractVocabulary{String} ,
    [ annotate    =  TextEncoders.annotate_string ,
    [ process     =  TextEncodeBase.nestedcall(TextEncoders.string_getvalue) ,
    [ onehot      =  TextEncoders.lookup_first ,
    [ decode      =  identity ,
    [ textprocess =  TextEncodeBase.join_text, ]]]]]
    ; config...)

Constructor of TrfTextEncoder. All keyword arguments are store in the .config field.

source
Transformers.TextEncoders.set_tokenizerMethod
set_tokenizer(builder, e::TrfTextEncoder)

Return a new text encoder with the tokenizer field replaced with builder(e). builder can either return a AbstractTokenizer or a AbstractTokenization.

source
Transformers.TextEncoders.set_vocabMethod
set_vocab(builder, e::TrfTextEncoder)

Return a new text encoder with the vocab field replaced with builder(e). builder can either return a AbstractVocabulary{String} or a AbstractVector{String}.

source