Transformers.TextEncoders
Text processing module.
API Reference
Transformers.TextEncoders.TrfTextEncoder — Typestruct TrfTextEncoder{
T <: AbstractTokenizer,
V <: AbstractVocabulary{String},
C, A, EP, OP, DP, TP
} <: AbstractTransformerTextEncoder
tokenizer::T
vocab::V
config::C
annotate::A
process::EP
onehot::OP
decode::DP
textprocess::TP
endThe general text encoder. TrfTextEncoder has multiple fields that can modify the encode/decode process:
.annotate(default toTextEncoders.annotate_strings): Annotate the input string for the tokenizer,
e.g. String would be treated as a single sentence, not a single word.
.process(default toTextEncodeBase.nestedcall(TextEncoders.string_getvalue)): The pre-process
function applied to the tokenization results, e.g. adding special end-of-sentence token, computing attention mask...
.onehot(default toTextEncoders.lookup_fist): Apply onehot encoding on the preprocess result,
the default behavior takes the first element from the proprocess result and applies onehot encoding.
.decode(default toidentity): The function that converts each token id back to string. This can
be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.
.textprocess(default toTextEncodeBase.join_text): the function that joins thedecode-d result
in complete sentence(s).
Transformers.TextEncoders.TrfTextEncoder — MethodTrfTextEncoder(
tokenizer :: AbstractTokenizer ,
vocab :: AbstractVocabulary{String} ,
[ annotate = TextEncoders.annotate_string ,
[ process = TextEncodeBase.nestedcall(TextEncoders.string_getvalue) ,
[ onehot = TextEncoders.lookup_first ,
[ decode = identity ,
[ textprocess = TextEncodeBase.join_text, ]]]]]
; config...)Constructor of TrfTextEncoder. All keyword arguments are store in the .config field.
Transformers.TextEncoders.bert_cased_tokenizer — Methodbert_cased_tokenizer(input)Google bert tokenizer which remain the case during tokenization. Recommended for multi-lingual data.
Transformers.TextEncoders.bert_uncased_tokenizer — Methodbert_uncased_tokenizer(input)Google bert tokenizer which do lower case on input before tokenization.
Transformers.TextEncoders.gpt_tokenizer — Methodgpt_tokenizer(x)An alternative for origin tokenizer (spacy tokenizer) used in gpt model.
Transformers.TextEncoders.set_annotate — Methodset_annotate(builder, e::TrfTextEncoder)Return a new text encoder with the annotate field replaced with builder(e).
Transformers.TextEncoders.set_config — Methodset_config(builder, e::TrfTextEncoder)Return a new text encoder with the config field replaced with builder(e).
Transformers.TextEncoders.set_decode — Methodset_decode(builder, e::TrfTextEncoder)Return a new text encoder with the decode field replaced with builder(e).
Transformers.TextEncoders.set_onehot — Methodset_onehot(builder, e::TrfTextEncoder)Return a new text encoder with the onehot field replaced with builder(e).
Transformers.TextEncoders.set_process — Methodset_process(builder, e::TrfTextEncoder)Return a new text encoder with the process field replaced with builder(e).
Transformers.TextEncoders.set_textprocess — Methodset_textprocess(builder, e::TrfTextEncoder)Return a new text encoder with the textprocess field replaced with builder(e).
Transformers.TextEncoders.set_tokenizer — Methodset_tokenizer(builder, e::TrfTextEncoder)Return a new text encoder with the tokenizer field replaced with builder(e). builder can either return a AbstractTokenizer or a AbstractTokenization.
Transformers.TextEncoders.set_vocab — Methodset_vocab(builder, e::TrfTextEncoder)Return a new text encoder with the vocab field replaced with builder(e). builder can either return a AbstractVocabulary{String} or a AbstractVector{String}.
Transformers.TextEncoders.text_standardize — Methodtext_standardize(text)The function in the origin gpt code. Fixes some issues the spacy tokenizer had on books corpus also does some whitespace standardization.