Transformers.TextEncoders
Text processing module.
API Reference
Transformers.TextEncoders.TrfTextEncoder
— Typestruct TrfTextEncoder{
T <: AbstractTokenizer,
V <: AbstractVocabulary{String},
C, A, EP, OP, DP, TP
} <: AbstractTransformerTextEncoder
tokenizer::T
vocab::V
config::C
annotate::A
process::EP
onehot::OP
decode::DP
textprocess::TP
end
The general text encoder. TrfTextEncoder
has multiple fields that can modify the encode/decode process:
.annotate
(default toTextEncoders.annotate_strings
): Annotate the input string for the tokenizer,
e.g. String
would be treated as a single sentence, not a single word.
.process
(default toTextEncodeBase.nestedcall(TextEncoders.string_getvalue)
): The pre-process
function applied to the tokenization results, e.g. adding special end-of-sentence
token, computing attention mask...
.onehot
(default toTextEncoders.lookup_fist
): Apply onehot encoding on the preprocess result,
the default behavior takes the first element from the proprocess result and applies onehot encoding.
.decode
(default toidentity
): The function that converts each token id back to string. This can
be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.
.textprocess
(default toTextEncodeBase.join_text
): the function that joins thedecode
-d result
in complete sentence(s).
Transformers.TextEncoders.TrfTextEncoder
— MethodTrfTextEncoder(
tokenizer :: AbstractTokenizer ,
vocab :: AbstractVocabulary{String} ,
[ annotate = TextEncoders.annotate_string ,
[ process = TextEncodeBase.nestedcall(TextEncoders.string_getvalue) ,
[ onehot = TextEncoders.lookup_first ,
[ decode = identity ,
[ textprocess = TextEncodeBase.join_text, ]]]]]
; config...)
Constructor of TrfTextEncoder
. All keyword arguments are store in the .config
field.
Transformers.TextEncoders.bert_cased_tokenizer
— Methodbert_cased_tokenizer(input)
Google bert tokenizer which remain the case during tokenization. Recommended for multi-lingual data.
Transformers.TextEncoders.bert_uncased_tokenizer
— Methodbert_uncased_tokenizer(input)
Google bert tokenizer which do lower case on input before tokenization.
Transformers.TextEncoders.gpt_tokenizer
— Methodgpt_tokenizer(x)
An alternative for origin tokenizer (spacy tokenizer) used in gpt model.
Transformers.TextEncoders.set_annotate
— Methodset_annotate(builder, e::TrfTextEncoder)
Return a new text encoder with the annotate
field replaced with builder(e)
.
Transformers.TextEncoders.set_config
— Methodset_config(builder, e::TrfTextEncoder)
Return a new text encoder with the config
field replaced with builder(e)
.
Transformers.TextEncoders.set_decode
— Methodset_decode(builder, e::TrfTextEncoder)
Return a new text encoder with the decode
field replaced with builder(e)
.
Transformers.TextEncoders.set_onehot
— Methodset_onehot(builder, e::TrfTextEncoder)
Return a new text encoder with the onehot
field replaced with builder(e)
.
Transformers.TextEncoders.set_process
— Methodset_process(builder, e::TrfTextEncoder)
Return a new text encoder with the process
field replaced with builder(e)
.
Transformers.TextEncoders.set_textprocess
— Methodset_textprocess(builder, e::TrfTextEncoder)
Return a new text encoder with the textprocess
field replaced with builder(e)
.
Transformers.TextEncoders.set_tokenizer
— Methodset_tokenizer(builder, e::TrfTextEncoder)
Return a new text encoder with the tokenizer
field replaced with builder(e)
. builder
can either return a AbstractTokenizer
or a AbstractTokenization
.
Transformers.TextEncoders.set_vocab
— Methodset_vocab(builder, e::TrfTextEncoder)
Return a new text encoder with the vocab
field replaced with builder(e)
. builder
can either return a AbstractVocabulary{String}
or a AbstractVector{String}
.
Transformers.TextEncoders.text_standardize
— Methodtext_standardize(text)
The function in the origin gpt code. Fixes some issues the spacy tokenizer had on books corpus also does some whitespace standardization.