Transformers.TextEncoders

Text processing module.

API Reference

Transformers.TextEncoders.TrfTextEncoder — Type

struct TrfTextEncoder{
    T <: AbstractTokenizer,
    V <: AbstractVocabulary{String},
    C, A, EP, OP, DP, TP
} <: AbstractTransformerTextEncoder
    tokenizer::T
    vocab::V
    config::C
    annotate::A
    process::EP
    onehot::OP
    decode::DP
    textprocess::TP
end

The general text encoder. TrfTextEncoder has multiple fields that can modify the encode/decode process:

.annotate (default to TextEncoders.annotate_strings): Annotate the input string for the tokenizer,

e.g. String would be treated as a single sentence, not a single word.

.process (default to TextEncodeBase.nestedcall(TextEncoders.string_getvalue)): The pre-process

function applied to the tokenization results, e.g. adding special end-of-sentence token, computing attention mask...

.onehot (default to TextEncoders.lookup_fist): Apply onehot encoding on the preprocess result,

the default behavior takes the first element from the proprocess result and applies onehot encoding.

.decode (default to identity): The function that converts each token id back to string. This can

be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.

.textprocess (default to TextEncodeBase.join_text): the function that joins the decode-d result

in complete sentence(s).

source

Transformers.TextEncoders.TrfTextEncoder — Method

TrfTextEncoder(
    tokenizer     :: AbstractTokenizer ,
    vocab         :: AbstractVocabulary{String} ,
    [ annotate    =  TextEncoders.annotate_string ,
    [ process     =  TextEncodeBase.nestedcall(TextEncoders.string_getvalue) ,
    [ onehot      =  TextEncoders.lookup_first ,
    [ decode      =  identity ,
    [ textprocess =  TextEncodeBase.join_text, ]]]]]
    ; config...)

Constructor of TrfTextEncoder. All keyword arguments are store in the .config field.

source

Transformers.TextEncoders.bert_cased_tokenizer — Method

bert_cased_tokenizer(input)

Google bert tokenizer which remain the case during tokenization. Recommended for multi-lingual data.

source

Transformers.TextEncoders.bert_uncased_tokenizer — Method

bert_uncased_tokenizer(input)

Google bert tokenizer which do lower case on input before tokenization.

source

Transformers.TextEncoders.gpt_tokenizer — Method

gpt_tokenizer(x)

An alternative for origin tokenizer (spacy tokenizer) used in gpt model.

source

Transformers.TextEncoders.set_annotate — Method

set_annotate(builder, e::TrfTextEncoder)

Return a new text encoder with the annotate field replaced with builder(e).

source

Transformers.TextEncoders.set_config — Method

set_config(builder, e::TrfTextEncoder)

Return a new text encoder with the config field replaced with builder(e).

source

Transformers.TextEncoders.set_decode — Method

set_decode(builder, e::TrfTextEncoder)

Return a new text encoder with the decode field replaced with builder(e).

source

Transformers.TextEncoders.set_onehot — Method

set_onehot(builder, e::TrfTextEncoder)

Return a new text encoder with the onehot field replaced with builder(e).

source

Transformers.TextEncoders.set_process — Method

set_process(builder, e::TrfTextEncoder)

Return a new text encoder with the process field replaced with builder(e).

source

Transformers.TextEncoders.set_textprocess — Method

set_textprocess(builder, e::TrfTextEncoder)

Return a new text encoder with the textprocess field replaced with builder(e).

source

Transformers.TextEncoders.set_tokenizer — Method

set_tokenizer(builder, e::TrfTextEncoder)

Return a new text encoder with the tokenizer field replaced with builder(e). builder can either return a AbstractTokenizer or a AbstractTokenization.

source

Transformers.TextEncoders.set_vocab — Method

set_vocab(builder, e::TrfTextEncoder)

Return a new text encoder with the vocab field replaced with builder(e). builder can either return a AbstractVocabulary{String} or a AbstractVector{String}.

source

Transformers.TextEncoders.text_standardize — Method

text_standardize(text)

The function in the origin gpt code. Fixes some issues the spacy tokenizer had on books corpus also does some whitespace standardization.

source