Transformers.TextEncoders

Text processing module.

API Reference

Transformers.TextEncoders.BertTextEncoder — Type

BertTextEncoder

The text encoder for Bert model (WordPiece tokenization).

source

Transformers.TextEncoders.GPT2TextEncoder — Type

GPT2TextEncoder

The text encoder for GPT2 model (ByteLevel BytePairEncoding tokenization).

source

Transformers.TextEncoders.T5TextEncoder — Type

T5TextEncoder

The text encoder for T5 model (SentencePiece tokenization).

source

Transformers.TextEncoders.TransformerTextEncoder — Type

struct TransformerTextEncoder{
    T<:AbstractTokenizer, V<:AbstractVocabulary{String}, P
} <: AbstractTransformerTextEncoder
    tokenizer::T
    vocab::V
    process::P
    startsym::String
    endsym::String
    padsym::String
    trunc::Union{Nothing, Int}
end

The text encoder for general transformers. Taking a tokenizer, vocabulary, and a processing function, configured with a start symbol, an end symbol, a padding symbol, and a maximum length.

TransformerTextEncoder(tokenze, vocab, process; trunc = nothing,
                       startsym = "<s>", endsym = "</s>", unksym = "<unk>", padsym = "<pad>")

tokenize can be any tokenize function from WordTokenizers. vocab is either a list of word or a Vocab. process can be omitted, then a predefined processing pipeline will be used. When vocab is a list, those special symbol (e.g. padsym) would be added to the word list.

TransformerTextEncoder(f, e::TransformerTextEncoder)

Take a text encoder and create a new text encoder with same configuration except the processing function. f is a function that take the encoder and return a new process function. This is useful for changing part of the procssing function.

Example

julia> textenc = TransformerTextEncoder(labels; startsym, endsym, unksym,
                                        padsym = unksym, trunc = 100)
TransformerTextEncoder(
├─ TextTokenizer(default),
├─ vocab = Vocab{String, SizedArray}(size = 37678, unk = </unk>, unki = 1),
├─ startsym = <s>,
├─ endsym = </s>,
├─ padsym = </unk>,
├─ trunc = 100,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := TextEncodeBase.with_head_tail(<s>, </s>)(target.token)
  ╰─ target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(10))(target.token)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(10, <pad>, tail, tail)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target := (target.token, target.attention_mask)
)

julia> TransformerTextEncoder(ans) do enc
           enc.process[1] |> TextEncoders.Pipelines(enc.process[4:5]) |> TextEncoders.PipeGet{(:token,)}()
       end
TransformerTextEncoder(
├─ TextTokenizer(default),
├─ vocab = Vocab{String, SizedArray}(size = 37678, unk = </unk>, unki = 1),
├─ startsym = <s>,
├─ endsym = </s>,
├─ padsym = </unk>,
├─ trunc = 100,
└─ process = Pipelines:
  ╰─ target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  ╰─ target[token] := TextEncodeBase.trunc_and_pad(10, <pad>, tail, tail)(target.token)
  ╰─ target[token] := TextEncodeBase.nested2batch(target.token)
  ╰─ target := (target.token)
)

source

TextEncodeBase.decode — Method

decode(e::AbstractTransformerTextEncoder, x::Union{
    Integer,
    OneHotArray,
    AbstractArray{<:Integer}
})

Decode the one-hot encoding or indices into String (or Array{String}) from the bound vocabulary.

decode(e::AbstractTransformerTextEncoder, x::AbstractArray)

Perform argmax(x; dims = 1) and then decode. x should be collected beforehand if it's on GPU.

source

TextEncodeBase.decode — Method

decode(bertenc::BertTextEncoder, x)

Convert indices back to string with bert vocabulary.