Api reference

TextEncodeBase.AbstractTokenization
TextEncodeBase.AbstractTokenizer
TextEncodeBase.ConstTerm
TextEncodeBase.FlatTokenizer
TextEncodeBase.IndexInputTerm
TextEncodeBase.InputTerm
TextEncodeBase.NestedTokenizer
TextEncodeBase.RepeatedTerm
TextEncodeBase.SequenceTemplate
TextEncodeBase.Splittability
TextEncodeBase.TemplateTerm
TextEncodeBase.TextEncoder
TextEncodeBase.TextEncoder
TextEncodeBase.TokenStages
TextEncodeBase.Vocab
TextEncodeBase.Vocab
TextEncodeBase.batch2nested
TextEncodeBase.decode
TextEncodeBase.decode_indices
TextEncodeBase.decode_text
TextEncodeBase.encode
TextEncodeBase.encode_indices
TextEncodeBase.join_text
TextEncodeBase.lookup
TextEncodeBase.lookup
TextEncodeBase.lookup
TextEncodeBase.lookup
TextEncodeBase.lookup
TextEncodeBase.lookup
TextEncodeBase.lookup
TextEncodeBase.matchsplits
TextEncodeBase.matchsplits
TextEncodeBase.nested2batch
TextEncodeBase.onehot_encode
TextEncodeBase.peek_sequence_sample_type
TextEncodeBase.preprocess
TextEncodeBase.process
TextEncodeBase.process
TextEncodeBase.sequence_sample_type
TextEncodeBase.splittability
TextEncodeBase.splittable
TextEncodeBase.splitting
TextEncodeBase.tokenization
TextEncodeBase.tokenize
TextEncodeBase.tokenize_procedure
TextEncodeBase.trunc_and_pad
TextEncodeBase.trunc_or_pad
TextEncodeBase.type_sequence_sample_type
TextEncodeBase.with_head_tail
TextEncodeBase.wrap
TextEncodeBase.wrap
TextEncodeBase.@stage

TextEncodeBase.AbstractTokenization — Type

abstract type for tokenization.

The tokenization procedure is separate into multiple TokenStages and recursive calls of splitting, wrap, and tokenize. splitting break string into substrings, wrap mark the substrings with new TokenStages, and tokenize is responsible for the tokenization.

TextEncodeBase.AbstractTokenizer — Type

abstract type for tokenizers.

Each tokenizer is link with a tokenization (by defining tokenization(::Tokenizer) = Tokenization()). The overall framework dispatch on both tokenizer and tokenization, but most of the time we only add methods for tokenization. This allow further composability and can interfere the tokenization process with given tokenizer.

TextEncodeBase.ConstTerm — Type

ConstTerm(value::T, type_id = 1)

A TemplateTerm that simply put value to the output sequence.

TextEncodeBase.FlatTokenizer — Type

tokenizer that return flat array instead of nested array of tokens

TextEncodeBase.IndexInputTerm — Type

IndexInputTerm{T}(idx::Int, type_id = 1)

A TemplateTerm that take the idx-th sequence of the input. If the IndexInputTerm is also the idx-th input related term in a SequenceTemplate, it behave the same as InputTerm.

TextEncodeBase.InputTerm — Type

InputTerm{T}(type_id = 1)

A TemplateTerm that take out a sequence from the input.

TextEncodeBase.NestedTokenizer — Type

tokenizer that return nested array instead of flat array of tokens

TextEncodeBase.RepeatedTerm — Type

RepeatedTerm(terms::TemplateTerm...; dynamic_type_id = false)

A special term that indicate the terms sequence can appear zero or multiple times. Cannot be nested. If dynamic_type_id is set, each repeat would add an offset value to the type id of those repeat terms. The offset value if the number of repetiton, starting form 0, times dynamic_type_id.

TextEncodeBase.SequenceTemplate — Type

SequenceTemplate(terms::TemplateTerm)(sequences...)

Constructing a function by multiple TemplateTerm that indicate how to combine the input sequences. Return a tuple of the result sequence and a type id (a special number associated with the template term) sequence.

Example

julia> SequenceTemplate(ConstTerm(-1), InputTerm{Int}(), ConstTerm(-2))(1:5)[1] == TextEncodeBase.with_head_tail(1:5, -1, -2)
true

julia> SequenceTemplate(ConstTerm(-1), InputTerm{Int}(), ConstTerm(-2))(1:5)
([-1, 1, 2, 3, 4, 5, -2], [1, 1, 1, 1, 1, 1, 1])

julia> bert_template = SequenceTemplate(
           ConstTerm("[CLS]", 1), InputTerm{String}(1), ConstTerm("[SEP]", 1),
           RepeatedTerm(InputTerm{String}(2), ConstTerm("[SEP]", 2))
       )
SequenceTemplate{String}([CLS]:<type=1> Input:<type=1> [SEP]:<type=1> (Input:<type=2> [SEP]:<type=2>)...)

julia> bert_template(["hello", "world"])
(["[CLS]", "hello", "world", "[SEP]"], [1, 1, 1, 1])

julia> bert_template(["hello", "world"], ["today", "is", "a", "good", "day"])
(["[CLS]", "hello", "world", "[SEP]", "today", "is", "a", "good", "day", "[SEP]"], [1, 1, 1, 1, 2, 2, 2, 2, 2, 2])

TextEncodeBase.Splittability — Type

splittability trait

The splittability trait decide whether the given combination (tokenizer x tokenization x stage) is splittable or not (Splittable or UnSplittable). For example, DefaultTokenization and SentenceStage is splittable (i.e. splittability(::DefaultTokenization, ::SentenceStage) = Splittable()). The splittability change the behavior of tokenize: if it's splittable, tokenize will try to call splitting on the input, wrap each splitting result and recurse. Otherwise, it will directly call wrap and then recurse into tokenize.

TextEncodeBase.TemplateTerm — Type

abstract type TemplateTerm{T} end

Abstract type for term used in SequenceTemplate.

TextEncodeBase.TextEncoder — Type

TextEncoder(tokenizer, vocab, process = nestedcall(getvalue))

A simple encoder implementation.

TextEncodeBase.TextEncoder — Method

TextEncoder(builder, e::TextEncoder)

Given an encoder, return a new encoder that has the same tokenizer and vocabulary. builder is a function that take a encoder and return a new processing function.

TextEncodeBase.TokenStages — Type

abstract type for type that wrap input into specific stage for control tokenization.

There are six builtin stages in TextEncodeBase (all abstract XStage <: TokenStages):

1. Document <: DocumentStage: the input string is a full document,
 and thus need to be splitted into multiple sentence.
2. Sentence <: SentenceStage: the input string is a full string,
 and thus need to be splitted into multiple part (SubSentence/Word/Token).
3. SubSentence <: SubSentenceStage: special wrapper for case where the tokenizer
 does not directly break sentence all into words/tokens and these pieces contain
 multiple words/tokens, but you need the information that they are not full sentence.
4. Word <: WordStage: the input string is a single word.
5. SubWord <: SubWordStage: similar to SubSentence, but for word.
6. Token <: TokenStage: the final piece of the tokenization process.
 Generally, it's used to specify the end of this piece and should
 never be splitted.

Each wrapper have two field: x for the input, meta for extra information (nothing if not provided).

TextEncodeBase.Vocab — Type

Vocab(data::Vector{<:AbstractString}, unk::AbstractString="[UNK]")

Constructor for Vocab. data is the list of vocabulary word, can be nonunique. The actual list will be the unique version of data (i.e. vocab.list = unique(data)). unk is the indicator word for all unknown words. unk can be either in or not in data, depends on the use case.

TextEncodeBase.Vocab — Method

Vocab{T}(data::AbstractVector, unk) where T

construct Vocab with element type T. unk must be specified.

TextEncodeBase.batch2nested — Method

batch2nested(x)

convert single array into nested array.

See also: nested2batch

Example

julia> x = ["a" "d"; "b" "e"; "c" "f";;; "x" "u"; "y" "v"; "z" "w"; ]
3×2×2 Array{String, 3}:
[:, :, 1] =
 "a"  "d"
 "b"  "e"
 "c"  "f"

[:, :, 2] =
 "x"  "u"
 "y"  "v"
 "z"  "w"

julia> TextEncodeBase.batch2nested(x)
2-element Vector{Vector{Vector{String}}}:
 [["a", "b", "c"], ["d", "e", "f"]]
 [["x", "y", "z"], ["u", "v", "w"]]

TextEncodeBase.decode — Method

decode(e::AbstractTextEncoder, x)

Decode x. This is basically decode_indices but can be overloaded for post-processing.

TextEncodeBase.decode_indices — Method

decode_indices(e::AbstractTextEncoder, x)

Decode from indices. Decode x by reverse lookup x in e.vocab.

TextEncodeBase.decode_text — Method

decode_text(e::AbstractTextEncoder, x)

Decode x into texts. This is basically join_text with decode but can be overloaded for post-processing.

TextEncodeBase.encode — Method

encode(e::AbstractTextEncoder, x)

Encode x.

TextEncodeBase.encode_indices — Method

encode_indices(e::AbstractTextEncoder, x)

Encode for indices. Encode x without calling lookup bound with e.

TextEncodeBase.join_text — Function

join_text(x::AbstractArray [, delim [, last]])

join the inner most array and preserve the array structure. If the inner most array is multi-dimensional, join text along the first dimension.

Example

julia> TextEncodeBase.join_text([["a", "b", "c"], ['x', 'y', 'z']])
2-element Vector{String}:
 "abc"
 "xyz"

julia> TextEncodeBase.join_text([["a", "b", "c"], ['x', 'y', 'z']], " + ")
2-element Vector{String}:
 "a + b + c"
 "x + y + z"

julia> TextEncodeBase.join_text([[["a", "b", "c"], ['x', 'y', 'z']]], " + ", " = ")
1-element Vector{Vector{String}}:
 ["a + b = c", "x + y = z"]

julia> TextEncodeBase.join_text(["a" "d"; "b" "e"; "c" "f";;; "x" "u"; "y" "v"; "z" "w"; ], " + ", " = ")
2×2 Matrix{String}:
 "a + b = c"  "x + y = z"
 "d + e = f"  "u + v = w"

TextEncodeBase.lookup — Function

lookup(v::Vocab, x)

Lookup x in v. lookup words depends on the type of x. If x is an integer, return the x-th word on the vocabulary list (i.e. v.list[x]) and return the unknown word if x is out-of-bound (v.unk). If x is a string, return the indice of x in the vocabulary list (i.e findfirst(==(x), v.list) and return the unknown indice if x not found in the list. If the unknown word v.unk is in the list, the unknown indice is its indice, otherwise 0.

This function is bidirectional except for Vocab{<:Integer}. For integer vocabulary, this function only get the x-th word (v.list[x]). Use lookup(Int, v, x) for explicit indice lookup.

Example

julia> vocab = Vocab(["a", "b", "c", "a", "b", "c"])
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = [UNK], unki = 0)

julia> vocab_unk = Vocab(["a", "b", "xxx"], "xxx")
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = xxx, unki = 3)

julia> lookup(vocab, "b")
2

julia> lookup(vocab, "d")
0

julia> lookup(vocab_unk, "d")
3

julia> lookup(vocab, 1)
"a"

julia> lookup(vocab, 10000)
"[UNK]"

julia> lookup(vocab_unk, 10000)
"xxx"

TextEncodeBase.lookup — Method

lookup(e::AbstractTextEncoder, x)

Lookup x. This is basically onehot_encode but can be overloaded for extra processing.

TextEncodeBase.lookup — Method

lookup(Int, v::Vocab, x)

The explicit version of lookup(v, x). Lookup the indice of x in the vocabulary list. x should have the same type as Vocab's element type.

Example

julia> vocab_unk = Vocab(["a", "b", "xxx"], "xxx")
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = xxx, unki = 3)

julia> lookup(Int, vocab_unk, "b")
2

TextEncodeBase.lookup — Method

lookup(OneHot, v::Vocab, i)

lookup i and convert into one-hot representation.

Example

julia> lookup(OneHot, vocab, "a")
3-element OneHot{3}:
 1
 0
 0

julia> lookup(OneHot, vocab, ["a" "b"; "c" "d"])
3x2x2 OneHotArray{3, 3, Matrix{OneHot{0x00000003}}}:
[:, :, 1] =
 1  0
 0  0
 0  1

[:, :, 2] =
 0  0
 1  0
 0  0

julia> lookup(OneHot, vocab, 3)
ERROR: DomainError with c:
cannot convert `lookup(::Vocab, 3)` = "c" into one-hot representation.
Stacktrace:
[...]

TextEncodeBase.lookup — Method

lookup(v::Vocab, is::AbstractArray)

recursively lookup value from is

Example

julia> lookup(vocab, ["b", "c", "a", "A", "[UNK]"])
5-element Vector{Int64}:
 2
 3
 1
 0
 0

julia> lookup(vocab, [1, "a", 0, "A", "[UNK]"])
5-element Vector{Any}:
  "a"
 1
  "[UNK]"
 0
 0

TextEncodeBase.lookup — Method

lookup(v::Vocab, i::OneHotArray)

convert the one-hot representation back into words.

Example

julia> lookup(OneHot, vocab, ["a" "b"; "c" "d"])
3x2x2 OneHotArray{3, 3, Matrix{OneHot{0x00000003}}}:
[:, :, 1] =
 1  0
 0  0
 0  1

[:, :, 2] =
 0  0
 1  0
 0  0

julia> lookup(vocab, ans)
2×2 Matrix{String}:
 "a"  "b"
 "c"  "[UNK]"

TextEncodeBase.lookup — Method

lookup(::Type{T}, v::Vocab{T}, i::Integer) where T

The explicit version of lookup(v, i). Lookup the word at index i on vocabulary list. T should be the same type as Vocab's element type. This method won't work on integer vocab, use lookup(v, i) directly.

Example

julia> vocab_unk = Vocab(["a", "b", "xxx"], "xxx")
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = xxx, unki = 3)

julia> lookup(String, vocab_unk, 1)
"a"

TextEncodeBase.matchsplits — Method

matchsplits(pattern::AbstractPattern, str::String)

Split str with the regular expression pattern. Return a lazy iterator where each element is a Tuple{Bool, SubString}. The Bool indicate whether the SubString is a match of pattern.

Example

julia> matchsplits(r"a|c", "abc"^3)
MatchSplitIterator(r"a|c", "abcabcabc")

julia> collect(matchsplits(r"a|c", "abc"^3))
9-element Vector{Tuple{Bool, SubString{String}}}:
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")

TextEncodeBase.matchsplits — Method

matchsplits(patterns::Vector{<:AbstractPattern}, str::String)

Split str with the list of regular expression patterns. Return a lazy iterator where each element is a Tuple{Bool, SubString}. The Bool indicate whether the SubString is a match of pattern. The match order are specified by the list order.

Example

julia> matchsplits([r"a", r"c"], "abc"^3)
MatchSplits(Regex[r"a", r"c"], "abcabcabc")

julia> collect(matchsplits([r"a", r"c"], "abc"^3))
9-element Vector{Tuple{Bool, SubString{String}}}:
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")

julia> collect(matchsplits([r"ab", r"bc"], "abc"^3))
6-element Vector{Tuple{Bool, SubString{String}}}:
 (1, "ab")
 (0, "c")
 (1, "ab")
 (0, "c")
 (1, "ab")
 (0, "c")

TextEncodeBase.nested2batch — Method

nested2batch(x)

convert nested array into single array

See also: batch2nested

Example

julia> TextEncodeBase.nested2batch([[[1 2],[3 4]]])
1×2×2×1 Array{Int64, 4}:
[:, :, 1, 1] =
 1  2

[:, :, 2, 1] =
 3  4

TextEncodeBase.onehot_encode — Method

onehot_encode(e::AbstractTextEncoder, x)

Lookup x in encoder's vocabulary. Return one-hot encoded vectors.

TextEncodeBase.peek_sequence_sample_type — Method

peek_sequence_sample_type([T::Type,] x)

Non-recursive version of sequence_sample_type. Return -1 if the x is an array of array with unknown elements, thus it's possible that sequence_sample_type(x[i]) == -2. Specify T to check if x is a nested array with element type T. If T is not specified, every type not subtype to AbstractArray is a count as element type.

see also: type_sequence_sample_type, sequence_sample_type

Example

julia> TextEncodeBase.peek_sequence_sample_type([1,2,3])
1

julia> peek_sequence_sample_type(Int, Any[[[1,2,3]]]), sequence_sample_type(Int, Any[[[1,2,3]]])
(-1, 3)

julia> peek_sequence_sample_type(Int, [[[1,2,3], "abc"]]), sequence_sample_type(Int, [[[1,2,3], "abc"]])
(-1, -2)

TextEncodeBase.preprocess — Method

preprocess(tkr::AbstractTokenizer, x)

Preprocess the input x. This is only called during tkr(x).

TextEncodeBase.process — Method

process(e::AbstractTextEncoder, x)

Use encoder's processing function to process x.

TextEncodeBase.process — Method

process(::AbstractTextEncoder)

Get processing function of given encoder.

TextEncodeBase.sequence_sample_type — Method

sequence_sample_type([T::Type,] x)

Get the depth of the nested array. If return natural number, x is a nested array where each element has the same depth. Return -2 if x is not a nested array or the depth of elements are different. Depth of empty array compute with the type and sequence_sample_type(Any[]) is 1. Specify T to check if x is a nested array with element type T. If T is not specified, every type not subtype to AbstractArray is a count as element type.

see also: type_sequence_sample_type, peek_sequence_sample_type

Example

julia> sequence_sample_type([[1,2,3]])
2

julia> sequence_sample_type([[[2,3], [1]], Vector{Int}[]])
3

julia> sequence_sample_type([[[2,3], [1]], Any[]])
-2

julia> sequence_sample_type(Int, [[1,2], 3])
-2

julia> sequence_sample_type(Int, Any[[1,2], Int[]])
2

TextEncodeBase.splittability — Function

splittability(args...)

Return the splittability (Splittable/UnSplittable) of given argument combination. Overload to make a TokenStages splittable.

TextEncodeBase.splittable — Method

splittable(args...)

Return true if the splittability of given argument combination is Splittable().

TextEncodeBase.splitting — Function

splitting(t::AbstractTokenization, x::TokenStages)

Split x given its tokenization stage. For example, the default behavior of a document stage is splitting into sentences (with WordTokenizers.split_sentences).

Overload this method for custom tokenization.

TextEncodeBase.tokenization — Method

tokenization(::AbstractTokenizer) :: AbstractTokenization

Return the tokenization object of given tokenizer.

TextEncodeBase.tokenize — Method

tokenize(e::AbstractTextEncoder, x)

Use encoder's tokenizer to tokenize x.

TextEncodeBase.tokenize_procedure — Method

tokenization_procedure(tokenizer, tokenizaton, stage)

The procedure of tokenization (splitting + wrap + tokenize).

TextEncodeBase.trunc_and_pad — Function

trunc_and_pad(x, maxn, pad)

Truncate x if length exceed maxn, and add pad at the end of x until all length are the same. x can be either nested or single array. If maxn is nothing, the largest length of the inner-most array will be used, then the behavior equals to trunc_or_pad with nothing.

trunc_and_pad(x, maxn, pad, trunc_end = :tail, pad_end = :tail)

trunc_end and pad_end specified whether the truncation and padding happened at the begining of the sentences or the end of the sentence. The value is either :tail (means the end) or :head (means the begining).

trunc_and_pad(maxn, pad, trunc_end = :tail, pad_end = :tail)

Create a function that truncate input to be length <= maxn, and add pad until all input has equal length.

see also: trunc_or_pad

Example

julia> TextEncodeBase.trunc_and_pad(1:5, 7, -1)
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> TextEncodeBase.trunc_and_pad([1:5, 2:7], 10, -1)
2-element Vector{Vector{Int64}}:
 [1, 2, 3, 4, 5, -1]
 [2, 3, 4, 5, 6, 7]

julia> TextEncodeBase.trunc_and_pad([1:5, [2:7, [1:2]]], nothing, -1)
2-element Vector{Vector}:
 [1, 2, 3, 4, 5, -1]
 Vector[[2, 3, 4, 5, 6, 7], [[1, 2, -1, -1, -1, -1]]]

TextEncodeBase.trunc_or_pad — Function

trunc_or_pad(x, n, pad)

Truncate x to length n, or add pad at the end of x until length equal n. x can be either nested or single array. if n is nothing, the largest length of the inner-most array will be used.

trunc_or_pad(x, n, pad, trunc_end = :tail, pad_end = :tail)

trunc_end and pad_end specified whether the truncation and padding happened at the begining of the sentences or the end of the sentence. The value is either :tail (means the end) or :head (means the begining).

trunc_or_pad(n, pad, trunc_end = :tail, pad_end = :tail)

Create a function that will return new array with truncated or padded value of the input.

see also: trunc_and_pad

Example

julia> TextEncodeBase.trunc_or_pad(1:5, 7, -1)
7-element Vector{Int64}:
  1
  2
  3
  4
  5
 -1
 -1

julia> TextEncodeBase.trunc_or_pad([1:5, 2:7], 10, -1)
2-element Vector{Vector{Int64}}:
 [1, 2, 3, 4, 5, -1, -1, -1, -1, -1]
 [2, 3, 4, 5, 6, 7, -1, -1, -1, -1]

julia> TextEncodeBase.trunc_or_pad([1:5, [2:7, [1:2]]], nothing, -1)
2-element Vector{Vector}:
 [1, 2, 3, 4, 5, -1]
 Vector[[2, 3, 4, 5, 6, 7], [[1, 2, -1, -1, -1, -1]]]

TextEncodeBase.type_sequence_sample_type — Method

type_sequence_sample_type([T::Type,] t::Type)

Get the depth of the nested array type. If return natural number, t is a type of nested array. Return -1 if it cannot be known by type and return -2 if t is not a nested array type. Specify T to check if t is a nested array type with element type T. If T is not specified, every type not subtype to AbstractArray is a count as element type.

see also: sequence_sample_type, peek_sequence_sample_type

Example

julia> type_sequence_sample_type(Vector{Vector{Integer}})
2

julia> type_sequence_sample_type(Number, Array{Vector{Union{Float64, Int}}})
2

julia> type_sequence_sample_type(Int, Array{Vector{Union{Float64, Int}}})
-2

TextEncodeBase.with_head_tail — Method

with_head_tail(x, head, tail)

Return [head; x; tail]. Ignored if head or tail is nothing. x can be nested arrays.

Example

julia> TextEncodeBase.with_head_tail(1:5, -1, -2)
7-element Vector{Int64}:
 -1
  1
  2
  3
  4
  5
 -2

julia> TextEncodeBase.with_head_tail([1:5, 2:3], -1, -2)
2-element Vector{Vector{Int64}}:
 [-1, 1, 2, 3, 4, 5, -2]
 [-1, 2, 3, -2]

TextEncodeBase.wrap — Function

wrap(t::AbstractTokenization, s::TokenStages, x)

Mark the tokenization stage of x, which is part of the splitting result of s. For example, if we are doing simple whitespace tokenization and at the sentence stage, then x is just single word of s and thus return Word(x) (or Token(x)). Skip if x is already a TokenStages. (this method only apply to splittable stages)

Overload this method to control the tokenization process.

TextEncodeBase.wrap — Method

wrap(t::AbstractTokenization, x::TokenStages)

A handler for unsplittable stages (token/word/...).

Overload this method for custom transform.

TextEncodeBase.@stage — Macro

@stage StageName
@stage StageName{A<:SomeType, B}
@stage StageName AbstractStage
@stage StageName{A<:SomeType, B} <: AbstractStage

Define TokenStages with two field (x and meta), it's single arguement constructor, and add methods to setmeta and setvalue.

Equivalent to:

struct StageName{A<:SomeType, B} <: AbstractStage
    x::A
    meta::B
end

StageName(x) = StageName(x, nothing)
TextEncodeBase.setmeta(x::StageName, meta) = StageName(x.x, meta)
TextEncodeBase.setvalue(x::StageName, y) = StageName(y, x.meta)