Api reference

TextEncodeBase.AbstractTokenizationType

abstract type for tokenization.

The tokenization procedure is separate into multiple TokenStages and recursive calls of splitting, wrap, and tokenize. splitting break string into substrings, wrap mark the substrings with new TokenStages, and tokenize is responsible for the tokenization.

source
TextEncodeBase.AbstractTokenizerType

abstract type for tokenizers.

Each tokenizer is link with a tokenization (by defining tokenization(::Tokenizer) = Tokenization()). The overall framework dispatch on both tokenizer and tokenization, but most of the time we only add methods for tokenization. This allow further composability and can interfere the tokenization process with given tokenizer.

source
TextEncodeBase.RepeatedTermType
RepeatedTerm(terms::TemplateTerm...; dynamic_type_id = false)

A special term that indicate the terms sequence can appear zero or multiple times. Cannot be nested. If dynamic_type_id is set, each repeat would add an offset value to the type id of those repeat terms. The offset value if the number of repetiton, starting form 0, times dynamic_type_id.

source
TextEncodeBase.SequenceTemplateType
SequenceTemplate(terms::TemplateTerm)(sequences...)

Constructing a function by multiple TemplateTerm that indicate how to combine the input sequences. Return a tuple of the result sequence and a type id (a special number associated with the template term) sequence.

Example

julia> SequenceTemplate(ConstTerm(-1), InputTerm{Int}(), ConstTerm(-2))(1:5)[1] == TextEncodeBase.with_head_tail(1:5, -1, -2)
true

julia> SequenceTemplate(ConstTerm(-1), InputTerm{Int}(), ConstTerm(-2))(1:5)
([-1, 1, 2, 3, 4, 5, -2], [1, 1, 1, 1, 1, 1, 1])

julia> bert_template = SequenceTemplate(
           ConstTerm("[CLS]", 1), InputTerm{String}(1), ConstTerm("[SEP]", 1),
           RepeatedTerm(InputTerm{String}(2), ConstTerm("[SEP]", 2))
       )
SequenceTemplate{String}([CLS]:<type=1> Input:<type=1> [SEP]:<type=1> (Input:<type=2> [SEP]:<type=2>)...)

julia> bert_template(["hello", "world"])
(["[CLS]", "hello", "world", "[SEP]"], [1, 1, 1, 1])

julia> bert_template(["hello", "world"], ["today", "is", "a", "good", "day"])
(["[CLS]", "hello", "world", "[SEP]", "today", "is", "a", "good", "day", "[SEP]"], [1, 1, 1, 1, 2, 2, 2, 2, 2, 2])
source
TextEncodeBase.SplittabilityType

splittability trait

The splittability trait decide whether the given combination (tokenizer x tokenization x stage) is splittable or not (Splittable or UnSplittable). For example, DefaultTokenization and SentenceStage is splittable (i.e. splittability(::DefaultTokenization, ::SentenceStage) = Splittable()). The splittability change the behavior of tokenize: if it's splittable, tokenize will try to call splitting on the input, wrap each splitting result and recurse. Otherwise, it will directly call wrap and then recurse into tokenize.

source
TextEncodeBase.TextEncoderMethod
TextEncoder(builder, e::TextEncoder)

Given an encoder, return a new encoder that has the same tokenizer and vocabulary. builder is a function that take a encoder and return a new processing function.

source
TextEncodeBase.TokenStagesType

abstract type for type that wrap input into specific stage for control tokenization.

There are six builtin stages in TextEncodeBase (all abstract XStage <: TokenStages):

1. Document <: DocumentStage: the input string is a full document,
 and thus need to be splitted into multiple sentence.
2. Sentence <: SentenceStage: the input string is a full string,
 and thus need to be splitted into multiple part (SubSentence/Word/Token).
3. SubSentence <: SubSentenceStage: special wrapper for case where the tokenizer
 does not directly break sentence all into words/tokens and these pieces contain
 multiple words/tokens, but you need the information that they are not full sentence.
4. Word <: WordStage: the input string is a single word.
5. SubWord <: SubWordStage: similar to SubSentence, but for word.
6. Token <: TokenStage: the final piece of the tokenization process.
 Generally, it's used to specify the end of this piece and should
 never be splitted.

Each wrapper have two field: x for the input, meta for extra information (nothing if not provided).

source
TextEncodeBase.VocabType
Vocab(data::Vector{<:AbstractString}, unk::AbstractString="[UNK]")

Constructor for Vocab. data is the list of vocabulary word, can be nonunique. The actual list will be the unique version of data (i.e. vocab.list = unique(data)). unk is the indicator word for all unknown words. unk can be either in or not in data, depends on the use case.

source
TextEncodeBase.VocabMethod
Vocab{T}(data::AbstractVector, unk) where T

construct Vocab with element type T. unk must be specified.

source
TextEncodeBase.batch2nestedMethod
batch2nested(x)

convert single array into nested array.

See also: nested2batch

Example

julia> x = ["a" "d"; "b" "e"; "c" "f";;; "x" "u"; "y" "v"; "z" "w"; ]
3×2×2 Array{String, 3}:
[:, :, 1] =
 "a"  "d"
 "b"  "e"
 "c"  "f"

[:, :, 2] =
 "x"  "u"
 "y"  "v"
 "z"  "w"

julia> TextEncodeBase.batch2nested(x)
2-element Vector{Vector{Vector{String}}}:
 [["a", "b", "c"], ["d", "e", "f"]]
 [["x", "y", "z"], ["u", "v", "w"]]
source
TextEncodeBase.join_textFunction
join_text(x::AbstractArray [, delim [, last]])

join the inner most array and preserve the array structure. If the inner most array is multi-dimensional, join text along the first dimension.

Example

julia> TextEncodeBase.join_text([["a", "b", "c"], ['x', 'y', 'z']])
2-element Vector{String}:
 "abc"
 "xyz"

julia> TextEncodeBase.join_text([["a", "b", "c"], ['x', 'y', 'z']], " + ")
2-element Vector{String}:
 "a + b + c"
 "x + y + z"

julia> TextEncodeBase.join_text([[["a", "b", "c"], ['x', 'y', 'z']]], " + ", " = ")
1-element Vector{Vector{String}}:
 ["a + b = c", "x + y = z"]

julia> TextEncodeBase.join_text(["a" "d"; "b" "e"; "c" "f";;; "x" "u"; "y" "v"; "z" "w"; ], " + ", " = ")
2×2 Matrix{String}:
 "a + b = c"  "x + y = z"
 "d + e = f"  "u + v = w"
source
TextEncodeBase.lookupFunction
lookup(v::Vocab, x)

Lookup x in v. lookup words depends on the type of x. If x is an integer, return the x-th word on the vocabulary list (i.e. v.list[x]) and return the unknown word if x is out-of-bound (v.unk). If x is a string, return the indice of x in the vocabulary list (i.e findfirst(==(x), v.list) and return the unknown indice if x not found in the list. If the unknown word v.unk is in the list, the unknown indice is its indice, otherwise 0.

This function is bidirectional except for Vocab{<:Integer}. For integer vocabulary, this function only get the x-th word (v.list[x]). Use lookup(Int, v, x) for explicit indice lookup.

Example

julia> vocab = Vocab(["a", "b", "c", "a", "b", "c"])
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = [UNK], unki = 0)

julia> vocab_unk = Vocab(["a", "b", "xxx"], "xxx")
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = xxx, unki = 3)

julia> lookup(vocab, "b")
2

julia> lookup(vocab, "d")
0

julia> lookup(vocab_unk, "d")
3

julia> lookup(vocab, 1)
"a"

julia> lookup(vocab, 10000)
"[UNK]"

julia> lookup(vocab_unk, 10000)
"xxx"
source
TextEncodeBase.lookupMethod
lookup(Int, v::Vocab, x)

The explicit version of lookup(v, x). Lookup the indice of x in the vocabulary list. x should have the same type as Vocab's element type.

Example

julia> vocab_unk = Vocab(["a", "b", "xxx"], "xxx")
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = xxx, unki = 3)

julia> lookup(Int, vocab_unk, "b")
2
source
TextEncodeBase.lookupMethod
lookup(OneHot, v::Vocab, i)

lookup i and convert into one-hot representation.

Example

julia> lookup(OneHot, vocab, "a")
3-element OneHot{3}:
 1
 0
 0

julia> lookup(OneHot, vocab, ["a" "b"; "c" "d"])
3x2x2 OneHotArray{3, 3, Matrix{OneHot{0x00000003}}}:
[:, :, 1] =
 1  0
 0  0
 0  1

[:, :, 2] =
 0  0
 1  0
 0  0

julia> lookup(OneHot, vocab, 3)
ERROR: DomainError with c:
cannot convert `lookup(::Vocab, 3)` = "c" into one-hot representation.
Stacktrace:
[...]
source
TextEncodeBase.lookupMethod
lookup(v::Vocab, is::AbstractArray)

recursively lookup value from is

Example

julia> lookup(vocab, ["b", "c", "a", "A", "[UNK]"])
5-element Vector{Int64}:
 2
 3
 1
 0
 0

julia> lookup(vocab, [1, "a", 0, "A", "[UNK]"])
5-element Vector{Any}:
  "a"
 1
  "[UNK]"
 0
 0
source
TextEncodeBase.lookupMethod
lookup(v::Vocab, i::OneHotArray)

convert the one-hot representation back into words.

Example

julia> lookup(OneHot, vocab, ["a" "b"; "c" "d"])
3x2x2 OneHotArray{3, 3, Matrix{OneHot{0x00000003}}}:
[:, :, 1] =
 1  0
 0  0
 0  1

[:, :, 2] =
 0  0
 1  0
 0  0

julia> lookup(vocab, ans)
2×2 Matrix{String}:
 "a"  "b"
 "c"  "[UNK]"
source
TextEncodeBase.lookupMethod
lookup(::Type{T}, v::Vocab{T}, i::Integer) where T

The explicit version of lookup(v, i). Lookup the word at index i on vocabulary list. T should be the same type as Vocab's element type. This method won't work on integer vocab, use lookup(v, i) directly.

Example

julia> vocab_unk = Vocab(["a", "b", "xxx"], "xxx")
Vocab{String, StaticArrays.SizedVector{3, String, Vector{String}}}(size = 3, unk = xxx, unki = 3)

julia> lookup(String, vocab_unk, 1)
"a"
source
TextEncodeBase.matchsplitsMethod
matchsplits(pattern::AbstractPattern, str::String)

Split str with the regular expression pattern. Return a lazy iterator where each element is a Tuple{Bool, SubString}. The Bool indicate whether the SubString is a match of pattern.

Example

julia> matchsplits(r"a|c", "abc"^3)
MatchSplitIterator(r"a|c", "abcabcabc")

julia> collect(matchsplits(r"a|c", "abc"^3))
9-element Vector{Tuple{Bool, SubString{String}}}:
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")
source
TextEncodeBase.matchsplitsMethod
matchsplits(patterns::Vector{<:AbstractPattern}, str::String)

Split str with the list of regular expression patterns. Return a lazy iterator where each element is a Tuple{Bool, SubString}. The Bool indicate whether the SubString is a match of pattern. The match order are specified by the list order.

Example

julia> matchsplits([r"a", r"c"], "abc"^3)
MatchSplits(Regex[r"a", r"c"], "abcabcabc")

julia> collect(matchsplits([r"a", r"c"], "abc"^3))
9-element Vector{Tuple{Bool, SubString{String}}}:
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")
 (1, "a")
 (0, "b")
 (1, "c")

julia> collect(matchsplits([r"ab", r"bc"], "abc"^3))
6-element Vector{Tuple{Bool, SubString{String}}}:
 (1, "ab")
 (0, "c")
 (1, "ab")
 (0, "c")
 (1, "ab")
 (0, "c")
source
TextEncodeBase.nested2batchMethod
nested2batch(x)

convert nested array into single array

See also: batch2nested

Example

julia> TextEncodeBase.nested2batch([[[1 2],[3 4]]])
1×2×2×1 Array{Int64, 4}:
[:, :, 1, 1] =
 1  2

[:, :, 2, 1] =
 3  4
source
TextEncodeBase.peek_sequence_sample_typeMethod
peek_sequence_sample_type([T::Type,] x)

Non-recursive version of sequence_sample_type. Return -1 if the x is an array of array with unknown elements, thus it's possible that sequence_sample_type(x[i]) == -2. Specify T to check if x is a nested array with element type T. If T is not specified, every type not subtype to AbstractArray is a count as element type.

see also: type_sequence_sample_type, sequence_sample_type

Example

julia> TextEncodeBase.peek_sequence_sample_type([1,2,3])
1

julia> peek_sequence_sample_type(Int, Any[[[1,2,3]]]), sequence_sample_type(Int, Any[[[1,2,3]]])
(-1, 3)

julia> peek_sequence_sample_type(Int, [[[1,2,3], "abc"]]), sequence_sample_type(Int, [[[1,2,3], "abc"]])
(-1, -2)
source
TextEncodeBase.sequence_sample_typeMethod
sequence_sample_type([T::Type,] x)

Get the depth of the nested array. If return natural number, x is a nested array where each element has the same depth. Return -2 if x is not a nested array or the depth of elements are different. Depth of empty array compute with the type and sequence_sample_type(Any[]) is 1. Specify T to check if x is a nested array with element type T. If T is not specified, every type not subtype to AbstractArray is a count as element type.

see also: type_sequence_sample_type, peek_sequence_sample_type

Example

julia> sequence_sample_type([[1,2,3]])
2

julia> sequence_sample_type([[[2,3], [1]], Vector{Int}[]])
3

julia> sequence_sample_type([[[2,3], [1]], Any[]])
-2

julia> sequence_sample_type(Int, [[1,2], 3])
-2

julia> sequence_sample_type(Int, Any[[1,2], Int[]])
2
source
TextEncodeBase.splittabilityFunction
splittability(args...)

Return the splittability (Splittable/UnSplittable) of given argument combination. Overload to make a TokenStages splittable.

source
TextEncodeBase.splittingFunction
splitting(t::AbstractTokenization, x::TokenStages)

Split x given its tokenization stage. For example, the default behavior of a document stage is splitting into sentences (with WordTokenizers.split_sentences).

Overload this method for custom tokenization.

source
TextEncodeBase.trunc_and_padFunction
trunc_and_pad(x, maxn, pad)

Truncate x if length exceed maxn, and add pad at the end of x until all length are the same. x can be either nested or single array. If maxn is nothing, the largest length of the inner-most array will be used, then the behavior equals to trunc_or_pad with nothing.

trunc_and_pad(x, maxn, pad, trunc_end = :tail, pad_end = :tail)

trunc_end and pad_end specified whether the truncation and padding happened at the begining of the sentences or the end of the sentence. The value is either :tail (means the end) or :head (means the begining).

trunc_and_pad(maxn, pad, trunc_end = :tail, pad_end = :tail)

Create a function that truncate input to be length <= maxn, and add pad until all input has equal length.

see also: trunc_or_pad

Example

julia> TextEncodeBase.trunc_and_pad(1:5, 7, -1)
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> TextEncodeBase.trunc_and_pad([1:5, 2:7], 10, -1)
2-element Vector{Vector{Int64}}:
 [1, 2, 3, 4, 5, -1]
 [2, 3, 4, 5, 6, 7]

julia> TextEncodeBase.trunc_and_pad([1:5, [2:7, [1:2]]], nothing, -1)
2-element Vector{Vector}:
 [1, 2, 3, 4, 5, -1]
 Vector[[2, 3, 4, 5, 6, 7], [[1, 2, -1, -1, -1, -1]]]
source
TextEncodeBase.trunc_or_padFunction
trunc_or_pad(x, n, pad)

Truncate x to length n, or add pad at the end of x until length equal n. x can be either nested or single array. if n is nothing, the largest length of the inner-most array will be used.

trunc_or_pad(x, n, pad, trunc_end = :tail, pad_end = :tail)

trunc_end and pad_end specified whether the truncation and padding happened at the begining of the sentences or the end of the sentence. The value is either :tail (means the end) or :head (means the begining).

trunc_or_pad(n, pad, trunc_end = :tail, pad_end = :tail)

Create a function that will return new array with truncated or padded value of the input.

see also: trunc_and_pad

Example

julia> TextEncodeBase.trunc_or_pad(1:5, 7, -1)
7-element Vector{Int64}:
  1
  2
  3
  4
  5
 -1
 -1

julia> TextEncodeBase.trunc_or_pad([1:5, 2:7], 10, -1)
2-element Vector{Vector{Int64}}:
 [1, 2, 3, 4, 5, -1, -1, -1, -1, -1]
 [2, 3, 4, 5, 6, 7, -1, -1, -1, -1]

julia> TextEncodeBase.trunc_or_pad([1:5, [2:7, [1:2]]], nothing, -1)
2-element Vector{Vector}:
 [1, 2, 3, 4, 5, -1]
 Vector[[2, 3, 4, 5, 6, 7], [[1, 2, -1, -1, -1, -1]]]
source
TextEncodeBase.type_sequence_sample_typeMethod
type_sequence_sample_type([T::Type,] t::Type)

Get the depth of the nested array type. If return natural number, t is a type of nested array. Return -1 if it cannot be known by type and return -2 if t is not a nested array type. Specify T to check if t is a nested array type with element type T. If T is not specified, every type not subtype to AbstractArray is a count as element type.

see also: sequence_sample_type, peek_sequence_sample_type

Example

julia> type_sequence_sample_type(Vector{Vector{Integer}})
2

julia> type_sequence_sample_type(Number, Array{Vector{Union{Float64, Int}}})
2

julia> type_sequence_sample_type(Int, Array{Vector{Union{Float64, Int}}})
-2
source
TextEncodeBase.with_head_tailMethod
with_head_tail(x, head, tail)

Return [head; x; tail]. Ignored if head or tail is nothing. x can be nested arrays.

Example

julia> TextEncodeBase.with_head_tail(1:5, -1, -2)
7-element Vector{Int64}:
 -1
  1
  2
  3
  4
  5
 -2

julia> TextEncodeBase.with_head_tail([1:5, 2:3], -1, -2)
2-element Vector{Vector{Int64}}:
 [-1, 1, 2, 3, 4, 5, -2]
 [-1, 2, 3, -2]
source
TextEncodeBase.wrapFunction
wrap(t::AbstractTokenization, s::TokenStages, x)

Mark the tokenization stage of x, which is part of the splitting result of s. For example, if we are doing simple whitespace tokenization and at the sentence stage, then x is just single word of s and thus return Word(x) (or Token(x)). Skip if x is already a TokenStages. (this method only apply to splittable stages)

Overload this method to control the tokenization process.

source
TextEncodeBase.wrapMethod
wrap(t::AbstractTokenization, x::TokenStages)

A handler for unsplittable stages (token/word/...).

Overload this method for custom transform.

source
TextEncodeBase.@stageMacro
@stage StageName
@stage StageName{A<:SomeType, B}
@stage StageName AbstractStage
@stage StageName{A<:SomeType, B} <: AbstractStage

Define TokenStages with two field (x and meta), it's single arguement constructor, and add methods to setmeta and setvalue.

Equivalent to:

struct StageName{A<:SomeType, B} <: AbstractStage
    x::A
    meta::B
end

StageName(x) = StageName(x, nothing)
TextEncodeBase.setmeta(x::StageName, meta) = StageName(x.x, meta)
TextEncodeBase.setvalue(x::StageName, y) = StageName(y, x.meta)
source