Design
Tokenizer
The overall tokenizer framework is built on top of Julia's multiple dispatch. The main idea of the design is to make hijacking the tokenization process easier. This is done by dispatching to all AbstractTokenizer
, AbstractTokenization
, and TokenStages
, so that even if the tokenization and input are the same, we can still define a new tokenizer and change the behavior of some parts of that tokenization.
TokenStages
The TokenStages
is an abstract type used to specify the input. For example, we have Document <: TokenStages
and Sentence <: TokenStages
, so the input is not just a String
, which we probably cannot detect what is in. Every string should be wrap with a TokenStages
type explicitly. With the stages in mind, we can convert the tokenization process into recursively splitting the string and wrapping the substring as another stage until the result is a Token
type.
Splittability
Not every TokenStages
can be splitted into substring, like most of tokenizer won't split word into subwords. Therefore, we defined the Splittability
trait. The splittability is codetermined by AbstractTokenizer
, AbstractTokenization
, and TokenStages
. It is either Splittable
or UnSplittable
. If the input is splittable, there should have a splitting
method defined for that combination. On the other hand, if it's unsplittable, the tokenize function will directly call wrap
to tranform the input into next stage. Actually, there is also another input with type (ParentStages = Union{Nothing, TokenStages
) that can be used to find whether the tokenize function is called recursively.
Vocabulary
The Vocab
type take two argument, the list of words and a special token for all unknown words. The default constructor of Vocab
copy the list and remove all duplicate words. Besides, it also try to find the unknown token in the word list. If the unknown token is NOT in the word list, it will NOT add it into the word list. Instead, when lookup
unknown word with that Vocab
object, it will return 0 as the index for all unknown words. Therefore, make sure the unknown token is in the word list beforehand.
Pipelines
A Pipelines
is a chain of Pipeline
and Pipeline
is a function that take two arguments: the input and a NamedTuple
. Each Pipeline
is attached with one or more symbols. It apply a transform function on its arguments, and then the result will be mark with those symbols and produce another NamedTuple
. After that, the result NamedTuple
will be merged into the input NamedTuple
. So the Pipelines
is a sequence of transform function, and each transform result is marked with given names.