BytePairEncoding.jl
Pure Julia implementation of the Byte Pair Encoding(BPE) method.
BytePairEncoding.BPELearner
BytePairEncoding.bbpe2tiktoken
BytePairEncoding.cl100k_base_regex
BytePairEncoding.count_words
BytePairEncoding.gpt2_codemap
BytePairEncoding.gpt2_regex
BytePairEncoding.load_gpt2
BytePairEncoding.load_tiktoken
BytePairEncoding.load_tiktoken_encoder
BytePairEncoding.o200k_base_regex
BytePairEncoding.tiktoken2bbpe
BytePairEncoding.BPELearner
— TypeBPELearner(tokenization::AbstractTokenization; min_freq = 10, endsym = "</w>", sepsym = nothing)
Construct a learner with a tokenization
which has BPETokenization
and NoBPE
inside.
(bper::BPELearner)(word_counts, n_merge)
Calling the learner on a word_counts
dictionary (created by count_words
) generate a new tokenization
where NoBPE
is replaced with the learned BPE
.
BytePairEncoding.bbpe2tiktoken
— Functionbbpe2tiktoken(tkr)
Convert a gpt2-like byte-level tokenizer (with bpe::BPE
) to tiktoken tokenizer (with bpe::TikToken
). If there is a CodeNormalizer
in the tokenizer, it will be removed accordingly.
see also: tiktoken2bbpe
BytePairEncoding.cl100k_base_regex
— Methodthe regex pattern used by openai cl100k_base
BytePairEncoding.count_words
— Methodcount_words(bper::BPELearner, files::AbstractVector)
Given a list of files (where each line of the file would be considered as a (multi-sentences) document). Tokenize those file a count the occurence of each word token.
BytePairEncoding.gpt2_codemap
— Methodthe codemap used by openai gpt2
BytePairEncoding.gpt2_regex
— Methodthe regex pattern used by openai gpt2
BytePairEncoding.load_gpt2
— Methodload_gpt2()
Load gpt2 tokenizer.
BytePairEncoding.load_tiktoken
— Methodload_tiktoken(name)
Load tiktoken tokenizer. name
can be "o200k_base"
, "cl100k_base"
, "p50k_base"
, "p50k_base"
, "r50k_base"
, or "gpt2"
.
julia> tkr = BytePairEncoding.load_tiktoken("cl100k_base")
BPETokenizer(MatchTokenization(BPETokenization(Cl100kBaseTokenization, bpe = TikTokenBPE(100256 merges)), 5 patterns))
julia> tkr("hello world aaaaaaaaaaaa")
5-element Vector{String}:
"hello"
" world"
" a"
"aaaaaaaa"
"aaa"
BytePairEncoding.load_tiktoken_encoder
— Methodload_tiktoken_encoder(name)
Load the tiktoken encoder (tokenizer + predefined vocabulary)
The encoded value is off by 1 comparing to the python/rust tiktoken.
julia> enc = BytePairEncoding.load_tiktoken_encoder("cl100k_base")
┌ Warning: The maximum encoded value (`length(BPEEncoder.vocab)`) is larger than the number of possible tokens
│ because there are some "gaps" in the vocabulary. Be carefull if used to initialize embedding table.
└ @ BytePairEncoding
BPEEncoder(BPETokenizer(MatchTokenization(BPETokenization(Cl100kBaseTokenization, bpe = TikTokenBPE(100256 merges)), 5 patterns)), Vocab(size = 100277))
julia> enc.encode("hello world aaaaaaaaaaaa") # === enc(...)
5-element Vector{Int64}:
15340
1918
265
70541
33747
julia> enc.decode(enc("hello world aaaaaaaaaaaa"))
"hello world aaaaaaaaaaaa"
BytePairEncoding.o200k_base_regex
— Methodthe regex pattern used by openai o200k_base
BytePairEncoding.tiktoken2bbpe
— Functiontiktoken2bbpe(tkr, codemap::Union{CodeMap, Nothing} = nothing)
Convert a tiktoken tokenizer (with bpe::TikToken
) to gpt2-like byte-level tokenizer (with bpe::BPE
). If codemap
is provided, it will add the corresponding CodeNormalizer
to the tokenizer.
see also: bbpe2tiktoken