BytePairEncoding.jl

Pure Julia implementation of the Byte Pair Encoding(BPE) method.

BytePairEncoding.BPELearner
BytePairEncoding.bbpe2tiktoken
BytePairEncoding.cl100k_base_regex
BytePairEncoding.count_words
BytePairEncoding.gpt2_codemap
BytePairEncoding.gpt2_regex
BytePairEncoding.load_gpt2
BytePairEncoding.load_tiktoken
BytePairEncoding.load_tiktoken_encoder
BytePairEncoding.o200k_base_regex
BytePairEncoding.tiktoken2bbpe

BytePairEncoding.BPELearner — Type

BPELearner(tokenization::AbstractTokenization; min_freq = 10, endsym = "</w>", sepsym = nothing)

Construct a learner with a tokenization which has BPETokenization and NoBPE inside.

(bper::BPELearner)(word_counts, n_merge)

Calling the learner on a word_counts dictionary (created by count_words) generate a new tokenization where NoBPE is replaced with the learned BPE.

source

BytePairEncoding.bbpe2tiktoken — Function

bbpe2tiktoken(tkr)

Convert a gpt2-like byte-level tokenizer (with bpe::BPE) to tiktoken tokenizer (with bpe::TikToken). If there is a CodeNormalizer in the tokenizer, it will be removed accordingly.

see also: tiktoken2bbpe

source

BytePairEncoding.cl100k_base_regex — Method

the regex pattern used by openai cl100k_base

source

BytePairEncoding.count_words — Method

count_words(bper::BPELearner, files::AbstractVector)

Given a list of files (where each line of the file would be considered as a (multi-sentences) document). Tokenize those file a count the occurence of each word token.

source

BytePairEncoding.gpt2_codemap — Method

the codemap used by openai gpt2

source

BytePairEncoding.gpt2_regex — Method

the regex pattern used by openai gpt2

source

BytePairEncoding.load_gpt2 — Method

load_gpt2()

Load gpt2 tokenizer.

source

BytePairEncoding.load_tiktoken — Method

load_tiktoken(name)

Load tiktoken tokenizer. name can be "o200k_base", "cl100k_base", "p50k_base", "p50k_base", "r50k_base", or "gpt2".

julia> tkr = BytePairEncoding.load_tiktoken("cl100k_base")
BPETokenizer(MatchTokenization(BPETokenization(Cl100kBaseTokenization, bpe = TikTokenBPE(100256 merges)), 5 patterns))

julia> tkr("hello world aaaaaaaaaaaa")
5-element Vector{String}:
 "hello"
 " world"
 " a"
 "aaaaaaaa"
 "aaa"

source

BytePairEncoding.load_tiktoken_encoder — Method

load_tiktoken_encoder(name)

Load the tiktoken encoder (tokenizer + predefined vocabulary)

Warning

The encoded value is off by 1 comparing to the python/rust tiktoken.

julia> enc = BytePairEncoding.load_tiktoken_encoder("cl100k_base")
┌ Warning: The maximum encoded value (`length(BPEEncoder.vocab)`) is larger than the number of possible tokens
│ because there are some "gaps" in the vocabulary. Be carefull if used to initialize embedding table.
└ @ BytePairEncoding
BPEEncoder(BPETokenizer(MatchTokenization(BPETokenization(Cl100kBaseTokenization, bpe = TikTokenBPE(100256 merges)), 5 patterns)), Vocab(size = 100277))

julia> enc.encode("hello world aaaaaaaaaaaa") # === enc(...)
5-element Vector{Int64}:
 15340
  1918
   265
 70541
 33747

julia> enc.decode(enc("hello world aaaaaaaaaaaa"))
"hello world aaaaaaaaaaaa"

source

BytePairEncoding.o200k_base_regex — Method

the regex pattern used by openai o200k_base

source

BytePairEncoding.tiktoken2bbpe — Function

tiktoken2bbpe(tkr, codemap::Union{CodeMap, Nothing} = nothing)

Convert a tiktoken tokenizer (with bpe::TikToken) to gpt2-like byte-level tokenizer (with bpe::BPE). If codemap is provided, it will add the corresponding CodeNormalizer to the tokenizer.