Transformers.Pretrain

Functions for download and loading pretrain models.

using Pretrains

For GPT and BERT, we provide a simple api to get the released pretrain weight and load them into our Julia version Transformer implementation.

using Transformers
using Transformers.Pretrain
using Transformers.GenerativePreTrain
using Transformers.BidirectionalEncoder

# disable cli download check
ENV["DATADEPS_ALWAYS_ACCEPT"] = true

#load everything in the pretrain model
bert_model, wordpiece, tokenizer = pretrain"Bert-uncased_L-12_H-768_A-12"

#load model weight only
gpt_model = pretrain"gpt-OpenAIftlm:gpt_model"

#show the loaded model
show(bert_model)
show(gpt_model)
Total: 441026658 bytes
Downloaded: 4423680 bytes (1%) 8830976 bytes (2%) 13238272 bytes (3%) 17645568 bytes (4%) 22052864 bytes (5%) 26476544 bytes (6%) 30883840 bytes (7%) 35291136 bytes (8%) 39698432 bytes (9%) 44105728 bytes (10%) 48513024 bytes (11%) 52936704 bytes (12%) 57344000 bytes (13%) 61751296 bytes (14%) 66158592 bytes (15%) 70565888 bytes (16%) 74989568 bytes (17%) 79396864 bytes (18%) 83804160 bytes (19%) 88211456 bytes (20%) 92618752 bytes (21%) 97037291 bytes (22%) 101444587 bytes (23%) 105851883 bytes (24%) 110259179 bytes (25%) 114682859 bytes (26%) 119090155 bytes (27%) 123497451 bytes (28%) 127904747 bytes (29%) 132312043 bytes (30%) 136719339 bytes (31%) 141143019 bytes (32%) 145550315 bytes (33%) 149957611 bytes (34%) 154364907 bytes (35%) 158772203 bytes (36%) 163195883 bytes (37%) 167603179 bytes (38%) 172010475 bytes (39%) 176417771 bytes (40%) 180825067 bytes (41%) 185232363 bytes (42%) 189656043 bytes (43%) 194063339 bytes (44%) 198470635 bytes (45%) 202877931 bytes (46%) 207285227 bytes (47%) 211708907 bytes (48%) 216116203 bytes (49%) 220523499 bytes (50%) 224930795 bytes (51%) 229338091 bytes (52%) 233745387 bytes (53%) 238169067 bytes (54%) 242576363 bytes (55%) 246983659 bytes (56%) 251390955 bytes (57%) 255798251 bytes (58%) 260221931 bytes (59%) 264629227 bytes (60%) 269036523 bytes (61%) 273439242 bytes (62%) 277862922 bytes (63%) 282270218 bytes (64%) 286677514 bytes (65%) 291084810 bytes (66%) 295492106 bytes (67%) 299899402 bytes (68%) 304323082 bytes (69%) 308730378 bytes (70%) 313137674 bytes (71%) 317544970 bytes (72%) 321952266 bytes (73%) 326375946 bytes (74%) 330783242 bytes (75%) 335190538 bytes (76%) 339600559 bytes (77%) 344007855 bytes (78%) 348415151 bytes (79%) 352822447 bytes (80%) 357246127 bytes (81%) 361653423 bytes (82%) 366060719 bytes (83%) 370468015 bytes (84%) 374875311 bytes (85%) 379298991 bytes (86%) 383706287 bytes (87%) 388113583 bytes (88%) 392520879 bytes (89%) 396928175 bytes (90%) 401335471 bytes (91%) 405759151 bytes (92%) 410166447 bytes (93%) 414573743 bytes (94%) 418981039 bytes (95%) 423388335 bytes (96%) 427812015 bytes (97%) 432219311 bytes (98%) 436626607 bytes (99%) 441026658 bytes (100%)
[ Info: loading pretrain bert model: uncased_L-12_H-768_A-12.tfbson
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│   caller = #CompositeEmbedding#27 at etype.jl:20 [inlined]
└ @ Core ~/work/Transformers.jl/Transformers.jl/src/basic/embeds/etype.jl:20
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│   caller = #_create_classifier#27 at tfckpt2bson.jl:137 [inlined]
└ @ Core ~/work/Transformers.jl/Transformers.jl/src/bert/tfckpt2bson.jl:137
Total: 469864010 bytes
Downloaded: 4702208 bytes (1%) 9404416 bytes (2%) 14106624 bytes (3%) 18808832 bytes (4%) 23494656 bytes (5%) 28196864 bytes (6%) 32899072 bytes (7%) 37601280 bytes (8%) 42303488 bytes (9%) 46989312 bytes (10%) 51691520 bytes (11%) 56393728 bytes (12%) 61095936 bytes (13%) 65781760 bytes (14%) 70483968 bytes (15%) 75186176 bytes (16%) 79888384 bytes (17%) 84590592 bytes (18%) 89276416 bytes (19%) 93978624 bytes (20%) 98680832 bytes (21%) 103383040 bytes (22%) 108068864 bytes (23%) 112771072 bytes (24%) 117473280 bytes (25%) 122175488 bytes (26%) 126877696 bytes (27%) 131563520 bytes (28%) 136265728 bytes (29%) 140967936 bytes (30%) 145670144 bytes (31%) 150372352 bytes (32%) 155058176 bytes (33%) 159760384 bytes (34%) 164459698 bytes (35%) 169161906 bytes (36%) 173864114 bytes (37%) 178549938 bytes (38%) 183252146 bytes (39%) 187954354 bytes (40%) 192656562 bytes (41%) 197358770 bytes (42%) 202044594 bytes (43%) 206755961 bytes (44%) 211452919 bytes (45%) 216145596 bytes (46%) 220837231 bytes (47%) 225539439 bytes (48%) 230237398 bytes (49%) 234946560 bytes (50%) 239632384 bytes (51%) 244334592 bytes (52%) 249036800 bytes (53%) 253739008 bytes (54%) 258441216 bytes (55%) 263127040 bytes (56%) 267829248 bytes (57%) 272531456 bytes (58%) 277233664 bytes (59%) 281919488 bytes (60%) 286621696 bytes (61%) 291323904 bytes (62%) 296026112 bytes (63%) 300728320 bytes (64%) 305414144 bytes (65%) 310116352 bytes (66%) 314818560 bytes (67%) 319520768 bytes (68%) 324206592 bytes (69%) 328908800 bytes (70%) 333611008 bytes (71%) 338313216 bytes (72%) 343015424 bytes (73%) 347701248 bytes (74%) 352403456 bytes (75%) 357105664 bytes (76%) 361807872 bytes (77%) 366510080 bytes (78%) 371195904 bytes (79%) 375898112 bytes (80%) 380600320 bytes (81%) 385302528 bytes (82%) 389988352 bytes (83%) 394690560 bytes (84%) 399392768 bytes (85%) 404094976 bytes (86%) 408797184 bytes (87%) 413483008 bytes (88%) 418185216 bytes (89%) 422887424 bytes (90%) 427589632 bytes (91%) 432275456 bytes (92%) 436977664 bytes (93%) 441679872 bytes (94%) 446382080 bytes (95%) 451084288 bytes (96%) 455770112 bytes (97%) 460472320 bytes (98%) 465174528 bytes (99%) 469864010 bytes (100%)
[ Info: loading pretrain gpt model: OpenAIftlm.npbson gpt_model
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│   caller = #CompositeEmbedding#27 at etype.jl:20 [inlined]
└ @ Core ~/work/Transformers.jl/Transformers.jl/src/basic/embeds/etype.jl:20
TransformerModel{Bert{Stack{NTuple{12, Transformer{MultiheadAttention{Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dropout{Float64, Colon}}, Flux.LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Transformers.Basic.PwFFN{Flux.Dense{typeof(NNlib.gelu), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, Flux.LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Flux.Dropout{Float64, Colon}}}, Symbol("((x, m) => x':(x, m)) => 12")}, Flux.Dropout{Float64, Colon}}}(
  embed = CompositeEmbedding(tok = Embed(768), segment = Embed(768), pe = PositionEmbedding(768, max_len=512), postprocessor = Positionwise(LayerNorm((768,)), Dropout(0.1))),
  transformers = Bert(layers=12, head=12, head_size=64, pwffn_size=3072, size=768),
  classifier =
    (
      pooler => Dense(768, 768, tanh)
      masklm => (
        transform => Chain(Dense(768, 768, gelu), LayerNorm((768,)))
        output_bias => Vector{Float32}
      )
      nextsentence => Chain(Dense(768, 2), logsoftmax)
    )
)TransformerModel{Gpt{Stack{NTuple{12, Transformer{MultiheadAttention{Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Flux.Dropout{Float64, Colon}}, Flux.LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Transformers.Basic.PwFFN{Flux.Dense{typeof(NNlib.gelu), Matrix{Float32}, Vector{Float32}}, Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, Flux.LayerNorm{typeof(identity), Flux.Diagonal{Vector{Float32}}, Float32, 1}, Flux.Dropout{Float64, Colon}}}, Symbol("x':x => 12")}, Flux.Dropout{Float64, Colon}}}(
  embed = CompositeEmbedding(tok = Embed(768), pe = PositionEmbedding(768, max_len=512)),
  transformers = Gpt(layers=12, head=12, head_size=64, pwffn_size=3072, size=768)
)

The pretrain"<type>-<model-name>:<item>" string with pretrain prefix will load the specific item from a known pretrain file (see the list below). The <type> is matched case insensitively, so not matter bert, Bert, BERT, or even bErT will find the BERT pretrain model. On the other hand, the <model-name>, and <item> should be exactly the one on the list. See example.

Currently support pretrain:

Typemodelmodel namesupport itemsdetail description
GptgptOpenAIftlmgpt_model, bpe, vocab, tokenizergpt model with 12 layers, 768 hidden units, and 12 attention heads. subword with 40000-vocabularies-sized uncased bpe tokenizer.
Bertscibertscibert_scivocab_uncasedbert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece uncased tokenizer. vocabularies from scientific corpus.
Bertscibertscibert_basevocab_casedbert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece cased tokenizer.
Bertscibertscibert_basevocab_uncasedbert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece uncased tokenizer.
Bertscibertscibert_scivocab_casedbert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece cased tokenizer. vocabularies from scientific corpus.
Bertbertcased_L-12_H-768_A-12bert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece cased tokenizer.
Bertbertwwm_cased_L-24_H-1024_A-16bert_model, wordpiece, tokenizerbert model with 24 layers, 1024 hidden units, and 16 attention heads. subword with google wordpiece cased tokenizer. trained with whole word masked.
Bertbertuncased_L-12_H-768_A-12bert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece uncased tokenizer.
Bertbertmulti_cased_L-12_H-768_A-12bert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece multi_cased tokenizer.
Bertbertwwm_uncased_L-24_H-1024_A-16bert_model, wordpiece, tokenizerbert model with 24 layers, 1024 hidden units, and 16 attention heads. subword with google wordpiece cased tokenizer. trained with whole word masked.
Bertbertmultilingual_L-12_H-768_A-12bert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece multilingual tokenizer.
Bertbertchinese_L-12_H-768_A-12bert_model, wordpiece, tokenizerbert model with 12 layers, 768 hidden units, and 12 attention heads. subword with google wordpiece chinese tokenizer.
Bertbertcased_L-24_H-1024_A-16bert_model, wordpiece, tokenizerbert model with 24 layers, 1024 hidden units, and 16 attention heads. subword with google wordpiece cased tokenizer.
Bertbertuncased_L-24_H-1024_A-16bert_model, wordpiece, tokenizerbert model with 24 layers, 1024 hidden units, and 16 attention heads. subword with google wordpiece uncased tokenizer.

If you don't find a public pretrain you want on the list, please fire an issue.

See example folder for the complete example.

API reference

Transformers.Pretrain.pretrainsFunction

pretrains(query::String = ""; detailed::Bool = false)

Show all available models. you can also query a specific model or model name. show more detail with detailed = true.

source
Transformers.Pretrain.@pretrain_strMacro
pretrain"model-description:item"

convenient macro for loading data from pretrain. Use DataDeps to automatically download if a model is not found. the string should be in pretrain"<type>-<model-name>:<item>" format.

see also Pretrain.pretrains().

source