Terminology

Term and Naming explanation.

Prerequisite

Some term for better understanding this docs.

This actually live outside the scope of this package, but is extremely useful for illustrate the overall design. We'll use the $ operation to denote partial function application (i.e. f $ x is equivanlent to (arg...)->f(x, arg...)).

2. Feature / Length / Batch Dimension

Under the context of attention operation in deep learning, the input data can be viewed as a 3-dimensional array. The feature dimension, the length dimension, and the batch dimension (f-dim, l-dim, b-dim for short). Following the Julia's multidimensional array implementation (column-major), the data is store in a AbstractArray{T, 3} whose size is (f-dim, l-dim, b-dim).

For example, given 3 sentence as a batch, each sentence have 10 word, and we choose to represent a word with a vector of 32 element. This data will be store in an 3-dim array with size (32, 10, 3).

General speaking, batch stands for how many independent data you are going to run in one function call, usually just for performance/optimization need. length means how many entry you have for each data sample, like the #-words in a sentence or #-pixels in an image. feature is the number of value you used to represent an entry.

Attention

The overall attention operation can be viewed as three mutually inclusive block:

	     (main input)
	        Value           Key             Query  (Extras...)
	+---------|--------------|----------------|------|||---- Attention Operation ---+
	|         |              |                |      |||                            |
	|         |              |                |      |||   multihead, ...           |
	|         |              |                |      |||                            |
	|   +-----|--------------|----------------|------|||-----------------------+    |
	|   |     |              |                |      |||                       |    |
	|   |     |          +---|----------------|------|||-------------+         |    |
	|   |     |          |   |                |      |||             |         |    |
	|   |     |          |   |  scoring func  |      |||             |         |    |
	|   |     |          |   +------>+<-------+<=======+             |         |    |
	|   |     |          |           |                               |         |    |
	|   |     |          |           | masked_score,                 |         |    |
	|   |     |          |           | normalized_score,             |         |    |
	|   |     |          |           | ...                           |         |    |
	|   |     |          |           |                               |         |    |
	|   |     |          +-----------|------------ Attention Score --+         |    |
	|   |     |                      |                                         |    |
	|   |     |     mixing func      |                                         |    |
	|   |     +--------->+<----------+                                         |    |
	|   |                |                                                     |    |
	|   +----------------|------------------------------- Mixing --------------+    |
	|                    |                                                          |
	+--------------------|----------------------------------------------------------+
	              Attentive Value
	               (main output)

The attention operation is actually a special way to "mix" (or "pick" in common lecture) the input information. In (probably) the first attention paper, the attention is defined as weighted sum of the input sequence given a word embedding. The idea is furthur generalize to QKV attention in the first transformer paper.

1. Attention Score

The attention score is used to decide how much the each piece of input information will contribute to the output value and also how many entry the attention operation will output. The operation that will modify the attention score matrix should be consider as part of this block. For example: Different attention masks (local attention, random attention, ...), normalization (softmax, l2-norm, ...), and some special attention that take other inputs (transformer decoder, relative position encoding, ...).

Terminology

Prerequisite

1. PartialFunctions

2. Feature / Length / Batch Dimension

Attention

1. Attention Score

2. Mixing

3. Attention Operation

Attention Mask

1. Dataless mask

2. Array mask