Deep learning: Attention and Transformers

Quiz
•
Computers
•
5th Grade
•
Medium

Josiah Wang
Used 36+ times
FREE Resource
9 questions
Show all answers
1.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
True or False:
Dot-product attention with softmax activation can be thought of as a soft form of dictionary lookup over matrices
True
False
Answer explanation
See Tutorial. The dot-product between keys and queries gives a normalized vector which is used to extract a weighted combination of the values.
2.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which of the following are advantages of Transformers over Recurrent Sequence Models?
Better at learning long-range dependencies
Faster to train and run on modern hardware
Require many fewer parameters to achieve similar results
Answer explanation
Transformers are better able to learn long-range dependencies because the gradients for such relationships only pass through a few (vertically stacked) attention layers, rather than the full-sequence (as in RNNs). Additionally, they do not rely on a limited capacity memory to model long-range context, which aids inference and learning.
Modern GPU/TPU hardware is better equipped to handle a few large matrix ops than many small sequential ones.
3.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which of these parts of the self-attention operation are calculated by passing inputs through an MLP?
Values
Keys
Queries
Word Embeddings
Answer explanation
We are relying on the layer learning to map from the “fixed” input embedding space into “meaningful” subspaces where relationships between tokens are leveraged through the soft-indexing nature of attention. The values are also typically mapped so that subsequent layers can more easily learn mappings into richer subspaces.
4.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
True or False:
In practice we often use Multi-headed self attention. This intuitively works as it is more useful to learn multiple simpler transformations into distinct sub-spaces than one complicated transformation into a richer sub-space.
True
False
Answer explanation
You might imagine a situation in which our word embedding somehow captures whether a word is a noun or a verb. Then it might be easier to learn long-range noun-noun and verb-verb relations if we learn two distinct, simpler subspaces (by using 2 attention heads) than one richer subspace, where the MLP would have to disentangle these notions. Additionally, it might make it easier for the transformed value-space to amplify certain semantically distinct properties which can be leveraged by subsequent layers.
5.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
We are creating a Transformer using multi-headed attention, such that input embeddings of dimension 128 match the output shape of our self-attention layer.
If we use multi-headed attention, with 4 heads, what dimensionality will the outputs of each head have?
32
64
128
512
Answer explanation
See Tutorial. Largely for the sake of convenience (to allow residual connections and make shapes easy to keep track of) we want the output shape to be 128. We’ll achieve this by concatenating the outputs of the 4 attention heads. You might consider having their output dimension be 128, and then aggregating with a sum or max etc., but then you are performing more compute and mixing up the information which your different heads might have extracted.
6.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
Why do we add (or concatenate etc.) position encodings to Transformer Inputs?
Because the dot-product attention operation is agnostic to token ordering
To increase robustness to adversarial attacks in the token embedding space
Answer explanation
The soft dot-product attention operation is equivariant to the ordering of the tokens in the sequence. As such, if we want transformers to pay attention to absolute and relative positions, we can either add this information directly to the input tokens, or within the layer. The former is more straight-forward
7.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which of the following properties will a good position encoding ideally have:
Unique for all positions
Relative-distances are independent of absolute sequence position
Well-defined for arbitrary sequence lengths
Answer explanation
See Tutorial. Ideally, we'd like the following to hold for our choice of positional encoding:
1. The encodings should be determinstic and unique for every word in the sentence
2. Distance (within sequence) based differences in value should not depend on sequence length
The above two properties allow the network to reason about absolute positions and relative position within sequences in a consistent fashion. Finally we also want the encodings to:
3. Be well-behaved for sequences of unseen length (e.g. defined over any domain, whilst maintaining above properties)
8.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
How have Transformers been successfully applied to generate images?
Treating neighboring pixels as an input sequence and autoregressing on a next-pixel-prediction problem
Learning to map sequences of flattened noise vectors (matching image size) to images
Answer explanation
See slides. Image Transformers achieved state of the art image generation on ImageNet by posing image generation as an auto-regressive sequence problem. A proposal similar to c) using VQ-VAEs to learn discretized latent spaces as inputs to transformers has been successful (see Dalle)
9.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
O(NM(d+l))
O(NM2dl)
O(NM2d2l)
O(NMd2l)
Answer explanation
See Notes. Given two matrices A (NxM) and B (MxL), multiplying these has time complexity O(NML). Applying this to the question, the QK^T multiplication goes as O(NMd), then we calculate softmax, which is M operations (addition along row, and division) over N rows O(NM). Finally we multiply the resulting NxM matrix with V, which is O(NMl). So we get O(NM[l+d+1]) ->O(NM[l+d])
Similar Resources on Wayground
11 questions
Flowchart Symbol

Quiz
•
1st - 5th Grade
10 questions
Unplugged

Quiz
•
KG - 5th Grade
12 questions
Deep Learning quiz 2

Quiz
•
1st - 12th Grade
10 questions
Embedded System Quiz

Quiz
•
5th Grade
10 questions
Data Representation - Instructions

Quiz
•
3rd - 12th Grade
14 questions
HT1 - Yr8 Summative Assessment

Quiz
•
5th Grade
10 questions
Computer Science tests

Quiz
•
1st - 5th Grade
10 questions
Remote Learning Etiquette

Quiz
•
2nd - 5th Grade
Popular Resources on Wayground
50 questions
Trivia 7/25

Quiz
•
12th Grade
11 questions
Standard Response Protocol

Quiz
•
6th - 8th Grade
11 questions
Negative Exponents

Quiz
•
7th - 8th Grade
12 questions
Exponent Expressions

Quiz
•
6th Grade
4 questions
Exit Ticket 7/29

Quiz
•
8th Grade
20 questions
Subject-Verb Agreement

Quiz
•
9th Grade
20 questions
One Step Equations All Operations

Quiz
•
6th - 7th Grade
18 questions
"A Quilt of a Country"

Quiz
•
9th Grade