
Deep learning: Attention and Transformers
Authored by Josiah Wang
Computers
5th Grade
Used 38+ times

AI Actions
Add similar questions
Adjust reading levels
Convert to real-world scenario
Translate activity
More...
Content View
Student View
9 questions
Show all answers
1.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
True or False:
Dot-product attention with softmax activation can be thought of as a soft form of dictionary lookup over matrices
True
False
Answer explanation
See Tutorial. The dot-product between keys and queries gives a normalized vector which is used to extract a weighted combination of the values.
2.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which of the following are advantages of Transformers over Recurrent Sequence Models?
Better at learning long-range dependencies
Faster to train and run on modern hardware
Require many fewer parameters to achieve similar results
Answer explanation
Transformers are better able to learn long-range dependencies because the gradients for such relationships only pass through a few (vertically stacked) attention layers, rather than the full-sequence (as in RNNs). Additionally, they do not rely on a limited capacity memory to model long-range context, which aids inference and learning.
Modern GPU/TPU hardware is better equipped to handle a few large matrix ops than many small sequential ones.
3.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which of these parts of the self-attention operation are calculated by passing inputs through an MLP?
Values
Keys
Queries
Word Embeddings
Answer explanation
We are relying on the layer learning to map from the “fixed” input embedding space into “meaningful” subspaces where relationships between tokens are leveraged through the soft-indexing nature of attention. The values are also typically mapped so that subsequent layers can more easily learn mappings into richer subspaces.
4.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
True or False:
In practice we often use Multi-headed self attention. This intuitively works as it is more useful to learn multiple simpler transformations into distinct sub-spaces than one complicated transformation into a richer sub-space.
True
False
Answer explanation
You might imagine a situation in which our word embedding somehow captures whether a word is a noun or a verb. Then it might be easier to learn long-range noun-noun and verb-verb relations if we learn two distinct, simpler subspaces (by using 2 attention heads) than one richer subspace, where the MLP would have to disentangle these notions. Additionally, it might make it easier for the transformed value-space to amplify certain semantically distinct properties which can be leveraged by subsequent layers.
5.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
We are creating a Transformer using multi-headed attention, such that input embeddings of dimension 128 match the output shape of our self-attention layer.
If we use multi-headed attention, with 4 heads, what dimensionality will the outputs of each head have?
32
64
128
512
Answer explanation
See Tutorial. Largely for the sake of convenience (to allow residual connections and make shapes easy to keep track of) we want the output shape to be 128. We’ll achieve this by concatenating the outputs of the 4 attention heads. You might consider having their output dimension be 128, and then aggregating with a sum or max etc., but then you are performing more compute and mixing up the information which your different heads might have extracted.
6.
MULTIPLE CHOICE QUESTION
1 min • 1 pt
Why do we add (or concatenate etc.) position encodings to Transformer Inputs?
Because the dot-product attention operation is agnostic to token ordering
To increase robustness to adversarial attacks in the token embedding space
Answer explanation
The soft dot-product attention operation is equivariant to the ordering of the tokens in the sequence. As such, if we want transformers to pay attention to absolute and relative positions, we can either add this information directly to the input tokens, or within the layer. The former is more straight-forward
7.
MULTIPLE SELECT QUESTION
1 min • 1 pt
Which of the following properties will a good position encoding ideally have:
Unique for all positions
Relative-distances are independent of absolute sequence position
Well-defined for arbitrary sequence lengths
Answer explanation
See Tutorial. Ideally, we'd like the following to hold for our choice of positional encoding:
1. The encodings should be determinstic and unique for every word in the sentence
2. Distance (within sequence) based differences in value should not depend on sequence length
The above two properties allow the network to reason about absolute positions and relative position within sequences in a consistent fashion. Finally we also want the encodings to:
3. Be well-behaved for sequences of unseen length (e.g. defined over any domain, whilst maintaining above properties)
Access all questions and much more by creating a free account
Create resources
Host any resource
Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever
or continue with

Microsoft
%20(1).png)
Apple
Others
Already have an account?
Similar Resources on Wayground
10 questions
Photography project
Quiz
•
3rd - 6th Grade
11 questions
What is a computer
Quiz
•
KG - University
13 questions
windows
Quiz
•
5th - 9th Grade
10 questions
Internet services-Venesha
Quiz
•
3rd - 6th Grade
10 questions
Computer Devices and its Operating System
Quiz
•
4th - 8th Grade
10 questions
Working with tables
Quiz
•
5th - 12th Grade
11 questions
Fortnite
Quiz
•
5th Grade - Professio...
8 questions
Logo Quiz
Quiz
•
5th Grade
Popular Resources on Wayground
15 questions
Fractions on a Number Line
Quiz
•
3rd Grade
20 questions
Equivalent Fractions
Quiz
•
3rd Grade
25 questions
Multiplication Facts
Quiz
•
5th Grade
54 questions
Analyzing Line Graphs & Tables
Quiz
•
4th Grade
22 questions
fractions
Quiz
•
3rd Grade
20 questions
Main Idea and Details
Quiz
•
5th Grade
20 questions
Context Clues
Quiz
•
6th Grade
15 questions
Equivalent Fractions
Quiz
•
4th Grade