Deep learning: Attention and Transformers 5th Grade Quiz

Deep learning: Attention and Transformers

Quiz

•

Computers

•

5th Grade

•

Medium

Josiah Wang

Used 36+ times

FREE Resource

9 questions

Show all answers

MULTIPLE CHOICE QUESTION

1 min • 1 pt

True or False:

Dot-product attention with softmax activation can be thought of as a soft form of dictionary lookup over matrices

True

False

Answer explanation

See Tutorial. The dot-product between keys and queries gives a normalized vector which is used to extract a weighted combination of the values.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which of the following are advantages of Transformers over Recurrent Sequence Models?

Better at learning long-range dependencies

Faster to train and run on modern hardware

Require many fewer parameters to achieve similar results

Answer explanation

Transformers are better able to learn long-range dependencies because the gradients for such relationships only pass through a few (vertically stacked) attention layers, rather than the full-sequence (as in RNNs). Additionally, they do not rely on a limited capacity memory to model long-range context, which aids inference and learning.

Modern GPU/TPU hardware is better equipped to handle a few large matrix ops than many small sequential ones.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which of these parts of the self-attention operation are calculated by passing inputs through an MLP?

Values

Keys

Queries

Word Embeddings

Answer explanation

We are relying on the layer learning to map from the “fixed” input embedding space into “meaningful” subspaces where relationships between tokens are leveraged through the soft-indexing nature of attention. The values are also typically mapped so that subsequent layers can more easily learn mappings into richer subspaces.

MULTIPLE CHOICE QUESTION

1 min • 1 pt

True or False:

In practice we often use Multi-headed self attention. This intuitively works as it is more useful to learn multiple simpler transformations into distinct sub-spaces than one complicated transformation into a richer sub-space.

True

False

Answer explanation

You might imagine a situation in which our word embedding somehow captures whether a word is a noun or a verb. Then it might be easier to learn long-range noun-noun and verb-verb relations if we learn two distinct, simpler subspaces (by using 2 attention heads) than one richer subspace, where the MLP would have to disentangle these notions. Additionally, it might make it easier for the transformed value-space to amplify certain semantically distinct properties which can be leveraged by subsequent layers.

MULTIPLE CHOICE QUESTION

1 min • 1 pt

We are creating a Transformer using multi-headed attention, such that input embeddings of dimension 128 match the output shape of our self-attention layer.

If we use multi-headed attention, with 4 heads, what dimensionality will the outputs of each head have?

128

512

Answer explanation

See Tutorial. Largely for the sake of convenience (to allow residual connections and make shapes easy to keep track of) we want the output shape to be 128. We’ll achieve this by concatenating the outputs of the 4 attention heads. You might consider having their output dimension be 128, and then aggregating with a sum or max etc., but then you are performing more compute and mixing up the information which your different heads might have extracted.

MULTIPLE CHOICE QUESTION

1 min • 1 pt

Why do we add (or concatenate etc.) position encodings to Transformer Inputs?

Because the dot-product attention operation is agnostic to token ordering

To increase robustness to adversarial attacks in the token embedding space

Answer explanation

The soft dot-product attention operation is equivariant to the ordering of the tokens in the sequence. As such, if we want transformers to pay attention to absolute and relative positions, we can either add this information directly to the input tokens, or within the layer. The former is more straight-forward

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which of the following properties will a good position encoding ideally have:

Unique for all positions

Relative-distances are independent of absolute sequence position

Well-defined for arbitrary sequence lengths

Answer explanation

See Tutorial. Ideally, we'd like the following to hold for our choice of positional encoding:

1. The encodings should be determinstic and unique for every word in the sentence

2. Distance (within sequence) based differences in value should not depend on sequence length

The above two properties allow the network to reason about absolute positions and relative position within sequences in a consistent fashion. Finally we also want the encodings to:

3. Be well-behaved for sequences of unseen length (e.g. defined over any domain, whilst maintaining above properties)

MULTIPLE CHOICE QUESTION

1 min • 1 pt

How have Transformers been successfully applied to generate images?

Treating neighboring pixels as an input sequence and autoregressing on a next-pixel-prediction problem

Learning to map sequences of flattened noise vectors (matching image size) to images

Answer explanation

See slides. Image Transformers achieved state of the art image generation on ImageNet by posing image generation as an auto-regressive sequence problem. A proposal similar to c) using VQ-VAEs to learn discretized latent spaces as inputs to transformers has been successful (see Dalle)

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

O(NM(d+l))

O(NM2dl)

O(NM2d2l)

O(NMd2l)

Answer explanation

See Notes. Given two matrices A (NxM) and B (MxL), multiplying these has time complexity O(NML). Applying this to the question, the QK^T multiplication goes as O(NMd), then we calculate softmax, which is M operations (addition along row, and division) over N rows O(NM). Finally we multiply the resulting NxM matrix with V, which is O(NMl). So we get O(NM[l+d+1]) ->O(NM[l+d])

Similar Resources on Wayground

11 questions

Flowchart Symbol

Quiz

•

1st - 5th Grade

10 questions

Unplugged

Quiz

•

KG - 5th Grade

12 questions

Deep Learning quiz 2

Quiz

•

1st - 12th Grade

10 questions

Embedded System Quiz

Quiz

•

5th Grade

10 questions

Data Representation - Instructions

Quiz

•

3rd - 12th Grade

14 questions

HT1 - Yr8 Summative Assessment

Quiz

•

5th Grade

10 questions

Computer Science tests

Quiz

•

1st - 5th Grade

10 questions

Remote Learning Etiquette

Quiz

•

2nd - 5th Grade

Popular Resources on Wayground

50 questions

Trivia 7/25

Quiz

•

12th Grade

11 questions

Standard Response Protocol

Quiz

•

6th - 8th Grade

11 questions

Negative Exponents

Quiz

•

7th - 8th Grade

12 questions

Exponent Expressions

Quiz

•

6th Grade

4 questions

Exit Ticket 7/29

Quiz

•

8th Grade

20 questions

Subject-Verb Agreement

Quiz

•

9th Grade

20 questions

One Step Equations All Operations

Quiz

•

6th - 7th Grade

18 questions

"A Quilt of a Country"

Quiz

•

9th Grade

Discover more resources for Computers

10 questions

Common Denominators

Quiz

•

5th Grade