Which of the following statements represents Key (K) as used in the self-attention calculation?

Represent the entities in the dataset that can be matched against the query.

Represent the actual information or content associated with each entity.

The order of words in a sentence

Represent the entity that is seeking information.

Following is the architecture within a Transformer Network. (without displaying positional encoding and output layers(s)) What is the output layer(s) of the Decoder ? (Marked YYY, pointed by the independent arrow)

Linear layer followed by a SoftMax layer

SoftMax layer followed by a linear layer

Which of these is a criterion for a bad positional encoding algorithm? [Select all that apply].

It outputs the same encoding for each time-step (word’s position in a sentence).

Distance between any two time-steps are inconsistent for all sentence lengths.

The algorithm is able to generalize to longer sentences.

It outputs a unique encoding for each time-step (word’s position in a sentence).

What does iii represent in this multi-head attention computation?

The computed attention weight matrix associated with the ith “head” (sequence)

The computed attention weight matrix associated with the order of the words in a sentence.

The computed attention weight matrix associated with the ith “word” in a sentence.

The computed attention weight matrix associated with specific representations of words given a Q.

Transformer Network methodology is taken from: (Check all that apply)

Convolutional Neural Network style of processing.

Recurrent Neural Network style of architecture.

Scaling laws for pre-training large language models consider several aspects to maximize performance of a model within a set of constraints and available scaling choices. Select all alternatives that should be considered for scaling when performing model pre-training?

Batch size: Number of samples per iteration

Model size: Number of parameters

Compute budget: Compute constraints

Which of the following stages are part of the generative AI model lifecycle? (Select all that apply)

Manipulating the model to align with specific project needs.

Defining the problem and identifying relevant datasets.

Selecting a candidate model and potentially pre-training a custom model.

Deploying the model into the infrastructure and integrating it with the application.

How does self-attention work? [Select all that apply]

Each query is compared with all keys to measure similarity using dot product.

These scores are then scaled by the square root of the dimension of the keys to stabilize gradients.

The scaled scores are passed through a Softmax function to obtain attention weights.

Each value is multiplied by its corresponding attention weight and summed to produce the final attention output for the query.

Sum each value vector with the Softmax score

In the context of multi-head attention, select what is considered True.

The input embeddings are linearly projected into multiple sets of queries, keys, and values.

Each head performs the attention operation independently, producing different outputs.

The outputs from all heads are concatenated and passed through a final linear layer to produce the final output.

Like sequential models, such as RNN, this approach enables parallel computation, as the attention mechanism can process all words in the sequence simultaneously.

A single attention mechanism might be sufficient to capture different types of relationships in the data.

For cross-attention mechanism, select all that can be True.

It is used in the decoder only when generating the output sequence.

It allows the decoder to attend to the encoder’s output.

The queries come from the encoder’s previous layer, and the keys and values come from the decoder’s output.

It produces context-aware representations for the encoder to generate the next token.

Positional encoding can be calculated from the following formulas. [Select all that apply]

The number of sine and cosine curves depends on the dimension of the transformer used to represent each token.

The position embeddings can be pre-computed to speed up training.

The sine must come after the cosine.

The position embeddings are non-deterministic.

if the transformer's dimension is 8, then we'll need 2 sine curves alternating with 2 cosine curves, to calculate the position embeddings.

Consider the following sentence, how can we define the relationship of the word "it"?

Using self-attention to define the relationship of "it" with any other word in the sentence.

Using masked-attention to define the relationship of "it" with any word appears before it in the sentence.

Using masked-attention to define the relationship of "it" with any word appears after it in the sentence.

Using self-attention to define the relationship of "it" with any relevant word in the sentence.

Given a sentence, compute the sentence probability P(How are you)=?

P(How).P(are|How) .P(you|How are)

P(you|How are)/N, where N is the total size of the corpus.

P(How)+P(are|How) +P(you|How are)

Perplexity of a bigram language model can be calculated as follows: [Select all that apply]

m is the number of all words in entire test set W

Wi is the ith word in the entire test set.

Small perplexity means good model

small perplexity means poor model

When this simplification can not be applied?

When word x is the last word of the sentence.

When word x is the first word of the sentence.

When word x is followed-up by </s> token(s).

In the context of contrastive learning ( Siamese Network), which best explains the role of hard negatives, margin-based loss functions, and their impact on training stability?

Hard negatives are examples that are too similar to positives, which increases the difficulty of classification. Margin-based loss functions prioritize these examples, helping to refine the decision boundary while stabilizing training.

Hard negatives are ignored during training to prevent instability, and the full cost function relies solely on minimizing the contrastive loss between easy negatives and positives.

Hard negatives increase the importance of easy examples, leading to a higher loss value that smooths the decision boundary and reduces the impact of positive-negative similarity.

Hard negatives are only relevant in supervised contrastive learning, while self-supervised models rely exclusively on easy negatives to optimize training efficiency.

CS4417/9117/9647 Final Exam Review Game

Authored by Marwa Elsayed

Computers

University

Used 243+ times

AI Actions

Add similar questions

Adjust reading levels

Convert to real-world scenario

Translate activity

More...

Content View

Student View

40 questions

Show all answers

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The GloVe model minimizes this objective: True/False: Xij is the number of times word j appears in the context of word i.

True

False

MULTIPLE SELECT QUESTION

45 sec • 1 pt

Which of these equations do you think should hold for a good word embedding? [Check all that apply)

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

True/False: Suppose you learn a word embedding for a vocabulary of 60000 words. Then the embedding vectors could be 60000 dimensional, so as to capture the full range of variation and meaning in those words.

True

False

MULTIPLE SELECT QUESTION

45 sec • 1 pt

In beam search, if you increase the beam width B, which of the following would you expect to be true?

Beam search will use up more memory.

Beam search will converge after fewer steps.

Beam search will generally find better solutions (i.e. do a better job maximizing P(y∣x)).

Beam search will run more quickly.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Consider using this encoder-decoder model for machine translation.
True/False: This model is a “conditional language model” in the sense that the decoder portion (shown in purple) is modeling the probability of the output sentence y given the input sentence x.

True

False

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

True

False

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

A Transformer Network, unlike its predecessors RNNs, GRUs and LSTMs, can process entire sentences all at the same time. (Parallel architecture).

True

False

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever

or continue with

Microsoft

Apple

Others

Already have an account?

Similar Resources on Wayground

39 questions

U8 Vocab

Quiz

•

University

42 questions

App inventor

Quiz

•

7th Grade - University

45 questions

Comptia Network+ chapter 1-3 exam

Quiz

•

11th Grade - University

40 questions

HTML CSS

Quiz

•

University

40 questions

SIA: Quiz 1 (Lesson 1-2)

Quiz

•

University

44 questions

AWS Certified Cloud Practitioner

Quiz

•

5th Grade - Professio...

44 questions

20703 - SCCM

Quiz

•

University

35 questions

Internet Safety Quiz

Quiz

•

7th Grade - University

Popular Resources on Wayground

15 questions

Fractions on a Number Line

Quiz

•

3rd Grade

20 questions

Equivalent Fractions

Quiz

•

3rd Grade

25 questions

Multiplication Facts

Quiz

•

5th Grade

54 questions

Analyzing Line Graphs & Tables

Quiz

•

4th Grade

$fractions$

22 questions

fractions

Quiz

•

3rd Grade

20 questions

Main Idea and Details

Quiz

•

5th Grade

20 questions

Context Clues

Quiz

•

6th Grade

15 questions

Equivalent Fractions

Quiz

•

4th Grade

Discover more resources for Computers

20 questions

CompTIA Network+ - Ports and Protocols

Quiz

•

University

CS4417/9117/9647 Final Exam Review Game

Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The GloVe model minimizes this objective: True/False: Xij is the number of times word j appears in the context of word i.

Which of these equations do you think should hold for a good word embedding? [Check all that apply)

True/False: Suppose you learn a word embedding for a vocabulary of 60000 words. Then the embedding vectors could be 60000 dimensional, so as to capture the full range of variation and meaning in those words.

In beam search, if you increase the beam width B, which of the following would you expect to be true?

Consider using this encoder-decoder model for machine translation.True/False: This model is a “conditional language model” in the sense that the decoder portion (shown in purple) is modeling the probability of the output sentence y given the input sentence x.

In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

A Transformer Network, unlike its predecessors RNNs, GRUs and LSTMs, can process entire sentences all at the same time. (Parallel architecture).

What letter does the "??question mark" represent in the following representation of Attention?

Which of the following statements represents Key (K) as used in the self-attention calculation?

Following is the architecture within a Transformer Network. (without displaying positional encoding and output layers(s))What is the output layer(s) of the Decoder ? (Marked YYY, pointed by the independent arrow)

Access all questions and much more by creating a free account

Similar Resources on Wayground

Popular Resources on Wayground

Discover more resources for Computers

Consider using this encoder-decoder model for machine translation.
True/False: This model is a “conditional language model” in the sense that the decoder portion (shown in purple) is modeling the probability of the output sentence y given the input sentence x.

Following is the architecture within a Transformer Network. (without displaying positional encoding and output layers(s))
What is the output layer(s) of the Decoder ? (Marked YYY, pointed by the independent arrow)