Search Header Logo

Data 151-175

Authored by Michael Caponpon

Professional Development

Professional Development

Used 1+ times

Data 151-175
AI

AI Actions

Add similar questions

Adjust reading levels

Convert to real-world scenario

Translate activity

More...

    Content View

    Student View

25 questions

Show all answers

1.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

Use Vertex AI for training existing Spark ML models

Rewrite your models on TensorFlow, and start using Vertex AI

Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery

Answer explanation

Correct Answer:

Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

Why This Works?

  • Dataproc is Google's managed Hadoop and Spark service, which allows you to run your existing Spark ML pipelines without major code changes.

  • BigQuery Connector for Spark lets Spark read directly from BigQuery, avoiding the need to export data manually.

  • This approach maintains continuity with your existing Spark ML pipeline while ensuring a smooth migration to Google Cloud.

2.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?

BigQuery

Cloud Bigtable

Cloud Datastore

Cloud SQL for PostgreSQL

Answer explanation

Correct Answer:

BigQuery

Why BigQuery?

BigQuery is the best choice because:

  • Scalability: It can efficiently handle 40 TB of data and continuous hourly updates.

  • Geospatial Processing: It has native GIS functions (ST_GEOGPOINT, ST_INTERSECTS, ST_DISTANCE, etc.), which are essential for analyzing ship locations in GeoJSON format.

  • Machine Learning Integration: BigQuery ML lets you train ML models directly without moving data to another platform.

  • Dashboards & Visualization: It integrates well with Looker, Data Studio, and third-party BI tools for real-time dashboards.

3.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?

Consume the stream of data in Data­flow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes and send an alert if the average is less than 4000 messages.

Consume the stream of data in Data­flow using Kafka IO. Set a fixed time window of 1 hour. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.

Use Kafka Connect to link your Kafka message queue to Pub/Sub. Use a Datafl­ow template to write your messages from Pub/Sub to Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Bigtable in the last hour. If that number falls below 4000, send an alert.

Use Kafka Connect to link your Kafka message queue to Pub/Sub. Use a Datafl­ow template to write your messages from Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.

Answer explanation

Correct Answer:

Consume the stream of data in Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes and send an alert if the average is less than 4000 messages.

Why This is the Best Option?

  • Streaming Analytics with Dataflow:

    • Google Cloud Dataflow (Apache Beam) can process real-time Kafka streams efficiently.

  • Sliding Window for Continuous Monitoring:

    • A sliding window (1-hour window, moving every 5 minutes) ensures near-real-time detection of message drop, rather than waiting for a fixed window.

  • Immediate Alerts:

    • Dataflow can trigger an alert as soon as the moving average falls below 4000 messages per second.

4.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure. What should you do?

Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.

Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.

Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.

Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.

Answer explanation

Correct Answer:

Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.

Why This is the Best Option?

  • Cloud SQL High Availability (HA) Setup:

    • Cloud SQL HA uses a primary instance with a failover replica in a different zone in the same region.

    • In the event of a zone failure, automatic failover occurs to the replica.

  • Automatic Failover Mechanism:

    • Cloud SQL HA relies on regional persistent disks to replicate data synchronously.

    • Failover is automated and minimizes downtime.

5.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Your company is selecting a system to centralize data ingestion and delivery. You are considering messaging and data integration systems to address the requirements. The key requirements are: The ability to seek to a particular offset in a topic, possibly back to the start of all data ever captured, Support for publish/subscribe semantics on hundreds of topics, Retain per-key ordering. Which system should you choose?

Apache Kafka

Cloud Storage

Datafl­ow

Firebase Cloud Messaging

Answer explanation

Correct Answer:

Apache Kafka

Why Apache Kafka?

Apache Kafka is the best choice because it meets all the key requirements:

  1. Seek to a particular offset

    • Kafka allows consumers to seek to a specific offset in a topic.

    • You can rewind to the start of all data ever captured if retention policies allow.

  2. Publish/Subscribe Semantics

    • Kafka natively supports pub/sub with multiple topics and partitions.

    • Can easily handle hundreds of topics efficiently.

  3. Per-Key Ordering

    • Kafka guarantees ordering within a partition for messages with the same key.

    • Ensures that events for the same key (e.g., a specific user or device) are processed in order.

6.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud. You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service. What should you do?

Deploy a Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://

Deploy a Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://

Install Hadoop and Spark on a 10-node Compute Engine instance group with standard instances. Install the Cloud Storage connector, and store the data in Cloud Storage. Change references in scripts from hdfs:// to gs://

Install Hadoop and Spark on a 10-node Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://

Answer explanation

Correct Answer:

Deploy a Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://.

Why This Answer?

1. Managed Service → Google Cloud Dataproc

  • Dataproc is a fully managed service for running Apache Hadoop and Spark.

  • It automatically handles scaling, fault tolerance, and cluster lifecycle.

2. Cost Optimization → Preemptible Workers

  • Using 50% preemptible workers reduces costs significantly.

  • Preemptible VMs are much cheaper than standard instances but can be terminated anytime.

  • Dataproc handles failures gracefully by rescheduling failed jobs.

3. Storage Optimization → Cloud Storage (gs://) instead of HDFS

  • Cloud Storage is more cost-effective and durable than HDFS.

  • Eliminates the need for managing an HDFS cluster.

  • Dataproc natively integrates with Cloud Storage using the Cloud Storage Connector.

  • Simply update Hadoop/Spark scripts to reference gs:// instead of hdfs://.

4. Standard Persistent Disk is Sufficient

  • Standard persistent disks provide enough performance for batch workloads.

  • SSD persistent disks (as in the second option) increase costs unnecessarily unless the workload is very I/O-intensive.

7.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Your team is working on a binary classification problem. You have trained a support vector machine (SVM) classifier with default parameters and received an area under the Curve (AUC) of 0.87 on the validation set. You want to increase the AUC of the model. What should you do?

Perform hyperparameter tuning

Train a classifier with deep neural networks, because neural networks would always beat SVMs

Deploy the model and measure the real-world AUC; it's always higher because of generalization

Scale predictions you get out of the model (tune a scaling factor as a hyperparameter) in order to get the highest AUC

Answer explanation

Correct Answer:

Perform hyperparameter tuning

Why This Answer?

Hyperparameter tuning is a crucial step in improving the performance of machine learning models, including Support Vector Machines (SVMs). Since the current model has an AUC of 0.87, optimizing hyperparameters can help boost performance further.

Key hyperparameters to tune for an SVM:

  • Kernel type (linear, polynomial, radial basis function (RBF), sigmoid)

  • C (Regularization Parameter) → Controls the trade-off between maximizing the margin and minimizing classification errors.

  • Gamma (for RBF and polynomial kernels) → Controls the influence of a single training example.

  • Degree (for polynomial kernel) → Determines the complexity of the decision boundary.

Tuning these using Grid Search or Bayesian Optimization can maximize AUC by finding the best combination of hyperparameters.

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Google

Continue with Google

Email

Continue with Email

Classlink

Continue with Classlink

Clever

Continue with Clever

or continue with

Microsoft

Microsoft

Apple

Apple

Others

Others

Already have an account?