
Data 151-175
Authored by Michael Caponpon
Professional Development
Professional Development
Used 1+ times

AI Actions
Add similar questions
Adjust reading levels
Convert to real-world scenario
Translate activity
More...
Content View
Student View
25 questions
Show all answers
1.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?
Use Vertex AI for training existing Spark ML models
Rewrite your models on TensorFlow, and start using Vertex AI
Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery
Answer explanation
Correct Answer:
✅ Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
Why This Works?
Dataproc is Google's managed Hadoop and Spark service, which allows you to run your existing Spark ML pipelines without major code changes.
BigQuery Connector for Spark lets Spark read directly from BigQuery, avoiding the need to export data manually.
This approach maintains continuity with your existing Spark ML pipeline while ensuring a smooth migration to Google Cloud.
2.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?
BigQuery
Cloud Bigtable
Cloud Datastore
Cloud SQL for PostgreSQL
Answer explanation
Correct Answer:
✅ BigQuery
Why BigQuery?
BigQuery is the best choice because:
Scalability: It can efficiently handle 40 TB of data and continuous hourly updates.
Geospatial Processing: It has native GIS functions (ST_GEOGPOINT, ST_INTERSECTS, ST_DISTANCE, etc.), which are essential for analyzing ship locations in GeoJSON format.
Machine Learning Integration: BigQuery ML lets you train ML models directly without moving data to another platform.
Dashboards & Visualization: It integrates well with Looker, Data Studio, and third-party BI tools for real-time dashboards.
3.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
You operate an IoT pipeline built around Apache Kafka that normally receives around 5000 messages per second. You want to use Google Cloud Platform to create an alert as soon as the moving average over 1 hour drops below 4000 messages per second. What should you do?
Consume the stream of data in Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes and send an alert if the average is less than 4000 messages.
Consume the stream of data in Dataflow using Kafka IO. Set a fixed time window of 1 hour. Compute the average when the window closes, and send an alert if the average is less than 4000 messages.
Use Kafka Connect to link your Kafka message queue to Pub/Sub. Use a Dataflow template to write your messages from Pub/Sub to Bigtable. Use Cloud Scheduler to run a script every hour that counts the number of rows created in Bigtable in the last hour. If that number falls below 4000, send an alert.
Use Kafka Connect to link your Kafka message queue to Pub/Sub. Use a Dataflow template to write your messages from Pub/Sub to BigQuery. Use Cloud Scheduler to run a script every five minutes that counts the number of rows created in BigQuery in the last hour. If that number falls below 4000, send an alert.
Answer explanation
Correct Answer:
✅ Consume the stream of data in Dataflow using Kafka IO. Set a sliding time window of 1 hour every 5 minutes. Compute the average when the window closes and send an alert if the average is less than 4000 messages.
Why This is the Best Option?
Streaming Analytics with Dataflow:
Google Cloud Dataflow (Apache Beam) can process real-time Kafka streams efficiently.
Sliding Window for Continuous Monitoring:
A sliding window (1-hour window, moving every 5 minutes) ensures near-real-time detection of message drop, rather than waiting for a fixed window.
Immediate Alerts:
Dataflow can trigger an alert as soon as the moving average falls below 4000 messages per second.
4.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure. What should you do?
Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.
Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.
Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.
Answer explanation
Correct Answer:
✅ Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
Why This is the Best Option?
Cloud SQL High Availability (HA) Setup:
Cloud SQL HA uses a primary instance with a failover replica in a different zone in the same region.
In the event of a zone failure, automatic failover occurs to the replica.
Automatic Failover Mechanism:
Cloud SQL HA relies on regional persistent disks to replicate data synchronously.
Failover is automated and minimizes downtime.
5.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
Your company is selecting a system to centralize data ingestion and delivery. You are considering messaging and data integration systems to address the requirements. The key requirements are: The ability to seek to a particular offset in a topic, possibly back to the start of all data ever captured, Support for publish/subscribe semantics on hundreds of topics, Retain per-key ordering. Which system should you choose?
Apache Kafka
Cloud Storage
Dataflow
Firebase Cloud Messaging
Answer explanation
Correct Answer:
✅ Apache Kafka
Why Apache Kafka?
Apache Kafka is the best choice because it meets all the key requirements:
Seek to a particular offset
Kafka allows consumers to seek to a specific offset in a topic.
You can rewind to the start of all data ever captured if retention policies allow.
Publish/Subscribe Semantics
Kafka natively supports pub/sub with multiple topics and partitions.
Can easily handle hundreds of topics efficiently.
Per-Key Ordering
Kafka guarantees ordering within a partition for messages with the same key.
Ensures that events for the same key (e.g., a specific user or device) are processed in order.
6.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud. You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service. What should you do?
Deploy a Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
Deploy a Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
Install Hadoop and Spark on a 10-node Compute Engine instance group with standard instances. Install the Cloud Storage connector, and store the data in Cloud Storage. Change references in scripts from hdfs:// to gs://
Install Hadoop and Spark on a 10-node Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://
Answer explanation
Correct Answer:
✅ Deploy a Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://.
Why This Answer?
1. Managed Service → Google Cloud Dataproc
Dataproc is a fully managed service for running Apache Hadoop and Spark.
It automatically handles scaling, fault tolerance, and cluster lifecycle.
2. Cost Optimization → Preemptible Workers
Using 50% preemptible workers reduces costs significantly.
Preemptible VMs are much cheaper than standard instances but can be terminated anytime.
Dataproc handles failures gracefully by rescheduling failed jobs.
3. Storage Optimization → Cloud Storage (gs://) instead of HDFS
Cloud Storage is more cost-effective and durable than HDFS.
Eliminates the need for managing an HDFS cluster.
Dataproc natively integrates with Cloud Storage using the Cloud Storage Connector.
Simply update Hadoop/Spark scripts to reference gs:// instead of hdfs://.
4. Standard Persistent Disk is Sufficient
Standard persistent disks provide enough performance for batch workloads.
SSD persistent disks (as in the second option) increase costs unnecessarily unless the workload is very I/O-intensive.
7.
MULTIPLE CHOICE QUESTION
30 sec • 1 pt
Your team is working on a binary classification problem. You have trained a support vector machine (SVM) classifier with default parameters and received an area under the Curve (AUC) of 0.87 on the validation set. You want to increase the AUC of the model. What should you do?
Perform hyperparameter tuning
Train a classifier with deep neural networks, because neural networks would always beat SVMs
Deploy the model and measure the real-world AUC; it's always higher because of generalization
Scale predictions you get out of the model (tune a scaling factor as a hyperparameter) in order to get the highest AUC
Answer explanation
Correct Answer:
✅ Perform hyperparameter tuning
Why This Answer?
Hyperparameter tuning is a crucial step in improving the performance of machine learning models, including Support Vector Machines (SVMs). Since the current model has an AUC of 0.87, optimizing hyperparameters can help boost performance further.
Key hyperparameters to tune for an SVM:
Kernel type (linear, polynomial, radial basis function (RBF), sigmoid)
C (Regularization Parameter) → Controls the trade-off between maximizing the margin and minimizing classification errors.
Gamma (for RBF and polynomial kernels) → Controls the influence of a single training example.
Degree (for polynomial kernel) → Determines the complexity of the decision boundary.
Tuning these using Grid Search or Bayesian Optimization can maximize AUC by finding the best combination of hyperparameters.
Access all questions and much more by creating a free account
Create resources
Host any resource
Get auto-graded reports

Continue with Google

Continue with Email

Continue with Classlink

Continue with Clever
or continue with

Microsoft
%20(1).png)
Apple
Others
Already have an account?