CertyIQ - Google - Prof Data Eng - pt 5 University Quiz

1.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally. Because large parts of globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your Kafka cluster. This is becoming difficult to manage and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?

Edge TPUs as sensor devices for storing and transmitting the messages.

Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.

An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.

A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.

2.

MULTIPLE SELECT QUESTION

45 sec • 1 pt

You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time. Which two methods can accomplish this? (Choose two.)

Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.

Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export

Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.

Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.

Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.

3.

MULTIPLE SELECT QUESTION

15 mins • 1 pt

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)

Denormalize the data as must as possible

Preserve the structure of the data as much as possible

Use BigQuery UPDATE to further reduce the size of the dataset.

Develop a data pipeline where status updates are appended to BigQuery instead of updated.

Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

4.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

Create a Dataproc cluster with high availability. Store the data in HDFS, and perform analysis as needed.

Store the data in BigQuery. Access the data using the BigQuery Connector on Dataproc and Compute Engine.

Store the data in a regional Cloud Storage bucket. Access the bucket directly using Dataproc, BigQuery, and Compute Engine.

Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Dataproc, BigQuery, and Compute Engine.

5.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?

Store and process the entire dataset in BigQuery.

Store and process the entire dataset in Bigtable.

Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket

Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.

6.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?

Use Cloud Vision AutoML with the existing dataset

Use Cloud Vision AutoML, but reduce your dataset twice.

Use Cloud Vision API by providing custom labels as recognition hints.

Train your own image recognition model leveraging transfer learning techniques.

7.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do?

Use Cloud TPUs without any additional adjustment to your code.

Use Cloud TPUs after implementing GPU kernel support for your customs ops.

Use Cloud GPUs after implementing GPU kernel support for your customs ops.

Stay on CPUs, and increase the size of the cluster you're training your model on.

8.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

You work on a regression problem in a natural language processing domain, and you have 100M labeled examples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the rootmean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

Increase the share of the test sample in the train-test split.

Try to collect more data and increase the size of your dataset.

Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.

Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

9.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.

Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.

Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.

Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

10.

MULTIPLE CHOICE QUESTION

15 mins • 1 pt

The marketing team at your organization provides regular updates of a segment of your customer dataset. The marketing team has given you a CSV with 1 million records that must be updated in BigQuery. When you use the UPDATE statement in BigQuery, you receive a quotaExceeded error. What should you do?

Reduce the number of records updated each day to stay within the BigQuery UPDATE DML statement limit.

Increase the BigQuery UPDATE DML statement limit in the Quota management section of the Google Cloud Platform Console

Split the source CSV file into smaller CSV files in Cloud Storage to reduce the number of BigQuery UPDATE DML statements per BigQuery job.

Import the new records from the CSV file into a new BigQuery table. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table.

Create a free account and access millions of resources

Similar Resources on Quizizz

Popular Resources on Quizizz

Discover more resources for Computers