You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.

Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.

Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.

An online brokerage company requires a high volume trade processing architecture. You need to create a secure queuing system that triggers jobs. The jobs will run in Google Cloud and call the company's Python API to execute trades. You need to efficiently implement a solution. What should you do?

Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to the Python API.

Write an application hosted on a Compute Engine instance that makes a push subscription to the Pub/Sub topic.

Write an application that makes a queue in a NoSQL database.

Use Cloud Composer to subscribe to a Pub/Sub topic and call the Python API.

You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes. You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload. What should you do?

Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.

Export Bigtable dump to GCS and run your analytical job on top of the exported files.

Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.

Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.

You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?

Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.

Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

You are administering shared BigQuery datasets that contain views used by multiple teams in your organization. The marketing team is concerned about the variability of their monthly BigQuery analytics spend using the on-demand billing model. You need to help the marketing team establish a consistent BigQuery analytics spend each month. What should you do?

Establish a BigQuery quota for the marketing team, and limit the maximum number of bytes scanned each day.

Create a BigQuery Enterprise reservation with a baseline of 250 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.

Create a BigQuery reservation with a baseline of 500 slots with no autoscaling for the marketing team, and bill them back accordingly.

Create a BigQuery Standard pay-as-you go reservation with a baseline of 0 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.

You want to schedule a number of sequential load and transformation jobs. Data files will be added to a Cloud Storage bucket by an upstream process. There is no fixed schedule for when the new data arrives. Next, a Dataproc job is triggered to perform some transformations and write the data to BigQuery. You then need to run additional transformation jobs in BigQuery. The transformation jobs are different for every table. These jobs might take hours to complete. You need to determine the most efficient and maintainable workflow to process hundreds of tables and provide the freshest data to your end users. What should you do?

1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators. 2. Create a separate DAG for each table that needs to go through the pipeline. 3. Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG.

1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage, Dataproc, and BigQuery operators. 2. Use a single shared DAG for all tables that need to go through the pipeline. 3. Schedule the DAG to run hourly.

1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Cloud Storage, Dataproc, and BigQuery operators. 2. Create a separate DAG for each table that needs to go through the pipeline. 3. Schedule the DAGs to run hourly.

1. Create an Apache Airflow directed acyclic graph (DAG) in Cloud Composer with sequential tasks by using the Dataproc and BigQuery operators. 2. Use a single shared DAG for all tables that need to go through the pipeline. 3. Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG.

You work for a farming company. You have one BigQuery table named sensors, which is about 500 MB and contains the list of your 5000 sensors, with columns for id, name, and location. This table is updated every hour. Each sensor generates one metric every 30 seconds along with a timestamp, which you want to store in BigQuery. You want to run an analytical query on the data once a week for monitoring purposes. You also want to minimize costs. What data model should you use?

1. Create a metrics table partitioned by timestamp. 2. Create a sensorId column in the metrics table, that points to the id column in the sensors table. 3. Use an INSERT statement every 30 seconds to append new metrics to the metrics table. 4. Join the two tables, if needed, when running the analytical query.

1. Create a metrics column in the sensors table. 2. Set RECORD type and REPEATED mode for the metrics column. 3. Use an UPDATE statement every 30 seconds to add new metrics.

1. Create a metrics column in the sensors table. 2. Set RECORD type and REPEATED mode for the metrics column. 3. Use an INSERT statement every 30 seconds to add new metrics.

1. Create a metrics table partitioned by timestamp. 2. Create a sensorId column in the metrics table, which points to the id column in the sensors table. 3. Use an UPDATE statement every 30 seconds to append new metrics to the metrics table. 4. Join the two tables, if needed, when running the analytical query.

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?

Have the data acquisition devices publish data to Cloud Pub/Sub.

Deploy small Kafka clusters in your data centers to buffer events.

Establish a Cloud Interconnect between all remote data centers and Google.

Write a Cloud Dataflow pipeline that aggregates all data in session windows.

You have data located in BigQuery that is used to generate reports for your company. You have noticed some weekly executive report fields do not correspond to format according to company standards. For example, report errors include different telephone formats and different country code identifiers. This is a frequent issue, so you need to create a recurring job to normalize the data. You want a quick solution that requires no coding. What should you do?

Use Cloud Data Fusion and Wrangler to normalize the data, and set up a recurring job.

Use Dataflow SQL to create a job that normalizes the data, and that after the first run of the job, schedule the pipeline to execute recurrently.

Create a Spark job and submit it to Dataproc Serverless.

Use BigQuery and GoogleSQL to normalize the data, and schedule recurring queries in BigQuery.

You are part of a healthcare organization where data is organized and managed by respective data owners in various storage services. As a result of this decentralized ecosystem, discovering and managing data has become difficult. You need to quickly identify and implement a cost-optimized solution to assist your organization with the following: • Data management and discovery • Data lineage tracking • Data quality validation How should you build the solution?

Use Dataplex to manage data, track data lineage, and perform data quality validation.

Use BigLake to convert the current solution into a data lake architecture.

Build a new data discovery tool on Google Kubernetes Engine that helps with new source onboarding and data lineage tracking.

Use BigQuery to track data lineage, and use Dataprep to manage data and perform data quality validation.

You have 100 GB of data stored in a BigQuery table. This data is outdated and will only be accessed one or two times a year for analytics with SQL. For backup purposes, you want to store this data to be immutable for 3 years. You want to minimize storage costs. What should you do?

1. Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2. Set a locked retention policy on the bucket. 3. Create a BigQuery external table on the exported files.

1. Create a BigQuery table clone. 2. Query the clone when you need to perform analytics.

1. Create a BigQuery table snapshot. 2. Restore the snapshot when you need to perform analytics.

1. Perform a BigQuery export to a Cloud Storage bucket with archive storage class. 2. Enable versioning on the bucket. 3. Create a BigQuery external table on the exported files.

You have a network of 1000 sensors. The sensors generate time series data: one metric per sensor per second, along with a timestamp. You already have 1 TB of data, and expect the data to grow by 1 GB every day. You need to access this data in two ways. The first access pattern requires retrieving the metric from one specific sensor stored at a specific timestamp, with a median single-digit millisecond latency. The second access pattern requires running complex analytic queries on the data, including joins, once a day. How should you store this data?

Store your data in Bigtable. Concatenate the sensor ID and timestamp and use it as the row key. Perform an export to BigQuery every day.

Store your data in BigQuery. Concatenate the sensor ID and timestamp, and use it as the primary key.

Store your data in Bigtable. Concatenate the sensor ID and metric, and use it as the row key. Perform an export to BigQuery every day.

Store your data in BigQuery. Use the metric as a primary key.

Your organization's data assets are stored in BigQuery, Pub/Sub, and a PostgreSQL instance running on Compute Engine. Because there are multiple domains and diverse teams using the data, teams in your organization are unable to discover existing data assets. You need to design a solution to improve data discoverability while keeping development and configuration efforts to a minimum. What should you do?

Use Data Catalog to automatically catalog BigQuery datasets and Pub/Sub topics. Use Data Catalog APIs to manually catalog PostgreSQL tables.

Use Data Catalog to automatically catalog BigQuery datasets. Use Data Catalog APIs to manually catalog Pub/Sub topics and PostgreSQL tables.

Use Data Catalog to automatically catalog BigQuery datasets and Pub/Sub topics. Use custom connectors to manually catalog PostgreSQL tables.

Use customer connectors to manually catalog BigQuery datasets, Pub/Sub topics, and PostgreSQL tables.

You have a BigQuery table that ingests data directly from a Pub/Sub subscription. The ingested data is encrypted with a Google-managed encryption key. You need to meet a new organization policy that requires you to use keys from a centralized Cloud Key Management Service (Cloud KMS) project to encrypt data at rest. What should you do?

Create a new BigQuery table and Pub/Sub topic by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.

Use Cloud KMS encryption key with Dataflow to ingest the existing Pub/Sub subscription to the existing BigQuery table.

Create a new BigQuery table by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.

Create a new Pub/Sub topic with CMEK and use the existing BigQuery table by using Google-managed encryption key.

You are analyzing the price of a company's stock. Every 5 seconds, you need to compute a moving average of the past 30 seconds' worth of data. You are reading data from Pub/Sub and using DataFlow to conduct the analysis. How should you set up your windowed pipeline?

Use a sliding window with a duration of 30 seconds and a period of 5 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow ()

Use a fixed window with a duration of 5 seconds. Emit results by setting the following trigger: AfterProcessingTime.pastFirstElementInPane().plusDelayOf (Duration.standardSeconds(30))

Use a fixed window with a duration of 30 seconds. Emit results by setting the following trigger: AfterWatermark.pastEndOfWindow().plusDelayOf (Duration.standardSeconds(5))

Use a sliding window with a duration of 5 seconds. Emit results by setting the following trigger: AfterProcessingTime.pastFirstElementInPane().plusDelayOf (Duration.standardSeconds(30))

You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the `Trust No One` (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your data. What should you do?

Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.

Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key. Use gsutil cp to upload each encrypted file to the Cloud Storage bucket. Manually destroy the key previously used for encryption, and rotate the key once.

Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.

Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.

Your company's data platform ingests CSV file dumps of booking and user profile data from upstream sources into Cloud Storage. The data analyst team wants to join these datasets on the email field available in both the datasets to perform analysis. However, personally identifiable information (PII) should not be accessible to the analysts. You need to de-identify the email field in both the datasets before loading them into BigQuery for analysts. What should you do?

1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud DLP with format-preserving encryption with FFX as the de-identification transformation type. 2. Load the booking and user profile data into a BigQuery table.

1. Create a pipeline to de-identify the email field by using recordTransformations in Cloud Data Loss Prevention (Cloud DLP) with masking as the de-identification transformations type. 2. Load the booking and user profile data into a BigQuery table.

1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking. 2. Create a policy tag with the email mask as the data masking rule. 3. Assign the policy to the email field in both tables. A 4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts.

1. Load the CSV files from Cloud Storage into a BigQuery table, and enable dynamic data masking. 2. Create a policy tag with the default masking value as the data masking rule. 3. Assign the policy to the email field in both tables. 4. Assign the Identity and Access Management bigquerydatapolicy.maskedReader role for the BigQuery tables to the analysts

You created a new version of a Dataflow streaming data ingestion pipeline that reads from Pub/Sub and writes to BigQuery. The previous version of the pipeline that runs in production uses a 5-minute window for processing. You need to deploy the new version of the pipeline without losing any data, creating inconsistencies, or increasing the processing latency by more than 10 minutes. What should you do?

Drain the old pipeline, then start the new pipeline.

Update the old pipeline with the new pipeline code.

Snapshot the old pipeline, stop the old pipeline, and then start the new pipeline from the snapshot.

Cancel the old pipeline, then start the new pipeline.

You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network. What should you do?

Assign the compute.networkUser role to the Dataflow service agent.

Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.

Assign the dataflow.admin role to the Dataflow service agent.

Assign the dataflow.admin role to the service account that executes the Dataflow pipeline.

You work for a financial institution that lets customers register online. As new customers register, their user data is sent to Pub/Sub before being ingested into BigQuery. For security reasons, you decide to redact your customers' Government issued Identification Number while allowing customer service representatives to view the original values when necessary. What should you do?

Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic format-preserving encryption token.

Use BigQuery's built-in AEAD encryption to encrypt the SSN column. Save the keys to a new table that is only viewable by permissioned users.

Use BigQuery column-level security. Set the table permissions so that only members of the Customer Service user group can see the SSN column.

Before loading the data into BigQuery, use Cloud Data Loss Prevention (DLP) to replace input values with a cryptographic hash.

You are running a Dataflow streaming pipeline, with Streaming Engine and Horizontal Autoscaling enabled. You have set the maximum number of workers to 1000. The input of your pipeline is Pub/Sub messages with notifications from Cloud Storage. One of the pipeline transforms reads CSV files and emits an element for every CSV line. The job performance is low, the pipeline is using only 10 workers, and you notice that the autoscaler is not spinning up additional workers. What should you do to improve performance?

Change the pipeline code, and introduce a Reshuffle step to prevent fusion.

Enable Vertical Autoscaling to let the pipeline use larger workers.

Update the job to increase the maximum number of workers.

Use Dataflow Prime, and enable Right Fitting to increase the worker resources.

You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuery. The security team has enabled an organizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses. What should you do?

Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.

Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.

Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only internal IPs.

Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow, Cloud Storage, and BigQuery as allowed services in the perimeter. Use Dataflow with only internal IP addresses.

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do?

Use Cloud GPUs after implementing GPU kernel support for your customs ops.

Use Cloud TPUs without any additional adjustment to your code.

Use Cloud TPUs after implementing GPU kernel support for your customs ops.

Stay on CPUs, and increase the size of the cluster you're training your model on.

You have a streaming pipeline that ingests data from Pub/Sub in production. You need to update this streaming pipeline with improved business logic. You need to ensure that the updated pipeline reprocesses the previous two days of delivered Pub/Sub messages. What should you do? (Choose two.)

Use Pub/Sub Snapshot capture two days before the deployment.

Use the Pub/Sub subscription retain-acked-messages flag.

Use the Pub/Sub subscription clear-retry-policy flag

Create a new Pub/Sub subscription two days before the deployment.

Use Pub/Sub Seek with a timestamp.

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.

Migrate your data to Cloud Storage and register the bucket as a Dataplex asset. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.

Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.

Migrate your data to BigLake. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc on Compute Engine.

You migrated a data backend for an application that serves 10 PB of historical product data for analytics. Only the last known state for a product, which is about 10 GB of data, needs to be served through an API to the other applications. You need to choose a cost-effective persistent storage solution that can accommodate the analytics requirements and the API performance of up to 1000 queries per second (QPS) with less than 1 second latency. What should you do?

1. Store the historical data in BigQuery for analytics. 2. In a Cloud SQL table, store the last state of the product after every product change. 3. Serve the last state data directly from Cloud SQL to the API.

1. Store the historical data in BigQuery for analytics. 2. Use a materialized view to precompute the last state of a product. 3. Serve the last state data directly from BigQuery to the API.

1. Store the products as a collection in Firestore with each product having a set of historical changes. 2. Use simple and compound queries for analytics. 3. Serve the last state data directly from Firestore to the API.

1. Store the historical data in Cloud SQL for analytics. 2. In a separate table, store the last state of the product after every product change. 3. Serve the last state data directly from Cloud SQL to the API.

You are preparing an organization-wide dataset. You need to preprocess customer data stored in a restricted bucket in Cloud Storage. The data will be used to create consumer analyses. You need to comply with data privacy requirements. What should you do?

Use Dataflow and the Cloud Data Loss Prevention API to mask sensitive data. Write the processed data in BigQuery.

Use customer-managed encryption keys (CMEK) to directly encrypt the data in Cloud Storage. Use federated queries from BigQuery. Share the encryption key by following the principle of least privilege.

Use the Cloud Data Loss Prevention API and Dataflow to detect and remove sensitive fields from the data in Cloud Storage. Write the filtered data in BigQuery.

Use Dataflow and Cloud KMS to encrypt sensitive fields and write the encrypted data in BigQuery. Share the encryption key by following the principle of least privilege.

You have terabytes of customer behavioral data streaming from Google Analytics into BigQuery daily. Your customers’ information, such as their preferences, is hosted on a Cloud SQL for MySQL database. Your CRM database is hosted on a Cloud SQL for PostgreSQL instance. The marketing team wants to use your customers’ information from the two databases and the customer behavioral data to create marketing campaigns for yearly active customers. You need to ensure that the marketing team can run the campaigns over 100 times a day on typical days and up to 300 during sales. At the same time, you want to keep the load on the Cloud SQL databases to a minimum. What should you do?

Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.

Create BigQuery connections to both Cloud SQL databases. Use BigQuery federated queries on the two databases and the Google Analytics data on BigQuery to run these queries.

Create a job on Apache Spark with Dataproc Serverless to query both Cloud SQL databases and the Google Analytics data on BigQuery for these queries.

Create a Dataproc cluster with Trino to establish connections to both Cloud SQL databases and BigQuery, to execute the queries.

You are using a Dataflow streaming job to read messages from a message bus that does not support exactly-once delivery. Your job then applies some transformations, and loads the result into BigQuery. You want to ensure that your data is being streamed into BigQuery with exactly-once delivery semantics. You expect your ingestion throughput into BigQuery to be about 1.5 GB per second. What should you do?

Use the BigQuery Storage Write API and ensure that your target BigQuery table is regional.

Use the BigQuery Storage Write API and ensure that your target BigQuery table is multiregional.

Use the BigQuery Streaming API and ensure that your target BigQuery table is regional.

Use the BigQuery Streaming API and ensure that your target BigQuery table is multiregional.

A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?

Implement clustering in BigQuery on the package-tracking ID column.

Implement clustering in BigQuery on the ingest date column.

Tier older data onto Cloud Storage files and create a BigQuery table using Cloud Storage as an external data source.

Re-create the table using data partitioning on the package delivery date.

You are migrating your data warehouse to BigQuery. You have migrated all of your data into tables in a dataset. Multiple users from your organization will be using the data. They should only see certain tables based on their team membership. How should you set user permissions?

Assign the users/groups data viewer access at the table level for each table

Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views

Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views

Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside

You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?

Use Cloud Vision AutoML with the existing dataset.

Use Cloud Vision AutoML, but reduce your dataset twice.

Use Cloud Vision API by providing custom labels as recognition hints.

Train your own image recognition model leveraging transfer learning techniques.

You orchestrate ETL pipelines by using Cloud Composer. One of the tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-party service. You want to be notified when the task does not succeed. What should you do?

Assign a function with notification logic to the on_failure_callback parameter tor the operator responsible for the task at risk.

Assign a function with notification logic to the on_retry_callback parameter for the operator responsible for the task at risk.

Configure a Cloud Monitoring alert on the sla_missed metric associated with the task at risk to trigger a notification.

Assign a function with notification logic to the sla_miss_callback parameter for the operator responsible for the task at risk.

You are implementing workflow pipeline scheduling using open source-based tools and Google Kubernetes Engine (GKE). You want to use a Google managed service to simplify and automate the task. You also want to accommodate Shared VPC networking considerations. What should you do?

Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.

Use Dataflow for your workflow pipelines. Use Cloud Run triggers for scheduling.

Use Dataflow for your workflow pipelines. Use shell scripts to schedule workflows.

Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the host project.

You are building a report-only data warehouse where the data is streamed into BigQuery via the streaming API. Following Google's best practices, you have both a staging and a production table for the data. How should you design your data loading to ensure that there is only one master dataset without affecting performance on either the ingestion or reporting pieces?

Have a staging table that moves the staged data over to the production table and deletes the contents of the staging table every three hours.

Have a staging table that is an append-only model, and then update the production table every three hours with the changes written to staging.

Have a staging table that is an append-only model, and then update the production table every ninety minutes with the changes written to staging.

Have a staging table that moves the staged data over to the production table and deletes the contents of the staging table every thirty minutes.

Your organization has two Google Cloud projects, project A and project B. In project A, you have a Pub/Sub topic that receives data from confidential sources. Only the resources in project A should be able to access the data in that topic. You want to ensure that project B and any future project cannot access data in the project A topic. What should you do?

Use Identity and Access Management conditions to ensure that only users and service accounts in project A. can access resources in project A.

Add firewall rules in project A so only traffic from the VPC in project A is permitted.

Configure VPC Service Controls in the organization with a perimeter around project A.

Configure VPC Service Controls in the organization with a perimeter around the VPC of project A.

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

A random universally unique identifier number (version 4 UUID)

A concatenation of the product name and the current epoch time

The original order identification number from the sales system, which is a monotonically increasing integer

You recently deployed several data processing jobs into your Cloud Composer 2 environment. You notice that some tasks are failing in Apache Airflow. On the monitoring dashboard, you see an increase in the total workers memory usage, and there were worker pod evictions. You need to resolve these errors. What should you do? (Choose two.)

Increase the maximum number of workers and reduce worker concurrency.

Increase the memory available to the Airflow workers.

Increase the directed acyclic graph (DAG) file parsing interval.

Increase the Cloud Composer 2 environment size from medium to large.

Increase the memory available to the Airflow triggerer.

The Development and External teams have the project viewer Identity and Access Management (IAM) role in a folder named Visualization. You want the Development Team to be able to read data from both Cloud Storage and BigQuery, but the External Team should only be able to read data from BigQuery. What should you do?

Create a VPC Service Controls perimeter containing both projects and Cloud Storage as a restricted API. Add the Development Team users to the perimeter's Access Level.

Remove Cloud Storage IAM permissions to the External Team on the acme-raw-data project.

Create Virtual Private Cloud (VPC) firewall rules on the acme-raw-data project that deny all ingress traffic from the External Team CIDR range.

Create a VPC Service Controls perimeter containing both projects and BigQuery as a restricted API. Add the External Team users to the perimeter's Access Level.

You need to create a SQL pipeline. The pipeline runs an aggregate SQL transformation on a BigQuery table every two hours and appends the result to another existing BigQuery table. You need to configure the pipeline to retry if errors occur. You want the pipeline to send an email notification after three consecutive failures. What should you do?

Use the BigQueryInsertJobOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.

Use the BigQueryUpsertTableOperator in Cloud Composer, set the retry parameter to three, and set the email_on_failure parameter to true.

Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable email notifications.

Create a BigQuery scheduled query to run the SQL transformation with schedule options that repeats every two hours, and enable notification to Pub/Sub topic. Use Pub/Sub and Cloud Functions to send an email after three failed executions. Show Suggested Answer

Google Professional Data Engineer

1.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You are developing an Apache Beam pipeline to extract data from a Cloud SQL instance by using JdbcIO. You have two projects running in Google Cloud. The pipeline will be deployed and executed on Dataflow in Project A. The Cloud SQL. instance is running in Project B and does not have a public IP address. After deploying the pipeline, you noticed that the pipeline failed to extract data from the Cloud SQL instance due to connection failure. You verified that VPC Service Controls and shared VPC are not in use in these projects. You want to resolve this error while ensuring that the data does not go through the public internet. What should you do?

Set up VPC Network Peering between Project A and Project B. Add a firewall rule to allow the peered subnet range to access all instances on the network.

Turn off the external IP addresses on the Dataflow worker. Enable Cloud NAT in Project A.

Add the external IP addresses of the Dataflow worker as authorized networks in the Cloud SQL instance.

Set up VPC Network Peering between Project A and Project B. Create a Compute Engine instance without external IP address in Project B on the peered subnet to serve as a proxy server to the Cloud SQL database.

Answer explanation

Cloud SQL supports private IP addresses through private service access. When you create a Cloud SQL instance, Cloud SQL creates the instance within its own virtual private cloud (VPC), called the Cloud SQL VPC. Enabling private IP requires setting up a peering connection between the Cloud SQL VPC and your VPC network.

2.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You currently have transactional data stored on-premises in a PostgreSQL database. To modernize your data environment, you want to run transactional workloads and support analytics needs with a single database. You need to move to Google Cloud without changing database management systems, and minimize cost and complexity. What should you do?

Migrate and modernize your database with Cloud Spanner.

Migrate your workloads to AlloyDB for PostgreSQL.

Migrate to BigQuery to optimize analytics.

Migrate your PostgreSQL database to Cloud SQL for PostgreSQL.

Answer explanation

They currently have transactional data stored on-premises in a PostgreSQL database and they want to modernize their database that supports transactional workloads and analytics .If they select cloud Sql (postgreSQL) it will minimize the cost and complexity. and for analytics purpose they can create federated queries over cloudSql(postgreSql) https://cloud.google.com/bigquery/docs/federated-queries-intro This approach will minimze the cost

3.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You have a Standard Tier Memorystore for Redis instance deployed in a production environment. You need to simulate a Redis instance failover in the most accurate disaster recovery situation, and ensure that the failover has no impact on production data. What should you do?

Create a Standard Tier Memorystore for Redis instance in the development environment. Initiate a manual failover by using the limited-data-loss data protection mode.

Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode.

Increase one replica to Redis instance in production environment. Initiate a manual failover by using the force-data-loss data protection mode.

Initiate a manual failover by using the limited-data-loss data protection mode to the Memorystore for Redis instance in the production environment.

Answer explanation

The best option is B - Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode. The key points are: • The failover should be tested in a separate development environment, not production, to avoid impacting real data. • The force-data-loss mode will simulate a full failover and restart, which is the most accurate test of disaster recovery. • Limited-data-loss mode only fails over reads which does not fully test write capabilities. • Increasing replicas in production and failing over (C) risks losing real production data. • Failing over production (D) also risks impacting real data and traffic. So option B isolates the test from production and uses the most rigorous failover mode to fully validate disaster recovery capabilities.

4.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You have a data processing application that runs on Google Kubernetes Engine (GKE). Containers need to be launched with their latest available configurations from a container registry. Your GKE nodes need to have GPUs, local SSDs, and 8 Gbps bandwidth. You want to efficiently provision the data processing infrastructure and manage the deployment process. What should you do?

Use Compute Engine startup scripts to pull container images, and use gcloud commands to provision the infrastructure.

Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images.

Use GKE to autoscale containers, and use gcloud commands to provision the infrastructure.

Use Dataflow to provision the data pipeline, and use Cloud Scheduler to run the job.

Answer explanation

- Dataflow is a fully managed service for stream and batch data processing and is well-suited for real-time data processing tasks like identifying longtail and outlier data points. - Using BigQuery as a sink allows to efficiently store the cleansed and processed data for further analysis and serving it to AI models.

5.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You want to create a machine learning model using BigQuery ML and create an endpoint for hosting the model using Vertex AI. This will enable the processing of continuous streaming data in near-real time from multiple vendors. The data may contain invalid values. What should you do?

Create a new BigQuery dataset and use streaming inserts to land the data from multiple vendors. Configure your BigQuery ML model to use the "ingestion" dataset as the framing data.

Use BigQuery streaming inserts to land the data from multiple vendors where your BigQuery dataset ML model is deployed.

Create a Pub/Sub topic and send all vendor data to it. Connect a Cloud Function to the topic to process the data and store it in BigQuery.

Create a Pub/Sub topic and send all vendor data to it. Use Dataflow to process and sanitize the Pub/Sub data and stream it to BigQuery.

Answer explanation

Dataflow provides a scalable and flexible way to process and clean the incoming data in real-time before loading it into BigQuery.

6.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

Use Vertex AI for training existing Spark ML models

Rewrite your models on TensorFlow, and start using Vertex AI

Use Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery

Answer explanation

Option C : It is the most rapid way to migrate your existing training pipelines to Google Cloud. It allows you to continue using your existing Spark ML models. It allows you to take advantage of the scalability and performance of Dataproc. It allows you to read data directly from BigQuery, which is a more efficient way to process large datasets

7.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on- premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?

Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.

Use a proxy host in the VPC in Google Cloud connecting to Kafka. Write a Dataflow pipeline, read data from the proxy host, and write the data to BigQuery.

Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.

Setup a Kafka Connect bridge between Kafka and Pub/Sub. Write a Dataflow pipeline, read the data from Pub/Sub, and write the data to BigQuery.

Answer explanation

Latency: Option C, with direct integration between Kafka and Dataflow, offers lower latency by eliminating intermediate steps. Flexibility: Custom Dataflow pipelines (Option C) provide more control over data processing and optimization compared to using a pre-built template.

8.

MULTIPLE SELECT QUESTION

45 sec • 1 pt

You have a BigQuery table that contains customer data, including sensitive information such as names and addresses. You need to share the customer data with your data analytics and consumer support teams securely. The data analytics team needs to access the data of all the customers, but must not be able to access the sensitive data. The consumer support team needs access to all data columns, but must not be able to access customers that no longer have active contracts. You enforced these requirements by using an authorized dataset and policy tags. After implementing these steps, the data analytics team reports that they still have access to the sensitive columns. You need to ensure that the data analytics team does not have access to restricted data. What should you do? (Choose two.)

Create two separate authorized datasets; one for the data analytics team and another for the consumer support team.

Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags.

Replace the authorized dataset with an authorized view. Use row-level security and apply filter_expression to limit data access.

Remove the bigquery.dataViewer role from the data analytics team on the authorized datasets.

Enforce access control in the policy tag taxonomy.

Answer explanation

B - It will ensure they don't have access to secure columns
E - It will allow to enforce column level security

9.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You work for a large real estate firm and are preparing 6 TB of home sales data to be used for machine learning. You will use SQL to transform the data and use
BigQuery ML to create a machine learning model. You plan to use the model for predictions against a raw dataset that has not been transformed. How should you set up your workflow in order to prevent skew at prediction time?

When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.

When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. Before requesting predictions, use a saved query to transform your raw input data, and then use ML.EVALUATE.

Use a BigQuery view to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.

Preprocess all data using Dataflow. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any further transformations on the input data.

Answer explanation

When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.
Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning.

10.

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You are preparing data that your machine learning team will use to train a model using BigQueryML. They want to predict the price per square foot of real estate. The training data has a column for the price and a column for the number of square feet. Another feature column called ‘feature1’ contains null values due to missing data. You want to replace the nulls with zeros to keep more data points. Which query should you use?

SELECT * EXCEPT(feature1), IFNULL(feature1, 0) AS feature1_cleaned

FROM training_data;

SELECT * EXCEPT(price, square_feet), price/square_feet AS price_per_sqft

FROM training_data

WHERE feature1 IS NOT NULL;

SELECT * EXCEPT(price, square_feet, feature1), price/square_feet AS price_per_sqft, IFNULL(feature1, 0) AS feature1_cleaned

FROM training_data;

SELECT * FROM training_data

WHERE feature1 IS NOT NULL;

Answer explanation

it retains all the original columns and specifically addresses the issue of null values in ‘feature1’ by replacing them with zeros, without altering any other columns or performing unnecessary calculations. This makes the data ready for use in BigQueryML without losing any important information.

Google Professional Data Engineer - Part 2

137 questions

You have a Standard Tier Memorystore for Redis instance deployed in a production environment. You need to simulate a Redis instance failover in the most accurate disaster recovery situation, and ensure that the failover has no impact on production data. What should you do?

You want to create a machine learning model using BigQuery ML and create an endpoint for hosting the model using Vertex AI. This will enable the processing of continuous streaming data in near-real time from multiple vendors. The data may contain invalid values. What should you do?

Dataflow provides a scalable and flexible way to process and clean the incoming data in real-time before loading it into BigQuery.

Latency: Option C, with direct integration between Kafka and Dataflow, offers lower latency by eliminating intermediate steps. Flexibility: Custom Dataflow pipelines (Option C) provide more control over data processing and optimization compared to using a pre-built template.

B - It will ensure they don't have access to secure columns
E - It will allow to enforce column level security

it retains all the original columns and specifically addresses the issue of null values in ‘feature1’ by replacing them with zeros, without altering any other columns or performing unnecessary calculations. This makes the data ready for use in BigQueryML without losing any important information.

Create a free account and access millions of resources

Similar Resources on Wayground

Popular Resources on Wayground

Discover more resources for Computers

Google Professional Data Engineer - Part 2

137 questions

You have a Standard Tier Memorystore for Redis instance deployed in a production environment. You need to simulate a Redis instance failover in the most accurate disaster recovery situation, and ensure that the failover has no impact on production data. What should you do?

You want to create a machine learning model using BigQuery ML and create an endpoint for hosting the model using Vertex AI. This will enable the processing of continuous streaming data in near-real time from multiple vendors. The data may contain invalid values. What should you do?

Dataflow provides a scalable and flexible way to process and clean the incoming data in real-time before loading it into BigQuery.

Latency: Option C, with direct integration between Kafka and Dataflow, offers lower latency by eliminating intermediate steps. Flexibility: Custom Dataflow pipelines (Option C) provide more control over data processing and optimization compared to using a pre-built template.

B - It will ensure they don't have access to secure columns E - It will allow to enforce column level security

Create a free account and access millions of resources

Similar Resources on Wayground

Popular Resources on Wayground

Discover more resources for Computers

B - It will ensure they don't have access to secure columns
E - It will allow to enforce column level security