What mechanism allows Spark Declarative Pipelines to efficiently process only new data in subsequent runs?

Auto Loader combined with checkpoints to track ingestion progress

Manual watermarking defined in SQL queries

Full table rewrites on each pipeline run

Periodic batch scheduling that limits how often pipelines run

What does the AUTO CDC INTO syntax in Lakeflow Declarative Pipelines accomplish?

Simplifies change data capture by incrementally applying inserts, updates, and deletes to a target table

Deletes all data in the pipeline

Automatically creates dashboards for CDC data

When a streaming table in Spark Declarative Pipelines processes new data from its source, how is that data written to the target table?

New records are appended incrementally to the streaming table

Records are merged into the table using MERGE INTO semantics

Existing data in the table is completely overwritten

Data is stored only in a temporary view until the next refresh

In Spark Declarative Pipelines, what is the primary role of a materialized view?

Producing aggregated or derived results from upstream tables

Raw data ingestion from external sources

Enforcing schemas on incoming data

Tracking every historical change to records over time

Which of the following statements best describes the core purpose of Lakeflow Spark Declarative Pipelines?

It acts as a declarative framework that lets you define incremental batch or streaming data pipelines in SQL or Python, while handling orchestration, incremental processing, and failure recovery automatically.

It is only for batch ETL jobs and does not support real-time or streaming data.

It only handles ingestion, and transformations must be handled by separate Spark jobs outside the framework.

It’s a low-level library requiring manual orchestration of Spark Structured Streaming jobs.

In Lakeflow Spark Declarative Pipelines, what is the primary purpose of adding an expectation when defining a streaming table or materialized view?

To apply a data quality constraint that validates each record as it flows through, and optionally drop or flag invalid records

To automatically partition data based on quality checks

To convert JSON strings into structured columns for better querying

To enforce a static schema so that all data must match specified columns exactly

You define a Lakeflow Spark Declarative Pipeline where: - A bronze streaming table ingests new JSON files as they arrive in cloud storage. - A downstream silver streaming table applies transformations such as filtering and enrichment on the bronze streaming table. - A materialized view aggregates the transformed results. When new files arrive in cloud storage and the pipeline is run, what happens?

Only the new files are ingested, downstream streaming tables are updated incrementally, and the materialized view is incrementally updated when possible.

Only the new files are ingested and downstream streaming tables are updated incrementally, but the materialized view is always fully recomputed.

Only the new files are ingested and downstream streaming tables are updated incrementally, but the materialized view must be manually refreshed.

The entire pipeline is re-run from scratch, reprocessing all historical data to produce updated results.

A team has an incremental batch Spark Declarative Pipeline that processes new files daily. The data source begins delivering files continuously, and the team wants near real time processing without rewriting transformations. What change is required?

Change the pipeline trigger from scheduled to continuous execution

Add manual watermark logic to every query

Rewrite all tables as streaming only SQL

Convert the pipeline to a notebook based streaming job

What is the purpose of the event log in Lakeflow Spark Declarative Pipelines, and what information does it provide?

It tracks pipeline runs, including start time, end time, number of records processed, and any errors or warnings from expectations or transformations

It records user access and permissions changes to pipeline definitions

It logs only schema changes made to target tables during schema evolution

It stores the raw data ingested by streaming tables before processing

You are building a Spark Declarative Pipeline with the following requirements: - A Silver streaming table orders_silver already exists and is updated incrementally - You want to create a Gold-layer dataset that aggregates order counts by customer - The aggregated results should be stored as an object and incrementally updated when possible as new data arrives - You want the pipeline engine to manage refresh logic automatically Which SQL definition best meets these requirements?

CREATE OR REFRESH MATERIALIZED VIEW customer_order_summary AS SELECT customer_id, COUNT(order_id) AS order_count FROM orders_silver GROUP BY customer_id;

CREATE STREAMING TABLE customer_order_summary AS SELECT customer_id, COUNT(order_id) AS order_count FROM STREAM(orders_silver) GROUP BY customer_id;

INSERT INTO customer_order_summary SELECT customer_id, COUNT(order_id) FROM orders_silver GROUP BY customer_id;

CREATE OR REPLACE VIEW customer_order_summary AS SELECT customer_id, COUNT(order_id) AS order_count FROM orders_silver GROUP BY customer_id;

You are building a Spark Declarative Pipeline with the following datasets: - A streaming table orders_stream that ingests new orders as they arrive. - A static table customers_dim containing customer attributes that change infrequently. You want to enrich each incoming order with customer details while allowing the pipeline to handle incremental processing and execution order. Which approach best fits Spark Declarative Pipelines?

Join the streaming table with the static table in a declarative transformation and let the pipeline manage incremental updates

Manually cache the static table inside the pipeline for each run

Run a separate batch job to periodically join and overwrite the results

Convert customers_dim into a streaming table so both inputs are streaming

Lakeflow Spark Pipelines

Authored by โปร แกรมเมอร์

English

University

Used 7+ times

AI Actions

Add similar questions

Adjust reading levels

Convert to real-world scenario

Translate activity

More...

Content View

Student View

20 questions

Show all answers

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

In Spark Declarative Pipelines, what is the primary difference between a streaming table and a materialized view?

Streaming tables are used only for raw data, while materialized views are used only for final reporting tables

Streaming tables always recompute all data on each run, while materialized views never recompute data

Streaming tables ingest and incrementally process incoming data from a source, while materialized views incrementally maintain the results of a query over upstream tables when possible

Streaming tables are used only for batch workloads, while materialized views are used only for streaming workloads

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You are migrating a traditional batch ETL workflow, where the entire dataset is fully reprocessed each time the job runs, into a Lakeflow Spark Declarative Pipeline using a Bronze > Silver > Gold architecture. In the existing workflow:
- Raw files are ingested, cleaned, joined, and aggregated in a single batch job.
- Each run processes all historical data, even when new data has arrived

Which approach best reflects how this workflow should be redesigned using Spark Declarative Pipelines?

Keep the single batch job and run it more frequently to reduce data latency

Split the logic into separate batch jobs and manually orchestrate them in sequence

Define a Bronze streaming table for raw ingestion, Silver tables for cleansing and enrichment, and Gold materialized views for aggregations, allowing the pipeline to manage dependencies and incremental updates

Convert the batch SQL into Python and execute it unchanged in a pipeline notebook

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You have a Lakeflow Spark Declarative Pipeline that includes streaming tables, downstream transformations, and materialized views.
You want to:
-Delete all pipeline checkpoints
-Clear all data from streaming tables
-Reprocess all source data from scratch
-Fully rebuild all downstream tables and materialized views

Which action should you take?

Select delete all and run the pipeline

Manually delete the pipeline and start it again

Run the pipeline with a full table refresh

Run the pipeline with different settings

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

When running a Spark Declarative Pipeline for the second time after landing new data, how many rows should be processed?

Zero rows, requiring a manual refresh

The original rows only

All rows in the source volume

Only the new rows added since the last run

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

When using batch notebook-based ETL for large data volumes, what processing or cost consideration often motivates teams to migrate to Spark Declarative Pipelines?

Batch notebook ETL often fully reprocesses data on each run, increasing compute cost, whereas Spark Declarative Pipelines manage incremental processing automatically

Batch notebook ETL does not support distributed processing, while Spark Declarative Pipelines do

Batch notebook ETL cannot use Auto Loader, while Spark Declarative Pipelines can

Batch notebook ETL cannot be scheduled, while Spark Declarative Pipelines can

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

Assume you have JSON log files arriving continuously in cloud storage at /Volumes/logs/events. You want to ingest them into a streaming table called events_bronze.

Which SQL statement correctly defines this streaming table in Databricks SQL?

CREATE STREAMING TABLE events_bronze AS SELECT * FROM read_files('/Volumes/logs/events', format => 'json');

CREATE STREAMING TABLE events_bronze AS SELECT * FROM read_files('/Volumes/logs/events', format => 'json') SCHEDULE EVERY 1 HOUR;

CREATE OR REFRESH STREAMING TABLE events_bronze AS SELECT * FROM STREAM read_files('/Volumes/logs/events', format => 'json');

CREATE OR REFRESH TABLE events_bronze AS SELECT * FROM read_files('/Volumes/logs/events', format => 'json');

MULTIPLE CHOICE QUESTION

30 sec • 1 pt

You are building a pipeline in Lakeflow Spark Declarative Pipelines. You want to make the path to your input data configurable depending on the environment (dev, test, prod). You set a pipeline configuration parameter named input_path. Which SQL snippet correctly references this parameter to define a streaming table that uses that path?

CREATE OR REFRESH STREAMING TABLE raw_events AS SELECT * FROM STREAM read_files(input_path, format => 'json');

CREATE OR REFRESH STREAMING TABLE raw_events AS SELECT * FROM STREAM read_files('${input_path}', format => 'json');

CREATE OR REFRESH STREAMING TABLE raw_events AS SELECT * FROM STREAM read_files({input_path}, format => 'json');

Access all questions and much more by creating a free account

Create resources

Host any resource

Get auto-graded reports

Continue with Google

Continue with Email

Continue with Microsoft

or continue with

Facebook

Apple

Others

Already have an account?

Similar Resources on Wayground

20 questions

EL 102 - Sentences Quiz

Quiz

•

University

16 questions

Animal Farm – Chapter 4 Quiz (EOC Style, Revised)

Quiz

•

10th Grade - University

15 questions

Quiz 4 (for Eng 100) practice

Quiz

•

University

20 questions

Syntax

Quiz

•

11th Grade - University

20 questions

Fine-tune Your English

Quiz

•

University

15 questions

Segregation US

Quiz

•

7th Grade - University

20 questions

Sentences

Quiz

•

University

15 questions

Mid Test Advanced Grammar

Quiz

•

University

Popular Resources on Wayground

28 questions

US History Regents Review

Quiz

•

11th Grade

36 questions

Biology Regents Review

Quiz

•

9th - 10th Grade

20 questions

Math Review

Quiz

•

3rd Grade

38 questions

Regents Life Science General Review

Quiz

•

9th Grade

20 questions

Math Review

Quiz

•

6th Grade

21 questions

EOY Grade 6 Benchmark Assessment - Content Skills

Quiz

•

6th Grade

20 questions

Inferences

Quiz

•

4th Grade

20 questions

Figurative Language Review

Quiz

•

6th Grade

Discover more resources for English

16 questions

TSI Math 2.0 Practice

Quiz

•

9th Grade - University

59 questions

SS Final Exam Review

Quiz

•

KG - University

23 questions

super heros

Quiz

•

KG - Professional Dev...

Lakeflow Spark Pipelines

In Spark Declarative Pipelines, what is the primary difference between a streaming table and a materialized view?

When running a Spark Declarative Pipeline for the second time after landing new data, how many rows should be processed?

When using batch notebook-based ETL for large data volumes, what processing or cost consideration often motivates teams to migrate to Spark Declarative Pipelines?

Assume you have JSON log files arriving continuously in cloud storage at /Volumes/logs/events. You want to ingest them into a streaming table called events_bronze. Which SQL statement correctly defines this streaming table in Databricks SQL?

What mechanism allows Spark Declarative Pipelines to efficiently process only new data in subsequent runs?

What does the AUTO CDC INTO syntax in Lakeflow Declarative Pipelines accomplish?

When a streaming table in Spark Declarative Pipelines processes new data from its source, how is that data written to the target table?

Access all questions and much more by creating a free account

Similar Resources on Wayground

Popular Resources on Wayground

Discover more resources for English

Assume you have JSON log files arriving continuously in cloud storage at /Volumes/logs/events. You want to ingest them into a streaming table called events_bronze.

Which SQL statement correctly defines this streaming table in Databricks SQL?