Google Cloud Certified - Professional Data Engineer

Ace Your Professional Data Engineer Certification with Practice Exams.

Google Cloud Certified – Professional Data Engineer – Practice Exam (Question 50)


Question 1

You are building a model to predict whether or not it will rain on a given day.
You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy.
What can you do?

  • A. Eliminate features that are highly correlated to the output labels.
  • B. Combine highly co-dependent features into one representative feature.
  • C. Instead of feeding in each feature individually, average their values in batches of 3.
  • D. Remove the features that have null values for more than 50% of the training records.

Correct Answer: B


Question 2

You are building a new application that you need to collect data from in a scalable way.
Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
– Decoupling producer from consumer.
– Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely.
– Near real-time SQL query.
– Maintain at least 2 years of historical data, which will be queried with SQL.
Which pipeline should you use to meet these requirements?

  • A. Create an application that provides an API. Write a tool to poll the API and write data to Google Cloud Storage as gzipped JSON files.
  • B. Create an application that writes to a Google Cloud SQL database to store the data. Set up periodic exports of the database to write to Google Cloud Storage and load into Google BigQuery.
  • C. Create an application that publishes events to Google Cloud Pub/Sub, and create Spark jobs on Google Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.
  • D. Create an application that publishes events to Google Cloud Pub/Sub, and create a Google Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Google Cloud Storage and Google BigQuery.

Correct Answer: A


Question 3

You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners.
Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones.
What should you do?

  • A. Create an API using Google App Engine to receive and send messages to the applications.
  • B. Use a Google Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them.
  • C. Create a table on Google Cloud SQL, and insert and delete rows with the job information.
  • D. Create a table on Google Cloud Spanner, and insert and delete rows with the job information.

Correct Answer: A

Reference contents:
Sending and Receiving Mail with the Mail API


Question 4

You are building an application to share financial market data with consumers, who will receive data feeds.
Data is collected from the markets in real time.
Consumers will receive the data in the following ways:
– Real-time event stream.
– ANSI SQL access to real-time stream and historical data.
– Batch historical exports.
Which solution should you use?

  • A. Google Cloud Dataflow, Google Cloud SQL, Google Cloud Spanner
  • B. Google Cloud Pub/Sub, Google Cloud Storage, Google BigQuery
  • C. Google Cloud Dataproc, Google Cloud Dataflow, Google BigQuery
  • D. Google Cloud Pub/Sub, Google Cloud Dataproc, Google Cloud SQL

Correct Answer: A


Question 5

You are building a new real-time data warehouse for your company and will use Google BigQuery streaming inserts.
There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data.
Which query type should you use?

  • A. Include ORDER BY DESK on timestamp column and LIMIT to 1.
  • B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
  • C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
  • D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

Correct Answer: D


Question 6

You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices.
The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required. You need to analyze the data by querying against individual fields.
Which three databases meet your requirements? (Choose three.)

  • A. Redis
  • B. HBase
  • C. MySQL
  • D. MongoDB
  • E. Cassandra
  • F. HDFS with Hive

Correct Answer: B,D,F


Question 7

You are creating a model to predict housing prices.
Due to budget constraints, you must run it on a single resource-constrained virtual machine.
Which learning algorithm should you use?

  • A. Linear regression
  • B. Logistic classification
  • C. Recurrent neural network
  • D. Feedforward neural network

Correct Answer: A


Question 8

You are creating a new pipeline in Google Cloud to stream IoT data from Google Cloud Pub/Sub through Google Cloud Dataflow to Google BigQuery.
While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Google Cloud Dataflow pipeline to filter out this corrupt data.
What should you do?

  • A. Add a SideInput that returns a Boolean if the element is corrupt.
  • B. Add a ParDo transform in Google Cloud Dataflow to discard corrupt elements.
  • C. Add a Partition transform in Google Cloud Dataflow to separate valid data from corrupt data.
  • D. Add a GroupByKey transform in Google Cloud Dataflow to group all of the valid data together and discard the rest.

Correct Answer: B

Reference contents:
ParDo


Question 9

You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally.
You need to process, store and analyze these very large datasets in real time.
What should you do?

  • A. Send the data to Google Cloud Datastore and then export to Google BigQuery.
  • B. Send the data to Google Cloud Pub/Sub, stream Google Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.
  • C. Send the data to Google Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.
  • D. Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Google Cloud Storage, and run an analysis as needed.

Correct Answer: B


Question 10

You are deploying a new storage system for your mobile application, which is a media streaming service.
You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity ‘Movie’ the property ‘actors’ and the property ‘tags’ have multiple values but the property ‘date released’ does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released.
How should you avoid a combinatorial explosion in the number of indexes?

  • A. Manually configure the index in your index config as follows:
    Indexes:
    -kind: Movie
    Properties:
    -name: actors
    name date_released
    -kind: Movie
    Properties:
    -name: tags
    name: date_released
  • B. Manually configure the index in your index config as follows:
    Indexes:
    -kind: Movie
    Properties:
    -name: actors
    -name: date_published
  • C. Set the following in your entity options:
    exclude_from_indexes = “˜actors, tags’
  • D. Set the following in your entity options:
    exclude_from_indexes = “˜date_published’

Correct Answer: A

Reference contents:
Indexes | Cloud Datastore Documentation


Question 11

You are deploying MariaDB SQL databases on Google Compute Engine VM Instances and need to configure monitoring and alerting.
You want to collect metrics including network connections, disk IO and replication status from MariaDB with minimal development effort and use StackDriver for dashboards and alerts.
What should you do?

  • A. Install the OpenCensus Agent and create a custom metric collection application with a StackDriver exporter.
  • B. Place the MariaDB instances in an Instance Group with a Health Check.
  • C. Install the StackDriver Logging Agent and configure fluentd in_tail plugin to read MariaDB logs.
  • D. Install the StackDriver Agent and configure the MySQL plugin.

Correct Answer: C


Question 12

You are designing a basket abandonment system for an ecommerce company.
The system will send a message to a user based on these rules:
– No interaction by the user on the site for 1 hour.
– Has added more than $30 worth of products to the basket.
– Has not completed a transaction.
You use Google Cloud Dataflow to process the data and decide if a message should be sent.
How should you design the pipeline?

  • A. Use a fixed-time window with a duration of 60 minutes.
  • B. Use a sliding time window with a duration of 60 minutes.
  • C. Use a session window with a gap time duration of 60 minutes.
  • D. Use a global window with a time based trigger with a delay of 60 minutes.

Correct Answer: C


Question 13

You are designing a cloud-native historical data processing system to meet the following conditions:
– The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Google Cloud Dataproc, Google BigQuery, and Google Compute Engine.
– A streaming data pipeline stores new data daily.
– Performance is not a factor in the solution.
– The solution design should maximize availability.
How should you design data storage for this solution?

  • A. Create a Google Cloud Dataproc cluster with high availability. Store the data in HDFS, and perform analysis as needed.
  • B. Store the data in Google BigQuery. Access the data using the Google BigQuery Connector on Google Cloud Dataproc and Google Compute Engine.
  • C. Store the data in a regional Google Cloud Storage bucket. Access the bucket directly using Google Cloud Dataproc, Google BigQuery, and Google Compute Engine.
  • D. Store the data in a multi-regional Google Cloud Storage bucket. Access the data directly using Google Cloud Dataproc, Google BigQuery, and Google Compute Engine.

Correct Answer: C


Question 14

You are designing a data processing pipeline.
The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour.
How should you design the solution?

  • A. Use Apache Kafka for message ingestion and use Google Cloud Dataproc for streaming analysis.
  • B. Use Apache Kafka for message ingestion and use Google Cloud Dataflow for streaming analysis.
  • C. Use Google Cloud Pub/Sub for message ingestion and Google Cloud Dataproc for streaming analysis.
  • D. Use Google Cloud Pub/Sub for message ingestion and Google Cloud Dataflow for streaming analysis.

Correct Answer: C


Question 15

You are designing an Apache Beam pipeline to enrich data from Google Cloud Pub/Sub with static reference data from Google BigQuery.
The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to Google BigQuery for analysis.
Which job type and transforms should this pipeline use?

  • A. Batch job, PubSubIO, side-inputs
  • B. Streaming job, PubSubIO, Jdbc IO, side-outputs
  • C. Streaming job, PubSubIO, BigQueryIO, side-inputs
  • D. Streaming job, PubSubIO, BigQueryIO, side-outputs

Correct Answer: A


Question 16

You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud.
Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Google Cloud Storage with multiple engines.
Which storage service and schema design should you use?

  • A. Use Google Cloud Bigtable for storage. Install the HBase shell on a Google Compute Engine instance to query the Google Cloud Bigtable data.
  • B. Use Google Cloud Bigtable for storage. Link as permanent tables in Google BigQuery for query.
  • C. Use Google Cloud Storage for storage. Link as permanent tables in Google BigQuery for query.
  • D. Use Google Cloud Storage for storage. Link as temporary tables in Google BigQuery for query.

Correct Answer: A

Reference contents:
Quotas and limits | BigQuery
Cloud Storage as a data lake | Architectures


Question 17

You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud.
You want to support transactions that scale horizontally. You also want to optimize data for range queries on non-key columns.
What should you do?

  • A. Use Google Cloud SQL for storage. Add secondary indexes to support query patterns.
  • B. Use Google Cloud SQL for storage. Use Google Cloud Dataflow to transform data to support query patterns.
  • C. Use Google Cloud Spanner for storage. Add secondary indexes to support query patterns.
  • D. Use Google Cloud Spanner for storage. Use Google Cloud Dataflow to transform data to support query patterns.

Correct Answer: C

Reference contents:
Secondary indexes | Cloud Spanner


Question 18

You are designing storage for very large text files for a data pipeline on Google Cloud.
You want to support ANSI SQL queries. You also want to support compression and parallel load from the input locations using Google recommended practices.
What should you do?

  • A. Transform text files to compressed Avro using Google Cloud Dataflow. Use Google BigQuery for storage and query.
  • B. Transform text files to compressed Avro using Google Cloud Dataflow. Use Google Cloud Storage and Google BigQuery permanent linked tables for query.
  • C. Compress text files to gzip using the Grid Computing Tools. Use Google BigQuery for storage and query.
  • D. Compress text files to gzip using the Grid Computing Tools. Use Google Cloud Storage, and then import into Google Cloud Bigtable for query.

Correct Answer: D

Reference contents:
Loading Avro data from Cloud Storage | BigQuery


Question 19

You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat.
Here is some of the information you need to store:
– The user profile: What the user likes and doesn’t like to eat.
– The user account information: Name, address, preferred meal times.
– The order information: When orders are made, from where, to whom.
The database will be used to store all the transactional data of the product. You want to optimize the data schema.
Which Google Cloud Platform product should you use?

  • A. Google BigQuery
  • B. Google Cloud SQL
  • C. Google Cloud Bigtable
  • D. Google Cloud Datastore

Correct Answer: A


Question 20

You are developing a software application using Google Cloud Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline.
Which component will be used for the data processing operation?

  • A. PCollection
  • B. Transform
  • C. Pipeline
  • D. Sink API

Correct Answer: B

In Google Cloud, the Dataflow SDK provides a transform component. It is responsible for the data processing operation. You can use conditional, for loops, and other complex programming structure to create a branching pipeline.

Reference contents:
Cloud Dataflow Programming Model


Question 21

You are developing an application on Google Cloud that will automatically generate subject labels for users’ blog posts.
You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning.
What should you do?

  • A. Call the Google Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
  • B. Call the Google Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.
  • C. Build and train a text classification model using TensorFlow. Deploy the model using Google Cloud Machine Learning Engine. Call the model from your application and process the results as labels.
  • D. Build and train a text classification model using TensorFlow. Deploy the model using a Google Kubernetes Engine cluster. Call the model from your application and process the results as labels.

Correct Answer: A


Question 22

You are developing an application that uses a recommendation engine on Google Cloud.
Your solution should display new videos to customers based on past views. Your solution needs to generate labels for the entities in videos that the customer has viewed. Your design must be able to provide very fast filtering suggestions based on data from other customer preferences on several TB of data.
What should you do?

  • A. Build and train a complex classification model with Spark MLlib to generate labels and filter the results. Deploy the models using Google Cloud Dataproc. Call the model from your application.
  • B. Build and train a classification model with Spark MLlib to generate labels. Build and train a second classification model with Spark MLlib to filter results to match customer preferences. Deploy the models using Google Cloud Dataproc. Call the models from your application.
  • C. Build an application that calls the Google Cloud Video Intelligence API to generate labels. Store data in Google Cloud Bigtable, and filter the predicted labels to match the user’s viewing history to generate preferences.
  • D. Build an application that calls the Google Cloud Video Intelligence API to generate labels. Store data in Google Cloud SQL, and join and filter the predicted labels to match the user’s viewing history to generate preferences.

Correct Answer: C


Question 23

You are implementing security best practices on your data pipeline.
Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?

  • A. Restrict the Google Cloud Storage bucket so only you can see the files.
  • B. Grant the Project Owner role to a service account, and run the job with it.
  • C. Use a service account with the ability to read the batch files and to write to Google BigQuery.
  • D. Use a user account with the Project Viewer role on the Google Cloud Dataproc cluster to read the batch files and write to Google BigQuery.

Correct Answer: B

Reference contents:
Dataproc permissions and IAM roles | Dataproc Documentation


Question 24

You are implementing several batch jobs that must be executed on a schedule.
These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in Google BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times.
Which service should you use to manage the execution of these jobs?

  • A. Google Cloud Scheduler
  • B. Google Cloud Dataflow
  • C. Google Cloud Functions
  • D. Google Cloud Composer

Correct Answer: A


Question 25

You are integrating one of your internal IT applications and Google BigQuery, so users can query Google BigQuery from the application’s interface.
You do not want individual users to authenticate to Google BigQuery and you do not want to give them access to the dataset. You need to securely access Google BigQuery from your IT application.
What should you do?

  • A. Create groups for your users and give those groups access to the dataset.
  • B. Integrate with a single sign-on (SSO) platform, and pass each user’s credentials along with the query request.
  • C. Create a service account and grant dataset access to that account. Use the service account’s private key to access the dataset.
  • D. Create a dummy user and grant dataset access to that user. Store the username and password for that user in a file on the files system, and use those credentials to access the Google BigQuery dataset.

Correct Answer: C


Question 26

You are managing a Google Cloud Dataproc cluster.
You need to make a job run faster while minimizing costs, without losing work in progress on your clusters.
What should you do?

  • A. Increase the cluster size with more non-preemptible workers.
  • B. Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
  • C. Increase the cluster size with preemptible worker nodes, and use Google Stackdriver to trigger a script to preserve work.
  • D. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.

Correct Answer: D

Reference contents:
Dataproc Enhanced Flexibility Mode | Dataproc Documentation


Question 27

You are migrating your data warehouse to Google BigQuery.
You have migrated all of your data into tables in a dataset. Multiple users from your organization will be using the data. They should only see certain tables based on their team membership.
How should you set user permissions?

  • A. Assign the users/groups data viewer access at the table level for each table.
  • B. Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views.
  • C. Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views.
  • D. Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside.

Correct Answer: C

Reference contents:
Creating an authorized view | BigQuery


Question 28

You are operating a Google Cloud Dataflow streaming pipeline.
The pipeline aggregates events from a Google Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Google Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Google Stackdriver to ensure that it is processing data.
Which Stackdriver alerts should you create?

  • A. An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/ used_bytes for the destination.
  • B. An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination.
  • C. An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/ num_undelivered_messages for the destination.
  • D. An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/ num_undelivered_messages for the destination.

Correct Answer: B


Question 29

You are operating a streaming Google Cloud Dataflow pipeline.
Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update.
What should you do?

  • A. Update the Google Cloud Dataflow pipeline inflight by passing the –update option with the –jobName set to the existing job name.
  • B. Update the Google Cloud Dataflow pipeline inflight by passing the –update option with the –jobName set to a new unique job name.
  • C. Stop the Google Cloud Dataflow pipeline with the Cancel option. Create a new Google Cloud Dataflow job with the updated code.
  • D. Stop the Google Cloud Dataflow pipeline with the Drain option. Create a new Google Cloud Dataflow job with the updated code.

Correct Answer: A

Reference contents:
Updating an existing pipeline | Cloud Dataflow


Question 30

You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud.
You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service.
What should you do?

  • A. Deploy a Google Cloud Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Google Cloud Storage, and change references in scripts from hdfs:// to gs://
  • B. Deploy a Google Cloud Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Google Cloud Storage, and change references in scripts from hdfs:// to gs://
  • C. Install Hadoop and Spark on a 10-node Google Compute Engine instance group with standard instances. Install the Google Cloud Storage connector, and store the data in Google Cloud Storage. Change references in scripts from hdfs:// to gs://
  • D. Install Hadoop and Spark on a 10-node Google Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://

Correct Answer: A


Question 31

You are planning to use Google Cloud Dataflow SDK to analyze customer data such as displayed below.
Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.
Tom,555 X street –
Tim,553 Y street –
Sam, 111 Z street –
Which operation is best suited for the above data processing requirement?

  • A. ParDo
  • B. Sink API
  • C. Source API
  • D. Data extraction

Correct Answer: A

In Google Cloud Dataflow SDK, you can use the ParDo to extract only a customer name of each element in your PCollection.

Reference contents:
Parallel Processing with ParDo


Question 32

You are responsible for writing your company’s ETL pipelines to run on an Apache Hadoop cluster.
The pipeline will require some checkpointing and splitting pipelines.
Which method should you use to write the pipelines?

  • A. PigLatin using Pig
  • B. HiveQL using Hive
  • C. Java using MapReduce
  • D. Python using MapReduce

Correct Answer: D


Question 33

You are running a pipeline in Google Cloud Dataflow that receives messages from a Google Cloud Pub/Sub topic and writes the results to a Google BigQuery dataset in the EU.
Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization.
Which two actions can you take to increase performance of your pipeline? (Choose two.)

  • A. Increase the number of max workers.
  • B. Use a larger instance type for your Google Cloud Dataflow workers.
  • C. Change the zone of your Google Cloud Dataflow pipeline to run in us-central1.
  • D. Create a temporary table in Google Cloud Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Google Cloud Bigtable to Google BigQuery.
    E. Create a temporary table in Google Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Google Cloud Spanner to Google BigQuery.

Correct Answer: B, E

Reference contents:
Common error guidance | Cloud Dataflow


Question 34

You are selecting services to write and transform JSON messages from Google Cloud Pub/Sub to Google BigQuery for a data pipeline on Google Cloud.
You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention.
What should you do?

  • A. Use Google Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
  • B. Use Google Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
  • C. Use Google Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default auto scaling setting for worker instances.
  • D. Use Google Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Google Compute Engine machine types when needed.

Correct Answer: B


Question 35

You are training a spam classifier.
You notice that you are overfitting the training data.
Which three actions can you take to resolve this problem? (Choose three.)

  • A. Get more training examples.
  • B. Reduce the number of training examples.
  • C. Use a smaller set of features.
  • D. Use a larger set of features.
  • E. Increase the regularization parameters.
  • F. Decrease the regularization parameters.

Correct Answer: A, D, F

Reference contents:
Preventing overfitting | BigQuery ML


Question 36

You are using Google BigQuery as your data warehouse.
Your users report that the following simple query is running very slowly, no matter when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country. You check the query plan for the query and see the following output in the Read section of Stage:1:
What is the most likely cause of the delay for this query?

Professional Data Engineer:バー
  • A. Users are running too many concurrent queries in the system.
  • B. The [myproject:mydataset.mytable] table has too many partitions.
  • C. Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values.
  • D. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew.

Correct Answer: A

Reference contents:
Avoiding SQL anti-patterns | BigQuery


Question 37

You are working on a niche product in the image recognition domain.
Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud.
What should you do?

  • A. Use Google Cloud TPUs without any additional adjustment to your code.
  • B. Use Google Cloud TPUs after implementing GPU kernel support for your customs ops.
  • C. Use Google Cloud GPUs after implementing GPU kernel support for your customs ops.
  • D. Stay on CPUs, and increase the size of the cluster you’re training your model on.

Correct Answer: B

Reference contents:
Cloud Tensor Processing Units (TPUs)


Question 38

You are working on a sensitive project involving private user data.
You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project.
How should you maintain users’ privacy?

  • A. Grant the consultant the Viewer role on the project.
  • B. Grant the consultant the Google Cloud Dataflow Developer role on the project.
  • C. Create a service account and allow the consultant to log on with it.
  • D. Create an anonymized sample of the data for the consultant to work with in a different project.

Correct Answer: C


Question 39

You create a new report for your large team in Google Data Studio 360.
The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.
Which two actions should you take? (Choose two.)

  • A. Ensure all the tables are included in the global dataset.
  • B. Ensure each table is included in a dataset for a region.
  • C. Adjust the settings for each table to allow a related region-based security group view access.
  • D. Adjust the settings for each view to allow a related region-based security group view access.
  • E. Adjust the settings for each dataset to allow a related region-based security group view access.

Correct Answer: B, D


Question 40

You create an important report for your large team in Google Data Studio 360.
The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old.
What should you do?

  • A. Disable caching by editing the report settings.
  • B. Disable caching in Google BigQuery by editing table details.
  • C. Refresh your browser tab showing the visualizations.
  • D. Clear your browser history for the past hour then reload the tab showing the virtualizations.

Correct Answer: A

Reference contents:
Manage data freshness – Data Studio Help


Question 41

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally.
Because large parts of the globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your Kafka cluster. This is becoming difficult to manage and prohibitively expensive.
What is the Google-recommended cloud native architecture for this scenario?

  • A. Edge TPUs as sensor devices for storing and transmitting the messages.
  • B. Google Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.
  • C. An IoT gateway connected to Google Cloud Pub/Sub, with Google Cloud Dataflow to read and process the messages from Google Cloud Pub/Sub.
  • D. A Kafka cluster virtualized on Google Compute Engine in us-east with Google Cloud Load Balancing to connect to the devices around the world.

Correct Answer: C


Question 42

You decided to use Google Cloud Datastore to ingest vehicle telemetry data in real time.
You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Google Cloud Datastore in a different environment. You want to archive these snapshots for a long time.
Which two methods can accomplish this? (Choose two.)

  • A. Use managed export, and store the data in a Google Cloud Storage bucket using Nearline or Coldline class.
  • B. Use managed export, and then import to Google Cloud Datastore in a separate project under a unique namespace reserved for that export.
  • C. Use managed export, and then import the data into a Google BigQuery table created just for that export, and delete temporary export files.
  • D. Write an application that uses Google Cloud Datastore client libraries to read all the entities. Treat each entity as a Google BigQuery table row via Google BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the Google BigQuery table is partitioned using the export timestamp column.
  • E. Write an application that uses Google Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.

Correct Answer: C, E

Reference contents:
Exporting and Importing Entities | Cloud Datastore Documentation


Question 43

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics.
Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources.
How should you adjust the database design?

  • A. Add capacity (memory and disk space) to the database server by the order of 200.
  • B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
  • C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
  • D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Correct Answer: C


Question 44

You have a data pipeline that writes data to Google Cloud Bigtable using well-designed row keys.
You want to monitor your pipeline to determine when to increase the size of your Google Cloud Bigtable cluster.
Which two actions can you take to accomplish this? (Choose two.)

  • A. Review Key Visualizer metrics. Increase the size of the Google Cloud Bigtable cluster when the Read pressure index is above 100.
  • B. Review Key Visualizer metrics. Increase the size of the Google Cloud Bigtable cluster when the Write pressure index is above 100.
  • C. Monitor the latency of write operations. Increase the size of the Google Cloud Bigtable cluster when there is a sustained increase in write latency.
  • D. Monitor storage utilization. Increase the size of the Google Cloud Bigtable cluster when utilization increases above 70% of max capacity.
  • E. Monitor latency of read operations. Increase the size of the Google Cloud Bigtable cluster of read operations take longer than 100 ms.

Correct Answer: A, C


Question 45

You have a data pipeline with a Google Cloud Dataflow job that aggregates and writes time series metrics to Google Cloud Bigtable.
This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data.
Which two actions should you take? (Choose two.)

  • A. Configure your Google Cloud Dataflow pipeline to use local execution.
  • B. Increase the maximum number of Google Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions.
  • C. Increase the number of nodes in the Google Cloud Bigtable cluster.
  • D. Modify your Google Cloud Dataflow pipeline to use the Flatten transform before writing to Google Cloud Bigtable.
  • E. Modify your Google Cloud Dataflow pipeline to use the CoGroupByKey transform before writing to Google Cloud Bigtable.

Correct Answer: D, E

Reference contents:
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Specifying pipeline execution parameters | Cloud Dataflow
Building production-ready data pipelines using Dataflow: Developing and testing data pipelines


Question 46

You have data stored in Google BigQuery. The data in the Google BigQuery dataset must be highly available.
You need to define a storage, backup, and recovery strategy of this data that minimizes cost.
How should you configure the Google BigQuery table?

  • A. Set the Google BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
  • B. Set the Google BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
  • C. Set the Google BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
  • D. Set the Google BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.

Correct Answer: B

Reference contents:
Geography and regions | Documentation
Availability and durability | BigQuery
BigQuery for data warehouse practitioners | Solutions


Question 47

You have a job that you want to cancel.
It is a streaming pipeline, and you want to ensure that any data that is in-flight is processed and written to the output.
Which of the following commands can you use on the Google Cloud Dataflow monitoring console to stop the pipeline job?

  • A. Cancel
  • B. Drain
  • C. Stop
  • D. Finish

Correct Answer: B

Using the Drain option to stop your job tells the Dataflow service to finish your job in its current state. Your job will immediately stop ingesting new data from input sources, but the Dataflow service will preserve any existing resources (such as worker instances) to finish processing and writing any buffered data in your pipeline.

Reference contents:
Stopping a running pipeline | Cloud Dataflow


Question 48

You have a petabyte of analytics data and need to design a storage and processing platform for it.
You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers.
What should you do?

  • A. Store and process the entire dataset in Google BigQuery.
  • B. Store and process the entire dataset in Google Cloud Bigtable.
  • C. Store the full dataset in Google BigQuery, and store a compressed copy of the data in a Google Cloud Storage bucket.
  • D. Store the warm data as files in Google Cloud Storage, and store the active data in Google BigQuery. Keep this ratio as 80% warm and 20% active.

Correct Answer: D


Question 49

You have a query that filters a Google BigQuery table using a WHERE clause on timestamp and ID columns.
By using bq query -dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by Google BigQuery with minimal changes to existing SQL queries.
What should you do?

  • A. Create a separate table for each ID.
  • B. Use the LIMIT keyword to reduce the number of rows returned.
  • C. Recreate the table with a partitioning column and clustering column.
  • D. Use the bq query – -maximum_bytes_billed flag to restrict the number of bytes billed.

Correct Answer: B

Reference contents:
Controlling costs in BigQuery


Question 50

You have a requirement to insert minute-resolution data from 50,000 sensors into a Google BigQuery table.
You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends.
What should you do?

  • A. Use bq load to load a batch of sensor data every 60 seconds.
  • B. Use a Google Cloud Dataflow pipeline to stream data into the Google BigQuery table.
  • C. Use the INSERT statement to insert a batch of data every 60 seconds.
  • D. Use the MERGE statement to apply updates in batch every 60 seconds.

Correct Answer: C

Comments are closed