Professional Data Engineer Practice Exam (2021.03.05) : Cloud Smog

Google Cloud Certified - Professional Data Engineer

Written by E.G -

on 3月 5, 2021

Ace Your Professional Data Engineer Certification with Practice Exams.

Google Cloud Certified – Professional Data Engineer – Practice Exam (Question 41)

Question 1

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly.
The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit.
Which solution should you choose?

A. Use Google BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.
C. Use the Google Cloud Vision API to detect for damage, and raise an alert through Google Cloud Functions. Integrate the package tracking applications with this function.
D. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Google Cloud Datalab that uses this model so you can analyze for damaged packages.

Correct Answer: A

Question 2

You work for a shipping company that uses handheld scanners to read shipping labels.
Your company has strict data privacy standards that require scanners to only transmit recipients’ personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems.
What should you do?

A. Create an authorized view in Google BigQuery to restrict access to tables with sensitive data.
B. Install a third-party data validation tool on Google Compute Engine virtual machines to check the incoming data for sensitive information.
C. Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
D. Build a Cloud Function that reads the topics and makes a call to the Google Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.

Correct Answer: A

Question 3

You work for an advertising company, and you’ve developed a Spark ML model to predict click-through rates at advertisement blocks.
You’ve been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you’ve been using will be migrated to Google BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud.
What should you do?

A. Use Cloud ML Engine for training existing Spark ML models.
B. Rewrite your models on TensorFlow, and start using Cloud ML Engine.
C. Use Google Cloud Dataproc for training existing Spark ML models, but start reading data directly from Google BigQuery.
D. Spin up a Spark cluster on Google Compute Engine, and train Spark ML models on the data exported from Google BigQuery.

Correct Answer: A

Question 4

You work for an economic consulting firm that helps companies identify economic trends as they happen.
As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in Google BigQuery as cheaply as possible.
What should you do?

A. Load the data every 30 minutes into a new partitioned table in Google BigQuery.
B. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in Google BigQuery
C. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query Google BigQuery and combine the data programmatically with the data stored in Google Cloud Datastore
D. Store the data in a file in a regional Google Cloud Storage bucket. Use Google Cloud Dataflow to query Google BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Correct Answer: C

Question 5

You work on a regression problem in a natural language processing domain, and you have 100M labeled examples in your dataset.
You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set.
How should you improve the performance of your model?

A. Increase the share of the test sample in the train-test split.
B. Try to collect more data and increase the size of your dataset.
C. Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
D. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

Correct Answer: D

Question 6

You’re training a model to predict housing prices based on an available dataset with real estate properties.
Your plan is to train a fully connected neural net, and you’ve discovered that the dataset contains latitude and longitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you’d like to engineer a feature that incorporates this physical dependency.
What should you do?

A. Provide latitude and longitude as input vectors to your neural net.
B. Create a numeric column from a feature cross of latitude and longitude.
C. Create a feature cross of latitude and longitude, bucketize at the minute level and use L1 regularization during optimization.
D. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization.

Correct Answer: B

Reference contents:
– Working with BigQuery GIS data

Question 7

You’re using Google Cloud Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes.
You’ve recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.
What should you do?

A. Export Google Cloud Bigtable dump to Google Cloud Storage and run your analytical job on top of the exported files.
B. Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
C. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
D. Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.

Correct Answer: B

Reference contents:
– Examples of Replication Settings > Isolate batch analytics workloads from other applications | Cloud Bigtable Documentation

Question 8

You’ve migrated a Hadoop job from an on-prem cluster to dataproc and Google Cloud Storage.
Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you’d like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you’d like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

A. Increase the size of your parquet files to ensure they are 1 GB minimum.
B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
C. Switch from HDDs to SSDs, copy initial data from Google Cloud Storage to HDFS, run the Spark job and copy results back to Google Cloud Storage.
D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.

Correct Answer: C

Question 9

Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics.
They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly.
How should you optimize the cluster for cost?

A. Migrate the workload to Google Cloud Dataflow.
B. Use preemptible virtual machines (VMs) for the cluster.
C. Use a higher-memory node so that the job runs faster.
D. Use SSDs on the worker nodes so that the job can run faster.

Correct Answer: A

Question 10

Your company built a TensorFlow neural-network model with a large number of neurons and layers.
The model fits well for the training data. However, when tested against new data, it performs poorly.
What method can you employ to address this?

A. Threading
B. Serialization
C. Dropout Methods
D. Dimensionality Reduction

Correct Answer: C

Reference contents:
– A simple deep learning model for stock price prediction using TensorFlow

Question 11

Your company handles data processing for a number of different clients.
Each client prefers to use their own suite of analytics tools, with some allowing direct query access via Google BigQuery. You need to secure the data so that clients cannot see each other’s data. You want to ensure appropriate access to the data.
Which three steps should you take? (Choose three.)

A. Load data into different partitions.
B. Load data into a different dataset for each client.
C. Put each client’s Google BigQuery dataset into a different table.
D. Restrict a client’s dataset to approved users.
E. Only allow a service account to access the datasets.
F. Use the appropriate identity and access management (IAM) roles for each client’s users.

Correct Answer: B, D, F

Question 12

Your company has a hybrid cloud initiative.
You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers.
Which cloud-native service should you use to orchestrate the entire pipeline?

A. Google Cloud Dataflow
B. Google Cloud Composer
C. Google Cloud Dataprep
D. Google Cloud Dataproc

Correct Answer: D

Question 13

Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine.
The scientist primarily wants to create labelled datasets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?

A. Run a local version of Jupiter on the laptop.
B. Grant the user access to Google Cloud Shell.
C. Host a visualization tool on a VM on Google Compute Engine.
D. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.

Correct Answer: B

Question 14

Your company has recently grown rapidly and is now ingesting data at a significantly higher rate than it was previously.
You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs.
What should you recommend they do?

A. Rewrite the job in Pig.
B. Rewrite the job in Apache Spark.
C. Increase the size of the Hadoop cluster.
D. Decrease the size of the Hadoop cluster but also rewrite the job in Hive.

Correct Answer: A

Question 15

Your company is currently setting up data pipelines for their campaign.
For all the Google Cloud Pub/Sub streaming data, one of the important business requirements is to be able to periodically identify the inputs and their timings during their campaign. Engineers have decided to use windowing and transformation in Google Cloud Dataflow for this purpose. However, when testing this feature, they find that the Google Cloud Dataflow job fails for the all streaming insert.
What is the most likely cause of this problem?

A. They have not assigned the timestamp, which causes the job to fail.
B. They have not set the triggers to accommodate the data coming in late, which causes the job to fail.
C. They have not applied a global windowing function, which causes the job to fail when the pipeline is created.
D. They have not applied a non-global windowing function, which causes the job to fail when the pipeline is created.

Correct Answer: C

Question 16

Your company is in a highly regulated industry.
One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery.
Which three approaches can you take? (Choose three.)

A. Disable writes to certain tables.
B. Restrict access to tables by role.
C. Ensure that the data is encrypted at all times.
D. Restrict Google BigQuery API access to approved users.
E. Segregate data across multiple tables or databases.
F. Use Google Stackdriver Audit Logging to determine policy violations.

Correct Answer: B, D, F

Question 17

Your company is loading CSV files into Google BigQuery.
The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file.
What is the most likely cause of this problem?

A. The CSV data loaded in Google BigQuery is not flagged as CSV.
B. The CSV data has invalid rows that were skipped on import.
C. The CSV data loaded in Google BigQuery is not using Google BigQuery’s default encoding.
D. The CSV data has not gone through an ETL phase before loading into Google BigQuery.

Correct Answer: C

Reference contents:
– Loading CSV data from Cloud Storage | BigQuery

Question 18

Your company is migrating their 30-node Apache Hadoop cluster to the cloud.
They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster.
What should you do?

A. Create a Google Cloud Dataflow job to process the data.
B. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
C. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
D. Create a Google Cloud Dataproc cluster that uses the Google Cloud Storage connector.
E. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.

Correct Answer: D

Question 19

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow.
Numerous data logs are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for new key features in the logs.

Google BigQueryIO.Read -
.named("ReadLogData")
.from("clouddataflow-readonly:samples.log_data")

You want to improve the performance of this data read.
What should you do?

A. Specify the TableReference object in the code.
B. Use .fromQuery operation to read specific fields from the table.
C. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
D. Call a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.

Correct Answer: D

Question 20

Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season.
The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost.
What should they do?

A. Redefine the schema by evenly distributing reads and writes across the row space of the table.
B. The performance issue should be resolved over time as the site of the BigDate cluster is increased.
C. Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.
D. Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.

Correct Answer: A

Reference contents:
– Designing your schema | Cloud Bigtable Documentation
– Understanding Cloud Bigtable performance

Question 21

Your company is selecting a system to centralize data ingestion and delivery.
You are considering messaging and data integration systems to address the requirements. The key requirements are:
– The ability to seek a particular offset in a topic, possibly back to the start of all data ever captured.
– Support for publish/subscribe semantics on hundreds of topics.
– Retain per-key ordering.
Which system should you choose?

A. Apache Kafka
B. Google Cloud Storage
C. Google Cloud Pub/Sub
D. Firebase Cloud Messaging

Correct Answer: A

Reference contents:
– Replaying and purging messages | Cloud Pub/Sub Documentation

Question 22

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance.
How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?

A. Use a row key of the form <timestamp>.
B. Use a row key of the form <sensorid>.
C. Use a row key of the form <timestamp>#<sensorid>.
D. Use a row key of the form >#<sensorid>#<timestamp>.

Correct Answer: D

Reference contents:
– Designing your schema | Cloud Bigtable Documentation

Question 23

Your company is using WILDCARD tables to query data across multiple tables with similar names.
The SQL statement is currently failing with the following error:

# Syntax error : Expected end of statement but got "-" at [4:11]
SELECT age -

FROM -
bigquery-public-data.noaa_gsod.gsod

WHERE -
age != 99
AND_TABLE_SUFFIX = "˜1929'

ORDER BY -
age DESC

Which table name will make the SQL statement work correctly?

A. ‘bigquery-public-data.noaa_gsod.gsod’
B. bigquery-public-data.noaa_gsod.gsod*
C. ‘bigquery-public-data.noaa_gsod.gsod’*
D. ‘bigquery-public-data.noaa_gsod.gsod*’

Correct Answer: D

Reference contents:
– Querying multiple tables using a wildcard table | BigQuery

Question 24

Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data.
The data is imported to Google Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have asked you to fix the problem. You want to maximize transfer speeds.
Which action should you take?

A. Increase the CPU size on your server.
B. Increase the size of the Google Persistent Disk on your server.
C. Increase your network bandwidth from your datacenter to GCP.
D. Increase your network bandwidth from Google Compute Engine to Google Cloud Storage.

Correct Answer: C

Question 25

Your company needs to upload their historic data to Google Cloud Storage.
The security rules don’t allow access from external IPs to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every day.
What should they do?

A. Execute gsutil rsync from the on-premises servers.
B. Use Google Cloud Dataflow and write the data to Google Cloud Storage.
C. Write a job template in Google Cloud Dataproc to perform the data transfer.
D. Install an FTP server on a Google Compute Engine VM to receive the files and move them to Google Cloud Storage.

Correct Answer: B

Question 26

Your company produces 20,000 files every hour.
Each data file is formatted as a CSV file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low. You are told that due to seasonality, your company expects the number of files to double for the next three months.
Which two actions should you take? (Choose two.)

A. Introduce data compression for each file to increase the rate file of file transfer.
B. Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
C. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
D. Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
E. Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premises data to the designated storage bucket.

Correct Answer: C,E

Question 27

Your company receives both batch- and stream-based event data.
You want to process the data using Google Cloud Dataflow over a predictable time period. However, you realize that in some instances data can arrive late or out of order.
How should you design your Google Cloud Dataflow pipeline to handle data that is late or out of order?

A. Set a single global window to capture all the data.
B. Set sliding windows to capture all the lagged data.
C. Use watermarks and timestamps to capture the lagged data.
D. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.

Correct Answer: B

Question 28

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud.
Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data.
How should you deduplicate the data most efficiently?

A. Assign global unique identifiers (GUID) to each data entry.
B. Compute the hash value of each data entry, and compare it with all historical data.
C. Store each data entry as the primary key in a separate database and apply an index.
D. Maintain a database table to store the hash value and other metadata for each data entry.

Correct Answer: D

Question 29

Your company’s customer and order databases are often under heavy load.
This makes performing analytics against them difficult without harming operations. The databases are in a MySQL cluster, with nightly backups taken using mysqldump. You want to perform analytics with minimal impact on operations.
What should you do?

A. Add a node to the MySQL cluster and build an OLAP cube there.
B. Use an ETL tool to load the data from MySQL into Google BigQuery.
C. Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
D. Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.

Correct Answer: C

Question 30

Your company’s on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc.
A like-for- like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration.
What should you do?

A. Put the data into Google Cloud Storage.
B. Use preemptible virtual machines (VMs) for the Google Cloud Dataproc cluster.
C. Tune the Google Cloud Dataproc cluster so that there is just enough disk for all data.
D. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.

Correct Answer: A

Question 31

Your financial services company is moving to cloud technology and wants to store 50 TB of financial time-series data in the cloud.
This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?

A. Google Cloud Bigtable
B. Google BigQuery
C. Google Cloud Storage
D. Google Cloud Datastore

Correct Answer: A

Reference contents:
– Schema design for time series data | Cloud Bigtable Documentation

Question 32

Your globally distributed auction application allows users to bid on items.
Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first.
What should you do?

A. Create a file on a shared file and have the application servers write all bid events to that file. Process the file with Apache Hadoop to identify which user bid first.
B. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Push the events from Google Cloud Pub/Sub to a custom endpoint that writes the bid event information into Google Cloud SQL.
C. Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
D. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.

Correct Answer: C

Question 33

Your infrastructure includes a set of YouTube channels.
You have been tasked with creating a process for sending the YouTube channel data to Google Cloud for analysis. You want to design a solution that allows your world-wide marketing teams to perform ANSI SQL and other types of analysis on up-to-date YouTube channels log data.
How should you set up the log data transfer into Google Cloud?

A. Use Storage Transfer Service to transfer the offsite backup files to a Google Cloud Storage Multi-Regional storage bucket as a final destination.
B. Use Storage Transfer Service to transfer the offsite backup files to a Google Cloud Storage Regional bucket as a final destination.
C. Use Google BigQuery Data Transfer Service to transfer the offsite backup files to a Google Cloud Storage Multi-Regional storage bucket as a final destination.
D. Use Google BigQuery Data Transfer Service to transfer the offsite backup files to a Google Cloud Storage Regional storage bucket as a final destination.

Correct Answer: B

Question 34

Your neural network model is taking days to train.
You want to increase the training speed.
What can you do?

A. Subsample your test dataset.
B. Subsample your training dataset.
C. Increase the number of input features to your model.
D. Increase the number of layers in your neural network.

Correct Answer: D

Reference contents:
– Improving the Performance of a Neural Network | by Rohith Gandhi

Question 35

Your organization has been collecting and analyzing data in Google BigQuery for 6 months.
The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of data. The view is described in legacy SQL. Next month, existing applications will be connecting to Google BigQuery to read the events data via an ODBC connection. You need to ensure the applications can connect.
Which two actions should you take? (Choose two.)

A. Create a new view over events using standard SQL.
B. Create a new partitioned table using a standard SQL query.
C. Create a new view over events_partitioned using standard SQL.
D. Create a service account for the ODBC connection to use for authentication.
E. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared “events”.

Correct Answer: A,E

Question 36

Your software uses a simple JSON format for all messages.
These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Google Cloud Pub/Sub successfully.
What should you do next?

A. Check the dashboard application to see if it is not displaying correctly.
B. Run a fixed dataset through the Google Cloud Dataflow pipeline and analyze the output.
C. Use Google Stackdriver Monitoring on Google Cloud Pub/Sub to find the missing messages.
D. Switch Google Cloud Dataflow to pull messages from Google Cloud Pub/Sub instead of Google Cloud Pub/Sub pushing messages to Google Cloud Dataflow.

Correct Answer: B

Question 37

Your startup has never implemented a formal security policy.
Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing.
What should you do first?

A. Use Google Stackdriver Audit Logs to review data access.
B. Get the identity and access management (Cloud IAM) policy of each table.
C. Use Stackdriver Monitoring to see the usage of Google BigQuery query slots.
D. Use the Google Cloud Billing API to see what account the warehouse is being billed to.

Correct Answer: A

Question 38

Your team is responsible for developing and maintaining ETLs in your company.
One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).
What should you do?

A. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
B. Add a try… catch block to your DoFn that transforms the data, extract erroneous rows from logs.
C. Add a try… catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
D. Add a try… catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.

Correct Answer: D

Reference contents:
– Data Analytics

Question 39

Your team is working on a binary classification problem.
You have trained a support vector machine (SVM) classifier with default parameters, and received an area under the Curve (AUC) of 0.87 on the validation set. You want to increase the AUC of the model.
What should you do?

A. Perform hyperparameter tuning.
B. Train a classifier with deep neural networks, because neural networks would always beat SVMs.
C. Deploy the model and measure the real-world AUC; it’s always higher because of generalization.
D. Scale predictions you get out of the model (tune a scaling factor as a hyperparameter) in order to get the highest AUC.

Correct Answer: D

Question 40

Your United States-based company has created an application for assessing and responding to user actions.
The primary table’s data volume grows by 250,000 records per second. Many third parties use your application’s APIs to build the functionality into their own frontend applications. Your application’s APIs should comply with the following requirements:
– Single global endpoint.
– ANSI SQL support.
– Consistent access to the most up-to-date data.
What should you do?

A. Implement Google BigQuery with no region selected for storage or processing.
B. Implement Google Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.
C. Implement Google Cloud SQL for PostgreSQL with the master in North America and read replicas in Asia and Europe.
D. Implement Google Cloud Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.

Correct Answer: B

Question 41

Your weather app queries a database every 15 minutes to get the current temperature.
The frontend is powered by Google App Engine and serves millions of users.
How should you design the frontend to respond to a database failure?

A. Issue a command to restart the database servers.
B. Retry the query with exponential backoff, up to a cap of 15 minutes.
C. Retry the query every second until it comes back online to minimize staleness of data.
D. Reduce the query frequency to once every hour until the database comes back online.

Correct Answer: B

Reference contents:
– Managing database connections | Cloud SQL for MySQL

Categories:

Google Cloud Platform

Tags:

Professional Data Engineer

Comments are closed