Google Cloud Certified – Professional Data Engineer – Practice Exam (Question 50)
A data scientist has created a Google BigQuery ML model and asks you to create an ML pipeline to serve predictions.
You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions:
SELECT predicted_label, user_id FROM ML.PREDICT (MODEL ‘dataset.model’, table user_features).
How should you create the ML pipeline?
- A. Add a WHERE clause to the query, and grant the Google BigQuery Data Viewer role to the application service account.
- B. Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account.
- C. Create a Google Cloud Dataflow pipeline using Google BigQueryIO to read results from the query. Grant the dataflow.worker role to the application service account.
- D. Create a Google Cloud Dataflow pipeline using Google BigQueryIO to read predictions for all users from the query. Write the results to Google Cloud Bigtable using BigtableIO. Grant the bigtable.reader role to the application service account so that the application can read predictions for individual users from Google Cloud Bigtable.
Correct Answer: D
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time.
This is then loaded into Google BigQuery. Analysts in your company want to query the tracking data in Google BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in Google BigQuery.
What should you do?
- A. Implement clustering in Google BigQuery on the ingest date column.
- B. Implement clustering in Google BigQuery on the package-tracking ID column.
- C. Tier older data onto Google Cloud Storage files, and leverage extended tables.
- D. Re-create the table using data partitioning on the package delivery date.
Correct Answer: A
After migrating ETL jobs to run on Google BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original.
You’ve loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.
What should you do?
- A. Select random samples from the tables using the RAND() function and compare the samples.
- B. Select random samples from the tables using the HASH() function and compare the samples.
- C. Use a Google Cloud Dataproc cluster and the Google BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
- D. Create stratified random samples using the OVER() function and compare equivalent samples from each table.