Google Cloud Certified – Professional Data Engineer – Practice Exam (Question 50)
You are building a model to predict whether or not it will rain on a given day.
You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy.
What can you do?
- A. Eliminate features that are highly correlated to the output labels.
- B. Combine highly co-dependent features into one representative feature.
- C. Instead of feeding in each feature individually, average their values in batches of 3.
- D. Remove the features that have null values for more than 50% of the training records.
Correct Answer: B
You are building a new application that you need to collect data from in a scalable way.
Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
– Decoupling producer from consumer.
– Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely.
– Near real-time SQL query.
– Maintain at least 2 years of historical data, which will be queried with SQL.
Which pipeline should you use to meet these requirements?
- A. Create an application that provides an API. Write a tool to poll the API and write data to Google Cloud Storage as gzipped JSON files.
- B. Create an application that writes to a Google Cloud SQL database to store the data. Set up periodic exports of the database to write to Google Cloud Storage and load into Google BigQuery.
- C. Create an application that publishes events to Google Cloud Pub/Sub, and create Spark jobs on Google Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.
- D. Create an application that publishes events to Google Cloud Pub/Sub, and create a Google Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Google Cloud Storage and Google BigQuery.
Correct Answer: A
You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners.
Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones.
What should you do?
- A. Create an API using Google App Engine to receive and send messages to the applications.
- B. Use a Google Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them.
- C. Create a table on Google Cloud SQL, and insert and delete rows with the job information.
- D. Create a table on Google Cloud Spanner, and insert and delete rows with the job information.