Professional Data Engineer Practice Exam (2021.03.02) : Cloud Smog

Google Cloud Certified - Professional Data Engineer

Written by E.G -

on 3月 2, 2021

Ace Your Professional Data Engineer Certification with Practice Exams.

Google Cloud Certified – Professional Data Engineer – Practice Exam (Question 50)

Question 1

What are two of the characteristics of using online prediction rather than batch prediction?

A. It is optimized to handle a high volume of data instances in a job and to run more complex models.
B. Predictions are returned in the response message.
C. Predictions are written to output files in a Google Cloud Storage location that you specify.
D. It is optimized to minimize the latency of serving predictions.

Correct Answer: B, D

Online prediction –
.Optimized to minimize the latency of serving predictions.
.Predictions returned in the response message.

Batch prediction –
.Optimized to handle a high volume of instances in a job and to run more complex models.
.Predictions written to output files in a Google Cloud Storage location that you specify.

Reference contents:
– Prediction overview > Online prediction versus batch prediction

Question 2

What Google Cloud Dataflow concept determines when a Window’s contents should be output based on certain criteria being met?

A. Sessions
B. OutputCriteria
C. Windows
D. Triggers

Correct Answer: D

Triggers control when the elements for a specific key and window are output. As elements arrive, they are put into one or more windows by a Window transform and its associated WindowFn, and then passed to the associated Trigger to determine if the Windows contents should be output.

Reference contents:
– Trigger (Google Cloud Dataflow SDK 1.9.1 API)

Question 3

What is the general recommendation when designing your row keys for a Google Cloud Bigtable schema?

A. Include multiple time series values within the row key.
B. Keep the row kept as an 8 bit integer.
C. Keep your row key reasonably short.
D. Keep your row key as long as the field permits.

Correct Answer: C

A general guide is to keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Google Cloud Bigtable server.

Reference contents:
– Designing your schema > Row keys | Cloud Bigtable Documentation

Question 4

What is the HBase Shell for Google Cloud Bigtable?

A. The HBase shell is a GUI based interface that performs administrative tasks, such as creating and deleting tables.
B. The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables.
C. The HBase shell is a hypervisor based shell that performs administrative tasks, such as creating and deleting new virtualized instances.
D. The HBase shell is a command-line tool that performs only user account management functions to grant access to Google Cloud Bigtable instances.

Correct Answer: B

The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables. The Google Cloud Bigtable HBase client for Java makes it possible to use the HBase shell to connect to Google Cloud Bigtable.

Reference contents:
– Installing the HBase shell for Cloud Bigtable

Question 5

What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?

A. create a third instance and sync the data from the two storage types via batch jobs.
B. export the data from the existing instance and import the data into a new instance.
C. run parallel instances where one is HDD and the other is SDD.
D. the selection is final and you must resume using the same storage type.

Correct Answer: B

Reference contents:
– Choosing between SSD and HDD storage > Switching between SSD and HDD storage

Question 6

When a Google Cloud Bigtable node fails, ____ is lost.

A. all data
B. no data
C. the last transaction
D. the time dimension

Correct Answer: B

A Google Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google’s file system, in SSTable format. Each tablet is associated with a specific Google Cloud Bigtable node.
Data is never stored in Google Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Google Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Google Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Google Cloud Bigtable node fails, no data is lost.

Reference contents:
– Overview of Cloud Bigtable | Cloud Bigtable Documentation

Question 7

When creating a new Google Cloud Dataproc cluster with the projects.regions.
These four values are required:
project, region, name, and ____.

A. zone
B. node
C. label
D. type

Correct Answer: A

At a minimum, you must specify four values when creating a new cluster with the projects.regions.clusters.create operation:
– The project in which the cluster will be created
– The region to use
– The name of the cluster
– The zone in which the cluster will be created.
You can specify many more details beyond these minimum requirements. For example, you can also specify the number of workers, whether preemptible compute should be used, and the network settings.

Reference contents:
– Use the Cloud Client Libraries for Python > Create a Dataproc cluster

Question 8

When running a pipeline that has a Google BigQuery source, on your local machine, you continue to get permission denied errors.
What could be the reason for that?

A. Your gcloud does not have access to the Google BigQuery resources.
B. Google BigQuery cannot be accessed from local machines.
C. You are missing gcloud on your machine.
D. Pipelines cannot be run locally.

Correct Answer: A

When reading from a Google Cloud Dataflow source or writing to a Google Cloud Dataflow sink using DirectPipelineRunner, the Google Cloud Platform account that you configured with the gcloud executable will need access to the corresponding source/sink

Reference contents:
– DirectPipelineRunner (Google Cloud Dataflow SDK 1.9.1 API)

Question 9

When using Google Cloud Dataproc clusters, you can access the YARN web interface by configuring a browser to connect through a ____ proxy.

A. HTTPS
B. VPN
C. SOCKS
D. HTTP

Correct Answer: C

When using Google Cloud Dataproc clusters, configure your browser to use the SOCKS proxy. The SOCKS proxy routes data intended for the Google Cloud Dataproc cluster through an SSH tunnel.

Reference contents:
– Cluster web interfaces > Available interfaces

Question 10

When you design a Google Cloud Bigtable schema it is recommended that you _________.

A. Avoid schema designs that are based on NoSQL concepts.
B. Create schema designs that are based on a relational database design.
C. Avoid schema designs that require atomicity across rows.
D. Create schema designs that require atomicity across rows.

Correct Answer: C

All operations are atomic at the row level. For example, if you update two rows in a table, it’s possible that one row will be updated successfully and the other update will fail. Avoid schema designs that require atomicity across rows.

Reference contents:
– Designing your schema > Row keys | Cloud Bigtable Documentation

Question 11

When you store data in Google Cloud Bigtable, what is the recommended minimum amount of stored data?

A. 500 TB
B. 1 GB
C. 1 TB
D. 500 GB

Correct Answer: C

Google Cloud Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions. It is not a good solution for less than 1 TB of data.

Reference contents:
– Overview of Cloud Bigtable > Other storage and database options | Cloud Bigtable Documentation

Question 12

Which action can a Google Cloud Dataproc Viewer perform?

A. Submit a job.
B. Create a cluster.
C. Delete a cluster.
D. List the jobs.

Correct Answer: D

A Google Cloud Dataproc Viewer is limited in its actions based on its role. A viewer can only list clusters, get cluster details, list jobs, get job details, list operations, and get operation details.

Reference contents:
– Dataproc permissions and IAM roles > IAM Roles and Dataproc Operations Summary | Dataproc Documentation

Question 13

Which Google Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?

A. An hourly watermark.
B. An event time trigger.
C. The with Allowed Lateness method.
D. A processing time trigger.

Correct Answer: D

When collecting and grouping data into windows, Beam uses triggers to determine when to emit the aggregated results of each window. Processing time triggers. These triggers operate on the processing time the time when the data element is processed at any given stage in the pipeline.
Event time triggers. These triggers operate on the event time, as indicated by the timestamp on each data element. Beams default trigger is event time-based.

Reference contents:
– Apache Beam Programming Guide > 9. Triggers | Apache Beam Programming Guide

Question 14

Which Google Cloud Platform service is an alternative to Hadoop with Hive?

A. Google Cloud Dataflow
B. Google Cloud Bigtable
C. Google BigQuery
D. Google Cloud Datastore

Correct Answer: C

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse.

Reference contents:
– Apache Hive

Question 15

Which is not a valid reason for poor Google Cloud Bigtable performance?

A. The workload isn’t appropriate for Google Cloud Bigtable.
B. The table’s schema is not designed correctly.
C. The Google Cloud Bigtable cluster has too many nodes.
D. There are issues with the network connection.

Correct Answer: C

The Google Cloud Bigtable cluster doesn’t have enough nodes. If your Google Cloud Bigtable cluster is overloaded, adding more nodes can improve performance. Use the monitoring tools to check whether the cluster is overloaded.

Reference contents:
– Understanding Cloud Bigtable performance

Question 16

Which is the preferred method to use to avoid hotspotting in time series data in Google Cloud Bigtable?

A. Field promotion
B. Randomization
C. Salting
D. Hashing

Correct Answer: A

By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.

Reference contents:
– Schema design for time series data > Ensure that your row key avoids hotspotting | Cloud Bigtable Documentation

Question 17

Which Java SDK class can you use to run your Google Cloud Dataflow programs locally?

A. LocalRunner
B. DirectPipelineRunner
C. MachineRunner
D. LocalPipelineRunner

Correct Answer: B

DirectPipelineRunner allows you to execute operations in the pipeline directly, without any optimization. Useful for small local execution and tests.

Reference contents:
– DirectPipelineRunner (Google Cloud Dataflow SDK 1.9.1 API)

Question 18

Which methods can be used to reduce the number of rows processed by Google BigQuery?

A. Splitting tables into multiple tables; putting data in partitions.
B. Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause.
C. Putting data in partitions; using the LIMIT clause.
D. Splitting tables into multiple tables; using the LIMIT clause.

Correct Answer: A

If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, Google BigQuery will still process the entire table.

Reference contents:
– Introduction to partitioned tables | BigQuery

Question 19

Which of the following are examples of hyperparameters? (Select 2 answers.)

A. Number of hidden layers
B. Number of nodes in each hidden layer
C. Biases
D. Weights

Correct Answer: A, B

If model parameters are variables that get adjusted by training with existing data, your hyperparameters are the variables about the training process itself. For example, part of setting up a deep neural network is deciding how many “hidden” layers of nodes to use between the input layer and the output layer, as well as how many nodes each layer should use. These variables are not directly related to the training data at all. They are configuration variables. Another difference is that parameters change during a training job, while the hyperparameters are usually constant during a job.
Weights and biases are variables that get adjusted during the training process, so they are not hyperparameters.

Reference contents:
– Overview of hyperparameter tuning | AI Platform Training

Question 20

Which of the following are feature engineering techniques? (Select 2 answers)

A. Hidden feature layers.
B. Feature prioritization.
C. Crossed feature columns.
D. Bucketization of a continuous feature.

Correct Answer: C, D

Selecting and crafting the right set of feature columns is key to learning an effective model.
Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into.
Using each base feature column separately may not be enough to explain the data. To learn the differences between different feature combinations, we can add crossed feature columns to the model.

Question 21

Which of the following IAM roles does your Google Compute Engine account require to be able to run pipeline jobs?

A. dataflow.worker
B. dataflow.compute
C. dataflow.developer
D. dataflow.viewer

Correct Answer: A

The dataflow.worker role provides the permissions necessary for a Google Compute Engine service account to execute work units for a Google Cloud Dataflow pipeline.

Reference contents:
– Cloud Dataflow access control guide

Question 22

Which of the following is NOT a valid use case to select HDD (hard disk drives) as the storage for Google Cloud Bigtable?

A. You expect to store at least 10 TB of data.
B. You will mostly run batch workloads with scans and writes, rather than frequently executing random reads of a small number of rows.
C. You need to integrate with Google BigQuery.
D. You will not use the data to back a user-facing or latency-sensitive application.

Correct Answer: C

For example, if you plan to store extensive historical data for a large number of remote-sensing devices and then use the data to generate daily reports, the cost savings for HDD storage may justify the performance tradeoff. On the other hand, if you plan to use the data to display a real-time dashboard, it probably would not make sense to use HDD storage reads would be much more frequent in this case, and reads are much slower with HDD storage.

Reference contents:
– Choosing between SSD and HDD storage | Cloud Bigtable Documentation

Question 23

Which of the following is NOT one of the three main types of triggers that Google Cloud Dataflow supports?

A. Trigger based on element size in bytes.
B. Trigger that is a combination of other triggers.
C. Trigger based on element count.
D. Trigger based on time.

Correct Answer: A

There are three major kinds of triggers that Google Cloud Dataflow supports:
1. Time-based triggers
2. Data-driven triggers. You can set a trigger to emit results from a window when that window has received a certain number of data elements.
3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way

Reference contents:
– Dataflow > Triggerss

Question 24

Which of the following is not possible using primitive roles?

A. Give a user viewer access to Google BigQuery and owner access to Google Compute Engine instances.
B. Give UserA owner access and UserB editor access for all datasets in a project.
C. Give a user access to view all datasets in a project, but not run queries on them.
D. Give GroupA owner access and GroupB editor access for all datasets in a project.

Correct Answer: C

Primitive roles can be used to give owner, editor, or viewer access to a user or group, but they can’t be used to separate data access permissions from job-running permissions.

Reference contents:
– Predefined roles and permissions > BigQuery permissions and predefined IAM roles

Question 25

Which of the following is NOT true about Google Cloud Dataflow pipelines?

A. Google Cloud Dataflow pipelines are tied to Google Cloud Dataflow, and cannot be run on any other runner.
B. Google Cloud Dataflow pipelines can consume data from other Google Cloud services.
C. Google Cloud Dataflow pipelines can be programmed in Java.
D. Google Cloud Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources.

Correct Answer: A

Google Cloud Dataflow pipelines can also run on alternate runtimes like Spark and Flink, as they are built using the Apache Beam SDKs.

Reference contents:
– Dataflow

Question 26

Which of the following is not true about Google Cloud Dataflow pipelines?

A. Pipelines are a set of operations.
B. Pipelines represent a data processing job.
C. Pipelines represent a directed graph of steps.
D. Pipelines can share data between instances.

Correct Answer: D

The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms.

Reference contents:
– Dataflow > Pipelines

Question 27

Which of the following job types are supported by Google Cloud Dataproc (select 3 answers)?

A. Hive
B. Pig
C. YARN
D. Spark

Correct Answer: A, B, D

Google Cloud Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.

Reference contents:
– Dataproc FAQ > What type of jobs can I run?

Question 28

Which of the following statements about Legacy SQL and Standard SQL is not true?

A. Standard SQL is the preferred query language for Google BigQuery.
B. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
C. One difference between the two query languages is how you specify fully-qualified table names (i.e. table names that include their associated project name).
D. You need to set a query language for each dataset and the default is Standard SQL.

Correct Answer: D

You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since Google BigQuery 2.0 was released.
In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, instead. Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.

Reference contents:
– Migrating to standard SQL | BigQuery

Question 29

Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)

A. The wide model is used for memorization, while the deep model is used for generalization.
B. A good use for the wide and deep model is a recommender system.
C. The wide model is used for generalization, while the deep model is used for memorization.
D. A good use for the wide and deep model is a small-scale linear regression problem.

Correct Answer: A, B

Can we teach computers to learn like humans do, by combining the power of memorization and generalization? It’s not easy.
Question to answer, but by jointly training a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. At Google, we call it Wide & Deep Learning. It’s useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.

Reference contents:
– Training using the built-in wide and deep algorithm
– Wide & Deep Learning: Better Together with TensorFlow

Question 30

Which of the following statements is NOT true regarding Google Cloud Bigtable access roles?

A. Using IAM roles, you cannot give a user access to only one table in a project, rather than all tables in a project.
B. To give a user access to only one table in a project, grant the user the Google Cloud Bigtable Editor role for that table.
C. You can configure access control only at the project level.
D. To give a user access to only one table in a project, you must configure access through your application.

Correct Answer: B

For Google Cloud Bigtable, you can configure access control at the project level. For example, you can grant the ability to:
– Read from, but not write to, any table within the project.
– Read from and write to any table within the project, but not manage instances.
– Read from and write to any table within the project, and manage instances.

Reference contents:
– Access control | Cloud Bigtable Documentation

Question 31

Which of these are examples of a value in a sparse vector? (Select 2 answers.)

A. [0, 5, 0, 0, 0, 0]
B. [0, 0, 0, 1, 0, 0, 1]
C. [0, 1]
D. [1, 0, 0, 0, 0, 0, 0]

Correct Answer: C, D

Categorical features in linear models are typically translated into a sparse vector in which each possible value has a corresponding index or id. For example, if there are only three possible eye colors you can represent ‘eye_color’ as a length 3 vector: ‘brown’ would become [1, 0, 0], ‘blue’ would become [0, 1, 0] and ‘green’ would become [0, 0, 1]. These vectors are called “sparse” because they may be very long, with many zeros, when the set of possible values is very large (such as all English words).
[0, 0, 0, 1, 0, 0, 1] is not a sparse vector because it has two 1s in it. A sparse vector contains only a single 1.
[0, 5, 0, 0, 0, 0] is not a sparse vector because it has a 5 in it. Sparse vectors only contain 0s and 1s.

Reference contents:
– Build a linear model with Estimators

Question 32

Which of these is not a supported method of putting data into a partitioned table?

A. If you have existing data in a separate file for each day, then create a partitioned table and upload each file into the appropriate partition.
B. Run a query to get the records for a specific day from an existing table and for the destination table, specify a partitioned table ending with the day in the format “$YYYYMMDD”.
C. Create a partitioned table and stream new records to it every day.
D. Use ORDER BY to put a table’s rows into chronological order and then change the table’s type to “Partitioned”.

Correct Answer: D

You cannot change an existing table into a partitioned table. You must create a partitioned table from scratch. Then you can either stream data into it every day and the data will automatically be put in the right partition, or you can load data into a specific partition by using “$YYYYMMDD” at the end of the table name.

Reference contents:
– Introduction to partitioned tables | BigQuery

Question 33

Which of these is NOT a way to customize the software on Google Cloud Dataproc cluster instances?

A. Set initialization actions.
B. Modify configuration files using cluster properties.
C. Configure the cluster using Google Cloud Deployment Manager.
D. Log into the master node and make changes from there.

Correct Answer: C

You can access the master node of the cluster by clicking the SSH button next to it in the Google Cloud Console.
You can easily use the –properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster.
When creating a Google Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Google Cloud Dataproc will run on all nodes in your Google Cloud Dataproc cluster immediately after the cluster is set up.

Reference contents:
– Cluster properties | Dataproc Documentation
– Initialization actions | Dataproc Documentation

Question 34

Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?

A. Weights
B. Biases
C. Continuous features
D. Input values

Correct Answer: A, B

A neural network is a simple mechanism that’s implemented with basic math. The only difference between the traditional programming model and a neural network is that you let the computer determine the parameters (weights and bias) by learning from training datasets.

Reference contents:
– Understanding neural networks with TensorFlow Playground

Question 35

Which of these operations can you perform from the Google BigQuery Web UI?

A. Upload a file in SQL format.
B. Load data with nested and repeated fields.
C. Upload a 20 MB file.
D. Upload multiple files using a wildcard.

Correct Answer: B

You can load data with nested and repeated fields using the Web UI.
You cannot use the Web UI to:
– Upload a file greater than 10 MB in size
– Upload multiple files at the same time
– Upload a file in SQL format
All three of the above operations can be performed using the “bq” command.

Reference contents:
– Introduction to loading data | BigQuery

Question 36

Which of these rules apply when you add preemptible workers to a Google Cloud Dataproc cluster (select 2 answers)?

A. Preemptible workers cannot use persistent disk.
B. Preemptible workers cannot store data.
C. If a preemptible worker is reclaimed, then a replacement worker must be added manually.
D. A Google Cloud Dataproc cluster cannot have only preemptible workers.

Correct Answer: B, D

The following rules will apply when you use preemptible workers with a Google Cloud Dataproc cluster:
– Processing onlySince preemptible can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Google Cloud Dataproc cluster only function as processing nodes.
– Non preemptible-only clusters to ensure clusters do not lose all workers, Google Cloud Dataproc cannot create preemptible-only clusters.
– Persistent disk sizeAs a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.

Reference contents:
– Secondary workers – preemptible and non-preemptible VMs

Question 37

Which of these sources can you not load data into Google BigQuery from?

A. File upload
B. Google Drive
C. Google Cloud Storage
D. Google Cloud SQL

Correct Answer: D

You can load data into Google BigQuery from a file upload, Google Cloud Storage, Google Drive, or Google Cloud Bigtable. It is not possible to load data into Google BigQuery directly from Google Cloud SQL. One way to get data from Google Cloud SQL to Google BigQuery would be to export data from Google Cloud SQL to Google Cloud Storage and then load it from there.

Reference contents:
– Introduction to loading data | BigQuery

Question 38

Which of these statements about exporting data from Google BigQuery is false?

A. To export more than 1 GB of data, you need to put a wildcard in the destination filename.
B. The only supported export destination is Google Cloud Storage.
C. Data can only be exported in JSON or Avro format.
D. The only compression option available is GZIP.

Correct Answer: C

Data can be exported in CSV, JSON, or Avro format. If you are exporting nested or repeated data, then CSV format is not supported.

Reference contents:
– Exporting table data | BigQuery

Question 39

Which of these statements about Google BigQuery caching is true?

A. By default, a query’s results are not cached.
B. Google BigQuery caches query results for 48 hours.
C. Query results are cached even if you specify a destination table.
D. There is no charge for a query that retrieves its results from cache.

Correct Answer: D

When query results are retrieved from a cached results table, you are not charged for the query.
Google BigQuery caches query results for 24 hours, not 48 hours.Query results are not cached if you specify a destination table. A query’s results are always cached except under certain conditions, such as if you specify a destination table.

Reference contents:
– Using cached query results > Exceptions to query caching | BigQuery

Question 40

Which role must be assigned to a service account used by the virtual machines in a Google Cloud Dataproc cluster so they can execute jobs?

A. Dataproc Worker
B. Dataproc Viewer
C. Dataproc Runner
D. Dataproc Editor

Correct Answer: A

Service accounts used with Google Cloud Dataproc must have Google Cloud Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).

Reference contents:
– Service accounts | Dataproc Documentation

Question 41

Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Google Cloud Bigtable cluster (select 2 answers)?

A. A sequential numeric ID
B. A timestamp followed by a stock symbol
C. A non-sequential numeric ID
D. A stock symbol followed by a timestamp

Correct Answer: A, B

using a timestamp as the first element of a row key can cause a variety of problems.
In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hotspotting.
Suppose your system assigns a numeric ID to each of your application’s users. You might be tempted to use the user’s numeric ID as the row key for your table.
However, since new users are more likely to be active users, this approach is likely to push most of your traffic to a small number of nodes.

Reference contents:
– Schema design for time series data > Ensure that your row key avoids hotspotting | Cloud Bigtable Documentation
– Designing your schema | Cloud Bigtable Documentation

Question 42

Which software libraries are supported by Google Cloud Machine Learning Engine?

A. Theano and TensorFlow
B. Theano and Torch
C. TensorFlow
D. TensorFlow and Torch

Correct Answer: C

Google Cloud ML Engine mainly does two things:
– Enables you to train machine learning models at scale by running TensorFlow training applications in the cloud.
– Hosts those trained models for you in the cloud so that you can use them to get predictions about new data.

Reference contents:
– Introduction to AI Platform

Question 43

Which SQL keyword can be used to reduce the number of columns processed by Google BigQuery?

A. BETWEEN
B. WHERE
C. SELECT
D. LIMIT

Correct Answer: C

SELECT allows you to query specific columns rather than the whole table.LIMIT, BETWEEN, and WHERE clauses will not reduce the number of columns processed by Google BigQuery.

Reference contents:
– Launch Checklist for Google Cloud Platform > Architecture design and development checklist | Documentation

Question 44

Which TensorFlow function can you use to configure a categorical column if you don’t know all of the possible values for that column?

A. categorical_column_with_vocabulary_list
B. categorical_column_with_hash_bucket
C. categorical_column_with_unknown_values
D. sparse_column_with_keys

Correct Answer: B

If you know the set of all possible feature values of a column and there are only a few of them, you can use categorical_column_with_vocabulary_list. Each key in the list will get assigned an auto-incremental ID starting from 0.
What if we don’t know the set of possible values in advance? Not a problem. We can use categorical_column_with_hash_bucket instead. What will happen is that each possible value in the feature column occupation will be hashed to an integer ID as we encounter them in training.

Reference contents:
– tf.feature_column.categorical_column_with_hash_bucket

Question 45

Why do you need to split a machine learning dataset into training data and test data?

A. So you can try two different sets of features.
B. To make sure your model is generalized for more than just the training data.
C. To allow you to create unit tests in your code.
D. So you can use one dataset for a wide model and one for a deep model.

Correct Answer: B

The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data. A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely to have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specialized to the structure in the training dataset. This is called overfitting.

Reference contents:
– A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea – MachineLearningMastery.com

Question 46

You architect a system to analyze seismic data.
Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted.
How should you change your ETL process to carry out sensor calibration systematically in the future?

A. Modify the transformMapReduce jobs to apply sensor calibration before they do anything else.
B. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.
C. Add sensor calibration data to the output of the ETL process, and document that all users need to apply sensor calibration themselves.
D. Develop an algorithm through simulation to predict variance of data output from the last MapReduce job based on calibration factors, and apply the correction to all data.

Correct Answer: A

Question 47

You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets.
You use on-demand pricing for Google BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don’t get slots to execute their query and you need to correct this. You’d like to avoid introducing new projects to your account.
What should you do?

A. Convert your batch BigQuery queries into interactive Google BigQuery queries.
B. Create an additional project to overcome the 2K on-demand per-project quota.
C. Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
D. Increase the amount of concurrent slots per project at the Quotas page at the Google Cloud Console.

Correct Answer: C

Reference contents:
– Busting 12 myths about BigQuery

Question 48

You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home.
You need to interpret customer voice commands and issue an order to the backend systems.
Which solutions should you choose?

A. Google Cloud Speech-to-Text API
B. Google Cloud Natural Language API
C. Dialogflow Enterprise Edition
D. Google Cloud AutoML Natural Language

Correct Answer: D

Question 49

You are building a data pipeline on Google Cloud.
You need to prepare data using a causal method for a machine-learning process. You want to support a logistic regression model. You also need to monitor and adjust for null values, which must remain real-valued and cannot be removed.
What should you do?

A. Use Google Cloud Dataprep to find null values in sample source data. Convert all nulls to “˜none’ using a Google Cloud Dataproc job.
B. Use Google Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Google Cloud Dataprep job.
C. Use Google Cloud Dataflow to find null values in sample source data. Convert all nulls to “none” using a Google Cloud Dataprep job.
D. Use Google Cloud Dataflow to find null values in sample source data. Convert all nulls to 0 using a custom script.

Correct Answer: C

Question 50

You are building a model to make clothing recommendations.
You know a user’s fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available.
How should you use this data to train the model?

A. Continuously retrain the model on just the new data.
B. Continuously retrain the model on a combination of existing data and the new data.
C. Train on the existing data while using the new data as your test set.
D. Train on the new data while using the existing data as your test set.

Correct Answer: B

Categories:

Google Cloud Platform

Tags:

Professional Data Engineer

Comments are closed