Professional Data Engineer Practice Exam (2021.03.01) : Cloud Smog

Google Cloud Certified - Professional Data Engineer

Written by E.G -

on 3月 1, 2021

Ace Your Professional Data Engineer Certification with Practice Exams.

Google Cloud Certified – Professional Data Engineer – Practice Exam (Question 50)

Question 1

A data scientist has created a Google BigQuery ML model and asks you to create an ML pipeline to serve predictions.
You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions:

SELECT predicted_label, user_id FROM ML.PREDICT (MODEL ‘dataset.model’, table user_features).

How should you create the ML pipeline?

A. Add a WHERE clause to the query, and grant the Google BigQuery Data Viewer role to the application service account.
B. Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account.
C. Create a Google Cloud Dataflow pipeline using Google BigQueryIO to read results from the query. Grant the dataflow.worker role to the application service account.
D. Create a Google Cloud Dataflow pipeline using Google BigQueryIO to read predictions for all users from the query. Write the results to Google Cloud Bigtable using BigtableIO. Grant the bigtable.reader role to the application service account so that the application can read predictions for individual users from Google Cloud Bigtable.

Correct Answer: D

Question 2

A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time.
This is then loaded into Google BigQuery. Analysts in your company want to query the tracking data in Google BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in Google BigQuery.
What should you do?

A. Implement clustering in Google BigQuery on the ingest date column.
B. Implement clustering in Google BigQuery on the package-tracking ID column.
C. Tier older data onto Google Cloud Storage files, and leverage extended tables.
D. Re-create the table using data partitioning on the package delivery date.

Correct Answer: A

Question 3

After migrating ETL jobs to run on Google BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original.
You’ve loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.
What should you do?

A. Select random samples from the tables using the RAND() function and compare the samples.
B. Select random samples from the tables using the HASH() function and compare the samples.
C. Use a Google Cloud Dataproc cluster and the Google BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
D. Create stratified random samples using the OVER() function and compare equivalent samples from each table.

Correct Answer: B

Question 4

All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Google Cloud Bigtable node.

A. before
B. after only if
C. once

Correct Answer: A

In a Google Cloud Bigtable architecture all client requests go through a front-end server before they are sent to a Google Cloud Bigtable node.
The nodes are organized into a Google Cloud Bigtable cluster, which belongs to a Google Cloud Bigtable instance, which is a container for the cluster. Each node in the cluster handles a subset of the requests to the cluster.
When additional nodes are added to a cluster, you can increase the number of simultaneous requests that the cluster can handle, as well as the maximum throughput for the entire cluster.

Reference contents:
– Overview of Cloud Bigtable | Cloud Bigtable Documentation

Question 5

An external customer provides you with a daily dump of data from their database.
The data flows into Google Cloud Storage as CSV files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted.
How should you build this pipeline?

A. Use federated data sources, and check data in the SQL query.
B. Enable Google BigQuery Monitoring in Google Stackdriver and create an alert.
C. Import the data into Google BigQuery using the gcloud CLI and set max_bad_records to 0.
D. Run a Google Cloud Dataflow batch pipeline to import the data into Google BigQuery, and push errors to another dead-letter table for analysis.

Correct Answer: D

Question 6

An online retailer has built their current application on Google App Engine.
A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose.
Which Google Cloud database should they choose?

A. Google BigQuery
B. Google Cloud SQL
C. Google Cloud BigTable
D. Google Cloud Datastore

Correct Answer: C

Reference contents:
– Looker Business Intelligence Platform

Question 7

An organization maintains a Google BigQuery dataset that contains tables with user-level data.
They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user-level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects.
What should they do?

A. Create and share an authorized view that provides the aggregate results.
B. Create and share a new dataset and view that provides the aggregate results.
C. Create and share a new dataset and table that contains the aggregate results.
D. Create dataViewer Identity and Access Management (IAM) roles on the dataset to enable sharing.

Correct Answer: D

Reference contents:
– Predefined roles and permissions | BigQuery

Question 8

As your organization expands its usage of GCP, many teams have started to create their own projects.
Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects. Furthermore, data from Google Cloud Storage buckets and Google BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies.
Which two steps should you take? (Choose two.)

A. Use Google Cloud Deployment Manager to automate access provision.
B. Introduce resource hierarchy to leverage access control policy inheritance.
C. Create distinct groups for various teams, and specify groups in Cloud IAM policies.
D. Only use service accounts when sharing data for Google Cloud Storage buckets and Google BigQuery datasets.
E. For each Google Cloud Storage bucket or Google BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.

Correct Answer: A, C

Question 9

Business owners at your company have given you a database of bank transactions.
Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data.
Which three machine learning applications can you use? (Choose three.)

A. Supervised learning to determine which transactions are most likely to be fraudulent.
B. Unsupervised learning to determine which transactions are most likely to be fraudulent.
C. Clustering to divide the transactions into N categories based on feature similarity.
D. Supervised learning to predict the location of a transaction.
E. Reinforcement learning to predict the location of a transaction.
F. Unsupervised learning to predict the location of a transaction.

Correct Answer: B, C, E

Question 10

By default, which of the following windowing behaviors does Google Cloud Dataflow apply to unbounded data sets?

A. Windows at every 100 MB of data
B. Single, Global Window
C. Windows at every 1 minute
D. Windows at every 10 minutes

Correct Answer: B

Google Cloud Dataflow’s default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections.

Reference contents:
– PCollection

Question 11

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects.
Your organization requires that all Google BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects.
What should you do?

A. Enable data access logs in each Data Analyst’s project. Restrict access to Stackdriver Logging via Cloud IAM roles.
B. Export the data access logs via a project-level export sink to a Google Cloud Storage bucket in the Data Analysts’ projects. Restrict access to the Google Cloud Storage bucket.
C. Export the data access logs via a project-level export sink to a Google Cloud Storage bucket in newly created projects for audit logs. Restrict access to the project with the exported logs.
D. Export the data access logs via an aggregated export sink to a Google Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.

Correct Answer: D

Question 12

Does Google Cloud Dataflow process batch data pipelines or streaming data pipelines?

A. Only Batch Data Pipelines.
B. Both Batch and Streaming Data Pipelines.
C. Only Streaming Data Pipelines.
D. None of the above.

Correct Answer: B

Google Cloud Dataflow is a unified processing model, and can execute both streaming and batch data pipelines

Reference contents:
– Dataflow

Question 13

Each analytics team in your organization is running Google BigQuery jobs in their own projects.
You want to enable each team to monitor slot usage within their projects.
What should you do?

A. Create a Stackdriver Monitoring dashboard based on the Google BigQuery metric query/scanned_bytes.
B. Create a Stackdriver Monitoring dashboard based on the Google BigQuery metric slots/allocated_for_project.
C. Create a log export for each project, capture the Google BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric.
D. Create an aggregated log export at the organization level, capture the Google BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric.

Correct Answer: D

Reference contents:
– Monitoring resource usage in a cloud data warehouse

Question 14

Flowlogistic is rolling out their real-time inventory tracking system.
The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?

A. Attach the timestamp on each message in the Google Cloud Pub/Sub subscriber application as they are received.
B. Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Google Cloud Pub/Sub.
C. Use the NOW () function in Google BigQuery to record the event’s time.
D. Use the automatically generated timestamp from Google Cloud Pub/Sub to order the data.

Correct Answer: B

Question 15

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to Google BigQuery.
Flowlogistic does not know how to store the data that is common to both workloads.
What should they do ?

A. Store the common data in Google BigQuery as partitioned tables.
B. Store the common data in Google BigQuery and expose authorized views.
C. Store the common data encoded as Avro in Google Cloud Storage.
D. Store the common data in the HDFS storage for a Google Cloud Dataproc cluster.

Correct Answer: B

Question 16

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field.
This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of Google BigQuery reports. However, they’ve been overwhelmed by all the data in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way.
What should you do?

A. Export the data into a Google Sheet for virtualization.
B. Create an additional table with only the necessary columns.
C. Create a view on the table to present to the virtualization tool.
D. Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Correct Answer: C

Question 17

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real time, and store the data reliably.
Which combination of GCP products should you choose?

A. Google Cloud Pub/Sub, Google Cloud Dataflow, and Google Cloud Storage
B. Google Cloud Pub/Sub, Google Cloud Dataflow, and Local SSD
C. Google Cloud Pub/Sub, Google Cloud SQL, and Google Cloud Storage
D. Google Cloud Load Balancing, Google Cloud Dataflow, and Google Cloud Storage
E. Google Cloud Dataflow, Google Cloud SQL, and Google Cloud Storage

Correct Answer: C

Question 18

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably.
Which combination of GCP products should you choose?

A. Google Cloud Pub/Sub, Google Cloud Dataflow, and Google Cloud Storage
B. Google Cloud Pub/Sub, Google Cloud Dataflow, and Local SSD
C. Google Cloud Pub/Sub, Google Cloud SQL, and Google Cloud Storage
D. Google Cloud Load Balancing, Google Cloud Dataflow, and Google Cloud Storage

Correct Answer: C

Question 18

For the best possible performance, what is the recommended zone for your Google Compute Engine instance and Google Cloud Bigtable instance?

A. Have the Google Compute Engine instance in the furthest zone from the Google Cloud Bigtable instance.
B. Have both the Google Compute Engine instance and the Google Cloud Bigtable instance to be in different zones.
C. Have both the Google Compute Engine instance and the Google Cloud Bigtable instance to be in the same zone.
D. Have the Google Cloud Bigtable instance to be in the same zone as all of the consumers of your data.

Correct Answer: C

It is recommended to create your Google Compute Engine instance in the same zone as your Google Cloud Bigtable instance for the best possible performance, If it’s not possible to create a instance in the same zone, you should create your instance in another zone within the same region. For example, if your Google Cloud Bigtable instance is located in us-central1-b, you could create your instance in us-central1-f. This change may result in several milliseconds of additional latency for each Google Cloud Bigtable request. It is recommended to avoid creating your Google Compute Engine instance in a different region from your Google Cloud Bigtable instance, which can add hundreds of milliseconds of latency to each Google Cloud Bigtable request.

Reference contents:
– Cloud Bigtable OAuth scopes | Cloud Bigtable Documentation

Question 19

Your weather app queries a database every 15 minutes to get the current temperature.
The frontend is powered by Google App Engine and serves millions of users.
How should you design the frontend to respond to a database failure?

A. Issue a command to restart the database servers.
B. Retry the query with exponential backoff, up to a cap of 15 minutes.
C. Retry the query every second until it comes back online to minimize staleness of data.
D. Reduce the query frequency to once every hour until the database comes back online.

Correct Answer: B

Question 20

Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing.
MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion.
What should you do?

A. Create a table called tracking_table and include a DATE column.
B. Create a partitioned table called tracking_table and include a TIMESTAMP column.
C. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
D. Create a table called tracking_table with a TIMESTAMP column to represent the day.

Correct Answer: B

Question 21

Google Cloud Bigtable indexes a single value in each row. This value is called the _______.

A. primary key
B. unique key
C. row key
D. master key

Correct Answer: C

Reference contents:
– Overview of Cloud Bigtable | Cloud Bigtable Documentation

Question 22

Google Cloud Bigtable is a recommended option for storing very large amounts of ____________________________?

A. multi-keyed data with very high latency
B. multi-keyed data with very low latency
C. single-keyed data with very low latency
D. single-keyed data with very high latency

Correct Answer: C

Google Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key. Google Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.

Reference contents:
– Overview of Cloud Bigtable | Cloud Bigtable Documentation

Question 23

Google Cloud Bigtable is Google’s ______ Big Data database service.

A. Relational
B. MySQL
C. NoSQL
D. SQL Server

Correct Answer: C

Google Cloud Bigtable is Google’s NoSQL Big Data database service. It is the same database that Google uses for services, such as Search, Analytics, Maps, and Gmail. It is used for requirements that are low latency and high throughput including Internet of Things (IoT), user analytics, and financial data analysis.

Reference contents:
– Cloud Bigtable: NoSQL database service

Question 24

Google Cloud Dataproc charges you only for what you really use with _____ billing.

A. month-by-month
B. minute-by-minute
C. week-by-week
D. hour-by-hour

Correct Answer: B

One of the advantages of Google Cloud Dataproc is its low cost. Google Cloud Dataproc charges for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.

Reference contents:
– What is Dataproc? | Dataproc Documentation

Question 25

Google Cloud Dataproc clusters contain many configuration files.
To update these files, you will need to use the –properties option.
The format for the option is: file_prefix:property=_____.

A. details
B. value
C. null
D. id

Correct Answer: B

To make updating files and properties easy, the –properties command uses a special format to specify the configuration file and the property and value within the file that should be updated. The formatting is as follows: file_prefix:property=value.

Reference contents:
– Cluster properties > Formatting

Question 26

Google Cloud Dataproc is a managed Apache Hadoop and Apache _____ service.

A. Blaze
B. Spark
C. Fire
D. Ignite

Correct Answer: B

Google Cloud Dataproc is a managed Apache Spark and Apache Hadoop service that lets you use open source data tools for batch processing, querying, streaming, and machine learning.

Reference contents:
– Dataproc documentation | Dataproc Documentation

Question 27

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of data.
Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?

A. Encrypted on Google Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user.
B. In a Google BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability.
C. In Google Cloud SQL, with separate database user names to each user. The Google Cloud SQL Admin activity logs will be used to provide the auditability.
D. In a bucket on Google Cloud Storage that is accessible only by an Google App Engine service that collects user information and logs the access before providing a link to the bucket.

Correct Answer: B

Question 28

How can you get a neural network to learn about relationships between categories in a categorical feature?

A. Create a multi-hot column.
B. Create a one-hot column.
C. Create a hash bucket.
D. Create an embedding column.

Correct Answer: D

There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesn’t encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.
Both of these problems can be solved by representing a categorical feature with an embedding column. The idea is that each category has a smaller vector with, lets say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).
You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.

Reference contents:
– Introduction to Google AI Platform Course

Question 29

How would you query specific partitions in a Google BigQuery table?

A. Use the DAY column in the WHERE clause.
B. Use the EXTRACT(DAY) clause.
C. Use the __PARTITIONTIME pseudo-column in the WHERE clause.
D. Use DATE BETWEEN in the WHERE clause.

Correct Answer: C

Partitioned tables include a pseudo column named _PARTITIONTIME that contains a date-based timestamp for data loaded into the table. To limit a query to particular partitions (such as Jan 1st and 2nd of 2017), use a clause similar to this:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(‘2017-01-01’) AND TIMESTAMP(‘2017-01-02’)

Reference contents:
– Introduction to partitioned tables | BigQuery

Question 30

If a dataset contains rows with individual people and columns for year of birth, country, and income, how many of the columns are continuous and how many are categorical?

A. 1 continuous and 2 categorical
B. 3 categorical
C. 3 continuous
D. 2 continuous and 1 categorical

Correct Answer: D

The columns can be grouped into two types categorical and continuous columns:
A column is called categorical if its value can only be one of the categories in a finite set. For example, the native country of a person (U.S., India, Japan, etc.) or the education level (high school, college, etc.) are categorical columns.
A column is called continuous if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.
Year of birth and income are continuous columns. Country is a categorical column. You could use bucketization to turn year of birth and/or income into categorical features, but the raw columns are continuous.

Question 31

If you want to create a machine learning model that predicts the price of a particular stock based on its recent price history, what type of estimator should you use?

A. Unsupervised learning
B. Regressor
C. Classifier
D. Clustering estimator

Correct Answer: B

Regression is the supervised learning task for modeling and predicting continuous, numeric variables. Examples include predicting real-estate prices, stock price movements, or student test scores. Classification is the supervised learning task for modeling and predicting categorical variables. Examples include predicting employee churn, email spam, financial fraud, or student letter grades.
Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset.
Examples include customer segmentation, grouping similar items in e-commerce, and social network analysis.

Reference contents:
– Modern Machine Learning Algorithms: Strengths and Weaknesses

Question 32

If you’re running a performance test that depends upon Google Cloud Bigtable, all the choices except one below are recommended steps.
Which is NOT a recommended step to follow?

A. Do not use a production instance.
B. Run your test for at least 10 minutes.
C. Before you test, run a heavy pre-test for several minutes.
D. Use at least 300 GB of data.

Correct Answer: A

If you’re running a performance test that depends upon Google Cloud Bigtable, be sure to follow these steps as you plan and execute your test:
Use a production instance. A development instance will not give you an accurate sense of how a production instance performs under load.
Use at least 300 GB of data. Google Cloud Bigtable performs best with 1 TB or more of data. However, 300 GB of data is enough to provide reasonable results in a performance test on a 3-node cluster. On larger clusters, use 100 GB of data per node.
Before you test, run a heavy pre-test for several minutes. This step gives Google Cloud Bigtable a chance to balance data across your nodes based on the access patterns it observes.
Run your test for at least 10 minutes. This step lets Google Cloud Bigtable further optimize your data, and it helps ensure that you will test reads from disk as well as cached reads from memory.

Reference contents:
– Understanding Cloud Bigtable performance

Question 33

In order to securely transfer web traffic data from your computer’s web browser to the Google Cloud Dataproc cluster you should use a(n) _____.

A. VPN connection
B. Special browser
C. SSH tunnel
D. FTP connection

Correct Answer: C

To connect to the web interfaces, it is recommended to use an SSH tunnel to create a secure connection to the master node.

Reference contents:
– Cluster web interfaces > Connecting to web interfaces

Question 34

MJTelco is building a custom interface to share data. They have these requirements:
1. They need to do aggregations over their petabyte-scale datasets.
2. They need to scan specific time range rows with a very fast response time (milliseconds).
Which combination of Google Cloud Platform products should you recommend?

A. Google Cloud Datastore and Google Cloud Bigtable
B. Google Cloud Bigtable and Google Cloud SQL
C. Google BigQuery and Google Cloud Bigtable
D. Google BigQuery and Google Cloud Storage

Correct Answer: C

Question 35

MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records.
Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day.
Which schema should you use?

A. Rowkey: date#device_id Column data: data_point
B. Rowkey: date Column data: device_id, data_point
C. Rowkey: device_id Column data: date, data_point
D. Rowkey: data_point Column data: device_id, date
E. Rowkey: date#data_point Column data: device_id

Correct Answer: D

Reference contents:
– Designing your schema | Cloud Bigtable Documentation

Question 36

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations.
You want to allow Google Cloud Dataflow to scale its compute power up as required.
Which Google Cloud Dataflow pipeline configuration setting should you update?

A. The zone
B. The number of workers
C. The disk size per worker
D. The maximum number of workers

Correct Answer: A

Question 37

Scaling a Google Cloud Dataproc cluster typically involves ____.

A. increasing or decreasing the number of worker nodes.
B. increasing or decreasing the number of master nodes.
C. moving memory to run more applications on a single node.
D. deleting applications from unused nodes periodically.

Correct Answer: A

After creating a Google Cloud Dataproc cluster, you can scale the cluster by increasing or decreasing the number of worker nodes in the cluster at any time, even when jobs are running on the cluster. Google Cloud Dataproc clusters are typically scaled to:
1) increase the number of workers to make a job run faster
2) decrease the number of workers to save money
3) increase the number of nodes to expand available Hadoop Distributed File System (HDFS) storage

Reference contents:
– Scaling clusters | Dataproc Documentation

Question 38

Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face.
To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?

A. Use K-means Clustering to detect faces in the pixels.
B. Use feature engineering to add features for eyes, noses, and mouths to the input data.
C. Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.
D. Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two categories.

Correct Answer: C

Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep is a strictly defined, technical term that means more than one hidden layer.
In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layers output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.
A neural network with only one hidden layer would be unable to automatically recognize high-level features of faces, such as eyes, because it wouldn’t be able to”build” these features using previous hidden layers that detect low-level features, such as lines.
Feature engineering is difficult to perform on raw image data.
K-means Clustering is an unsupervised learning method used to categorize unlabeled data.

Question 39

Suppose you have a table that includes a nested column called “city” inside a column called “person”, but when you try to submit the following query in Google BigQuery, it gives you an error.

SELECT person FROM `project1.example.table1` WHERE city = “London

How would you correct the error?

A. Add “, UNNEST(person)” before the WHERE clause.
B. Change “person” to “person.city”.
C. Change “person” to “city.person”.
D. Add “, UNNEST(city)” before the WHERE clause.

Correct Answer: A

To access the person.city column, you need to “UNNEST(person)” and JOIN it to table1 using a comma.

Reference contents:
– Migrating to standard SQL > Nested repeated results

Question 40

The _________ for Google Cloud Bigtable makes it possible to use Google Cloud Bigtable in a Google Cloud Dataflow pipeline.

A. Google Cloud Dataflow connector
B. Google Cloud Dataflow SDK
C. Google BigQuery API
D. Google BigQuery Data Transfer Service

Correct Answer: A

The Google Cloud Dataflow connector for Google Cloud Bigtable makes it possible to use Google Cloud Bigtable in a Google Cloud Dataflow pipeline. You can use the connector for both batch and streaming operations.

Reference contents:
– Dataflow Connector for Cloud Bigtable | Cloud Bigtable Documentation

Question 41

The CUSTOM tier for Google Cloud Machine Learning Engine allows you to specify the number of which types of cluster nodes?

A. Workers
B. Masters, workers, and parameter servers
C. Workers and parameter servers
D. Parameter servers

Correct Answer: C

The CUSTOM tier is not a set tier, but rather enables you to use your own cluster specification. When you use this tier, set values to configure your processing cluster according to these guidelines:
You must set TrainingInput.masterType to specify the type of machine to use for your master node.
You may set TrainingInput.workerCount to specify the number of workers to use.
You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use.
You can specify the type of machine for the master node, but you can’t specify more than one master node.

Reference contents:
– Training overview | AI Platform Training
– Specifying machine types or scale tiers | AI Platform Training

Question 42

The Google Cloud Dataflow SDKs have been recently transitioned into which Apache service?

A. Apache Spark
B. Apache Hadoop
C. Apache Kafka
D. Apache Beam

Correct Answer: D

Google Cloud Dataflow SDKs are being transitioned to Apache Beam, as per the latest Google directive.

Reference contents:
– Dataflow documentation

Question 43

The marketing team at your organization provides regular updates of a segment of your customer dataset.
The marketing team has given you a CSV with 1 million records that must be updated in Google BigQuery. When you use the UPDATE statement in Google BigQuery, you receive a quotaExceeded error.
What should you do?

A. Reduce the number of records updated each day to stay within the Google BigQuery UPDATE DML statement limit.
B. Increase the Google BigQuery UPDATE DML statement limit in the Quota management section of the Google Cloud Platform Console.
C. Split the source CSV file into smaller CSV files in Google Cloud Storage to reduce the number of Google BigQuery UPDATE DML statements per Google BigQuery job.
D. Import the new records from the CSV file into a new Google BigQuery table. Create a Google BigQuery job that merges the new records with the existing records and writes the results to a new Google BigQuery table.

Correct Answer: A

Question 44

The YARN ResourceManager and the HDFS NameNode interfaces are available on a Google Cloud Dataproc cluster ____.

A. application node
B. conditional node
C. master node
D. worker node

Correct Answer: C

The YARN ResourceManager and the HDFS NameNode interfaces are available on a Google Cloud Dataproc cluster master node. The cluster master-host-name is the name of your Google Cloud Dataproc cluster followed by an -m suffix for example, if your cluster is named “my-cluster”, the master-host-name would be “my-cluster-m”.

Reference contents:
– Cluster web interfaces > Available interfaces

Question 45

To give a user read permission for only the first three columns of a table, which access control method would you use?

A. Primitive role
B. Predefined role
C. Authorized view
D. It’s not possible to give access to only the first three columns of a table.

Correct Answer: C

An authorized view allows you to share query results with particular users and groups without giving them read access to the underlying tables. Authorized views can only be created in a dataset that does not contain the tables queried by the view.
When you create an authorized view, you use the view’s SQL query to restrict access to only the rows and columns you want the users to see.

Reference contents:
– Creating authorized views | BigQuery

Question 46

To run a TensorFlow training job on your own computer using Google Cloud Machine Learning Engine, what would your command start with?

A. gcloud ml-engine local train.
B. gcloud ml-engine jobs submit training.
C. gcloud ml-engine jobs submit training local.
D. You can’t run a TensorFlow program on your own computer using Google Cloud ML Engine.

Correct Answer: A

gcloud ml-engine local train – run a Google Cloud ML Engine training job locally.
This command runs the specified module in an environment similar to that of a live Google Cloud ML Engine Training Job.
This is especially useful in the case of testing distributed models, as it allows you to validate that you are properly interacting with the Google Cloud ML Engine cluster configuration.

Reference contents:
– gcloud ml-engine local train | Cloud SDK Documentation

Question 47

What are all of the Google BigQuery operations that Google charges for?

A. Storage, queries, and streaming inserts.
B. Storage, queries, and loading data from a file.
C. Storage, queries, and exporting data.
D. Queries and streaming inserts.

Correct Answer: A

Google charges for storage, queries, and streaming inserts. Loading data from a file and exporting data are free operations.

Reference contents:
– Pricing | BigQuery

Question 48

What are the minimum permissions needed for a service account used with Google Cloud Dataproc?

A. Execute to Google Cloud Storage; write to Google Cloud Logging.
B. Write to Google Cloud Storage; read to Google Cloud Logging.
C. Execute to Google Cloud Storage; execute to Google Cloud Logging.
D. Read and write to Google Cloud Storage; write to Google Cloud Logging.

Correct Answer: D

Service accounts authenticate applications running on your virtual machine instances to other Google Cloud Platform services. For example, if you write an application that reads and writes files on Google Cloud Storage, it must first authenticate to the Google Cloud Storage API. At a minimum, service accounts used with Google Cloud Dataproc need permissions to read and write to Google Cloud Storage, and to write to Google Cloud Logging.

Reference contents:
– Service accounts | Dataproc Documentation

Question 49

What are two methods that can be used to denormalize tables in Google BigQuery?

A.
- 1) Split table into multiple tables.
- 2) Use a partitioned table.
B.
- 1) Join tables into one table.
- 2) Use nested repeated fields.
C.
- 1) Use a partitioned table.
- 2) Join tables into one table.
D.
- 1) Use nested repeated fields.
- 2) Use a partitioned table.

Correct Answer: B

The conventional method of denormalizing data involves simply writing a fact, along with all its dimensions, into a flat table structure. For example, if you are dealing with sales transactions, you would write each individual fact to a record, along with the accompanying dimensions such as order and customer information.
The other method for denormalizing data takes advantage of Google BigQuery native support for nested and repeated structures in JSON or Avro input data. Expressing records using nested and repeated structures can provide a more natural representation of the underlying data. In the case of the sales order, the outer part of a JSON structure would contain the order and customer information, and the inner part of the structure would contain the individual line items of the order, which would be represented as nested, repeated elements.

Reference contents:
– BigQuery for data warehouse practitioners > Denormalization

Question 50

What are two of the benefits of using denormalized data structures in Google BigQuery?

A. Reduces the amount of data processed, reduces the amount of storage required.
B. Increases query speed, makes queries simpler.
C. Reduces the amount of storage required, increases query speed.
D. Reduces the amount of data processed, increases query speed.

Correct Answer: B

Denormalization increases query speed for tables with billions of rows because Google BigQuery’s performance degrades when doing JOINs on large tables, but with a denormalized data structure, you don’t have to use JOINs, since all of the data has been combined into one table. Denormalization also makes queries simpler because you do not have to use JOIN clauses. Denormalization increases the amount of data processed and the amount of storage required because it creates redundant data.

Reference contents:
– BigQuery for data warehouse practitioners > Denormalization

Categories:

Google Cloud Platform

Tags:

Professional Data Engineer

Comments are closed