Data Engineering on Google Cloud Platform

Duration:

3 Days

Audience:

Employees of federal, state and local governments; and businesses working with the government.

This class is intended for experienced developers who are responsible for managing big data transformations including:

  • Extracting, loading, transforming, cleaning, and validating data
  • Designing pipelines and architectures for data processing
  • Creating and maintaining machine learning and statistical models
  • Querying datasets, visualizing query results and creating reports

Course Overview:

This four-day instructor-led class provides you with a hands-on introduction to designing and building data processing systems on Google Cloud Platform. Through a combination of presentations, demos, and hand-on labs, you will learn how to design data processing systems, build end-to-end data pipelines, analyze data and carry out machine learning. The course covers structured, unstructured, and streaming data.

What You’ll Learn:

  • Design and build data processing systems on Google Cloud Platform
  • Process batch and streaming data by implementing autoscaling data pipelines on Cloud Dataflow
  • Derive business insights from extremely large
  • datasets using Google BigQuery
  • Train, evaluate and predict using machine learning models using Tensorflow and Cloud ML
  • Leverage unstructured data using Spark and ML APIs on Cloud Dataproc
  • Enable instant insights from streaming data

1. Serverless Data Analysis with BigQuery

  • What is BigQuery
  • Advanced Capabilities
  • Performance and pricing

2. Serverless, Autoscaling Data Pipelines with Dataflow

3. Getting Started with Machine Learning

  • What is machine learning (ML)
  • Effective ML: concepts, types
  • Evaluating ML
  • ML datasets: generalization

4. Building ML Models with Tensorflow

  • Getting started with TensorFlow
  • TensorFlow graphs and loops + lab
  • Monitoring ML training

5. Scaling ML Models with CloudML

  • Why Cloud ML?
  • Packaging up a TensorFlow model
  • End-to-end training

6. Feature Engineering

  • Creating good features
  • Transforming inputs
  • Synthetic features
  • Preprocessing with Cloud ML

7. ML Architectures

  • Wide and deep
  • Image analysis
  • Embeddings and sequences
  • Recommendation systems

8. Google Cloud Dataproc Overview

  • Introducing Google Cloud Dataproc
  • Creating and managing clusters
  • Defining master and worker nodes
  • Leveraging custom machine types and preemptible worker nodes
  • Creating clusters with the Web Console
  • Scripting clusters with the CLI
  • Using the Dataproc REST API
  • Dataproc pricing
  • Scaling and deleting Clusters

9. Running Dataproc Jobs

  • Controlling application versions
  • Submitting jobs
  • Accessing HDFS and GCS
  • Hadoop
  • Spark and PySpark
  • Pig and Hive
  • Logging and monitoring jobs
  • Accessing onto master and worker nodes with SSH
  • Working with PySpark REPL (command-line interpreter)

10. Integrating Dataproc with Google Cloud Platform

  • Initialization actions
  • Programming Jupyter/Datalab notebooks
  • Accessing Google Cloud Storage
  • Leveraging relational data with Google Cloud SQL
  • Reading and writing streaming Data with Google BigTable
  • Querying Data from Google BigQuery
  • Making Google API Calls from notebooks

11. Making Sense of Unstructured Data with Google’s Machine Learning APIs

  • Google’s Machine Learning APIs
  • Common ML Use Cases
  • Vision API
  • Natural Language API
  • Translate
  • Speech API

12. Need for Real-Time Streaming Analytics

  • What is Streaming Analytics?
  • Use-cases
  • Batch vs. Streaming (Real-time)
  • Related terminologies
  • GCP products that help build for high availability, resiliency, high-throughput, real-timestreaming analytics (review of Pub/Sub and Dataflow)

13. Architecture of Streaming Pipelines

  • Streaming architectures and considerations
  • Choosing the right components
  • Windowing
  • Streaming aggregation
  • Events, triggers

14. Stream Data and Events into PubSub

  • Topics and Subscriptions
  • Publishing events into Pub/Sub
  • Subscribing options: Push vs Pull
  • Alerts

15. Build a Stream Processing Pipeline

  • Pipelines, PCollections and Transforms
  • Windows, Events, and Triggers
  • Aggregation statistics
  • Streaming analytics with BigQuery
  • Low-volume alerts

16. High Throughput and Low-Latency with Bigtable

  • Latency considerations
  • What is Bigtable
  • Designing row keys
  • Performance considerations

17. High Throughput and Low-Latency with Bigtable

  • What is Google Data Studio?
  • From data to decisions

Labs

  • Lab 1: Queries and Functions
  • Lab 2: Load and Export data
  • Lab 3: Data pipeline
  • Lab 4: MapReduce in Dataflow
  • Lab 5: Side inputs
  • Lab 6: Streaming
  • Lab 7: Explore and create ML datasets
  • Lab 8: Using tf.learn
  • Lab 9: Using low-level TensorFlow + early stopping
  • Lab 10: Charts and Graphs of TensorFlow Training
  • Lab 11: Run a ML Model Locally and on Cloud
  • Lab 12: Feature Engineering
  • Lab 14:13 Custom Image Classification with Transfer Learning
  • Lab 15: Creating Hadoop Clusters with Google Cloud Dataproc
  • Lab 16: Running Hadoop and Spark Jobs with Dataproc
  • Lab 17: Big Data Analysis with Dataproc
  • Lab 18: Adding Machine Learning Capabilities to Big Data Analysis
  • Lab 19: Setup Project, Enable APIs, Setup Storage
  • Lab 20: Explore the datase
  • Lab 21: Create Architecture Reference
  • Lab 22: Streaming Data Ingest into PubSub Low-Volume Alerts
  • Lab 23: Alerting Scenario for Anomalies
  • Lab 24: Create Streaming Data Processing Pipelines with Dataflow
  • Lab 25: High-Volume Event Processing
  • Lab 26: Build a Real-Time Dashboard to Visualize Processed Data