Your browser is out of date

Update your browser to view this website correctly. Update my browser now

×

  • Cloudera Cloudera
  • Overview

    This three-day hands-on training course presents the concepts and architectures of Spark and the underlying data platform, providing students with the conceptual understanding necessary to diagnose and solve performance issues.

    With this understanding of Spark internals and the underlying data platform, the course teaches students how to tune Spark application code and configuration. The course illustrates performance design best practices and pitfalls. Students are prepared to apply these patterns and anti-patterns to their own designs and code.

    The course format emphasizes instructor-led demos of performance issues and techniques to address them, followed by hands-on exercises. Students explore these performance issues and techniques in an interactive notebook environment. Students take away from the course a practical, illustrative body of code.

    Prerequisites

    This course is designed for software developers, engineers, and data scientists who develop Spark applications and need the information and techniques for tuning their code. This is not a beginning course in Spark; students should be comfortable completing the tasks covered in Cloudera Developer Training for Apache Spark and Hadoop. Spark examples and hands-on exercises are presented in Python and Scala. The ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful.

    Book the course

    How would you like to train?

    Course Topics

    Spark Architecture

    • Coverage of all concepts found in the Spark Application UI
    • RDD execution
    • Data Frame execution
    • Catalyst optimizer
    • Partitioning
    • Shuffling

    Optimizing Data

    • Recognizing and dealing with skewed data
    • Handling small files
    • Join optimizations
      Broadcast Joins
      Common Joins
      Skewed Joins
      Bucketed Joins
    • Unbalanced partitions
    • Partitioned and bucketed tables
    • Object serialization
    • Compression
    • File formats
    • Storage options
    • Schema inference

    Optimizing Processing

    • Static vs. dynamic scheduling
    • Dynamic resource pools in YARN
    • Partition processing
    • Broadcast variables
    • Driver and executor memory and CPU core configuration
    • Python overhead
    • UDFs

    Developing High Performance Algorithms

    • Caching data
    • Checkpoints
    • Recovery

    Learn more

    CCA Spark and Hadoop Developer Certification

    This course is excellent preparation for the CCA Spark and Hadoop Developer exam. Although we recommend further training and hands-on experience before attempting the exam, this course covers many of the subjects tested. 

    Certification is a great differentiator. It helps establish you as a leader in the field, providing employers and customers with tangible evidence of your skills and expertise.

    Advance your career

    Big data developers are among the world's most in-demand and highly-compensated technical roles. Check out some of the job opportunities currently listed that match the professional profile, many of which seek CCA qualifications.

    Private training

    We also provide private training at your site, at your pace, and tailored to your needs.

    Your form submission has failed.

    This may have been caused by one of the following:

    • Your request timed out
    • A plugin/browser extension blocked the submission. If you have an ad blocking plugin please disable it and close this message to reload the page.