Objective
This is comprehensive guide about various Spark Hadoop cloudera certifications. In this cloudera certification tutorial we will discuss all the aspects like different certifications offered by cloudera, pattern of cloudera certification exam / test, number of questions passing score, time limits, required skills and weightage of each and every topic. We will discuss about all the certifications offered by cloudera like: “CCA Spark and Hadoop Developer Exam (CCA175)”, “Cloudera Certified Administrator for Apache Hadoop (CCAH)”, “CCP Data Scientist”, “CCP Data Engineer”.
1. CCA Spark and Hadoop Developer Exam (CCA175)
In CCA Spark and Hadoop Developer certification, you need to write code in Scala and Python and run it on the cluster to prove your skills. This exam can be taken from any computer at any time globally.
CCA175 is a hands-on, practical exam using Cloudera technologies. The users are given their own CDH5 (currently 5.3.2) cluster that is pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many other softwares that are needed by the users.
a. CCA Spark and Hadoop Developer Certification Exam (CCA175) Details:
- Number of Questions: 10–12 performance-based (hands-on) tasks on CDH5 cluster
- Time Limit: 120 minutes
- Passing Score: 70%
- Language: English, Japanese (forthcoming)
- CCA Spark and Hadoop Developer certification Cost: USD $295
b. CCA175 Exam Question Format
In each CCA question, you would be required to solve a particular scenario. In some cases, a tool such as Impala or Hive may be used. In other cases, coding is required. In Spark problem, a template (in Scala or Python) is often provided that contains a skeleton of the solution, asking the candidate to fill in the missing lines with functional code.
c. Prerequisites
There are no prerequisites required to take any Cloudera certification exam.
d. Exam selection and related topics
I. Required Skills
Data Ingest: These are the skills required to transfer data between external systems and your cluster. It includes:
- Using Sqoop to import data from a MySQL database into HDFS and Change the delimiter and file format of data
- Using Sqoop to Export data to a MySQL database
- Ingest real-time and near-real time (NRT) streaming data into HDFS using Flume
- Using Hadoop File System (FS) commands to load data into and out of HDFS
II. Transform, Stage, Store:
It converts a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS. This includes writing Spark applications in Scala / Python for the below tasks:
- Load data from HDFS and store results back to HDFS
- Join disparate datasets together
- Calculate aggregate statistics (e.g., average or sum)
- Filter data into a smaller dataset
- Write a query that produces ranked or sorted data
III. Data Analysis
Data Definition Language (DDL) to create tables in the Hive metastore, use by Hive and Impala.
- Read and/or create a table in the Hive metastore in a given schema
- Avro schema extraction from a set of data-files
- Hive metastore table creation using the Avro file format and an external schema file
- Improve query performance by creating partitioned tables in the Hive meta-store
- Evolve an Avro schema by changing JSON files
2. Cloudera Certified Administrator for Apache Hadoop (CCAH)
Cloudera Certified Administrator for Apache Hadoop (CCAH) certification shows your technical knowledge, skills, and ability to configure, deploy, monitor, manage, maintain, and secure an Apache Hadoop cluster.
a. Cloudera Certified Administrator for Apache Hadoop (CCA-500) details
- Number of Questions: 60 questions
- Time Limit: 90 minutes
- Passing Score: 70%
- Language: English, Japanese
- Cloudera Certified Administrator for Apache Hadoop (CCAH) certification Price: USD $295
b. Exam sections and related topics
I. HDFS (17%)
- HDFS Features & design principle and function of HDFS daemons
- Describe the operations of the Apache Hadoop cluster in data storage and in data processing
- Features of current computing systems that motivated system like Apache Hadoop and commands to handle files in the HDFS
- Given a scenario, identify appropriate use cases for HDFS Federation
- Identify components and daemon of an HDFS HA-Quorum cluster
- HDFS security (Kerberos) and file read-write paths
- Determine the best data serialization choice for a given scenario
- Internals of HDFS read operations and HDFS write operations
II. YARN (17%)
- Understand how to deploy core ecosystem components along with Spark, Impala, and Hive
- Understand Yarn, MapReduce v2 (MRv2 / YARN) deployments
- Understand basic design strategy for YARN and how resource allocations is handled by it
- Understand Resource Manager and Node Manager
- Identify the workflow of job running on YARN
- Determine which files you must change and how in order to migrate a cluster from MapReduce version 1 (MRv1) to MapReduce version 2 (MRv2) running on YARN
III. Hadoop Cluster Planning (16%)
- Principal points to consider while choosing the hardware and operating systems to host an Apache Hadoop cluster
- Understand kernel tuning and disk swapping
- Identify a hardware configuration and ecosystem components your cluster needs for the given scenario
- Cluster sizing: identify the specifics for the workload, including CPU, memory, storage, disk I/O for a given case
- Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
- Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario
IV. Hadoop Cluster Installation and Administration (25%)
- Understand how to install and configure Hadoop cluster
- Identify how the cluster will handle disk and machine failures for given case
- Analyze a logging configuration and logging configuration file format
- Understand the basics of Hadoop metrics and cluster health monitoring
- Install ecosystem components in CDH 5 like Impala, Flume, Oozie, Hue, Cloudera Manager, Sqoop, Hive, and Pig etc.
- Identify the function and purpose of available tools for managing the Apache Hadoop file system
V. Resource Management (10%)
- Understand the overall design goals of each of Hadoop schedulers and resource manager
- Given a scenario, determine how the Fair/FIFO/Capacity Scheduler allocates cluster resources under YARN
VI. Monitoring and Logging (15%)
- Understand the functions and features of Hadoop’s metric collection abilities
- Analyze the NameNode and Yarn Web UIs
- Understand how to monitor cluster daemons
- Identify and monitor CPU usage on master nodes
- Describe how to monitor swap and memory allocation on all nodes
- Interpret a log file and Identify how to manage Hadoop’s log files
3. CCP Data Scientist
“Cloudera Certified Professional Data Scientist” is able to perform descriptive and inferential statistics, apply advanced analytical techniques and build machine learning models using standard tools. Candidates need to prove their abilities on a live cluster with large datasets in a variety of formats. It needs clearing 3 CCP Data Scientist exams (DS700, DS701, and DS702) in any order. All three exams must be passed within 365 days of each other.
a. Common Skills (all exams)
- Extract relevant features from a large dataset containing bad records, partial records, errors, or other forms of “noise”
- Extract features from a data in multiple formats like JSON, XML, raw text logs, industry-specific encodings, and graph link data
b. Descriptive and Inferential Statistics on Big Data (DS700)
- Determining confidence for a hypothesis using statistical tests
- Calculate common summary statistics, such as mean, variance, and counts
- Fit a distribution to a dataset and use it to predict event likelihoods
- Perform complex statistical calculations on a large dataset
c. Advanced Analytical Techniques on Big Data (DS701)
- Build a model that contains relevant features from a large dataset
- Define relevant data groupings and assign data records from a large dataset into a defined set of data groupings
- Evaluate goodness of fit for a given set of data groupings and a dataset
- Apply advanced analytical techniques, such as network graph analysis or outlier detection
d. Machine Learning at Scale (DS702)
- Build a model with relevant features from a large dataset and select a classification algorithm for it
- Predict labels for an unlabeled dataset using a labeled dataset for reference
- Tune algorithm meta parameters to maximize algorithm performance
- Determine the success of a given algorithm for the given dataset using validation techniques
e. What technologies/languages do you need to know?
You’ll be provided with a cluster with Hadoop technologies on a cluster, plus standard tools like Python and R. Among these standard technologies, it’s your choice what to use to solve the problem.
4. CCP Data Engineer
“Cloudera Certified Data Engineer” is able to perform core competencies required to ingest, transform, store, and analyze data in Cloudera’s CDH environment.
a. What do you need to know?
I. Data Ingestion
These are the skills to transfer data between external systems and your cluster. It includes:
- Import and export data between an external RDBMS and your cluster, including specific subsets, changing the delimiter and file format of imported data during ingest, and altering the data access pattern or privileges.
- Ingest real-time and near-real time (NRT) streaming data into HDFS, including distribution to multiple data sources and converting data on ingest from one format to another.
- Load data into and out of HDFS using the Hadoop File System HDFS commands.
II. Transform, Stage, Store
It means converting a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS or Hive/HCatalog. It includes:
- Convert data from one file format to another and write it with compression
- Convert data from one set of values to another (e.g., Lat/Long to Postal Address using an external library)
- Purge bad records from a data set, e.g., null values
- De-duplication and merge data
- De-normalize data from multiple disparate data sets
- Evolve an Avro or Parquet schema
- Partition an existing data set according to one or more partition keys
- Tune data for optimal query performance
III. Data Analysis
It includes operations like Filter, sort, join, aggregate, and/or transform one or more data sets in a given format stored in HDFS to produce a specified result. The queries will include complex data types (e.g., array, map, struct), the implementation of external libraries, partitioned data, compressed data, and requires the use of metadata from Hive/HCatalog.
- Write a query to aggregate multiple rows of data and to filter data
- Write a query that produces ranked or sorted data
- Write a query that joins multiple data sets
- Read and/or create a Hive or an HCatalog table from existing data in HDFS
IV. Workflow
It includes the ability to create and execute various jobs and actions that move data towards greater value and use in a system. It includes:
- Create and execute a linear workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom actions, etc.
- Create and execute a branching workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom action, etc.
- Orchestrate a workflow to execute regularly at predefined times, including workflows that have data dependencies
b. What should you expect?
You are given five to eight customer problems each with a unique, large data set, a CDH cluster, and four hours. For each problem, you must implement a technical solution that meets all the requirements using any tool or combination of tools on the cluster (see list below) — you get to pick the tool(s) that are right for the job.