The Ultimate Hands-On Hadoop – Tame your Big Data!

The Ultimate Hands-On Hadoop – Tame your Big Data!

English | MP4 | AVC 1920×1080 | AAC 44KHz 2ch | 14.5 Hours | 3.53 GB

Hadoop, MapReduce, HDFS, Spark, Pig, Hive, HBase, MongoDB, Cassandra, Flume – the list goes on! Over 25 technologies.

The world of Hadoop and “Big Data” can be intimidating – hundreds of different technologies with cryptic names form the Hadoop ecosystem. With this course, you’ll not only understand what those systems are and how they fit together – but you’ll go hands-on and learn how to use them to solve real business problems!

Learn and master the most popular big data technologies in this comprehensive course, taught by a former engineer and senior manager from Amazon and IMDb. We’ll go way beyond Hadoop itself, and dive into all sorts of distributed systems you may need to integrate with.

  • Install and work with a real Hadoop installation right on your desktop with Hortonworks and the Ambari UI
  • Manage big data on a cluster with HDFS and MapReduce
  • Write programs to analyze data on Hadoop with Pig and Spark
  • Store and query your data with Sqoop, Hive, MySQL, HBase, Cassandra, MongoDB, Drill, Phoenix, and Presto
  • Design real-world systems using the Hadoop ecosystem
  • Learn how your cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin, and Hue
  • Handle streaming data in real time with Kafka, Flume, Spark Streaming, Flink, and Storm

Understanding Hadoop is a highly valuable skill for anyone working at companies with large amounts of data.

Almost every large company you might want to work at uses Hadoop in some way, including Amazon, Ebay, Facebook, Google, LinkedIn, IBM, Spotify, Twitter, and Yahoo! And it’s not just technology companies that need Hadoop; even the New York Times uses Hadoop for processing images.

This course is comprehensive, covering over 25 different technologies in over 14 hours of video lectures. It’s filled with hands-on activities and exercises, so you get some real experience in using Hadoop – it’s not just theory.

You’ll find a range of activities in this course for people at every level. If you’re a project manager who just wants to learn the buzzwords, there are web UI’s for many of the activities in the course that require no programming knowledge. If you’re comfortable with command lines, we’ll show you how to work with them too. And if you’re a programmer, I’ll challenge you with writing real scripts on a Hadoop system using Scala, Pig Latin, and Python.

You’ll walk away from this course with a real, deep understanding of Hadoop and its associated distributed systems, and you can apply Hadoop to real-world problems. Plus a valuable completion certificate is waiting for you at the end!

Please note the focus on this course is on application development, not Hadoop administration. Although you will pick up some administration skills along the way.

What Will I Learn?

  • Design distributed systems that manage “big data” using Hadoop and related technologies.
  • Use HDFS and MapReduce for storing and analyzing data at scale.
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
  • Analyze relational data using Hive and MySQL
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Choose an appropriate data storage technology for your application
  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm
Table of Contents

Learn all the buzzwords! And install Hadoop.
1 [Activity] Introduction, and install Hadoop on your desktop!
2 Hadoop Overview and History
3 Overview of the Hadoop Ecosystem
4 Tips for Using This Course

Using Hadoop’s Core: HDFS and MapReduce
5 HDFS: What it is, and how it works
6 [Activity] Install the MovieLens dataset into HDFS using the Ambari UI
7 [Activity] Install the MovieLens dataset into HDFS using the command line
8 MapReduce: What it is, and how it works
9 How MapReduce distributes processing
10 MapReduce example: Break down movie ratings by rating score
11 [Activity] Installing Python, MRJob, and nano
12 [Activity] Code up the ratings histogram MapReduce job and run it
13 [Exercise] Rank movies by their popularity
14 [Activity] Check your results against mine!

Programming Hadoop with Pig
15 Introducing Ambari
16 Introducing Pig
17 Example: Find the oldest movie with a 5-star rating using Pig
18 [Activity] Find old 5-star movies with Pig
19 More Pig Latin
20 [Exercise] Find the most-rated one-star movie
21 Pig Challenge: Compare Your Results to Mine!

Programming Hadoop with Spark
22 Why Spark?
23 The Resilient Distributed Dataset (RDD)
24 [Activity] Find the movie with the lowest average rating – with RDD’s
25 Datasets and Spark 2.0
26 [Activity] Find the movie with the lowest average rating – with DataFrames
27 [Activity] Movie recommendations with MLLib
28 [Exercise] Filter the lowest-rated movies by number of ratings
29 [Activity] Check your results against mine!

Using relational data stores with Hadoop
30 What is Hive?
31 [Activity] Use Hive to find the most popular movie
32 How Hive works
33 [Exercise] Use Hive to find the movie with the highest average rating
34 Compare your solution to mine.
35 Integrating MySQL with Hadoop
36 [Activity] Install MySQL and import our movie data
37 [Activity] Use Sqoop to import data from MySQL to HFDS/Hive
38 [Activity] Use Sqoop to export data from Hadoop to MySQL

Using non-relational data stores with Hadoop
39 Why NoSQL?
40 What is HBase
41 [Activity] Import movie ratings into HBase
42 [Activity] Use HBase with Pig to import data at scale.
43 Cassandra overview
44 [Activity] Installing Cassandra
45 [Activity] Write Spark output into Cassandra
46 MongoDB overview
47 [Activity] Install MongoDB, and integrate Spark with MongoDB
48 [Activity] Using the MongoDB shell
49 Choosing a database technology
50 [Exercise] Choose a database for a given problem

Querying your Data Interactively
51 Overview of Drill
52 [Activity] Setting up Drill
53 [Activity] Querying across multiple databases with Drill
54 Overview of Phoenix
55 [Activity] Install Phoenix and query HBase with it
56 [Activity] Integrate Phoenix with Pig
57 Overview of Presto
58 [Activity] Install Presto, and query Hive with it.
59 [Activity] Query both Cassandra and Hive using Presto.

Managing your Cluster
60 YARN explained
61 Tez explained
62 [Activity] Use Hive on Tez and measure the performance benefit
63 Mesos explained
64 ZooKeeper explained
65 [Activity] Simulating a failing master with ZooKeeper
66 Oozie explained
67 [Activity] Set up a simple Oozie workflow
68 Zeppelin overview
69 [Activity] Use Zeppelin to analyze movie ratings, part 1
70 [Activity] Use Zeppelin to analyze movie ratings, part 2
71 Hue overview
72 Other technologies worth mentioning

Feeding Data to your Cluster
73 Kafka explained
74 [Activity] Setting up Kafka, and publishing some data.
75 [Activity] Publishing web logs with Kafka
76 Flume explained
77 [Activity] Set up Flume and publish logs with it.
78 [Activity] Set up Flume to monitor a directory and store its data in HDFS

Analyzing Streams of Data
79 Spark Streaming: Introduction
80 [Activity] Analyze web logs published with Flume using Spark Streaming
81 [Exercise] Monitor Flume-published logs for errors in real time
82 Exercise solution: Aggregating HTTP access codes with Spark Streaming
83 Apache Storm: Introduction
84 [Activity] Count words with Storm
85 Flink: An Overview
86 [Activity] Counting words with Flink

Designing Real-World Systems
87 The Best of the Rest
88 Review: How the pieces fit together
89 Understanding your requirements
90 Sample application: consume webserver logs and keep track of top-sellers
91 Sample application: serving movie recommendations to a website
92 [Exercise] Design a system to report web sessions per day
93 Exercise solution: Design a system to count daily sessions

Learning More
94 Books and online resources
95 Bonus lecture: Discounts on my other big data / data science courses!