Big Data and Hadoop
Instructor: Venkatesh Vinayakarao
Term: Jan - Apr 2020
TA: Suchitra

Welcome to Big Data and Hadoop course! Massive increase in the availability of data has made the storage, management, and analysis extremely challenging. Various tools, technologies and frameworks have surfaced to help address this challenge. Apache Hadoop is one such framework that enables us to handle big data by making distributed computing easier. Concerns such as reliability, distributed file management and distributed processing have been abstracted from us by hadoop. In this course, we shall start with understanding the characteristics of big data and the fundamental concepts of cloud computing. We will explore the hadoop ecosystem. Specifically, we will explore HDFS, Map-Reduce, Pig and NoSQL DB. Our objective is to handle big data effectively and build web applications and RESTful services over cloud. This is an introductory course focused on the breadth of the big data landscape.

Key Learning Objectives

At the end of this course, you should be able to:
  • Understand how distribtued file systems work. Be able to use Hadoop HDFS.
  • Understand distributed processing fundamentals. Code using map-reduce framework and pig scripts.
  • Understand NoSQL DB concepts. Use any one NoSQL DB such as HBase/Hive/MongoDB.

Lecture Schedule

Lecture #TopicReadingsSlides/Material
Part 1: Introduction
1Introduction to Big DataThe Complete Beginner's Guide To Big Data Everyone Can Understand
Basics About Cloud Computing
Lecture 1
Lecture 2
2Hands-On Tutorial: A Tour of Big Data Stack with Cloudera VMCDH OverviewTutorial 1
3Distributed File SystemsFiles and Directories
File System
The Hadoop Distributed File System
Lecture 3
Lecture 4
4Hands-On Tutorial: HDFSExploring the File SystemTutorial 2
5Distributed Processing with Map-Reduce and PigOverview of Map-Reduce
Lecture 5
6Hands-On Tutorial: Map-ReduceTutorial 3
7Introduction to OOAD and UMLLecture 6
Lecture 7
8Big Data Design PatternsChapter 1 from Thinking in Patterns
Map-Reduce Design Patterns
Lecture 8
9Apache PigLecture 9
10Hands-On Tutorial: Map-Reduce and PigMapReduce Tutorial
Pig Tutorial
Tutorial 5
11NoSQL DBChapter 4 from BDA Book
Columnar Storage
NoSQL Explained
Lecture 10
12Hands-On Tutorial: HBase/Hive/MongoDBMongoDB CRUD OperationsTut6-MongoDB
13Web Application Development and Service Oriented ArchitectureWeb Application Development
Sections 1, 2 and 3 of Web Services
Lecture 11
14Hands-On Tutorial: Apache Tomcat, JSON, RESTful ServicesBuilding Web Applications with Tomcat
RESTful Services
Part 2: Products and Practices - Student Presentations
1Apache Spark and Google CloudSpark and Google Cloud
2Azure and AWSAzure and AWS
3Big Data Visualization using TableauTableau
4Crawling with Apache Nutchslides
5Logging with log4j and Log Aggregation with Apache Flumeslides
6Data Collection with Apache Flume and Sqoopslides
7Graph Databases and Neo4jDue to a skype bug, this presentation could not be recorded.slides
8Stream Processing with Apache Stormslides
9Co-ordination with ZooKeepervideo
10Publish-Subscribe Messaging Frameworks with Apache Kafka and Kinesisvideo
11Apache Mahoutvideo
RecapSession 1Recap of part-1 of this courseslides
RecapSession 2Recap of part-2 of this courseslides

Exam format and practice paper are available on moodle.

InstrumentMax Marks
Mid Exam20%
Final Exam35%
Hands-on Tutorials (2% * (best of) 5 Quizzes)10%
Student Presentations15%
Project (optional - can be replaced with a section in exam)20%

This is a group project. Allowed group sizes are 3 and 4. Submit a one page report before the project demonstration.



There is no prescribed text for this course. Readings will be shared during the lectures.

References Optional Readings

If you are not having fun, you are not the best student you can be!