Big Data and Hadoop
Instructor: Venkatesh Vinayakarao
Term: Jan - Apr 2020
TA: Suchitra



Welcome to Big Data and Hadoop course! With growing availability of data, their storage, management, and analysis have become extremely challenging. Various tools, technologies and frameworks have surfaced to help address this challenge. Apache Hadoop is one such framework that enables us to handle big data by making distributed computing easier. Concerns such as reliability, distributed file management and distributed processing have been abstracted from us by hadoop. In this course, we shall start with understand the characteristics of big data and the fundamental concepts of cloud computing. We will explore the hadoop ecosystem. Specifically, we will explore HDFS, Map-Reduce, Pig and NoSQL DB. Our objective is to handle big data effectively and build web applications and RESTful services over cloud. This is an introductory course focused on the breadth of the big data landscape.

Key Learning Objectives

At the end of this course, you should be able to:
  • Understand how distribtued file systems work. Be able to use Hadoop HDFS.
  • Understand distributed processing fundamentals. Code using map-reduce framework and pig scripts.
  • Understand NoSQL DB concepts. Use any one NoSQL DB such as HBase/Hive/MongoDB.

Lecture Schedule

Lecture #TopicReadingsSlides/Material
Part 1: Introduction
1Introduction to Big Data
2Hands-On Tutorial: A Tour of Big Data Stack with Cloudera VM
3Distributed File Systems
4Hands-On Tutorial: HDFS
5Distributed Processing with Map Reduce and Pig
6Hands-On Tutorial: Map-Reduce and Pig
7Introduction to Design Patterns
8Big Data Design Patterns
9NoSQL DB
10Hands-On Tutorial: HBase/Hive/MongoDB
11Web Application Development and Service Oriented Architecture
12Hands-On Tutorial: Apache Tomcat, JSON, RESTful Services
13Putting it all together, Building Webscale Applications
14Hands-On Tutorial: Solr
Part 2: Products and Practices - Student Presentations
15Student Presentation: Azure and AWS (or Apache Spark and Google Cloud)
16Student Presentation: Big Data Visualization using Tableau
17Student Presentation: Crawling with Apache Nutch
18Student Presentation: Data Collection with Apache Flume and Sqoop
19Student Presentation: Logging with log4j and Log Aggregation with Apache Flume
20Student Presentation: Graph Databases and Neo4j
21Student Presentation: Stream Processing with Apache Storm
22Student Presentation: Co-ordination with ZooKeeper
23Student Presentation: Workflow Management with OOZIE
24Student Presentation: Publish-Subscribe Messaging Frameworks with Apache Kafka and Kinesis
25Student Presentation: Knowledge Discovery(or Apache Mahout)
Part 3: Advanced
26Research Trends in Big Data
27-31Advanced concepts (if time permits)


Evaluation
InstrumentMax Marks
Mid Exam20%
Final Exam30%
Hands-on Tutorials (2% * (best of) 5 Quizzes)10%
Student Presentations (15% for presentation, 5% (0.5% * 10) for quiz on the presentation topic)20%
Project20%

Project
This is a group project. Allowed group sizes are 3 and 4. Your project will be evaluated based on the following:
  • The project description report highlighting the features of the project (1%)
  • Demonstration of Map-Reduce or Pig (5%), NoSQL DB (5%), Web Application/Service (5%)
  • Complete project report (could be an extension of the project description report) (4%) [Well-typed pdf file describing the motivation and use cases of your project along with sufficient screenshots is expected]

Pre-requisites
None.

Resources

Text
There is no prescribed text for this course. Readings will be shared during the lectures.

References
  • Big Data Analytics, A Hands-On Approach. Arshdeep Bahga and Vijay Madisetti.
  • Hadoop: The Definitive Guide. Tom White.


If you are not having fun, you are not the best student you can be!