Distributed Computing and Big Data

Welcome to Distributed Computing and Big Data course! Massive increase in the availability of data has made the storage, management, and analysis extremely challenging. Various tools, technologies and frameworks have surfaced to help address this challenge. Apache Hadoop is one such framework that enables us to handle big data by making distributed computing easier. Concerns such as reliability, distributed file management and distributed processing are abstracted from us by hadoop. In this course, we shall start with understanding the characteristics of big data and the fundamental concepts of cloud computing. We will explore the hadoop ecosystem. Specifically, we will explore HDFS, Map-Reduce, Pig and NoSQL DB. Our objective is to understand how big data can be effectively handled. We will also briefly discuss web applications development, with a special focus on RESTful services. This is an introductory course focused on the breadth of the big data landscape.

Key Learning Objectives

At the end of this course, you should be able to:

Understand the fundamentals of distributed storage using Hadoop HDFS as an example.
Understand distributed processing fundamentals using map-reduce framework and pig scripts.
Understand NoSQL DB concepts using MongoDB and/or HBase.
Understand web services.

Lecture Schedule

Lecture #	Topic	Readings	Slides/Material
1	Introduction to Big Data	The Complete Beginner's Guide To Big Data Everyone Can Understand Basics About Cloud Computing	Lecture 1 Lecture 2 Lecture 3
2	Hands-On Tutorial: A Tour of Big Data Stack with Cloudera VM	CDH Overview	Tutorial 1
3	Distributed File Systems	Files and Directories File System The Hadoop Distributed File System	Lecture 4 Lecture 4.1 (DC Model) DFS (Lecture 4 Updated)
4	Hands-On Tutorial: HDFS	Exploring the File System HDFS Tutorial (old)	Tutorial 2
5	Distributed Processing with Map-Reduce and Pig	Overview of Map-Reduce Pig-Latin	Lecture 5
6	Hands-On Tutorial: Map-Reduce		Tutorial 3
7	~~Introduction to OOAD and UML~~ (Not in Syllabus)		Lecture 6 Lecture 7 Tut4-JavaCode
8	Big Data Design Patterns	Chapter 1 from Thinking in Patterns Map-Reduce Design Patterns	Lecture 8
9	Apache Pig		Lecture 9
10	Hands-On Tutorial: Map-Reduce and Pig	MapReduce Tutorial Pig Tutorial	Tutorial 5: MR [zip], [video], Pig [zip]
11	NoSQL DB	Chapter 4 from BDA Book Columnar Storage NoSQL Explained	Lecture 10
12	Hands-On Tutorial: MongoDB, HBase	MongoDB CRUD Operations	Tut6-MongoDB Exercise Tut6-HBase
13	Graph DB with Neo4j		Lecture 11 Neo4j Commands
14	Web Application Development and Service Oriented Architecture	Web Application Development Sections 1, 2 and 3 of Web Services	Lecture 12 Notes Video
15	Hands-On Tutorial: Apache Tomcat, JSON, RESTful Services	Building Web Applications with Tomcat RESTful Services	Hive and Solr (Not in syllabus) Tut7-WS Tut7-Code Video
16	System Design Big Data - Products and Practices		Lecture 13 Lecture 14

Evaluation

Instrument	Max Marks
Mid Exam	25%
Final Exam	35%
Assignment (4*10%)	40%

Pre-requisites

None.

Resources

Text
There is no prescribed text for this course.

References

Optional Readings

Anatomy of the Linux file system
Chapters 11 and 12 (on File Systems) of Operating Systems Concepts 9th Edition
NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Sadalage and Fowler, 2012: This is a short 150 page text explaining key concepts of NoSQL. If you do not have time to read this book, watch this talk by Fowler.
Gang of Four Design Patterns Book
A Conversation with Turing Award Winner Leslie Lamport
The idea of Time
- How is time synchronized?
- Why Britain is the Center of the World
Internet
- How Does the Internet Work?
- The fight over the internet, under the sea