COL 868 / AIL 841 - Special Topics in Data Science / Database Systems (2023-24 Sem. 2)

Table of Contents



Organization

Credit Structure (L-T-P):3-0-0 (3 credits)

Course slot: AB (Mon., Thu. 3:30-5:00)

Lecture location: LH 519

Grading scheme (tentative)

Activity Weight
Mid-term survey 25%
Major 20%
Project 35%
Paper presentations 20%


NEWS / UPDATES

[10 Jan 2024] Exciting news! Prof. Sainyam Galhotra from Cornell Tech will deliver guest lectures on 15th and 18th January 2024.

[31 Dec 2023] First lecture will be held on 1st January 2024 at 3:30pm in Room LH519. Please attend even if you are not able to register.



About the Course

This course focuses on learning about the concepts of “Data Integration” or “Data Wrangling” and their pivotal role in the field of Data Science. Data wrangling, a historically challenging task, has engaged researchers and practitioners, proving essential for numerous multi-million dollar enterprises. It serves as a foundational element for Data Science pipelines, facilitating the transition from raw data to insightful analysis.

Recent years have witnessed the growing influence of AI/ML techniques in tackling the complexities of data wrangling and data science. From employing traditional machine learning models for data cleaning, integration, and discovering correlated datasets, to leveraging Large Language Models (LLMs) for large-scale data wrangling, this course explores the evolution of data wrangling methodologies.

Throughout the course, students will navigate through progression of data wrangling techniques, spanning traditional approaches to data integration and cleaning, culminating in the latest ML and LLM-backed techniques.

Prerequisites

  • Good knowledge of DBMS and Machine Learning / Data Mining / NLP / Info Retrieval
  • Knowledge of LLMs and deep learning is preferred
  • Ability to read code in Java and Python
  • Ability to write code (sometimes quite a lot)

Introductory Reference Books

  1. Principles of Data Integration by AnHai Doan, Alon Halevy and Zachary Ives.
  2. Data Cleaning by Ihab F. Ilyas and Xu Chu.
  3. Principles of Data Wrangling by Rattenbury, Hellerstein, Heer, Kandel and Carreras.

Use of Piazza

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.

Find our class signup link at: https://piazza.com/iitd.ac.in/winter2024/col868ail841


Course Slides

Slides for the course will be shared at somewhat regular intervals here

Calendar

Date Topic
1-1-24 Course Intro
8-1-24 Quick Intro to DBMS
11-1-24 Intro to Data Integration
15-1-24 Guest Lec. - I by Prof. Galhotra
18-1-24 Guest Lec. - II by Prof. Galhotra
Avatar
Srikanta Bedathur
DS Chair of Artificial Intelligence

Related