CSCI 453

Large-Scale Data Analytics and Visualization

Coordinator: jingnan xie

Credits: 4.0

Description

A practical introduction to data analytics, visualization, and blending theory. Students will learn about and apply various clustering algorithms and techniques for dealing with noisy data, use a distributed data analytics framework, complete laboratory assignments using version control, and enforce reproducibility by having all science easily sharable. Students will become familiar with modern data analytics methods and explore real-world data sets. Visualization of results will be a large component of the course through interactive and static frameworks. Offered Periodically.

Prerequisites

CSCI 366 AND (MATH 235 OR MATH 333 OR MATH 335).

Course Outcomes

At the end of this course, a student will:

  1. Create reproducible, explainable data science workflows

  2. Use modern distributed Map-Reduce framework, such as Apache Spark, to analyze data

  3. Implement parallel clustering methods

  4. Develop strategies for overcoming common imperfections in real-world datasets

  5. Apply visualization techniques to multi-dimensional data

  6. Apply gained skills to extract insights from multi-dimensional, real-word datasets

These goals will be accomplished through the content of the lectures and textbook, as well as hands-on experience. This hands-on experience includes writing programs (both in the lab and in project assignments). There will also be a significant course project in which you identify an analysis topic, discover data, model the data using data mining techniques, analyze the results, and report outcomes. The achievement of the goals will be measured through your performance on approximately 7 lab assignments, the project, and two exams (midterm and final).

Tentative Semester Schedule

Week 1: Introductory materials on experimental design and data

Week 2: Data operations: filtering, transforming, reducing

Week 3: Distributed computing

Week 4: Distributed regression

Week 5: Visualization of one-dimensional data 

Week 6: Visualization of two-dimensional data

Week 7: Exam

Week 8: Case Study: K-Means Clustering

Week 9: Distributed Graph Algorithms

Week 10: Case Study: Page Rank

Week 11: Distributed Regression

Week 12: Distributed Machine Learning + Cross Validation

Week 13: Distributed SQL

Week 14: Presentations