## COSC-254 Data Mining (Spring 2019)

### Course info

**Times & Location:** MW 2—3.20pm, Science Center
E110

**Website:** http://rionda.to/courses/cosc-254-s19/, Moodle for
assignments and forum

**Prerequisites:** COSC-211 Data Structures

**Instructor:** Matteo Riondato
(he/his, please call me "Matteo")

*Contact:* mriondato@amherst.edu (please use
`[COSC254]` in front of your subject. Only for confidential messages
that cannot go to the forum.)

*Office Hours:* T 3.30—5.30pm, Science Center C214. Please
reserve a 15-minutes slot by the day
before (Sunday) at 4pm.

**TA:** Alexander Einarsson

*Office Hours:* Th 3—5.00pm, Science Center E210.

### Description

This course is an*introduction to data mining*, the area of computer science that deals with the development of

*efficient algorithms for extracting information from data*. We will:

- talk about the key tasks in the analysis of transactional datasets, time series, and graphs, and the most efficient algorithms to solve them;
- learn about parallel/distributed systems to perform the analysis of massive datasets;
- use
*interactive notebooks*and*large-scale systems*to evaluate algorithms and analyze data.

### Syllabus

Most of the information you need is available in the syllabus. For anything else, please ask on
the Moodle
forum or, if it is confidential, email Matteo (please use
`[COSC254]` in front of your subject).

### Schedule & Diary

For the past dates, the listed topics are the topics covered on those dates.
For future dates, they are the planned topics, and subject to change. For the
readings, *MMD* denotes the *Mining of Massive Datasets* book,
and *DMT* denotes *Data Mining — The Textbook*.

- List of covered topics
- Lecture of 4/18: Ranking Slides on HITS Slides on PageRank

HW07 is out. Due on 4/24 at 1.59pm. - Lecture of 4/15: Indexing and Ranking Slides
- Lecture of 4/10: Link prediction. The Web and crawling Slides
- No Lecture of 4/8
- Lecture of 4/3: Community detection Slides.
- Lecture of 4/1: Closeness and Betweenness Centrality Slides.
- Lecture of 3/27: Centrality measures Slides. Readings: MMDS 10.1, DMT
from 19.2.1 to 19.2.5.2.

HW06 is out! Due 4/3 at 1.59pm. - Lecture of 3/25: Social network analysis Slides.
- Project 02 is out: proj02.pdf, triest.zip. Due on 4/15.
- Lecture of 3/20: Counting triangles on MapReduce Slides. Readings: MMDS 2.3.7, 2.5.3,
10.7.4.

HW05 is out! Due 3/27 at 1.59pm. - Lecture of 3/18: Counting triangles on static graphs. Homework correction Slides, Homework Slides. Readings: MMDS 10.7.1, 10.7.2, 10.7.3.
- Lecture of 3/6: Graphs and TRI\ÉST Slides.
- Lectures of 2/27 and 3/4: Data Streams: DGIM algorithm Slides. Readings: MMDS 4.6.

HW04 is out! Due 3/6 at 1.59pm. - Project 01 is out: proj01.pdf
- Lecture of 2/25: Data Streams: Bloom filter, Flajolet-Martin approach Slides. Readings: MMDS 4.3, 4.4.
- Lecture of 2/20: Data Streams: Intro, Reservoir sampling Slides. Readings: MMDS 4.1, 4.2.
- Lecture of 2/18: Eclat algorithm (Slides), Compressing Patterns (Slides). Readings: N/A.
**Due to the network outage, both HW02 and HW03 are due on Wed 2/20 at 2pm.**- Lecture of 2/13: Association Rules, Apriori algorithm Slides. Readings:
MMD 6.2.5, DMT 4.4.1, 4.4.2.

HW03 is out! Due 2/20 at 1.59pm. - Lecture of 2/11: Intro to Association Rules Slides. Readings: MMD 6.1.3, DMT: 4.3.
- Lecture of 2/6: Communication costs, Intro to Pattern Mining Slides. Readings: MMD 2.5, 6.1.1,
6.1.2, 6.2.1, 6.2.3, 6.2.4, DMT 4.1, 4.2.

HW02 is out! Due 2/13 at 1.59pm. - Lecture of 2/4: Matrix-by-Vector Multiplication in Hadoop. Readings: MMD 2.3.
- Lecture of 1/30: MapReduce & Hadoop
Slides. Readings: MMD 2.1, 2.2.

HW01 is out! Due 2/6 at 1.59pm. - Lecture of 1/28: What is Data Mining? Slides. Readings: MMD Ch.1, DMT Ch. 1.

HW00 is out! Due 1/30 at 1.59pm.

#### Future classes

- Week of 4/22: PageRank and Review
- Week of 4/29: Review