© Università degli Studi di Cassino e del Lazio Meridionale
Viale dell'Università - Rettorato - (Campus Universitario)
Loc. Folcara - 03043 CASSINO (FR)
Centralino 0776 2991
Fax 0776 310562
PEC
P.IVA 01730470604
C.F. 81006500607 (5xmille)
Coordinate bancarie: SWIFT BIC: POCAIT3CXXX
IBAN: IT75 B053 7274 3700 0001 0409 621
Statistical Learning and Data Mining
Prof. Mario Rosario Guarracino
Prof. Alfonso Iodice D’Enza
Contact information: mariorosario.guarracino@unicas.it, iodicede@unicas.it
Term: Second Semester
Credits (ECTS): 6
Prerequisites: Applied statistics
Language of Instruction: English
Class hours: 42
LEARNING OBJECTIVES:
Cognitive / Knowledge skills
- Develop an understanding of the statistical learning framework, with general concepts for model building, selection and evaluation.
- Understand the trade-off’s related to the analysis aim, to the nature and to the amount of available data.
- Study the theoretical foundation of the basic (linear) methods for regression and classification.
- Study the computational approaches that support the effective application of the studied methods.
Analytical / Critical Thinking Skills
- Learn the basic programming skills to implement linear methods for regression and classification.
- Interpret the results and identify the most effective way to analyze the available data.
- Learn to present the results in a rigorous way yet letting non technical audience to understand the main findings of an analysis.
COURSE DESCRIPTION:
In the first part of course general concepts applicable to both regression and classification problems. The definition of statistical learning, training and test errors, trade-off’s in choosing the right model. Then the linear models for regression are described: from simple to multiple regression, qualitative predictors, interactions and common issues in the application of such models. Afterwards, classification problems are described, from linear ones, e.g. logistic regression and linear discriminant analysis, to non linear ones, e.g. quadratic discriminant analysis.
The last part of the course is an introduction to model selection and regularization and to resampling methods for the estimate of the test error (cross validation) and for assessing the accuracy of an estimator (bootstrap). All the methods will be implemented and applied in cran-R metalanguage.
INSTRUCTIONAL FORMAT:
The class will meet for 2 hours (gross of interclass break), twice a week, for a total of 21 sessions. After an introduction aimed at providing the needed background, participants are required to do both conceptual and applied homework. Classes will consist of a lecture by the instructor, till a topic is completely covered. After each topic, a class will be devoted to the presentation by the participants of the assigned homework.
TENTATIVE COURSE SCHEDULE:
Week 1 (Textbook, chapter 1 and 2)
Introduction to the Course
Presentation of the Available materials
Clear Statement of Expected Mutual Requirements
Regression and classification problems
Trade-off’s in statistical learning
Parametric and non parametric methods
Supervised vs non supervised
Week 2 (Textbook, chapters 1 and 2)
Introduction to the R meta-language
Measuring errors in regression and classification models
Week 3 (Textbook, chapter 3)
Introduction to linear models for regression
Model fit and inference
Variance estimator
Week 4 (Textbook, chapter 3)
Confidence and prediction intervals
Algebraic formalization of multiple regression
Global test and block-based test
Qualitative predictors and interaction effects
Week 5 (Textbook, chapter 3)
Polynomial regression
Violations of model assumption
Correlated errors
Heteroschedasticity
Multicollinearity
Week 6 (Textbook, chapter 3)
Practical examples of linear regression in R
Implementation and interpretation
Model diagnostics
Week 7 (Textbook, chapter 4)
Classification methods
Logistic regression
Link function and model fit
Linear discriminant analysis
Bayes rule and difference with logistic approach
Week 8 (Textbook, chapter 4)
Multiple LDA and Logit
Class-specific errors
Roc curve
Quadratic discriminant analysis
Comparison of classification methods
Week 9 (Textbook, chapter 4)
Practical examples of classification in R
Implementation and interpretation
Model diagnostics
Week 10 (Textbook, chapters 5 and 6)
Resampling methods
Validation approaches
Bootstrap
Model selection
Week 11 (Textbook, chapter 6)
Shrinkage methods
Ridge regression
Lasso regression
Model selection via regularization
WORKLOAD EXPECTATIONS:
All students are expected to spend at least 2,5 hours of time on academic studies outside of, and in addition to, each hour of class time.
FORMS OF ASSESSMENT:
The instructor will use numerous and differentiated forms of assessment to calculate the final grade you receive for this course. For the record, these are listed and weighted below. The content, criteria and specific requirements for each assessment category will be explained in greater detail in class. Any questions about the requirements should be discussed directly with your faculty well in advance of the due date for each assignment.
FORM OF ASSESSMENT |
VALUE |
Class Participation |
10% |
Homework |
30% |
Homework discussion |
15% |
Final project |
30% |
Final project presentation |
15% |
ASSESSMENT OVERVIEW:
Class Participation: This grade will be calculated to reflect your participation in class discussions, your capacity to catch the rationale of the subjects presented in class.
Homework: the homework is supposed to be returned one week after their assignment. Different homework are assigned to different participants, and they refer to both conceptual and applied aspects of the covered topic. The applied part of the homework will require R implementations of the methods.
Homework discussion: the homework is presented in class and each participant is required to present the problem and the corresponding finding to the instructor as well as to the classmates.
Final project: In the final project the participant will implement a throughout analysis of a real data set, from the data pre-processing phase, to the analysis and to the presentation and interpretation of the results.
Final project presentation: The final project will be presented in a one-day workshop in presence of other participants, PhD’s and other scholars interested in the topic.
CLASS/INSTRUCTOR POLICIES:
Professionalism and communications: As a student, you are expected to maintain a professional, respectful and conscientious manner in the classroom with your instructors and fellow peers.
You are expected to take your academic work seriously and engage actively in your classes.. Advance preparation, completing your assignments, showing a focused and respectful attitude is expected of all students. Simply showing up for class or meeting minimum outlined criteria will not earn you a good grade in this course. Utilizing communications, properly addressing your faculty and staff, asking questions and expressing your views respectfully demonstrate your professionalism and cultural sensitivity.
Attendance and Classroom behavior: Although attendance is not compulsory, it is highly recommended. All students must have a respectful attitude towards the professor as well as the classmates.
Arriving late / departing early from Class: Once they have decided to attend, students must behave consistently. Arriving late or leaving class early is disruptive and shows a lack of respect for instructor and fellow students.
Make-up classes: The instructor reserves the right to schedule make-up classes in the event of an unforeseen or unavoidable schedule change. Make-up classes may be scheduled outside of typical class hours, as necessary.
Missing Examinations: Examinations will not be rescheduled. Pre-arranged travel or anticipated absence does not constitute an emergency and requests for missing or rescheduling exams will not be granted.
Use of Cell Phones, Laptops and Other Electronic Devices: Always check with your instructor about acceptable usage of electronic devices in class. Inappropriate usage of your electronic devices will result in a warning and may lead to a deduction in participation grades. Use of a cell phone for phone calls, text messages, emails, or any other purposes during class is impolite, inappropriate and prohibited Faculty determines whether laptops will be allowed in class.
REQUIRED READINGS:
Listed below are the required course textbooks and additional readings. These are required materials for the course and you are expected to have constant access to them from the very beginning of the course for reading, highlighting and note-taking. It is required that you have unrestricted access to each. Access to additional sources required for certain class sessions may be provided in paper or electronic format consistent with applicable copyright legislation.
Required texts: An Introduction to Statistical Learning, with Applications in R. James, G., Witten, D., Hastie, T. and Tibshirani, R.. Springer, 2013.
Online Reference & Research Tools:
cran-R: https://cran.r-project.org to download the installer of the R program. It is free and available for Windows, OS-X and Linux.