# Statistical Learning
This one is the best:
# 1️⃣ The Supervising Learning Problem
# Starting point
Outcome measurement (also called dependent variable, response, target).
Vector of predictor measurements (also called inputs, regressors, covariates, features, independent variables).
In the , takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample).
We have training data . These are observations (examples, instances) of these measurements.
# Objective of Supervised Learning
On the basis of the training data we would like to:
- Accurately predict unseen test cases.
- Understand which inputs affect the outcome, and how.
- Assess the quality of our predictions and inferences.
# Philosophy
- It is important to understand the ideas bahind the various techniques, in order to know how and when to use them.
- One has to understand the simpler methods first, in order to grasp the more sophisticated ones.
- It is important to accurately assess the performance of a method, to know how well of how badly it is working.
TIP
Simpler methods often perform as well as fancier ones!
This is an exciting research area, having important applications in science, industry and finance.
Statistical learning is a fundamental ingredient in the training of a modern .
# Unsupervised learning
- No outcome variable, just a set of predictors (features) measured on a set of samples.
- Objective is more fuzzy - find groups of samples that behave similarly, find features that behave * similarly, find linear combinations of features with the most variation.
- difficult to know how well your are doing.
- different from supervised learning, but can be useful as a pre-processing step for supervised learning.
# The Netflix prize
- Competition started in October 2006. Training data is ratings for 18000 movies by 400000 Netflix * customers, each rating between 1 and5.
- training data is very sparse - about 98% missing.
- objective is to predict the rating for a set of 1 million customer-movie pairs that are missing in the * training data.
- Netflix's original algorithm achieved a root MSE of 0.953. The first team to achieve 10% improvement wins one million dollars.
- is this a supervised or unsupervised learning problem?
# Statistical Learning versus Machine Learning
- Machine learning arose as a subfield of Artificial Intelligence.
- Statistical learning arose as a subfield of Statistics.
- - both fields focus on supervised and unsupervised problems:
- Machine learning has a greater emphasis on applications and .
- Statistical learning emphasizes and their intepretability, and and
- But the distinction has become more and more blurred, and there is a great deal of "cross-fertilization".
- Machine learning has the upper hand in Marketing!