# Statistical Learning

Stanford Course

Youtube Channel

This one is the best:

Course Web site

bOOK

bOOK

# 1️⃣ The Supervising Learning Problem

# Starting point

  • Outcome measurement YY (also called dependent variable, response, target).

  • Vector of pp predictor measurements XX (also called inputs, regressors, covariates, features, independent variables).

  • In the regressionproblemregression problem, YY takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample).

  • We have training data (x1,y1),...,(xN,yN)(x1, y1),...,(x_N, y_N). These are observations (examples, instances) of these measurements.

# Objective of Supervised Learning

On the basis of the training data we would like to:

  • Accurately predict unseen test cases.
  • Understand which inputs affect the outcome, and how.
  • Assess the quality of our predictions and inferences.

# Philosophy

  • It is important to understand the ideas bahind the various techniques, in order to know how and when to use them.
  • One has to understand the simpler methods first, in order to grasp the more sophisticated ones.
  • It is important to accurately assess the performance of a method, to know how well of how badly it is working.

TIP

Simpler methods often perform as well as fancier ones!

  • This is an exciting research area, having important applications in science, industry and finance.

  • Statistical learning is a fundamental ingredient in the training of a modern datascientistdata scientist.

# Unsupervised learning

  • No outcome variable, just a set of predictors (features) measured on a set of samples.
  • Objective is more fuzzy - find groups of samples that behave similarly, find features that behave * similarly, find linear combinations of features with the most variation.
  • difficult to know how well your are doing.
  • different from supervised learning, but can be useful as a pre-processing step for supervised learning.

# The Netflix prize

  • Competition started in October 2006. Training data is ratings for 18000 movies by 400000 Netflix * customers, each rating between 1 and5.
  • training data is very sparse - about 98% missing.
  • objective is to predict the rating for a set of 1 million customer-movie pairs that are missing in the * training data.
  • Netflix's original algorithm achieved a root MSE of 0.953. The first team to achieve 10% improvement wins one million dollars.
  • is this a supervised or unsupervised learning problem?

# Statistical Learning versus Machine Learning

  • Machine learning arose as a subfield of Artificial Intelligence.
  • Statistical learning arose as a subfield of Statistics.
  • ThereismuchoverlapThere is much overlap - both fields focus on supervised and unsupervised problems:
    • Machine learning has a greater emphasis on largescalelarge scale applications and predictionaccuracyprediction accuracy.
    • Statistical learning emphasizes modelsmodels and their intepretability, and precisionprecision and uncertaintyuncertainty
  • But the distinction has become more and more blurred, and there is a great deal of "cross-fertilization".
  • Machine learning has the upper hand in Marketing!