Machine Learning

Author:	Hilary Mason <[email protected]> @hmason

ML History

Eniac

Turing Test

Eliza

AI Winter

jmseigler? SexBot (except not)

Add stats in the 90's (revitalizes AI)

Clustering

Start with K-means

Entity disambiguation

Topic Model

R has a topic module

Hilary has Python code

Recommendations

Based on existing data of users with similar interests

Amazon

Netflix

Classification

Train the classifiers

Bayesian

Spam Filter

Facial Recognition

Dirty Hacks

Good sources of training data

Wikipedia

NY Times

lynx --dump <url>

How to approach

Obtain

Scrub

Explore

Model

iNterpret

Build a Model
Probability Theory

Area is 1

P(A or B) = P(A) + P(B) - P(A and B)

Bayes Law

Twitter

Sports down, Math up

Python using NLTK

On GitHub

Feature Selection

Easy for humans, but not statistically feasible

Think about what's interesting about the data.

(Twitter) N-grams, people, presence of link, etc.

Bit.ly

Actual a hard problem

Size indicators

Billions or trillions of data points

In memory DB of everything within the last hour

Velocity, half-life, prediction

Location mining

Cultural analysis based on when & where people are clicking

Collaborative Filtering

Tom Mitchell
Data Mining (Purple Cover)
Email for resources
WordNet
Research benefits of combining models