forked from wush978/FeatureHashing
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
135 lines (93 loc) · 5.15 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
FeatureHashing
==============
[![Travis-ci Status](https://travis-ci.org/wush978/FeatureHashing.svg)](https://travis-ci.org/wush978/FeatureHashing)
[![Coverage Status](https://img.shields.io/coveralls/wush978/FeatureHashing.svg)](https://coveralls.io/r/wush978/FeatureHashing?branch=master)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/FeatureHashing)](http://cran.r-project.org/web/packages/FeatureHashing)
[![rstudio mirror downloads](http://cranlogs.r-pkg.org/badges/FeatureHashing)](https://github.com/metacran/cranlogs.app)
Implement feature hashing with R
```{r setup, include=FALSE}
library(knitcitations)
bib <- read.bibtex("README.bib")
# citep(bib[[1]])
```
## Introduction
[Feature hashing](http://en.wikipedia.org/wiki/Feature_hashing), also called as the hashing trick, is a method to
transform features to vector. Without looking the indices up in an
associative array, it applies a hash function to the features and uses their
hash values as indices directly.
The package FeatureHashing implements the method in `r citep(bib[["DBLP:conf/icml/WeinbergerDLSA09"]])` to transform
a `data.frame` to sparse matrix. The package provides a formula interface similar to model.matrix
in R and Matrix::sparse.model.matrix in the package Matrix. Splitting of concatenated data,
check the help of `test.tag` for explanation of concatenated data, during the construction of the model matrix.
## Installation
To install the stable version from Cran, run this command:
```r
install.packages("FeatureHashing")
```
For up-to-date version, please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
```r
devtools::install_github('wush978/FeatureHashing')
```
### When should we use Feature Hashing?
Feature hashing is useful when the user does not easy to know the dimension of the feature vector.
For example, the bag-of-word representation in document classification problem requires scanning entire dataset to know how many words we have, i.e. the dimension of the feature vector.
In general, feature hashing is useful in the following environment:
- Streaming Environment
- Distirbuted Environment
Because it is expensive or impossible to know the real dimension of the feature vector.
## Getting Started
The following scripts show how to use the `FeatureHashing` to construct `Matrix::dgCMatrix` and train a model in other packages which supports `Matrix::dgCMatrix` as input.
The dataset is a sample from iPinYou dataset which is described in `r citep(bib[["zhang2014real"]])`.
### Logistic Regression with [`glmnet`](http://cran.r-project.org/web/packages/glmnet/index.html)
```{r lr}
# The following script assumes that the data.frame
# of the training dataset and testing dataset are
# assigned to variable `ipinyou.train` and `ipinyou.test`
# respectively
library(FeatureHashing)
# Checking version.
stopifnot(packageVersion("FeatureHashing") >= package_version("0.9"))
data(ipinyou)
f <- ~ IP + Region + City + AdExchange + Domain +
URL + AdSlotId + AdSlotWidth + AdSlotHeight +
AdSlotVisibility + AdSlotFormat + CreativeID +
Adid + split(UserTag, delim = ",")
# if the version of FeatureHashing is 0.8, please use the following command:
# m.train <- as(hashed.model.matrix(f, ipinyou.train, 2^16, transpose = FALSE), "dgCMatrix")
m.train <- hashed.model.matrix(f, ipinyou.train, 2^16)
m.test <- hashed.model.matrix(f, ipinyou.test, 2^16)
# logistic regression with glmnet
library(glmnet)
cv.g.lr <- cv.glmnet(m.train, ipinyou.train$IsClick,
family = "binomial")#, type.measure = "auc")
p.lr <- predict(cv.g.lr, m.test, s="lambda.min")
auc(ipinyou.test$IsClick, p.lr)
```
### Gradient Boosted Decision Tree with [`xgboost`](http://cran.r-project.org/web/packages/xgboost/index.html)
Following the script above,
```{r xgboost}
# GBDT with xgboost
library(xgboost)
cv.g.gdbt <- xgboost(m.train, ipinyou.train$IsClick, max.depth=7, eta=0.1,
nround = 100, objective = "binary:logistic", verbose = ifelse(interactive(), 1, 0))
p.lm <- predict(cv.g.gdbt, m.test)
glmnet::auc(ipinyou.test$IsClick, p.lm)
```
### Per-Coordinate FTRL-Proximal with $L_1$ and $L_2$ Regularization for Logistic Regression
The following scripts use an implementation of the FTRL-Proximal for Logistic Regresion, which is published in `r citep(bib[["DBLP:conf/kdd/McMahanHSYEGNPDGCLWHBK13"]])`, to predict the probability (1-step prediction) and update the model simultaneously.
```{r ftprl}
source(system.file("ftprl.R", package = "FeatureHashing"))
m.train <- hashed.model.matrix(f, ipinyou.train, 2^16, transpose = TRUE)
ftprl <- initialize.ftprl(0.1, 1, 0.1, 0.1, 2^16)
ftprl <- update.ftprl(ftprl, m.train, ipinyou.train$IsClick, predict = TRUE)
auc(ipinyou.train$IsClick, attr(ftprl, "predict"))
```
If we use the same algorithm to predict the click through rate of the 3rd season of iPinYou, the overall AUC will be 0.77 which is comparable to the overall AUC of the 3rd season 0.76 reported in `r citep(bib[["zhang2014real"]])`.
## Supported Data Structure
- character and factor
- numeric and integer
- array, i.e. concatenated strings such as `c("a,b", "a,b,c", "a,c", "")`
## Reference
```{r bibliograph, results='asis', echo=FALSE}
bibliography()
```