Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Multi Dimensional Scaling #18

Open
kanishkamisra opened this issue Mar 21, 2018 · 2 comments
Open

New feature: Multi Dimensional Scaling #18

kanishkamisra opened this issue Mar 21, 2018 · 2 comments

Comments

@kanishkamisra
Copy link
Contributor

kanishkamisra commented Mar 21, 2018

Since widyr has two distance based functions, namely pairwise_delta and pairwise_dist, I was wondering if we can take it a step further and add functionality for the data analyst/scientist to scale these pairwise distances to k dimensions. This is done in base R using cmdscale that takes a distance matrix and returns a matrix with each item as a row and each of the k dimensions as columns. The points are calculated such that the distance is maintained.

Example:

euro_mat <- as.matrix(eurodist)[1:5, 1:5]

euro_mat
#>           Athens Barcelona Brussels Calais Cherbourg
#> Athens         0      3313     2963   3175      3339
#> Barcelona   3313         0     1318   1326      1294
#> Brussels    2963      1318        0    204       583
#> Calais      3175      1326      204      0       460
#> Cherbourg   3339      1294      583    460         0

cmdscale(euro_mat)
#>                [,1]       [,2]
#> Athens    2515.1944   14.68566
#> Barcelona -659.3761  961.30504
#> Brussels  -422.4595 -328.35478
#> Calais    -633.6209 -359.57342
#> Cherbourg -799.7379 -288.06249

In the tidy, widy way, I figured the following way to do this when we have a distance based result from a widyr method:

library(tidytext)
library(janeaustenr)
library(widyr)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
#> ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
#> ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
#> ✔ readr   1.1.1          ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

austen_dist <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word) %>%
  pairwise_dist(book, word, n)

## From utils.R in widyr package (private method), didnt want to use :::
col_name <- function(x, default = stop("Please supply column name", call. = FALSE))
{
  if (is.character(x))
    return(x)
  if (identical(x, quote(expr = )))
    return(default)
  if (is.name(x))
    return(as.character(x))
  if (is.null(x))
    return(x)
  stop("Invalid column specification", call. = FALSE)
}

multi_scale <- function(tbl, item1, item2, value, k = 2, ...) {
  multi_scale_(tbl,
               col_name(substitute(item1)),
               col_name(substitute(item2)),
               col_name(substitute(value)),
               k = 2, ...)
}


multi_scale_ <- function(tbl, item1, item2, value, k = 2, ...) {
  tbl_matrix <- tbl %>%
    spread(item2, col_name(value), fill = 0) %>%
    remove_rownames() %>%
    column_to_rownames("item1") %>%
    as.matrix()
  
  cmdscale(tbl_matrix, k = k) %>%
    as.data.frame() %>%
    rownames_to_column("item") %>%
    as.tibble()
}

austen_dist %>%
  multi_scale(item1, item2, distance)
#> Warning: Setting row names on a tibble is deprecated.
#> # A tibble: 6 x 3
#>   item                     V1      V2
#>   <chr>                 <dbl>   <dbl>
#> 1 Sense & Sensibility   211.    302. 
#> 2 Pride & Prejudice      36.7   -64.8
#> 3 Mansfield Park      -3942.   1325. 
#> 4 Emma                -3513.  -1525. 
#> 5 Northanger Abbey     3924.   -115. 
#> 6 Persuasion           3284.     77.5

This can be used for something like this:

austen_dist %>%
  multi_scale(item1, item2, distance) %>%
  ggplot(aes(V1, V2, color = item)) +
  geom_point(size = 3, alpha = 0.8) + 
  scale_y_continuous(limits = c(-4000, 4000)) + 
  theme_bw()

multi_scale

Now, this method returns a warning because of column_to_rownames() that I would like to get rid of. I'm searching for one but wanted to know your thoughts before I submit a PR.

@kanishkamisra
Copy link
Contributor Author

Adding a as.data.frame() before converting column to row names solves the warning problem!

@kanishkamisra
Copy link
Contributor Author

Another problem is that if the distance tbl is formed using upper = FALSE, this will return wrong results..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant