Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.

Rework API to refer to current row with it #122

Open
LeandroC89 opened this issue May 26, 2021 · 5 comments
Open

Rework API to refer to current row with it #122

LeandroC89 opened this issue May 26, 2021 · 5 comments
Assignees

Comments

@LeandroC89
Copy link
Contributor

LeandroC89 commented May 26, 2021

Hello,
maybe I'm just missing something but is there a way to reference current row?

For example during addColumn, is there a way to know which row is being populated at the time?

This would be useful for running sum like operations, and maybe to another problem I'm not being able to solve, which would be "How do we create a column that will copy the value from column A if it is not NA but copy it from column B if it is?", I know this is an entirely new question by itself but knowing what row I'm populating would allow me to solve it via workaround.

Thank you in advance!

@holgerbrandl
Copy link
Owner

holgerbrandl commented May 28, 2021 via email

@LeandroC89
Copy link
Contributor Author

Thank you, I knew the addRowNumber method but never thought about the variable.

It's not a direct solution (I may not be looking at this the right way) but since it is still an array and not a value per se, it can't easily be used.

Example would be
data.addColumn("Row"){ rowNumber.toString() }
would result in something like [1, 2, ...] (Just an example and not an useful test case)

If I wanted to retrieve one value from the previous row, I can't directly use rowNumber as it is an array.

The way I managed to do it was using the map() function

data.addColumn("Previous Value"){ rowNumber.map{entry: Int -> if (entry < 2 ) 0 else data["Value"][entry - 2] } }
(It would also have worked with any column with unique values (and using filter) now that I think of it)

I'm leaving this open in case you'd like to add something to discussion (as like I said I may not be looking at this the right way). but please feel free to mark it solved as I was now able to solve my problem (using map() on the rowNumber).

Thank you for your help.

@holgerbrandl
Copy link
Owner

Regarding "How do we create a column that will copy the value from column A if it is not NA but copy it from column B if it is?":
It's a pity that we can't implement vector valued operators in Kotlin. This would make the workflow so much more streamined (similar to R/python). Anyway, I guess you could solve this with

data.addColumn("AorB"){ it["A"].values().zip(it["B"].values()).map{ (a, b) -> a ?: b} }

Regarding your example, what about lead/lag (see https://github.com/holgerbrandl/krangl/blob/master/CHANGES.md#v013)

In general, I struggle myself with krangle's syntax very often, and don't find it intuitive in many situations. In particular when adding columns (which is the most frequent op imho). One way to overcome this would be to refer to just the row with it by default in addColumn and provide access to the entire data-frame via the lambda context. Similar to how filterByRow works

 df.filterByRow { (it["age"] as Int).rem(10) == 0 } // round birthdays :-)

// mock
// A or B if null
irisData.addColumn("equipment"){ it["A"] ?: it["B"]}
// mean subtraction
irisData.addColumn("equipment"){ it["A"] - df["A"].sum()!! }

Clearly, this would be a major breaking change, but I wouldn't mind anything that makes it more fun/convenient to use.

It would also come along with some new inconsistency in summarize where it must refer to the group/dataframe (or could be missing in favour of a df that comes via the extension-scope.

Any thoughts?

@LeandroC89
Copy link
Contributor Author

Regarding "How do we create a column that will copy the value from column A if it is not NA but copy it from column B if it is?" I had done it with a similar approach once I realized I could use map.

df = df.addColumn("NewColumn"){
    rowNumber.map{ entryNr: Int ->
            (df["A"][entryNr - 1]?: df["B"][entryNr - 1])
    }
}

Regarding the lead/lag, I was unaware of these functions and they would have been very useful in most of my use cases, I really need to spend more time reading the API. Thank you for pointing these out.

Regarding the it referring to the row, I feel like that would be more intuitive as I struggled a lot with simple string operations and ended up with the whole column contents on each cell of the new column, when I first started to explore krangl ( as I initially thought that it would be the current row being "processed" and not the whole dataframe ). I have used map ever since though, I just never thought of accessing previous row value using it before .

As much as I agree that it would make it more intuitive, I have no idea how that would affect what already works, like adding a column based on a formula or array. I will try to play with this a little once I have some time too.

@holgerbrandl
Copy link
Owner

thought that it would be the current row being "processed" and not the whole dataframe

I guess that would be the common intuition in kotlin. The reason why I designed the API to refer to the entire table is that this the way dplyr works in R.

@holgerbrandl holgerbrandl self-assigned this Jun 23, 2021
@holgerbrandl holgerbrandl changed the title Reference to current row Rework API to refer to current row with it Jun 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants