Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.

Support replaceNA? #118

Open
alphaho opened this issue Mar 16, 2021 · 4 comments
Open

Support replaceNA? #118

alphaho opened this issue Mar 16, 2021 · 4 comments

Comments

@alphaho
Copy link

alphaho commented Mar 16, 2021

I've found it a bit counter-intuitive to replace NA with the value I want, as there's no API to call directly.

For now, I've used the below logic to do the job. May I know is this the proper way to do so? Do we need to add such API to krangl? (I'm happy to raise a PR if the below snippet is the officital way)

val myColumnName = "whatever-column-I-want-to-replace-NA"
df.addColumn(myColumnName to { it[myColumnName].values().map { it ?: 0 } }) 
@alphaho
Copy link
Author

alphaho commented Mar 16, 2021

Related to #110

@holgerbrandl
Copy link
Owner

Thanks for the suggestion. Indeed I find the current approach clumsy as well.

I was thinking about


fun DataFrame.replaceNA(value: Any,  columnSelect: ColumnSelector) =
    colSelectAsNames(columnSelect).fold(this) { df, column, ->
        df.addColumn(column) {
            it[column].values().map {
                it ?: value
            }
        }
    }

// note: colSelectAsNames is an internal method as of now

fun main() {
    val sales = dataFrameOf("product", "sales", "city")(
        "A", 32, "Dresden",
        null, 12.3, "New York",
        "A", 24, "Boston",
        "B", null, "Boston",
        "B", 44, null
    )

    sales
        .replaceNA("AAA") { listOf("city", "product") }
        .apply {
            print()
            schema()
        }
    sales
        .replaceNA("AAA") { listOf("city") }
        .apply {
            print()
            schema()
        }

    sales
        .replaceNA(0) { listOf("sales") }
        .apply {
            print()
            schema()
        }
}

But maybe that's too much as well, and a per column extension would be the most intuitive solution. For sure a PR would be great here. I'm happy to help as well, but would value your input first regarding an intuitive API design.

@alphaho
Copy link
Author

alphaho commented Apr 10, 2021

You are right. Calling sales.replaceNA("AAA") { listOf("city", "product") } may not be as intuitive as it would be in Pandas, which will be df['DataFrame Column'] = df['DataFrame Column'].fillna(0) .

I think at least there are two problems here:

  1. one is the ColumnSelector (e.g. { listOf("city", "product")}) seems a bit verbose to use. The first thought that comes to my mind is whether we can get rid of calling listOf, but only require vararg column names.

  2. the other is immutability puts some constraints in defining such APIs. That's what I believe to be the reason that the extension function is defined on DataFrame instead of a column. How about we have some API like:

fun DataFrame.replaceColumn(columnName: String, update: (DataCol) -> DataCol) {
    val df = this
    df.addColumn(columnName) {
        update(it[columnName])
    }
}

// would need some help to make this work:
fun DataCol.replaceNA(defaultValue: Any): DataCol = values().map { it?: defaultValue}

sales.replaceColumn("product") { 
  it.replaceNA("AAA")
}.replaceColumn("sales") {
  it.replaceNA(0)
}

Another kind of API may be like the following. But it seems a bit harder to set up, as I think we would need some kind of mutable column to collect all the mutations:

sales.replaceColumns("product", "sales") {  productColumn, salesColumn ->
  productColumn.replaceNA("AAA")
  salesColumn.replaceNA(0)
}

This is to mimic the style of buildList API or buildString which take a function to mutate MutableList / StringBuilder to build up the new immutable value (DataCol in the case of above API). I think this can be more friendly to new users, as it provides some familiarity to the users (using pandas) while keeping the DataFrame and DataCol immutable, even though it may have some performance drawbacks.

How do you think about these two APIs?

@holgerbrandl
Copy link
Owner

Sorry for the delay & and thanks for the proposal.

Rearding 1. It's not that simple to replace listOf with varargs, because there are multiple implementations of ColumnSelect and listOf is just one of them. See

fun ColNames.listOf(vararg someNames: String): List<Boolean?> = names.map { someNames.contains(it) }.falseAsNull()

Regarding 2. addColumn will already replace an existing column. However, I agree that the signature of replaceColumn may make it more intuitive. This does not affect immutability (which even in your example is still in place)

I don't think that we can express replaceColumns in kotlin unless we provide different implementations (one for each number of arguments).

With some weeks since my last comment, I actually like the idea of sales.replaceNA("AAA") { listOf("city", "product") } except for the method name which might be better replaceNaWith or replaceNaWithIn.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants