Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings #120

harshit3610 · 2021-04-12T19:26:11Z

As per API docs, CSVFormat class from apache commons is used to add custom null values while reading files. I am new to Krangl so may not know of any simple workarounds. Is it possible to add a function similar to na_values of pandas to make the read operation little simple?

holgerbrandl · 2021-04-14T18:27:14Z

What about

DataFrame.readCSV(File("foo.csv"), CSVFormat.DEFAULT.withNullString("MISSING"))

?

Technically a dedicated argument could be added, but I'm not sure if this would bloat the method signatures in the long run.

A known limitation of the underlying apache commons API is that you can only provide a single null string and not a collection.

harshit3610 · 2021-04-15T09:45:54Z

Is there a way to remove apache commons as a requirement? Can we provide a mechanism to replace all occurrences of a custom null string with "NA" value while reading the file ? The limitation of only accepting a single null string will be a huge limitation in the long run and may affect the adoption rate of the library by enthusiasts. Are there any technical limitations as to why apache commons must be used?

holgerbrandl · 2021-04-21T20:45:17Z

Why would we want to replace apache-commons-csv? What would be a better alternative?

I've chosen apache-commons-csv initially here because I could not find any better alternatives.

I see the point that having just a single NA string is limiting, but I don't think its a major problem.

harshit3610 · 2021-04-22T14:00:13Z

In pandas API, a typical read_csv function allows adding multiple custom NA values in the following way

pd.read_csv("data.txt",na_values = [ 'na', 'Not available', "", "-"])

In many data sets, we have data that's not up to the mark and multiple strings for NA data exist. I was hoping if there would be a way to add such an argument(na_values) to krangl API's read functions with option of passing an array of strings or a list of strings similar to how pandas makes it work. The added trouble of adding apache commons as a dependency only to get 1 single NA string option is too much effort in my opinion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings #120

Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings #120

harshit3610 commented Apr 12, 2021

holgerbrandl commented Apr 14, 2021

harshit3610 commented Apr 15, 2021

holgerbrandl commented Apr 21, 2021

harshit3610 commented Apr 22, 2021

Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings #120

Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings #120

Comments

harshit3610 commented Apr 12, 2021

holgerbrandl commented Apr 14, 2021

harshit3610 commented Apr 15, 2021

holgerbrandl commented Apr 21, 2021

harshit3610 commented Apr 22, 2021