Skip to content
This repository has been archived by the owner on Jan 28, 2023. It is now read-only.

Consider adding option to specify custom NA values while reading a CSV file by passing a list/array of strings #120

Open
harshit3610 opened this issue Apr 12, 2021 · 4 comments

Comments

@harshit3610
Copy link

As per API docs, CSVFormat class from apache commons is used to add custom null values while reading files. I am new to Krangl so may not know of any simple workarounds. Is it possible to add a function similar to na_values of pandas to make the read operation little simple?

@holgerbrandl
Copy link
Owner

What about

DataFrame.readCSV(File("foo.csv"), CSVFormat.DEFAULT.withNullString("MISSING"))

?

Technically a dedicated argument could be added, but I'm not sure if this would bloat the method signatures in the long run.

A known limitation of the underlying apache commons API is that you can only provide a single null string and not a collection.

@harshit3610
Copy link
Author

Is there a way to remove apache commons as a requirement? Can we provide a mechanism to replace all occurrences of a custom null string with "NA" value while reading the file ? The limitation of only accepting a single null string will be a huge limitation in the long run and may affect the adoption rate of the library by enthusiasts. Are there any technical limitations as to why apache commons must be used?

@holgerbrandl
Copy link
Owner

Why would we want to replace apache-commons-csv? What would be a better alternative?

I've chosen apache-commons-csv initially here because I could not find any better alternatives.

I see the point that having just a single NA string is limiting, but I don't think its a major problem.

@harshit3610
Copy link
Author

In pandas API, a typical read_csv function allows adding multiple custom NA values in the following way

pd.read_csv("data.txt",na_values = [ 'na', 'Not available', "", "-"])

In many data sets, we have data that's not up to the mark and multiple strings for NA data exist. I was hoping if there would be a way to add such an argument(na_values) to krangl API's read functions with option of passing an array of strings or a list of strings similar to how pandas makes it work. The added trouble of adding apache commons as a dependency only to get 1 single NA string option is too much effort in my opinion

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants