Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a User, I can stream reports in a .csv format #68

Open
gabrielleberanger opened this issue Dec 16, 2020 · 1 comment
Open

As a User, I can stream reports in a .csv format #68

gabrielleberanger opened this issue Dec 16, 2020 · 1 comment
Assignees
Labels
new feature Creating a new feature P2 2nd priority

Comments

@gabrielleberanger
Copy link
Contributor

gabrielleberanger commented Dec 16, 2020

WHY
Today, the only output stream format available is .njson (i.e. a file with n lines, each line being a dictionnary).
This format has two downsides:

  • It does not allow us to easily conduct preliminary analysis on the output data: .njson files cannot be directly forwarded to non-tech users, and cannot be put into a pandas DataFrame without undergoing preliminary transformations.
  • Some APIs natively return data in a .csv format: in these cases, we have to convert each line to a dictionnary, which can occasion parsing errors.

HOW
Create a .csv streamer.

@gabrielleberanger gabrielleberanger added the new feature Creating a new feature label Dec 16, 2020
@gabrielleberanger gabrielleberanger added the P1 1st priority label Dec 16, 2020
@benoitgoujon benoitgoujon self-assigned this Jan 8, 2021
@benoitgoujon
Copy link
Contributor

Hi there,

I've started working on this issue and I've noticed that we may encounter a problem with the current software architecture.

Currently, the format of the destination file is enforced. We will have a .njson file by default. Even though there is a Pickle option, it is never used in the code. If we want to introduce a new format like CSV, we must let users decide which format they prefer. It would be intuitive to have an option in the writer command, something like write_gcs --gcs-file-format csv.

BUT, to do so, we need to change the stream we use (CSVStream vs JSONStream) and this choice must be implemented in the read() function in the reader. So, that would force us to add the file format as an option of the reader, something like read_dv360 --dv360-file-format csv, which is not as intuitive as if it was in the writer options because we now mix up the reader and writer options.

Is it acceptable though?

What is your opinion regarding this issue?

@gabrielleberanger gabrielleberanger added P2 2nd priority and removed P1 1st priority labels Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Creating a new feature P2 2nd priority
Projects
None yet
Development

No branches or pull requests

2 participants