Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet outputstream rw #3

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

vdat-coatue
Copy link

Updates library with the ability to write parquet files to an outputstream vis writeS method. Previously only file writes were available.

@@ -0,0 +1,36 @@
package com.github.mjakubowski84.parquet4s;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these need to be in Java?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they don't... will move to pure scala.

@@ -108,6 +155,18 @@ object ParquetWriter {
writer.close()
}
}

override def writeS(outputFile: OutputFile, data: Iterable[T], options: Options): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of writeS I would create a type class for Sinks e.g.:

def write[T: ParquetWriter, S: Sink](output: S, data: Iterable[T], options = options: ParquetWriter.Options = ParquetWriter.Options())

Then I would provide Sink implementations for both File and OutputStream

* @param data data to write
* @param options configuration of how Parquet files should be created and written
*/
def writeS(outputFile: OutputFile, data: Iterable[T], options: ParquetWriter.Options): Unit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a util to write to just java.io.File. OutputFile interface should be private.

private val logger = LoggerFactory.getLogger(ParquetWriter.this.getClass)

if (logger.isDebugEnabled) {
logger.debug(s"""Resolved following schema to write Parquet to STream":\n$schema""")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the if check - this is macro-ed away at runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is base slf4j-api Logger so unfortunately, it isn't

Copy link

@rtoomey-coatue rtoomey-coatue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great first cut - could use a more detailed unit test but otherwise nice addition.

ParquetWriter.writeS(outfile,data)
outputStream.close()
println(outputStream)
outputStream.size() > 1 should equal(true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improve test by

  1. creating seq of generated content (there are some utils in the other tests)
  2. writing generated content to parquet file using existing writer
  3. writing generated content to parquet stream using new writer
  4. compare content is the same (turn stream into file, read file into stream, etc)

val outfile = new StreamOutputFile(outputStream)
ParquetWriter.writeS(outfile, testCase.data)(testCase.writer)
outputStream.close()
outputStream.size() > 1 should equal(true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to see a test here more rigorous than, stream is not empty

* @param options configuration of how Parquet files should be read
* @return iterable collection of data read from path
*/
def readS(inputStream: InputFile, options: ParquetReader.Options): ParquetIterable[T]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer readStream as less ambiguous

* @param data data to write
* @param options configuration of how Parquet files should be created and written
*/
def writeS(outputFile: OutputFile, data: Iterable[T], options: ParquetWriter.Options): Unit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, writeStream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants