-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet outputstream rw #3
base: master
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,36 @@ | |||
package com.github.mjakubowski84.parquet4s; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these need to be in Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they don't... will move to pure scala.
@@ -108,6 +155,18 @@ object ParquetWriter { | |||
writer.close() | |||
} | |||
} | |||
|
|||
override def writeS(outputFile: OutputFile, data: Iterable[T], options: Options): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of writeS
I would create a type class for Sinks e.g.:
def write[T: ParquetWriter, S: Sink](output: S, data: Iterable[T], options = options: ParquetWriter.Options = ParquetWriter.Options())
Then I would provide Sink
implementations for both File
and OutputStream
* @param data data to write | ||
* @param options configuration of how Parquet files should be created and written | ||
*/ | ||
def writeS(outputFile: OutputFile, data: Iterable[T], options: ParquetWriter.Options): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a util to write to just java.io.File
. OutputFile
interface should be private.
private val logger = LoggerFactory.getLogger(ParquetWriter.this.getClass) | ||
|
||
if (logger.isDebugEnabled) { | ||
logger.debug(s"""Resolved following schema to write Parquet to STream":\n$schema""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the if check - this is macro-ed away at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is base slf4j-api Logger
so unfortunately, it isn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great first cut - could use a more detailed unit test but otherwise nice addition.
ParquetWriter.writeS(outfile,data) | ||
outputStream.close() | ||
println(outputStream) | ||
outputStream.size() > 1 should equal(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improve test by
- creating seq of generated content (there are some utils in the other tests)
- writing generated content to parquet file using existing writer
- writing generated content to parquet stream using new writer
- compare content is the same (turn stream into file, read file into stream, etc)
val outfile = new StreamOutputFile(outputStream) | ||
ParquetWriter.writeS(outfile, testCase.data)(testCase.writer) | ||
outputStream.close() | ||
outputStream.size() > 1 should equal(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to see a test here more rigorous than, stream is not empty
* @param options configuration of how Parquet files should be read | ||
* @return iterable collection of data read from path | ||
*/ | ||
def readS(inputStream: InputFile, options: ParquetReader.Options): ParquetIterable[T] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer readStream
as less ambiguous
* @param data data to write | ||
* @param options configuration of how Parquet files should be created and written | ||
*/ | ||
def writeS(outputFile: OutputFile, data: Iterable[T], options: ParquetWriter.Options): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, writeStream
Updates library with the ability to write parquet files to an outputstream vis
writeS
method. Previously only file writes were available.