Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple keys in bigdiffy #138

Merged

Conversation

martinbomio
Copy link
Contributor

Addresses #137.
There were two options that I thought about:

  1. Join all keys and to get a single String in *KeyFn functions.
  2. Use the list of string representation of the keys as key to the join.

Decided to go with (1) since we will need to join the keys anyways to write the results.

@martinbomio martinbomio force-pushed the support-multiple-keys-biggdiffy branch 3 times, most recently from 401dab3 to 5669ac1 Compare November 6, 2018 18:07
@tailrec
def get(xs: Array[String], i: Int, r: GenericRecord): String =
if (i == xs.length - 1) {
r.get(xs(i)).toString
} else {
get(xs, i + 1, r.get(xs(i)).asInstanceOf[GenericRecord])
}
val xs = key.split('.')
(r: GenericRecord) => get(xs, 0, r)
(r: GenericRecord) => keys.map { k => get(k.split('.'), 0, r) }.mkString("_")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what if we change this to a Set[String] instead? My concern is this can produce some weird edge cases. They will probably be rare but it's not really transparent to end user.

e.g.
key1_something, key2 vs key1, something_key2 produces the same aggregate key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only suggesting Set since we use that in sampler but open to other ideas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be more confusing is the results don't respect the user ordering of the keys?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some dataset that preserves order instead. Up to you. Doesn't block this PR imo but would be nice to have fixed :)

@codecov
Copy link

codecov bot commented Nov 8, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@9fca065). Click here to learn what that means.
The diff coverage is 61.53%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #138   +/-   ##
=========================================
  Coverage          ?   70.93%           
=========================================
  Files             ?       37           
  Lines             ?     1421           
  Branches          ?      170           
=========================================
  Hits              ?     1008           
  Misses            ?      413           
  Partials          ?        0
Flag Coverage Δ
#ratatoolCli 4.48% <0%> (?)
#ratatoolDiffy 28.48% <54.54%> (?)
#ratatoolExamples 19.1% <30.76%> (?)
#ratatoolSampling 63.06% <ø> (?)
#ratatoolScalacheck 83.48% <ø> (?)
#ratatoolShapeless 5.42% <0%> (?)
Impacted Files Coverage Δ
...om/spotify/ratatool/shapeless/CaseClassDiffy.scala 70.45% <ø> (ø)
...atool/examples/diffy/ProtobufBigDiffyExample.scala 0% <0%> (ø)
...y/ratatool/examples/diffy/PreProcessBigDiffy.scala 100% <100%> (ø)
...in/scala/com/spotify/ratatool/diffy/BigDiffy.scala 53.33% <63.63%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fca065...875e038. Read the comment docs.

@idreeskhan
Copy link
Contributor

Can you also add a test that runs with multiple keys?
Also would be good to update the docs with the new format.

@idreeskhan
Copy link
Contributor

Sorry, was travelling for a bit. Can you resolve conflicts? and then we can try to get this merged

@martinbomio martinbomio force-pushed the support-multiple-keys-biggdiffy branch from c0752b6 to 875e038 Compare December 12, 2018 20:11
@martinbomio
Copy link
Contributor Author

@idreeskhan PTAL, rebased master and added multikey case class

@idreeskhan
Copy link
Contributor

LGTM, if you have a chance could you do some quick and dirty benchmarking to see the performance impact of boxing? If not we can probably just create an open issue for it and revisit if it crops up.

@martinbomio
Copy link
Contributor Author

@idreeskhan I created #141

@idreeskhan
Copy link
Contributor

👍

@idreeskhan idreeskhan merged commit 3c54cfa into spotify:master Jan 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants