Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize writes in genomix data using nulls #57

Conversation

jakebiesinger
Copy link
Contributor

@anbangx @JavierJia @Nan-Zhang

This commit is based on and takes #56 further into Node-- only those values that are non-null and collections of length > 0 are saved into the output stream.

Outside of Node, the only visible difference is that there are no default values for averageCoverage (it's null by default and trying to use it without setting it first will give a NPE).

Inside of Node, anywhere a get() is called, we initialize the requested value if it's not already existing. There's also a fair bit of extra code involved in checking for null (curse you, Java) especially in comparison with raw types.

The first commit is kind of unrelated-- it removes EnumSet.allOf() where it can be replaced with much less overhead with enum.values().

@jakebiesinger
Copy link
Contributor Author

Any comments on this one? It's been open for 4 days now o_O

if (n.internalKmer != null && n.internalKmer.getKmerLetterLength() > 0) {
n.internalKmer.write(out);
}
if (n.averageCoverage != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this write() block consist with the setByCopy(), in terms of n.getActiveFields & NODE_FIELDS.XXXX ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they should be consistent. n.getActiveFields writes a single byte whose bits are defined in NODE_FIELDS and are 0 if the particular values are null or size() == 0. The setByCopy() function might be a little more permissive in that if size() == 0, it will create a new list of length 0 rather than being null. Shouldn't affect behavior but I could change it to be more consistent if we want.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make it more consistent. Because there are several fields to write and read, then the order became very critical. If we want to change or add some fields in the future, it would be easier to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I should be more clear:

  • setAsCopy(byte[] data, int offset) reads the very first byte from the byte[] just as write() writes it and readFields() reads it. There is no difference there and they should have the exact same effect. They should be completely consistent with NODE_FIELDS and getActiveFields().
  • setAsReference(byte[] data, int offset) is almost an exact copy of that function with the exception that VKmer instances (in the EdgeMaps and the internalKmer will be references rather than copies-- everything else must be a copy.
  • setAsCopy(Node other) is what I meant when I said they're not consistent-- for that function, when other has a list whose size() == 0, this.<list> could either be null or of size() == 0 as well. Right now, it's size() == 0, just like the Node that's being copied. But this really shouldn't affect anything practically.

@JavierJia
Copy link
Collaborator

LGTM 👍

@jakebiesinger
Copy link
Contributor Author

@anbangx the test cases in travis are currently failing because the pregelix input files are out of date. Sure would be nice to have that automatic generation tool you've been working on 😄.

@jakebiesinger
Copy link
Contributor Author

BTW we should have better performace after this is merged in, thanks to dc90e19. According to #20, the total time spent in creating and iterating over those EnumSet's was 10% of the total runtime (:exclamation:)

@Nan-Zhang
Copy link
Collaborator

members alway be null until we need it~ I think I can modify and add more 'null' cases in nodeTest to test some functions.
Why enum.values() has much less overhead comparing to EnumSet.allOf()?

@jakebiesinger
Copy link
Contributor Author

@Nan-Zhang yes, we could return a null in some cases instead of creating the object on the fly. The problem with that approach is when you want to iterate, you have to check for null values:

if (node.getEdgeMap(FF) != null)
    for (EdgeMap m : node.getEdgeMap(FF))
         // do something with m

which makes the interface ugly. So without changing the interface at all (we never return null, as we never did before), we can just generate the values on the fly when requested. Thus, algorithms that don't use certain values won't have them written.

@jakebiesinger
Copy link
Contributor Author

Why enum.values() has much less overhead comparing to EnumSet.allOf()?

.values() is just returning a static final array. EnumSet.allof has to create a new object and inspect the values of the enum.

@Nan-Zhang
Copy link
Collaborator

LGTM

On Tue, Nov 12, 2013 at 3:01 PM, Jake Biesinger [email protected]:

Why enum.values() has much less overhead comparing to EnumSet.allOf()?

.values() is just returning a static final array. EnumSet.allof has to
create a new object and inspect the values of the enum.

?
Reply to this email directly or view it on GitHubhttps://github.com//pull/57#issuecomment-28342829
.

jakebiesinger added a commit that referenced this pull request Nov 14, 2013
…mix-data-using-nulls

optimize writes in genomix data using nulls
@jakebiesinger jakebiesinger merged commit 4b32544 into wbiesing/pregelix-messages-have-null-values Nov 14, 2013
@jakebiesinger jakebiesinger deleted the wbiesing/optimize-writes-in-genomix-data-using-nulls branch November 14, 2013 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants