Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfs-datastores should use Hadoop's serialization mechanism #3

Open
sritchie opened this issue Dec 29, 2011 · 4 comments
Open

dfs-datastores should use Hadoop's serialization mechanism #3

sritchie opened this issue Dec 29, 2011 · 4 comments

Comments

@sritchie
Copy link
Collaborator

Rather than requiring custom serialization methods in the PailStructure, dfs-datastores should provide a way to hook in to Hadoop's io.serializations system. It's much easier to write pail structures that implement Serializable if they don't have to carry around a serializer. This also allows for mixed-type pails.

@nathanmarz
Copy link
Owner

I don't agree with this. The advantage of tying serialization to Pails is that they enforce the types within them. When I read data from a pail, I'm confident in exactly what I'll be getting out (this is also why mixed-type pails are a non-goal).

Additionally, pail should be usable outside of a M/R context, which means it must know its own serializers.

The problem we have now is that to use a pail within a M/R job, we need implement the serializer for the pail as well as for io.serializations. Rather than have pail use io.serializations, I think it makes more sense to have M/R jobs that use pail to automatically add pail's serializers into io.serializations.

@sritchie
Copy link
Collaborator Author

I think that makes a lot of sense. What I meant is that it would be advantageous if Pail could serialize objects using Hadoop's Serialization interface, rather than specifying serialize and deserialize in PailStructure. PailStructure could replace serialize and deserialize with getSerializations; the pail would then enforce the order of the supplied serializations. We could also including a method to prepare the JobConf by appending these to io.serializations as you suggested.

I agree that a huge benefit of Pail is the ability to enforce the internal structure of its data, but I'm not convinced that type is the way to do this. For example, if we replace hfs-seqfile with a Clojure data pail, we'll probably specify the type as clojure.lang.PersistentVector, or even just java.util.List, so that we can store tuples with multiple fields. This ends up enforcing nothing about the internal data.

Additionally, Pail doesn't use the type information to actually enforce anything. (You can dump strings into an Integer pail, for example.) The type's just a check between two pails before an absorb or copy. This isn't good enough for safety, since two PailStructures with the same name and type but different getTarget implementations will corrupt each other upon combination.

What do you think of:

public interface PailStructure<T> extends Serializable {
    public boolean isValidTarget(String... dirs);
    public String getSerialization();
    public List<String> getTarget(T object);
    public String structureID();
}

structureID would return a user-specified identifier used to compare two pails. This would allow pails to be dynamically generated while still enforcing their structure on copy or absorb calls.

pail.meta would then contain:

--- 
format: SequenceFile
serialization: cascading.kryo.KryoSerialization
structure_id: forma.hadoop.pail.DataChunkPailStructure

@sritchie
Copy link
Collaborator Author

sritchie commented Jan 7, 2012

Update -- we're thinking

public interface PailStructure<T> extends Serializable {
    public boolean isValidTarget(String... dirs);
    public String getSerializations();
    public List<String> getTarget(T object);
    public String getType();
    public Map<String, String> getMetadata();
}

@sritchie
Copy link
Collaborator Author

Pretty clear that Bijection would be a good thing here.

jschwellach added a commit to jschwellach/dfs-datastores that referenced this issue Jan 9, 2017
new version 2.0.0 with Spark support and improved file handling for AWS S3 file systems
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants