-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dfs-datastores should use Hadoop's serialization mechanism #3
Comments
I don't agree with this. The advantage of tying serialization to Pails is that they enforce the types within them. When I read data from a pail, I'm confident in exactly what I'll be getting out (this is also why mixed-type pails are a non-goal). Additionally, pail should be usable outside of a M/R context, which means it must know its own serializers. The problem we have now is that to use a pail within a M/R job, we need implement the serializer for the pail as well as for io.serializations. Rather than have pail use io.serializations, I think it makes more sense to have M/R jobs that use pail to automatically add pail's serializers into io.serializations. |
I think that makes a lot of sense. What I meant is that it would be advantageous if Pail could serialize objects using Hadoop's Serialization interface, rather than specifying I agree that a huge benefit of Pail is the ability to enforce the internal structure of its data, but I'm not convinced that type is the way to do this. For example, if we replace Additionally, Pail doesn't use the type information to actually enforce anything. (You can dump strings into an Integer pail, for example.) The type's just a check between two pails before an absorb or copy. This isn't good enough for safety, since two PailStructures with the same name and type but different What do you think of: public interface PailStructure<T> extends Serializable {
public boolean isValidTarget(String... dirs);
public String getSerialization();
public List<String> getTarget(T object);
public String structureID();
}
---
format: SequenceFile
serialization: cascading.kryo.KryoSerialization
structure_id: forma.hadoop.pail.DataChunkPailStructure |
Update -- we're thinking public interface PailStructure<T> extends Serializable {
public boolean isValidTarget(String... dirs);
public String getSerializations();
public List<String> getTarget(T object);
public String getType();
public Map<String, String> getMetadata();
} |
Pretty clear that Bijection would be a good thing here. |
new version 2.0.0 with Spark support and improved file handling for AWS S3 file systems
Rather than requiring custom serialization methods in the PailStructure, dfs-datastores should provide a way to hook in to Hadoop's
io.serializations
system. It's much easier to write pail structures that implement Serializable if they don't have to carry around a serializer. This also allows for mixed-type pails.The text was updated successfully, but these errors were encountered: