Skip to content

Possible Future Work for CRADA Year Two or Otherwise

johnbent edited this page Apr 2, 2013 · 11 revisions

SimpleByteRange Index Entry Compression

Start time could be 6 bytes if you use days from a recent year and microseconds since that day, gives you hundreds of years

End time could be 4 bytes if you use elapsed from start in microseconds if you have an hour max for the elapse, 5 bytes is about 300 hours elapsed in microseconds

Make sure we don't load end time when you open for read

Data Buffering

Fopen fwrite with buffer tuning, why not stick this as an option in fstest and try write size n to n sweep on Cray Lustre to see if it fixes shallow slope, try it on Panasas on Cray to see if it helps DVS, try it on Panasas on cluster, we could try it on GPFS also to be complete. That way we know if this is worth doing and if we get write buffering for nearly free. Seems easy to add to fstest and Alfred could run the write size n to n sweep test with write and fwrite to compare.

If this is good, we can consider doing an IOStore::Write() parent class to the IOHandle::write and make IOHandle::write private and a friend to IOStore so that only IOStore can call the IOHandle::write. Then we can add buffering in the IOStore::Write() and all IOStores can benefit from buffering. We'd need each IOHandle to provide an optional buffer which the IOStore::Write() wrapper could use when available.

Data Compression

Compression test. Why don't we take your program that reads the plfs map, serially, create a program that reads the map entry, reads the data associated with the record, if bigger than X bytes, runs compress on that read data, and reports on ave, min, max and total compression for the file. We run it against rage and silverton files to see if either wins on compression so we can see if adding compression to plfs is worth the trouble. Also measure time to compress and add that up and report so that we know the overhead. Seems like a simple thing to do?

Again, this could be in the IOStore::Write() parent class as described above in the Data Buffering section.

Fix O_RDWR when user really should have opened O_READONLY

Currently in O_RDWR mode, we destroy the index and recreate it for every read. This is probably the most correct behavior to help ensure that writes by another process are available for reads by others. However, it kills performance so we should sacrifice the correct behavior in favor of performance. We just need to document that reads following writes by others in N-1 O_RDWR is undefined and this is probably true of most file systems and users shouldn't be doing N-1 in O_RDWR anyway.

Move collectives into the library

TLDR version: Pass a table of message exchange general function pointers. So we'd have a table like in MPI-IO where we specify things like ad_plfs_open. This table would have things like plfs_broadcast and the ad_plfs layer would set that function pointer to MPI_Broadcast and the upc layer would do something similar. This would make the ad_plfs layer trivially small. ad_plfs_open would just pass its args directly to plfs_open along with the table of function pointers. In the library, if the table isn't passed, then plfs_open just does what it currently does. But if the table is passed, then plfs_open would do the optimizations that are currently in the ad_plfs_open code but would use the function pointers instead of the MPI_* calls. This then would allow PLFS to build these optimizations without linking against MPI. Then the thin ad_plfs layer could be patched into official MPI distributions and we wouldn't have to worry about updating them when we want to change the optimizations.

Long version: We have a bunch of optimizations in the ADIO layer which are used when we have inter-process communication available via an MPI communicator and MPI communication routines. This is great. But what if we are running under a different parallel programming model like UPC (http://upc.lbl.gov/)? We'd have to rewrite all those optimizations using a UPC communicator and UPC communication routines.

OR . . . we could create an abstraction for a parallel communicator and for parallel communication routines. Then we take the optimizations out of ad_plfs, and rewrite them just once using the new abstracted functions and the new abstracted communicator. Then we edit ad_plfs to work with the abstracted layer. All it will do is create a function pointer table using the MPI_* communication routines and pass those as options to the plfs library functions. The library functions will check whether they are passed and use them to do those optimizations. If they aren't passed, then the library code works exactly like it does today.

Once we have that abstraction in place, we can easily add a UPC layer which will benefit from the same optimizations. This has the added benefit that the ad_plfs code will be tiny and very unlikely to change in the future so we can try to get plfs patched into official MPI distributions.

Get Doxygen working on the code

Jingwang started doing this on some of his code. I think Chuck was as well. Then figure out how to make the doxygen webpage publicly available.

Get Gerrit working with the existing github

The code review that John has been doing with EMC is a fantastic interface. github.com is not nearly so nice.

Redesign metalinks

We're not sure that the current design of metalinks is as good as it can and should be. Unique metalinks are nice but difficult because we limit the number of subdirs so multiple threads might want to create the same subdir pointing to different shadows. We need to limit the number of subdirs for exascale but we should be OK at trinity scale to have up to 50K entries in a directory. So for metalinks for an adio pattern, the index can be index.patternid and the metalinks can be metalink.patternid.thread. We'll still have to figure out a new solution for exascale but that's still a long time away.

Merge Jun's index compression and Zhenhua's.

Although both are waiting for at-scale testing to see whether they deliver the performance improvements we expect.

Going through trac and cleaning up small lingering bugs

These are listedhere and marked with "bug"

Start porting the new SmallFile and new Index Compression branches to the new IOStore branch

Chuck ported the existing trunk and it wasn't super difficult. One hard part might be what to do about the fd cache in small file code.

Not creating unnecessary empty files in container mode.

This could also be done in IOStore::Open wrapper class which could do deferred writes. This would be an option to the IOStore::Open class which could specify things either to defer to first write or to immediately open. Then if there were never writes, it'd never be opened. This would also rely on an IOStore::Write wrapper.

Creating an IOHandle pool of recently opened IOHandle's.

We don't necessarily have to immediately close every one so that if a caller closes one and then reopens one, existing handles in the pool could be reused to reduce the number of actual system calls coming out of the PLFS library going to the backends

Adding a checksum to the SimpleByteRange index entry

This would give better data integrity by detecting data corruption on the read.

Large scale performance regression test suite up and running

This is something that only LANL can deliver.

plfs collectives

TLDR version: This is what was described in the Fast Forward proposal that we delivered with Cray. This is similar to MPI-IO collective IO except that MPI-IO transfers a lot of data in order to get filesytem aligned blocks. Within PLFS, we need to transfer only a much smaller amount of data so that everyone writes the same amount of data. This will enable patterned index entries instead of SimpleByteRange entries.

Long version: If an MPI job is doing parallel IO where each proc does a different amount of data, then we can't find a pattern and the index will get very big. Imagine something like the respective procs do this much IO each during a collective write:

9,11,12,8

What we can do in this case is move one byte from 1 to 0, and one byte from 4 to 3. Then the writes will look like:

10,10,10,10

and we can do index compression with Zhenhua's patterned index code.

Better deletion of containers

Currently when we unlink a container, we just recurse and try to delete everything. Sometimes however there might be a dropping which we can't unlink for some reason. If we have already removed the access file then the unlink will fail and then the container will now appear as a directory and this can cause all sorts of problems. So we should delay deleting the access file until the very last operation (before removing top-level).

Better error handling

We often return an int which can be zero for success or a positive value for things like fd's or bytes or offsets and then a negative conversion of an errno for error signaling. Phrased like that, it's not so bad but we're not consistent enough. We should have every function every return a PlfsError type or instead have every function take a PlfsError type field.

Global append

It'd be really cool if multiple writers could all just append simultaneously to a shared file without having to coordinate offsets. Well, this should really be very easy in PLFS. Just have a plfs_append call that puts a -1 into the physical offset field of the PLFS index entry. At read time, compute offsets for every index entry that has a -1 using the timestamps in the index entries. Just make sure to have a deterministic ordering algorithm so that multiple readers across time and space always see the same file. [Now this might be tricky if writers mix appends with explicit offset writes. Imagine that one writer writes 100 bytes to offset 100 and another writer does a simple append of 100 bytes. Now a reader shows up and reads the file; the appended bytes will be found at offset 100. Now another writer shows up and does an explicit offset write at offset 100. Now a reader shows up and will find the appended bytes at offset 200. Oh, actually we can deal with this; when we construct the index, place the appended writes as they should be placed at the time that they arrived. If subsequent writes overwrite them, then that's what the user intended.]

Debug overhead

It looks like there is overhead caused by debugging even when mlog is turned off. Part of this is dealt with by mlog_oss class. But there is the LogMessage class which I think also adds overhead. Let's just rip all that code out and trust mlog to do all our debugging for us. Also, make it so that plfs-collect-timings is disabled by default.