-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New PR to enhance compatibility with fsspec and initial implementation of ADL Gen2 #6
Conversation
…ce from AbstractFileSystem and AzureDLFilesystem
Use ms class inheritance
Use ms class inheritance
… between files and directoryes
…e.py. This handles the case when the remote file is small enough to be read at once, rather than caching the file. This also replicates the pattern used in the existing implementation of the HTTPFile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave an initial pass and things look reasonable. I'm curious about a few things
- Where does this live long-term? Is dask/adlfs the right repo? We can easily migrate things
- How do we move forward? I think it'd be valuable to get an initial release of this out there and rapidly make new releases as we iterate.
setup.py
Outdated
setup(name='adlfs', | ||
version='0.1.0', | ||
description='Access Azure Datalake Gen1 with fsspec and dask', | ||
url='https://github.com/hayesgb/adlfs/', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant Where's a good long-term home for this? gcsfs & s3fs are both in the Dask org.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repo is dask/dask-adlfs
, which would be fine with me, but also happy to drop the "dask-" part. I should note that MS's library is not called adlfs, even though that's how we referred to it when it was being written and some places within the code still mention that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of consistency, it should probably live in the dask org, and I'm fine with that. As far as naming goes, there has been mention of a possible conflict with the Azure library, but as martin mentions above, the MS library isn't called adlfs anywhere I've seen, and that name is consistent with the other cloud libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger -- Saw work has started on azure-pipeline. I've added you as a maintainer on PyPi.
adlfs/core.py
Outdated
|
||
|
||
class AzureDatalakeFileSystem(AbstractFileSystem): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: remove this blank lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend flake in the ci, but only after this PR goes in. Can also black, since it's taken over elsewhere.
adlfs/core.py
Outdated
Examples | ||
_________ | ||
>>> adl = AzureDatalakeFileSystem(tenant_id="xxxx", client_id="xxxx", | ||
client_secret="xxxx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: these continuation lines should start with ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
adlfs/core.py
Outdated
'client_secret': CLIENT_SECRET | ||
}) | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically Parameters comes firs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved parameters above examples.
adlfs/core.py
Outdated
The username or serivceprincipal id | ||
client_secret: string | ||
The access key | ||
store_name: string (None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace None
with "optional".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
def size(self, path): | ||
return self.info(path)['length'] | ||
|
||
def __getstate__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this (still) necessary? I know there's been some upstream changes around serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can replace with simpler reduce, or just use superclass. I'd bet the latter works fine.
adlfs/core.py
Outdated
|
||
|
||
class AzureDatalakeFileSystem(AbstractFileSystem): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a protocol=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, "adl". To some extent, this is implied by including the class in the fsspec registry/ known implementations, but good to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to call the blob version, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Azure datalake Gen2 calls out abfs[s] in the hdfs specification, and blob shares multiprotocol access with Gen2 datalake. OK with protocol = "abfs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps abs
for "azure blog storage"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Many other file systems name themselves with "fs", but it doesn't appear in the protocol name.
logging.debug(f'kwargs are: {out}') | ||
return out | ||
|
||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much of this can be placed in a base class that's shared by the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your question about the instantiation or the staticmethods?
Azure Datalake Gen1 doesn't implement access through an access key like is available for Blob storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was a general comment, not directed at this particular line.
… as fix to bug reading on Datalake Gen1
Resolve suggestions
@hayesgb just let us know when you're ready here, and I think we'll merge this in. |
No to throw a wrench into things, but the BlockBlobService function has an argument to utilize the Azurite emulator called https://github.com/hayesgb/adlfs/blob/master/adlfs/core.py#L264 Also, if the Azurite emulator is running on a remote host I've gotten this invocation to work. Again, it might be nice to support some of these arguments in the bbs = BlockBlobService(account_name='devstoreaccount1', account_key='Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==', custom_domain='http://<hostname>:<port>/devstoreaccount1') I might just make a PR later after you merge this one. |
Are you otherwise having a pleasant experience with this code, @AlbertDeFusco ? |
I haven't tested it personally, but I intend to do it soon. I spent today figuring out how BlockBlobService connected to azurite (locally and remote). I may build this from source and give it a try soon. |
Edit: I intend to fork this and try it with my Azurite docker container |
@hayesgb in the issue martin linked it's broken again with the latest fsspec, but for the record, I got intake.upload (and the corresponding intake catalog.entry.read) working using:
if I can provide more useful info just let me know |
With a change to support import dask.dataframe as dd
from fsspec.registry import known_implementations
known_implementations['abfs'] = {'class': 'adlfs.AzureBlobFileSystem'}
STORAGE_OPTIONS = {
'custom_domain':'http://<hostname>:<port>/devstoreaccount1',
'account_name':'devstoreaccount1',
'account_key':'Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=='
}
ddf = dd.read_csv('abfs://data/autompg.csv', storage_options=STORAGE_OPTIONS)
print(ddf.groupby('origin')['mpg'].mean().compute())
origin
Asia 30.450633
Europe 27.602941
US 20.033469
Name: mpg, dtype: float64 And fs operations work (including put) import fsspec
fs = fsspec.filesystem('abfs', container_name='data', **STORAGE_OPTIONS)
print(fs.ls(''))
['autompg.csv']
with fs.open('autompg.csv', 'rt') as f:
d = f.readlines()
print(d[:3])
['mpg,cyl,displ,hp,weight,accel,yr,origin,name\n',
'18.0,8,307.0,130,3504,12.0,70,US,chevrolet chevelle malibu\n',
'15.0,8,350.0,165,3693,11.5,70,US,buick skylark 320\n'] |
Note: to add "abs" (better than abfs, I think we decided) and "adl" protocols to the known_implementations in fsspec as soon as this is merged. A release of this repo on pypi and conda-forge ought to follow that shortly. |
There are some issues with the Quickstart guide in the readme. It should use |
I'm going to push on this for the next few hours, so that we can resolve fsspec/filesystem_spec#205. Hope that's OK @hayesgb. |
@@ -0,0 +1,4 @@ | |||
azure-datalake-store>=0.0.46,<0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pinning to <.1
necessary? I'm not familiar with azure's version policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. I don't think they will do much additional development on the azure-datalake-store library, since its for the Gen1 datalake. However, be aware that they're actively working on the storage library, so it should be pinned to <3.0
Do we have a testing strategy in mind? https://github.com/Azure/azure-data-lake-store-python/tree/master/tests uses vcr :/ |
^ see @AlbertDeFusco 's comments on trying with azurite, which is a docker-based azure storage emulator |
Thanks! |
My fork has support for |
@AlbertDeFusco , please PR to here after this PR is merged, and would appreciate if you have a go at starting the docker image as a fixture and running a basic tests |
Implementation currently works as follows:
ADL Gen1 using serviceprincipalcredentials with Azure. Requires passing credentials and store_name, with the storage_options parameter in a dask read call. Do not include but store_name in the filepath. When writing to the datalake though, store_name is required to be included in the filepath. Just found this bug and intend to fix it by this weekend, if not before.
ADL Gen2 - Started work on an AzureBlobFileSystem and corresponding AzureBlobFile class. These also authenticate using serviceprincipal credentials. Reading from the datalake seems to be working fine, but the _upload_chunk() method needs additional work.