Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR to enhance compatibility with fsspec and initial implementation of ADL Gen2 #6

Merged
merged 116 commits into from
Nov 15, 2019

Conversation

hayesgb
Copy link
Collaborator

@hayesgb hayesgb commented Sep 11, 2019

Implementation currently works as follows:

  • ADL Gen1 using serviceprincipalcredentials with Azure. Requires passing credentials and store_name, with the storage_options parameter in a dask read call. Do not include but store_name in the filepath. When writing to the datalake though, store_name is required to be included in the filepath. Just found this bug and intend to fix it by this weekend, if not before.

  • ADL Gen2 - Started work on an AzureBlobFileSystem and corresponding AzureBlobFile class. These also authenticate using serviceprincipal credentials. Reading from the datalake seems to be working fine, but the _upload_chunk() method needs additional work.

hayesgb and others added 30 commits February 18, 2019 16:03
…ce from AbstractFileSystem and AzureDLFilesystem
…e.py. This handles the case when the remote file is small enough to be read at once, rather than caching the file. This also replicates the pattern used in the existing implementation of the HTTPFile.
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave an initial pass and things look reasonable. I'm curious about a few things

  1. Where does this live long-term? Is dask/adlfs the right repo? We can easily migrate things
  2. How do we move forward? I think it'd be valuable to get an initial release of this out there and rapidly make new releases as we iterate.

.gitmodules Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Outdated
setup(name='adlfs',
version='0.1.0',
description='Access Azure Datalake Gen1 with fsspec and dask',
url='https://github.com/hayesgb/adlfs/',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant Where's a good long-term home for this? gcsfs & s3fs are both in the Dask org.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repo is dask/dask-adlfs, which would be fine with me, but also happy to drop the "dask-" part. I should note that MS's library is not called adlfs, even though that's how we referred to it when it was being written and some places within the code still mention that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of consistency, it should probably live in the dask org, and I'm fine with that. As far as naming goes, there has been mention of a possible conflict with the Azure library, but as martin mentions above, the MS library isn't called adlfs anywhere I've seen, and that name is consistent with the other cloud libraries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger -- Saw work has started on azure-pipeline. I've added you as a maintainer on PyPi.

adlfs/core.py Outdated


class AzureDatalakeFileSystem(AbstractFileSystem):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: remove this blank lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend flake in the ci, but only after this PR goes in. Can also black, since it's taken over elsewhere.

adlfs/core.py Outdated
Examples
_________
>>> adl = AzureDatalakeFileSystem(tenant_id="xxxx", client_id="xxxx",
client_secret="xxxx"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: these continuation lines should start with ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

adlfs/core.py Outdated
'client_secret': CLIENT_SECRET
})

Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically Parameters comes firs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved parameters above examples.

adlfs/core.py Outdated
The username or serivceprincipal id
client_secret: string
The access key
store_name: string (None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace None with "optional".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

def size(self, path):
return self.info(path)['length']

def __getstate__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this (still) necessary? I know there's been some upstream changes around serialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can replace with simpler reduce, or just use superclass. I'd bet the latter works fine.

adlfs/core.py Outdated


class AzureDatalakeFileSystem(AbstractFileSystem):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a protocol=?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, "adl". To some extent, this is implied by including the class in the fsspec registry/ known implementations, but good to be explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to call the blob version, though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure datalake Gen2 calls out abfs[s] in the hdfs specification, and blob shares multiprotocol access with Gen2 datalake. OK with protocol = "abfs"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps abs for "azure blog storage"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Many other file systems name themselves with "fs", but it doesn't appear in the protocol name.

logging.debug(f'kwargs are: {out}')
return out

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much of this can be placed in a base class that's shared by the two?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your question about the instantiation or the staticmethods?

Azure Datalake Gen1 doesn't implement access through an access key like is available for Blob storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was a general comment, not directed at this particular line.

@TomAugspurger
Copy link
Contributor

@hayesgb just let us know when you're ready here, and I think we'll merge this in.

@AlbertDeFusco
Copy link
Contributor

No to throw a wrench into things, but the BlockBlobService function has an argument to utilize the Azurite emulator called is_emulator. When set to true it connects to 127.0.0.1:10000. It might be useful to support that.

https://github.com/hayesgb/adlfs/blob/master/adlfs/core.py#L264

Also, if the Azurite emulator is running on a remote host I've gotten this invocation to work. Again, it might be nice to support some of these arguments in the storage_options.

bbs = BlockBlobService(account_name='devstoreaccount1', account_key='Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==', custom_domain='http://<hostname>:<port>/devstoreaccount1')

I might just make a PR later after you merge this one.

@martindurant
Copy link
Member

Are you otherwise having a pleasant experience with this code, @AlbertDeFusco ?

@AlbertDeFusco
Copy link
Contributor

I haven't tested it personally, but I intend to do it soon. I spent today figuring out how BlockBlobService connected to azurite (locally and remote). I may build this from source and give it a try soon.

@AlbertDeFusco
Copy link
Contributor

Edit: I intend to fork this and try it with my Azurite docker container

@martindurant
Copy link
Member

@JoranDox
Copy link

@hayesgb in the issue martin linked it's broken again with the latest fsspec, but for the record, I got intake.upload (and the corresponding intake catalog.entry.read) working using:

  • adlfs (data lake gen 1, regular text file)
  • fsspec
  • intake
  • (pip freeze in linked issue)

if I can provide more useful info just let me know

@AlbertDeFusco
Copy link
Contributor

AlbertDeFusco commented Nov 15, 2019

With a change to support custom_domain I have confirmed that this works with the Azurite docker container. I earlier used the azure-storage-blob package to upload the file.

import dask.dataframe as dd
from fsspec.registry import known_implementations
known_implementations['abfs'] = {'class': 'adlfs.AzureBlobFileSystem'}

STORAGE_OPTIONS = {
    'custom_domain':'http://<hostname>:<port>/devstoreaccount1',
    'account_name':'devstoreaccount1',
    'account_key':'Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=='
}

ddf = dd.read_csv('abfs://data/autompg.csv', storage_options=STORAGE_OPTIONS)
print(ddf.groupby('origin')['mpg'].mean().compute())
    origin
    Asia      30.450633
    Europe    27.602941
    US        20.033469
    Name: mpg, dtype: float64

And fs operations work (including put)

import fsspec

fs = fsspec.filesystem('abfs', container_name='data', **STORAGE_OPTIONS)
print(fs.ls(''))
    ['autompg.csv']

with fs.open('autompg.csv', 'rt') as f:
    d = f.readlines()
print(d[:3])
    ['mpg,cyl,displ,hp,weight,accel,yr,origin,name\n',
     '18.0,8,307.0,130,3504,12.0,70,US,chevrolet chevelle malibu\n',
     '15.0,8,350.0,165,3693,11.5,70,US,buick skylark 320\n']

@martindurant
Copy link
Member

Note: to add "abs" (better than abfs, I think we decided) and "adl" protocols to the known_implementations in fsspec as soon as this is merged. A release of this repo on pypi and conda-forge ought to follow that shortly.

@AlbertDeFusco
Copy link
Contributor

There are some issues with the Quickstart guide in the readme. It should use AzureBlobFileSystem for abs (née abfs), not AzureDatalakeFileSystem.

@TomAugspurger
Copy link
Contributor

I'm going to push on this for the next few hours, so that we can resolve fsspec/filesystem_spec#205. Hope that's OK @hayesgb.

@@ -0,0 +1,4 @@
azure-datalake-store>=0.0.46,<0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pinning to <.1 necessary? I'm not familiar with azure's version policy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. I don't think they will do much additional development on the azure-datalake-store library, since its for the Gen1 datalake. However, be aware that they're actively working on the storage library, so it should be pinned to <3.0

@TomAugspurger
Copy link
Contributor

Do we have a testing strategy in mind? https://github.com/Azure/azure-data-lake-store-python/tree/master/tests uses vcr :/

@martindurant
Copy link
Member

^ see @AlbertDeFusco 's comments on trying with azurite, which is a docker-based azure storage emulator

@martindurant
Copy link
Member

@TomAugspurger
Copy link
Contributor

Thanks!

@AlbertDeFusco
Copy link
Contributor

My fork has support for is_emulated, just set that to true in storage_options after running the azurite docker image

https://github.com/albertdefusco/adlfs

@martindurant
Copy link
Member

@AlbertDeFusco , please PR to here after this PR is merged, and would appreciate if you have a go at starting the docker image as a fixture and running a basic tests

@TomAugspurger TomAugspurger merged commit 065b5ba into fsspec:master Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants