Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for wasb:// protocol on Azure HDInsight #35

Open
aviks opened this issue Feb 7, 2018 · 5 comments
Open

Support for wasb:// protocol on Azure HDInsight #35

aviks opened this issue Feb 7, 2018 · 5 comments

Comments

@aviks
Copy link

aviks commented Feb 7, 2018

I can see the files using hadoop fs -ls but not using readdir. Trying to create a file reference for a file I know to exist using HDSFFile and then stat shows Elly.HDFSException("Path not found")

sshuser@hn0-myclust:~$ hadoop fs -ls /
Found 15 items
drwxr-xr-x   - root   supergroup          0 2018-02-07 14:25 /HdiSamples
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /ams
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /amshbase
drwxrwxrwx   - yarn   hadoop              0 2018-02-07 14:15 /app-logs
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /apps
drwxr-xr-x   - yarn   hadoop              0 2018-02-07 14:15 /atshistory
drwxr-xr-x   - root   supergroup          0 2018-02-07 14:24 /custom-scriptaction-logs
drwxr-xr-x   - root   supergroup          0 2018-02-07 14:25 /example
drwxr-xr-x   - hbase  supergroup          0 2018-02-07 14:15 /hbase
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /hdp
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /hive
drwxr-xr-x   - mapred supergroup          0 2018-02-07 14:15 /mapred
drwxrwxrwx   - mapred hadoop              0 2018-02-07 14:15 /mr-history
drwxrwxrwx   - hdfs   supergroup          0 2018-02-07 14:15 /tmp
drwxr-xr-x   - hdfs   supergroup          0 2018-02-07 14:15 /user
sshuser@hn0-myclust:~$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using Elly

julia> dfs = HDFSClient("hn0-myclust.3p0iyjauoc2e3faws152r5tm0e.cx.internal.cloudapp.net", 8020)
HDFSClient: sshuser@hn0-myclust.3p0iyjauoc2e3faws152r5tm0e.cx.internal.cloudapp.net:8020/
    id: 76ba6c80-1ac9-45
    connected: false
    pwd: /


julia> readdir(dfs)
1-element Array{AbstractString,1}:
 "tmp"
@aviks
Copy link
Author

aviks commented Feb 8, 2018

[Renamed the issue]

So this is due to the fact that Azure uses a separate wasb:// protocol layered over hdfs://, which uses azure blob store as the underlying storage. This will probably need to be supported explicitly within Elly.

Some background: https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/

@aviks aviks changed the title Unable to read files on HDInsight Support for wasb:// protocol on Azure HDInsight Feb 8, 2018
@aviks
Copy link
Author

aviks commented Feb 8, 2018

Similarly, HDInsight supports the adl:// protocol that uses Azure Data Lake Store as the underlying storage engine for hadoop. Would be good to support that as well.

@tanmaykm
Copy link
Member

Looks like this wasb support came in with Hadoop v2.9: https://hadoop.apache.org/docs/r2.9.0/hadoop-azure/index.html#Introduction

But what is not clear yet to me is whether the server will transparently wrap wasb and present a hdfs interface. If that is true then we should be able to access wasb by just upgrading Elly to use v2.9 protobuf apis. But I am still unsure how/why that would work. Will dig a bit deeper.

@tanmaykm
Copy link
Member

tanmaykm commented Dec 13, 2019

This looks like being entirely implemented as a client library - see org/apache/hadoop/fs/azure/NativeAzureFileSystem.html source.

It seems to be reading the hdfs config, but it interacts with azure services directly. The hdfs namenode and datanodes do not seem to be aware of this at all.

So, the implementation of HDFSFile in Elly.jl can cater only to hdfs:// filesystem. And we probably need to look at Azure apis to do an implementation of NativeAzureFile on similar lines in Julia. Also there doesn't seem to be any direct Azure API for this (wasb) filesystem protocol, only APIs for blobstore. We will need to implement the filesystem metadata management in Julia as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants