A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Python and Awesome Sysadmin
- Awesome Hadoop
- Hadoop
- YARN
- NoSQL
- SQL on Hadoop
- Data Management
- Workflow, Lifecycle and Governance
- Data Ingestion and Integration
- DSL
- Libraries and Tools
- Realtime Data Processing
- Distributed Computing and Programming
- Packaging, Provisioning and Monitoring
- Monitoring
- Search
- Security
- Benchmark
- Machine learning and Big Data analytics
- Misc.
- Resources
- Other Awesome Lists
- Apache Hadoop - Apache Hadoop
- Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
- SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
- GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
- Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
- dumbo - Python module that allows you to easily write and run Hadoop programs.
- hadoopy - Python MapReduce library written in Cython.
- mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
- pydoop - Pydoop is a package that provides a Python API for Hadoop.
- hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
- White Elephant - Hadoop log aggregator and dashboard
- Kiji Project
- Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
- Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
- Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
- Apache Ignite - Distributed in-memory platform
- Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
- Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
- mpich2-yarn - Running MPICH2 on Yarn
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.
- Apache HBase - Apache HBase
- Apache Phoenix - A SQL skin over HBase supporting secondary indices
- happybase - A developer-friendly Python library to interact with Apache HBase.
- Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
- Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
- hindex - Secondary Index for HBase
- Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
- OpenTSDB - The Scalable Time Series Database
- Apache Cassandra
SQL on Hadoop
- Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
- Apache Phoenix A SQL skin over HBase supporting secondary indices
- Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
- Lingual - SQL interface for Cascading (MR/Tez job generator)
- Cloudera Impala
- Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
- Apache Tajo - Data warehouse system for Apache Hadoop
- Apache Drill - Schema-free SQL Query Engine
- Apache Trafodion
- Apache Calcite - A Dynamic Data Management Framework
- Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
- Apache Oozie - Apache Oozie
- Azkaban
- Apache Falcon - Data management and processing platform
- Apache NiFi - A dataflow system
- Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
- Luigi - Python package that helps you build complex pipelines of batch jobs
- Apache Flume - Apache Flume
- Suro - Netflix's distributed Data Pipeline
- Apache Sqoop - Apache Sqoop
- Apache Kafka - Apache Kafka
- Gobblin from LinkedIn - Universal data ingestion framework for Hadoop
- Apache Pig - Apache Pig
- Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
- vahara - Machine learning and natural language processing with Apache Pig
- packetpig - Open Source Big Data Security Analytics
- akela - Mozilla's utility library for Hadoop, HBase, Pig, etc.
- seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
- Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
- PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
- Kite Software Development Kit - A set of libraries, tools, examples, and documentation
- gohadoop - Native go clients for Apache Hadoop YARN.
- Hue - A Web interface for analyzing data with Apache Hadoop.
- Apache Zeppelin - A web-based notebook that enables interactive data analytics
- Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
- Apache Thrift
- Apache Avro - Apache Avro is a data serialization system.
- Elephant Bird - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
- Spring for Apache Hadoop
- hdfs - A native go client for HDFS
- Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.
- snakebite
- Apache Storm
- Apache Samza
- Apache Spark
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
- Apache Spark
- Spark Packages - A community index of packages for Apache Spark
- SparkHub - A community site for Apache Spark
- Apache Crunch
- Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
- Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
- Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.
- Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
- Apache Ambari - Apache Ambari
- Ganglia Monitoring System
- ankush - A big data cluster management tool that creates and manages clusters of different technologies.
- Apache Zookeeper - Apache Zookeeper
- Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
- Buildoop - Hadoop Ecosystem Builder
- Deploop - The Hadoop Deploy System
- Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
- inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
- ElasticSearch
- Apache Solr
- SenseiDB - Open-source, distributed, realtime, semi-structured database
- Banana - Kibana port for Apache Solr
- Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.
- Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
- Apache Sentry - An authorization module for Hadoop
- Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
- Big Data Benchmark
- HiBench
- Big-Bench
- hive-benchmarks
- hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
- YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
- Apache Mahout
- Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
- MLlib - MLlib is Apache Spark's scalable machine learning library.
- R - R is a free software environment for statistical computing and graphics.
- RHadoop including RHDFS, RHBase, RMR2, plyrmr
- RHive RHive, for launching Hive queries from R
- Apache Lens
- Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
- Hive Plugins
- UDF
- http://nexr.github.io/hive-udf/
- https://github.com/edwardcapriolo/hive_cassandra_udfs
- https://github.com/livingsocial/HiveSwarm
- https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
- https://github.com/karthkk/udfs
- https://github.com/twitter/elephant-bird - Twitter
- https://github.com/lovelysystems/ls-hive
- https://github.com/stewi2/hive-udfs
- https://github.com/klout/brickhouse
- https://github.com/markgrover/hive-translate (PostgreSQL translate())
- https://github.com/deanwampler/HiveUDFs
- https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
- https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
- https://github.com/Netflix/Surus
- Storage Handler
- https://github.com/dvasilen/Hive-Cassandra
- https://github.com/yc-huang/Hive-mongo
- https://github.com/balshor/gdata-storagehandler
- https://github.com/karthkk/hive-hbase-json
- https://github.com/sunsuk7tp/hive-hbase-integration
- https://bitbucket.org/rodrigopr/redisstoragehandler
- https://github.com/zhuguangbin/HiveJDBCStorageHanlder
- https://github.com/chimpler/hive-solr
- https://github.com/bfemiano/accumulo-hive-storage-manager
- SerDe
- Libraries and tools
- https://github.com/forward3d/rbhive
- https://github.com/synctree/activerecord-hive-adapter
- https://github.com/hrp/sequel-hive-adapter
- https://github.com/forward/node-hive
- https://github.com/recruitcojp/WebHive
- shib - WebUI for query engines: Hive and Presto
- clive - Clojure library for interacting with Hive via Thrift
- https://github.com/anjuke/hwi
- https://code.google.com/a/apache-extras.org/p/hipy/
- https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
- PyHive - Python interface to Hive and Presto
- https://github.com/recruitcojp/OdbcHive
- Hive-Sharp
- HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
- Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
- Hive_test- Unit test framework for hive and hive-service
- Flume Plugins
- Flume MongoDB Sink
- Flume HornetQ Channel
- Flume MessagePack Source
- Flume RabbitMQ source and sink
- Flume UDP Source
- Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC
- Flume Custom Serializers
- Real-time analytics in Apache Flume
- .Net FlumeNG Clients
Various resources, such as books, websites and articles.
Useful websites and articles
- Hadoop Weekly
- The Hadoop Ecosystem Table
- Hadoop 1.x vs 2
- Apache Hadoop YARN: Yet Another Resource Negotiator
- Introducing Apache Hadoop YARN
- Apache Hadoop YARN - Background and an Overview
- Apache Hadoop YARN - Concepts and Applications
- Apache Hadoop YARN - ResourceManager
- Apache Hadoop YARN - NodeManager
- Migrating to MapReduce 2 on YARN (For Users)
- Migrating to MapReduce 2 on YARN (For Operators)
- Hadoop and Big Data: Use Cases at Salesforce.com
- All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
- What is Bigtop, and Why Should You Care?
- Hadoop - Distributions and Commercial Support
- Ganglia configuration for a small Hadoop cluster and some troubleshooting
- Hadoop illuminated - Open Source Hadoop Book
- NoSQL Database
- 10 Best Practices for Apache Hive
- Hadoop Operations at Scale
- AWS BigData Blog
- Hadoop360
- How to monitor Hadoop metrics
- Hadoop Summit Presentations - Slide decks from Hadoop Summit
- Hadoop 24/7
- An example Apache Hadoop Yarn upgrade
- Apache Hadoop In Theory And Practice
- Hadoop Operations at LinkedIn
- Hadoop Performance at LinkedIn
- Docker based Hadoop provisioning
- Hadoop: The Definitive Guide
- Hadoop Operations
- Apache Hadoop Yarn
- HBase: The Definitive Guide
- Programming Pig
- Programming Hive
- Hadoop in Practice, Second Edition
- Hadoop in Action, Second Edition
Other amazingly awesome lists can be found in the awesome-awesomeness and awesome list.