Google Summer of Code 2014

At Twitter, we love Open Source, working with students and Google Summer of Code (GSOC)! What is GSOC? Every year, Google invites students to come up with interesting problems for their favorite open-source projects and work on them over the summer. Participants get support from the community, plus a mentor who makes sure you don't get lost and that you meet your goals. Aside from the satisfaction of solving challenging problems and contributing to the open source community, students get paid and get some sweet swag for their work! In our opinion, this is a great opportunity to get involved with open source, improve your skills and help out the community!

If you're interested in Outreach Program for Women as an option, please see that wiki: https://github.com/twitter/twitter.github.com/wiki/Outreach-Program-for-Women-2014

Information for Students

These ideas were contributed by our developers and our community, they are only meant to be a starting point. If you wish to submit a proposal based on these ideas, you may wish to contact the developers and find out more about the particular suggestion you're looking at.

Being accepted as a Google Summer of Code student is quite competitive. Accepted students typically have thoroughly researched the technologies of their proposed project and have been in frequent contact with potential mentors. Simply copying and pasting an idea here will not work. On the other hand, creating a completely new idea without first consulting potential mentors is unlikely to work out.

If there is no specific contact given you can ask questions via @TwitterOSS or via the twitter-gsoc mailing list.

Accepted Projects

For 2014, @TwitterOSS accepted 9 students to work on 7 different open source projects:

The project details are listed below:

Use zero-copy read path in Parquet

Brief explanation: Read performance can be improved by putting less pressure on memory bandwidth (https://github.com/Parquet/parquet-mr/issues/287)
Expected results: use of new ByteBuffer based apis in Hadoop to improve read perf
HDFS-2834 ByteBuffer-based read API for DFSInputStream
HADOOP-8148 ByteBuffer based codec API
Backward compatibility with Hadoop 1.0
Mentor: (@gerashegalov) and Julien Le Dem (@J_)
Student: Sunyu Duan [email protected]

Netty: Pluggable algorithm to choose next EventLoop

Brief explanation: Currently when a new EventLoop is needed to register a Channel our EventLoopGroup implementations just use round-robin like algorithm to choose the next EventLoop to use. Unfortunately this is not good enough as different EventLoops may become more busy then others over the time. This is especially true when your Application handles different kind of connections (long-living and short-living). We should allow to plugin different algorithms to choose the next EventLoop which take various feedback into account. See #1230
Expected results:
A user can plugin different implementations into the EventLoopGroup implementation constructors.
Incorporate changes to allow to gather feedback from the EventLoops about how busy they are and so be able to make use of this informations in custom implementations.
Knowledge Prerequisite:
Java
Concurrency and multithread programming
Experience with building a network application atop Netty
Mentor: Norman Maurer (@normanmaurer)
Student: Jakob Buchgraber [email protected]

Various compression codecs for Netty

Brief explanation: Netty is an asynchronous event-driven network application framework for rapid development of high performance protocol servers and clients. Compression codecs will allow cutting traffic and creating applications, which are able to transfer large amounts of data via the Net even more effective and faster. While various compression codecs will allow making the optimal choice for a specific problem. Documentation and examples will help users to use any compression codecs in their projects.
Expected results:
Abstraction layer is for unitization and convenient use of compression algorithms.
Adapt already implemented compression codecs (Zlib and Snappy) for new abstraction layer.
Implement such compression codecs as LZF, LZ4, QuickLZ, LZMA, Bzip2.
Compare implemented compression codecs (performance / compression degree).
The examples that demonstrate how to use compression codecs in users’ projects.
Documentation of new developments and changes.
Knowledge Prerequisite:
Java
Concurrency and multithread programming
Experience with building a network application atop Netty
Mentor: Trustin Lee (@trustin)
Student: Idel Pivnitskiy (@Pivnitskiy) [email protected]

Aurora: building a logging and analytics framework

Brief explanation: Add code to gather centralized statistics about how people use Aurora/Mesos, and build tools to analyze it.
Expected results:
First, we identify the requirements from the logging system for Aurora. We study already existing solution such as Flume, Scribe, Chukwa and Kafka. The results shall be put in a report similar to the Wikimedia foundation's report on their choice of a logging solution. The report should be ready for submission by the end of the bonding period. We will also identify any missing functionalities in the chosen system.
Second, we start working on implementing any missing functionalities and integrating the chosen system with Aurora. Any added functionalities to the logging system shall be pushed in the respective open-source project.
Finally, once we have the logging system in place, we will design and build the analytics module. This system will support both simple queries such as the example given by Mark, "Show all of the update commands that resulted in a rollback between 12:00 and 2pm." and more complex ones like "show the correlation between failures and number of jobs" or "Detect anomalies in the logged data for the past 10 days" or "What is the distribution of job execution times". The analytics tool(s) will be written in Python and R (mainly) binded by RPy2 while leveraging the power pf MapReduce (when needed). The tool will be built to be modular to allow for future extensions and updates when needed. The analysis reports will be in both textual format and visual format, e.g., histograms, box-plots, CDFs and so on, to aid Aurora users and cluster managers to make informed decisions.
This proposal is thus a merger between AURORA-256 and AURORA-257 with some added functionalities.
Knowledge Prerequisite: Java/Python
JIRA Issue: AURORA-217 and AURORA-256
Mentor: Mark Chu-Carroll (@MarkCC)
Student: Ahmed Ali-Eldin (@ahmedaley) [email protected]

Android Support For Pants

Brief explanation: Add Android support to the Pants build system.
Expected results: Pants can compile Android based applications.
Knowledge Prerequisite: Python, Java, Android
Mentor: Travis Crawford (@tc) and John Sirois (@johnsirois)
Student: Mateo Rodriguez (@mateornaut) [email protected]

Finagle: pure zookeeper client

Brief explanation: Zookeeper is the open sourced library of cluster membership that we use at Twitter, right now the integration is made by using the zookeeper library. We would like to implement a ZooKeeper client purely in Finagle.
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Evan Meagher (@evanm), Marius Eriksen (@marius) and Steve Gury (@stevegury)
Student: Pierre-Antoine Ganaye (@trypag) [email protected]

Finagle: SMTP client

Brief Explanation: Implement the SMTP protocol in finagle. Finagle supports many protocols and it is currently possible to build your app by writing fully-async, efficient clients and servers so long as you're using these protocols. But for sending emails, we still need to Await.result(...) using blocking email libraries like javamail or Apache commons-email. Let's make it possible to build fully async, efficient apps that also send email!
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Steve Gury (@stevegury) and Selvin George (@selvin)
Student: Lera Dymbitska (@suncelesta) [email protected]

Summingbird: Addition of Tez backend for offline batch compute

Brief explanation: Tez (http://tez.incubator.apache.org) is a new Apache incubator to generalize and expand the map/reduce model of computation. Summingbird should be able to automatically take advantage of map-reduce-reduce plans, and other optimizations that Tez enables. This should perform better than the existing Hadoop-via-cascading-via-scalding backend that is currently available.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background).
Need to be somewhat familiar with Hadoop, Yarn.
Mentor: Alex Levenson ([email protected]) and Jonathan Coveney (@jco)
Student: Camelia Ciolac (@CC_Camelia) [email protected]

Analyze Wikipedia using Cassovary

Brief explanation: Study Wikipedia graph formed by links with in Wikipedia articles to each other. Load up this graph in Cassovary. Do analysis of this graph to make observations on Wikipedia pages such as linkage, similarity, clustering, etc. Use this graph to do entity resolution on arbitrary text.
Expected results: Cassovary modified as appropriate to do this analysis. Hopefully new findings on wikipedia graph structure. Working prototype of entity extraction.
Knowledge Prerequisite: Scala, Graph algorithms
Mentor: Pankaj Gupta (@pankaj) and Ajeet Grewal (@ajeet)
Student: Szymon Matejczyk (@szymonmatejczyk) [email protected]

Adding a Proposal

Please follow this template:

Brief explanation:
Expected results:
Knowledge Prerequisite:
Mentor:

When adding an idea to this section, please try to include the following data.

If you are not a developer but have a good idea for a proposal, get in contact with relevant developers first or @TwitterOSS.

Project Ideas

Finagle

A good starting point is Finagle is the Quickstart: http://twitter.github.io/finagle/guide/Quickstart.html

You could also start digging in the code here: https://github.com/twitter/finagle/

Check out the Finagle mailing list if you have any questions.

Distributed debugging (DTrace-like instrumentation for distributed systems)

Brief explanation: DTrace is a very powerful and versatile tool for debugging local application. We would like to employ similar types of instrumentation on a cluster of machines that form a distributed system, tracing requests based on specific conditions like the state of the server.
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

System profiler

Brief explanation: Being able to analyze performance characteristics of a server based on the requests that pass through it (where does the latency comes from, ...)
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

Kerberos authentication in Mux

Brief explanation: Mux is a new RPC session protocol in use at Twitter. We would like to add kerberos authentication.
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

Pure finagle zookeeper client

Brief explanation: Zookeeper is the open sourced library of cluster membership that we use at Twitter, right now the integration is made by using the zookeeper library. We would like to implement a ZooKeeper client purely in Finagle.
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Marius Eriksen (@marius) and Steve Gury (@stevegury)

finagle-smtp

Brief Explanation: Implement the SMTP protocol in finagle.
- Finagle supports many protocols and it is currently possible to build your app by writing fully-async, efficient clients and servers so long as you're using these protocols.
- But for sending emails, we still need to Await.result(...) using blocking email libraries like javamail or Apache commons-email.
- Let's make it possible to build fully async, efficient apps that also send email!
Knowledge Prerequisite: Scala, Distributed systems
Mentor: Steve Gury (@stevegury) and Selvin George (@selvin)

Aurora

Aurora CLI Improvements

Add new functionality to the Aurora client CLI to make programmers lives easier.
Knowledge prereq: Python
Mentor: Mark Chu-Carroll (@MarkCC)
JIRA Issue: AURORA-217

Aurora Configuration Documentation

Add documentation for the Pystachio framework used for managing configurations in Aurora.
Knowledge prereq: Python (weak)
Mentor: Mark Chu-Carroll (@MarkCC)

Aurora Analytics

Add code to gather centralized statistics about how people use Aurora/Mesos, and build tools to analyze it.
Knowledge prereq: Java/Python.
Mentor: Mark Chu-Carroll (@MarkCC)
JIRA Issue: AURORA-217

Aurora/Mesos Client Generalization

Aurora provides an interface to scheduling, running, and monitoring commands on a Mesos cluster. Other frameworks besides Aurora provide similar capabilities for some key functions. A general command client could abstract details of specific frameworks, and provide these functions for all frameworks.
Knowledge prereq: Java/C++
Mentor: Mark Chu-Carroll (@MarkCC)
JIRA Issue: AURORA-256

Mesos

Libprocess Benchmark Suite

Brief explanation: Implement a benchmark suite for libprocess to identify potential performance improvements and test for performance regressions.
Knowledge Prerequisite: C++
Mentor: Ben Mahler (@bmahler) Jie Yu (@jie_yu)
JIRA ISsue: MESOS-1018

Slave Unregistration

Brief explanation: Add the ability to kill a slave and have it drain all tasks rather than leave things running underneath it: https://issues.apache.org/jira/browse/MESOS-544
Knowledge Prerequisite: C++
Mentor: Ben Mahler (@bmahler)

Mesos CLI improvements

Brief explanation: Add new functionality to the Mesos CLI to make developers lives easier.
Knowledge Prerequisite: C++
Mentor: Benjamin Hindman (@benh)
JIRA ISsue: MESOS-1016

Summingbird

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Addition of Akka backend for streaming compute

Brief explanation: Akka(http://akka.io) is a popular open source distributed actor system. Integrating this into Summingbird would increase the range of potential compute platform for users. Making the system more accessible and suitable for more varied tasks.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop.
Mentor: Jonathan Coveney (@jco)

Addition of Samza backend for streaming compute

Brief explanation: Samza(http://samza.incubator.apache.org/) is a new Apache incubator project allowing compute to be placed between two Kafka streams. Integrating this into Summingbird would increase the range of potential compute platform for users. Making the system more accessible and suitable for more varied tasks.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop, Yarn.
Mentor: Ian O'Connell (@0x138)

Addition of Tez backend for offline batch compute

Brief explanation: Tez(http://tez.incubator.apache.org) is a new Apache incubator to generalize and expand the map/reduce model of computation. Summingbird should be able to automatically take advantage of map-reduce-reduce plans, and other optimizations that Tez enables. This should perform better than the existing Hadoop-via-cascading-via-scalding backend that is currently available.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop, Yarn.
Mentor: Ian O'Connell (@0x138)

Addition of batch key/value store on Mesos or Yarn

Brief explanation: Something that is sorely missing from the open source release of scalding is a good batch-writable read-only key-value store to use for batch jobs. This could be something like ElephantDB (https://github.com/nathanmarz/elephantdb) or HBase. Having such a project set up with Summingbird would be a huge coup for the open-source community.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Ideally familiar with mesos or yarn, and low latency key-value stores likes HBase or ElephantDB.
Mentor: Jonathan Coveney (@jco)

Scalding

Scalding Twitter's library for programming in scala on Hadoop. It is approachable by new-comers with a fields/Data-frame-like API as well as a type-safe API. There is also a linear algebra API to support working with giant matrices and vectors on Hadoop.

Auto-tuning for Scalding

Brief explanation: Scalding is a mature scala API for programming Hadoop, but it does no optimization automatically. There are several places where we could improve tuning: 1) we could look at the input sizes for all the sources and use that to estimate reducers 2) we can use some history service that gives expected input/output sizes of all the nodes in the plan 3) we can use the previous two to automatically switch to cascading local (in-memory) mode when the input and output is small enough, thus avoiding Hadoop. A stretch goal could be using the previous history to make a plan: for instance, when should we push mappings across nodes in cases other than filters (which are obvious).
Expected results: A version of scalding that can be merged into the develop branch that automatically tunes reducer settings.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop.
Mentor: Oscar Boykin @posco

Productionize the scalding REPL

Brief explanation: scalding has a repl, where users can enter commands and see them run. There are a few issues: 1) it does not currently detect what pipes need to be reevaluated 2) it cannot load files and jars and execute user scripts 3) the design of scalding usually assumes units (jobs) that don't interact well with more immutable functional style (because the plan is mutated by the job).
Expected results: We want a scalding executable that does the standard imports, can interact with a cluster or local mode, supports EMR, and does not inefficiently repeatedly compute data.
Knowledge Prerequisite: Need to know scala (or strong knowledge of Java with some functional programming background). Need to be somewhat familiar with Hadoop. Must be familiar with graphs for modeling flows of computation.
Mentor: Oscar Boykin @posco

Cassovary

Join the cassovary mailing list and ask questions there: https://groups.google.com/forum/#!forum/twitter-cassovary

Analyze Wikipedia using Cassovary

Brief explanation: Study Wikipedia graph formed by links with in Wikipedia articles to each other. Load up this graph in Cassovary. Do analysis of this graph to make observations on Wikipedia pages such as linkage, similarity, clustering, etc. Use this graph to do entity resolution on arbitrary text.
Expected results: Cassovary modified as appropriate to do this analysis. Hopefully new findings on wikipedia graph structure. Working prototype of entity extraction.
Knowledge Prerequisite: Scala, Graph algorithms
Mentor: Pankaj Gupta (@pankaj) and Ajeet Grewal (@ajeet)

Flight

New Example: Convert Twisitor to use Flight

Brief explanation: Flight can benefit from more examples. This project will convert Twisitor to use Flight and make it a featured more complex Flight Example.
Expected results: 1) Convert Twisitor to use Flight and document throughly. 2) Improve Flight documentation with new example. 3) Create Unit Tests for Twisitor 4) Optional: add more features to Twisitor
Knowledge Prerequisite: Javascript, HTML
Mentor: Chris Aniszczyk (@cra) and Angus Croll (@angustweets)

TODO

Parquet

https://github.com/Parquet/parquet-mr/issues?labels=GSoC-2014&state=open

Parquet compatibility across tools

Brief explanation: Develop cross tools compatibility tests for parquet (https://github.com/Parquet/parquet-mr/issues/300)
Expected results:
- Compatibility of nested data types across tools - pig, hive, avro, thrift etc.
- Automated compatibility check between java implementation and impala (across release versions)
Knowledge Prerequisite: Java, Hadoop, Test frameworks
Mentor: Aniket Mokashi (@aniket486) and Julien Le Dem (@J_)

Parquet optimizations from stats support

Brief explanation: Develop optimizations for parquet using page statistics (https://github.com/Parquet/parquet-mr/issues/301)
Expected results:
- In parquet-format 2.0, data page header in parquet-format supports Statistics. We need to add optimizations to make use of these page statistics.
- Index support for parquet
- Explore use of probabilistic data structures in parquet (CountMinSketch etc.)
Knowledge Prerequisite: Java
Mentor: Aniket Mokashi (@aniket486) and Julien Le Dem (@J_)

Use zero-copy read path in Parquet

Brief explanation: Read performance can be improved by putting less pressure on memory bandwidth (https://github.com/Parquet/parquet-mr/issues/287)
Expected results:
- use of new ByteBuffer based apis in Hadoop to improve read perf:
  - HDFS-2834 ByteBuffer-based read API for DFSInputStream
  - HADOOP-8148 ByteBuffer based codec API
- Backward compatibility with Hadoop 1.0
Mentor: (@gerashegalov) and Julien Le Dem (@J_)

POJO conversion

Brief explanation: Create an integration to read and write Parquet files using Plain Old Java Objects (https://github.com/Parquet/parquet-mr/issues/304)
Expected results: read and write parquet using POJOS
Mentor: Julien Le Dem (@J_)

Decouple Parquet from the Hadoop API

Brief explanation: To allow reading and writing Parquet files independently of the Hadoop API. (https://github.com/Parquet/parquet-mr/issues/305)
Expected results: read and write Parquet without the Hadoop libraries
Mentor: Julien Le Dem (@J_)

Study state of the art floating point compression algorithms

(https://github.com/Parquet/parquet-mr/issues/306)

Brief explanation: Study existing lossless floating point compression papers and implement benchmarks.
Expected results: Provide reference implementation and benchmark comparison. With integration into the Parquet library
Mentor: Julien Le Dem (@J_)

Netty

You can learn more about getting involved with the Netty Project here: http://netty.io/community.html

Android testsuite

Brief explanation:
- Netty project team is willing to support Android 4.0 Ice Cream Sandwich officially, and we need an automated testsuite to achieve the goal.
Expected results:
- During the build process, an Android emulator is automatically started and stopped to run all (or applicable) JUnit tests inside the Android emulator.
- The result of the JUnit tests inside the emulator affects the build result so that we can run the Android compatibility test in our CI machine.
- All Android compatibility issues found during the test are fixed.
Knowledge Prerequisite:
- Java and Android programming
- Custom JUnit runners
- Experience with building a network application atop Netty
Mentor: Trustin Lee (@trustin)

OpenSSL-based SSL handler

Brief explanation:
- There were some attempts to implement faster SSL connection in Netty using OpenSSL, but they were all limited to wrapping OpenSSL with JDK's SSLEngine API. Instead, we want to see a Netty ChannelHandler that calls OpenSSL via JNI directly for maximum performance.
Expected results:
- A ChannelHandler implementation that calls OpenSSL via JNI directly.
- No dependencies on libapr or libtcnative
- Better performance than the legacy (vs. OpenSSLEngine from libtcnative + SslHandler)
- Primary build target must be Linux.
Knowledge Prerequisite:
- Java and JNI
- C programming on Linux
- SSL in general and OpenSSL library
- Experience with building a network application atop Netty
Mentor: Trustin Lee (@trustin)

HTTP/2 codec

Brief explanation:
- Netty currently lacks official HTTP/2 codec, and we want it.
Expected results:
- An HTTP/2 codec that enables a user run an HTTP/2 client and server.
- The examples that demonstrate how to write an HTTP/2 client and server
- The HTTP/2 codec shares the message types with the existing HTTP/1 codec. You'll end up making necessary changes in the existing message types or introducing new ones derived from them.
Knowledge Prerequisite:
- Java
- Socket programming
- HTTP/1.1 and HTTP/2.0 protocol
- Asynchronous programs - Futures and Promises
- Experience with building a network application atop Netty
Mentor: Trustin Lee (@trustin)

Dynamic channel migration between different event loops

Brief explanation:
- When a new channel is created, Netty assigns a thread to it, and it never changes during the life cycle of the channel. This can be a problem when a certain event loop is too crowded while others are not, because there's no way to migrate the channels in the busy event loop to the idle ones.
Expected results:
- A user can migrate an arbitrary channel from one event loop to the other event loop.
- The event delivery order is maintained during the migration.
- The implementation is scalable enough that migrating many channels at once does not impact performance.
Knowledge Prerequisite:
- Java
- Concurrency and multithread programming
- Reactor pattern and event-driven programming
- Experience with building a network application atop Netty
Mentor: Trustin Lee (@trustin)

Pluggable algorithm to choose next EventLoop

Brief explanation:
- Currently when a new EventLoop is needed to register a Channel our EventLoopGroup implementations just use round-robin like algorithm to choose the next EventLoop to use. Unfortunately this is not good enough as different EventLoops may become more busy then others over the time. This is especially true when your Application handles different kind of connections (long-living and short-living). We should allow to plugin different algorithms to choose the next EventLoop which take various feedback into account. See #1230
Expected results:
- A user can plugin different implementations into the EventLoopGroup implementation constructors.
- Incorporate changes to allow to gather feedback from the EventLoops about how busy they are and so be able to make use of this informations in custom implementations.
Knowledge Prerequisite:
- Java
- Concurrency and multithread programming
- Experience with building a network application atop Netty
Mentor: Norman Maurer (@normanmaurer)

Expose Metrics hooks

Brief explanation:
- Netty currently expose no metrics at all. While it is possible to gather some metrics via implement a custom ChannelHandler some of the interesting metrics (like metrics on the EventLoops etc) are not accessible from there. As Netty itself should not depend on any other metrics library we should add some hooks to allow to plugin different implementations and also ship with optional implementations (a good start would be [yammer metrics] (http://metrics.codahale.com))
Expected results:
- Identify useful metrics
- Incorporate changes to allow hook in metrics API where needed
- Implement some optional implementation using [yammer metrics] (http://metrics.codahale.com)
Knowledge Prerequisite:
- Java
- Concurrency and multithread programming
- Metrics
- Experience with building a network application atop Netty
Mentor: Norman Maurer (@normanmaurer)

Pants

For more information about Pants, check these out:

Contributors Guide: http://pantsbuild.github.io/howto_contribute.html
Developers Guide: http://pantsbuild.github.io/howto_develop.html
Task Developers Guide: http://pantsbuild.github.io/dev_tasks.html

Android support for Pants

Brief explanation: Add Android support to the Pants build system.
Expected results: Pants can compile Android based applications.
Knowledge Prerequisite: Python, Java, Android
Mentor: Travis Crawford (@tc) and John Sirois (@johnsirois)

C/C++ support for Pants

Brief explanation: Add C/C++ support to the Pants build system.
Expected results: Pants can compile C/C++ based applications.
Knowledge Prerequisite: Python, C/C++
Mentor: Travis Crawford (@tc) and John Sirois (@johnsirois)

Eclipse Integration

Brief explanation: Add Eclipse integration to Pants
Expected results: Create a classpath container based on integrating with Pants and a launcher.
Knowledge Prerequisite: Python, Java, Eclipse
Mentor: Travis Crawford (@tc) and Chris Aniszczyk (@cra)

Zipkin

Zipkin in a box

Brief explanation: Getting an initial instance of Zipkin running is tedious. This should be quick, simple, and straight forward. Anyone wanting to try Zipkin should be able to spin up a test instance with very little fuss. It should be capable of receiving, storing, and indexing spans and displaying them via the web UI. It should be capable of receiving thrift serialized spans via scribe or directly, and JSON encoded spans via HTTP. Possible packaging solutions are a Vagrant instance, a Docker image, and/or a single zip that can be downloaded, unpacked and run.
Expected results: A Zipkin instance that can be setup in seconds to receive and display data.
Knowledge Prerequisite: Scala, SQL, some Ruby
Mentor: Jeff Smick (@sprsquish)

Activerecord Reputation System

Convert to Arel

Brief explanation: The activerecord-reputation-system gem helps you build a reputation system on top of activerecord and rails. Currently this gem does not return ActiveRecord::Relation objects, which means calls to reputation system cannot be chained or composed. This makes it difficult to compose another framework (e.g. an ACL system) with the reputation system. For example, queries like "what is the karma of all the users who have access to project 'foo'" are not readily possible.
Expected results: Should be able to chain/compose reputation system calls with other activerelation objects
Knowledge Prerequisite: Ruby, Arel, Activerecord
Mentor: Sumit Shah (@bigloser) and Cameron Dutro (@camertron)

TwitterCLDR

Improve string collation implementation

Twitter CLDR Ruby gem provides a basic implementation of string collation (locale-aware sorting), but there's a number of ways it can be improved:
Add support for script reordering that allows sorting characters from a native script before characters from other scripts (e.g., sorting Cyrillic characters before Latin ones in Russian locale). More info.
Switch from deprecated XML syntax for collation rules to the basic one. More info.
Address issues with ignoring denormalized code points in the Collation Elements Table. More info.
Expected results: Fixing all or some of the issues listed above and achieving better parity with Unicode Collation Algorithm implementation from ICU library.
Knowledge Prerequisite: Ruby.
Mentor: Kiryl Lashuk (@KL7) and Cameron Dutro (@camertron).

Port missing features from Ruby gem to JavaScript library

Twitter CLDR JavaScript library is still missing a lot of features that are available in the Ruby gem. Among them:
Text segmentation
Rule-based numbers formatting
Localization of language codes
String collation (though, this feature might be a bit to heavy for a JavaScript library)
etc.
Expected results: Having a wider range of Twitter CLDR features available in the JavaScript version of the library.
Knowledge Prerequisite: JavaScript, CoffeeScript, Ruby.
Mentor: Kiryl Lashuk (@KL7) and Cameron Dutro (@camertron).

SecureHeaders

Content Security Policy 1.1 Support

Content security policy (CSP) is a way to whitelist capabilities (such as running inline javascript, using eval, loading content from specific hosts, etc) in a browser.
CSP 1.1 provides mechanisms for whitelisting specific pieces of inline javascript without allowing inline javascript globally. This creates a good balance between security and development speed.
Leveraging this programmatically inside of the rails framework would be very powerful. This would be a continuation of the original Proof of concept
- Automatic calculation, generation, and application of nonces/hashes to inline script elements
- Environment-specific modes of operation (i.e. apply dynamically in development, but pre-calculate and apply statically in test/staging/production/etc).
- Create to use API that supports at least Erb and Mustache templates (haml, slim, etc for bonus points)
Expected results: A strategy that allows developers to safely and transparently apply a CSP that disallows inline javascript, except for the desired whitelisted elements.
Knowledge Prerequisite: Knowledge of the rails framework and experience with content security policy would be helpful, but not required.
Mentor: Neil Matatall (https://twitter.com/ndm), Justin Collins (https://twitter.com/presidentbeef), Matthew Finifter (https://twitter.com/mfinifter)

Project

Project URL

Project Idea (e.g., New Feature)

Brief explanation:
Expected results:
Knowledge Prerequisite:
Mentor:

General Proposal Requirements

Proposals will be submitted via http://www.google-melange.com/gsoc/homepage/google/gsoc2014, therefore plain text is the best way to go. We expect your application to be in the range of 1000 words. Anything less than that will probably not contain enough information for us to determine whether you are the right person for the job. Your proposal should contain at least the following information, but feel free to include anything that you think is relevant:

Please include your name and twitter handle!
Title of your proposal
Abstract of your proposal
A link to your github id (if you have one)
Detailed description of your idea including explanation on why is it innovative
Description of previous work, existing solutions (links to prototypes, bibliography are more than welcome)
Mention the details of your academic studies, any previous work, internships
Any relevant skills that will help you to achieve the goal (programming languages, frameworks)?
Any previous open-source projects (or even previous GSoC) you have contributed to?
Do you plan to have any other commitments during SoC that may affect you work? Any vacations/holidays planned?
Contact details

Good luck!

Follow us at @TwitterOSS

Google Summer of Code 2018 Projects

Google Summer of Code 2014

Google Summer of Code 2014

Information for Students

Accepted Projects

Use zero-copy read path in Parquet

Netty: Pluggable algorithm to choose next EventLoop

Various compression codecs for Netty

Aurora: building a logging and analytics framework

Android Support For Pants

Finagle: pure zookeeper client

Finagle: SMTP client

Summingbird: Addition of Tez backend for offline batch compute

Analyze Wikipedia using Cassovary

Adding a Proposal

Project Ideas

Distributed debugging (DTrace-like instrumentation for distributed systems)

System profiler

Kerberos authentication in Mux

Pure finagle zookeeper client

finagle-smtp

Aurora CLI Improvements

Aurora Configuration Documentation

Aurora Analytics

Aurora/Mesos Client Generalization

Libprocess Benchmark Suite

Slave Unregistration

Mesos CLI improvements

Addition of Akka backend for streaming compute

Addition of Samza backend for streaming compute

Addition of Tez backend for offline batch compute

Addition of batch key/value store on Mesos or Yarn

Auto-tuning for Scalding

Productionize the scalding REPL

Cassovary

Analyze Wikipedia using Cassovary

New Example: Convert Twisitor to use Flight

TODO

Parquet compatibility across tools

Parquet optimizations from stats support

Use zero-copy read path in Parquet

POJO conversion

Decouple Parquet from the Hadoop API

Study state of the art floating point compression algorithms

Android testsuite

OpenSSL-based SSL handler

HTTP/2 codec

Dynamic channel migration between different event loops

Pluggable algorithm to choose next EventLoop

Expose Metrics hooks

Android support for Pants

C/C++ support for Pants

Eclipse Integration

Zipkin in a box

Convert to Arel

Improve string collation implementation

Port missing features from Ruby gem to JavaScript library

Content Security Policy 1.1 Support

Project

Project Idea (e.g., New Feature)

General Proposal Requirements

Clone this wiki locally