Skip to content

High Level Architecture Design

Yogesh Simmhan edited this page Jul 19, 2016 · 16 revisions

Welcome to the goffish_v3 wiki!

Broad classification of required components:

  1. Programming Model
    Specifies the programming abstraction provided by the framework using which applications can be written. Popular choices of programming model includes vertex-centric and subgraph-centric.
  • Synchronous vs Asynchronous : BSP, BAP
  • General-purpose, vertex-centric, graph-centric
  1. Runtime Model
    Responsible for dynamic operations which can be performed during the run time. How the job executes? Given a underlying storage model, deciding where to spawn workers.
  2. Storage Model
    Deals with efficient storage and retrieval of various types of graphs. Persistent Storage of graphs.
  3. Admin Model
    Responsible for admin tasks including logging, setup, UI etc. Necessary parameters to log such as Time(for each phase),Messages,Synchronization time etc.

Different types of input, application and architecture [1][2]

  • For applications to leverage the 3-dimensions: input, application and architecture.

Specific requirements:

Data Model

What types of applicaitons/algos will benefit from these models? Motivation.

  • Support for time series graphs and dynamic graphs [1][2]

    • Large static graphs should be supported.
      • Does large mean entire graph fits in distributed memory? Or do we also support "out of core" loading and execution? For out of core, do we persist only the delta state and not the original topo/attr already in storage? what happens to messages intended for offline parts?
      • TODO: Check if other f/w support offloading parts of graph to disk, incrementally load "partitions" that are active? [DK]
    • Property graphs should be supported.
      • <name, type, value> set for each vertex and edge.
      • Fixed schema for all vertices, and all edges? Or have finite types of vertices and edges, with fixed schema per type? Or have no fixed pre-defined schema, but instead have each vertex/edge have a set of <N,T,V>?
      • TODO: Understand what property graph data models exist in lit. Consider knowledge graphs, RDF graphs. [AS]
    • Time series graphs should be natively supported by the framework
      • Timestamp for each graph instance, monotonically increasing timestamp between successive graphs.
      • Can the timestamp for each vertex and edge for a graph instance be different from the graph timestamp? Or have start and end time for an instance, and all vertex/edge TS fall between them. It may be that non-overlapping start/end times for graph instances may not be possible.
      • TODO: Look at kineograph on how TS graphs are supported by data model. [RD]
    • Dynamic graphs (slow changing structure, but related graphs) should be natively supported by the framework.
      • There is continuity in the structure of the graph, with graph topo deltas between "instances". E.g. social network graph or NELL graph that is changing struct over time. Property values may change too. How about property types?
      • TODO: Look at kineograph on how dynamic graphs are supported by data model. Look at requirements for dynamic knowledge graphs. [RD, AS]
    • Collections of graphs (unrelated graphs) should be natively supported by the framework.
      • Different graphs with nothing necessarilly common between them. Perform join, union, etc. across graph collections for queries, etc.
      • TODO: Look at collections of property graphs. What do graph DBs support? [AS]
  • Dynamic arrival of graphs [2]

    • Timeseries graphs
      • Vertex or Edge property updates come as a stream. Only a subset of vertices/edges, and a subset of their properties may change. Others retain their values from last updated timestamp. e.g. applications include maintaining shortest path between two vertices. e.g. Charith's work at USC, integrating stream with graph processing.
    • Dynamic graphs
      • Vertex or edge topo updates come as a stream.
  • Modify topology, property values and schema at runtime [1][2]

    • Insertion, deletion of vertices and edges, changing edge direction in the graph structure. Includes large deltas.
      • TODO: Look into types of graph mutations supported on graphs [DP]
    • Changing the values for vertices and edges at runtime.
      • TODO: Look into types of property graph updates and vertex/edge value supported on graph DBs [AS, DK]
    • Changing the schema for vertices and edges at runtime.
      • TODO: Look into types of property graph updates supported on graph DBs [AS]
  • Flexible graph API [1]

    • Flexible graph API to support various graph data structure as per the application requirements.
    • Custom classes for graph data structure: e.g. edge list, adjacency list, user-defined
      • TODO: Examples from Giraph [DK]

Programming Model

Programming model which should be exposed to the programmers to use the framework.

  • Synchronous Component centric models

    • Vertex, edge, subgraph/block (WCC, SCC), partition centric. Mix of all. e.g. blogel does v-centric then b-centric for each superstep, Giraph++ is partition-centric., Gather-Apply-Scatter (GAS) model for edge-centric
    • Use "filter" to define component dynamically, e.g. ARABESQUE
    • Allow sampling of vertices from a graph to decide active ones
    • Local supersteps within superstep...Giraph Unchained (BAP)
  • Asynchronous

    • e.g GraphLab
  • Composition of phases, tasks etc [1][2]

    • Applications can be designed as collection of phases which can have specific order.
    • There may be a fixed number of phases, or a "while(condition)" where a phase is executed till some condition is met (e.g. tested by a master, k-means has a master test edge cuts)
    • Have init, compute, conclude parts. Each phase can have different logic for these parts.
    • Each phase may execute on the same or different TS graph. What about dynamic graphs, collections?
    • Each part may execute for a fixed or a variable number (VTH) of supersteps, or a master may decide. e.g. PR has 30 iters, other algos VTH.
    • Master-compute model, global variables (aggregators),
    • Init and conclude phases before/after each graph? Before/after application?
    • What is the messaging APIs exposed? between components, phases, graphs, etc.
  • Transaction jobs and long running jobs [1][2]

    • Support for both type of processing: batch and real time.
    • Support to collaborate with other types of units to accomplish any complex/multi-stage
      • ref: epiCG: A GraphUnit Based Graph Processing Engine on epiC
  • Operations on meta-graph [1][2]

    • Processing meta-graph instead of the whole graph.
    • Ability to run analytics on meta-graph before running actual algorithm, result of which can be used for resource allocation.

Runtime Model

  • Repartitioning on the fly [2]
    Repartitioning during run time to support efficient processing of graph. Dealing with dynamic graphs with topological mutations, it is necessary for this feature to be present so that workload is distributed evenly among all the workers. This may be helpful if we want to use a different partitioning method(Overhead unknown?).

  • Worker tasks allocation and scheduling [2]
    Allocation of tasks to worker nodes and scheduling tasks. Ability to specify subgraphID, workerID, HostID for a vertex in the input format itself.

  • Dynamic mapping of workers to VMs [2]
    Dynamic mapping to efficiently use resources and reduce execution time. Ability to specify threshold value for parameters like outgoing message, compute time, beyond which mapping will be recomputed.

  • Pre-computation stats [1][2]
    Stats to gain insights about the graph and perform dynamic operations.
    Ability to compute and store general statistics of a graph such as degree distribution, attribute value distribution

  • Check points [2][3]
    Check points to restart and migrate graph data and worker tasks to help in recovery in case of failures.
    Fault tolerance for underlying storage model.(Replication???)

  • Support for scale in/out for VMs [2]
    Same deployment can be used with subset of available VMs.
    Dynamically scale in/out support for VM like Storm.
    Supporting Elastic Scalability in Distributed Environment

  • Efficient partitioning technique [1][2][3]
    To reduce communication overhead and optimize locality.

    • METIS, Graph Voronoi Diagram Partitioner, 2D Partitioner
    • Custom partitioner (WCC, SCC)

Storage Model

  • Updates to storage [3]
    Writing changes to a graph back to the file system.

  • Scale out and in the storage [3]
    Native support for horizontal scaling of storage.

  • Analyzing packing and reliability [2]
    Efficient packing of messages and reliably delivering them.

  • Write out non-graph data to file system [3]
    Write results to the file system. Ability to store any object to file system(Indexes,statistics etc.)

  • Efficient partitioning technique [1][2][3]
    To reduce communication overhead and optimize locality.

    • METIS, Graph Voronoi Diagram Partitioner, 2D Partitioner
    • Custom partitioner (WCC, SCC)

Admin Features

  • Logging, setup, UI, reliability [4]
    Admin tasks to ease development of applications.