Feature_1_0_Syslog_Event_Documentation

Feature 1.0 Syslog Event Documentation (DRAFT) #2282

The intent of this document is to describe at a high level the overall goal of this feature, and to describe in more detail an initial implementation of a few specific events.

Overall Goal

Eventually, we want to capture lots of events from the Aeolus system (and possibly other cloud-related systems, like Katello). These events may range from ‘Instance Start’ to ‘Role Assigned’ to ‘Image Build Started’ to ‘System Message’. This is quite a broad gamut, and far too much to attempt to implement all at one time. However, as detailed in aeolus-umbrella:Event_Format, we plan to start off using syslog for this, and eventually (if there is demand) adding an api around this to request or be notified about certain categories of events. When and if we get to such a point, there would likely be a separate ‘Event API’ type of subsystem, since the reporters are highly dispersed. Using syslog for the small subset of events to follow allows us to:

Offload the parsing of these events to a system that specializes in such things (splunk, for instance).
Test out use of syslog from a single component (conductor) to evaluate any issues (performance, poor support in a given language, etc) without compromising the goal or system in question.
Determine the feasibility of having all cloud components send their events to a central location/file/etc.

As stated previously, the idea is to capture a small number of events to start. In this way, we can provide some useful metrics for our users and test the event logging direction at the same time. The current hope is that if there are performance/integration issues, they can be remedied before we get to critical mass with regard to the number of events we are capturing. If this is not the case, we will not have invested too much effort on a dead end. That said, syslog is a robust and much-used system, so hopefully we do not find it to be a dead end.

Initial Implementation

As a first pass at capturing some events to syslog, we intend to capture Events 020001 and 020002 listed below. After this list is a brief description of what each chunk of information is.

Identifier Descriptions

In addition to the listed keywords above, each event will have a ‘tag’ identifier to specify which component it comes from (conductor, factory, etc). The exact names for the tags are still TBD. Some of these are fairly self-explanatory, but they have a description anyway, just to be clear.

Event Id - Unique id for this type of event (in these examples, it would be 020001 or 020002)
Instance Id - UUID from the conductor system identifying a specific instance
Deployment Id - UUID from the conductor system identifying the specific deployment this instance is part of
Image Id - Identifier that denotes the Image that this particular Instance was launched from.
Owner - User name of the person who owns this instance or deployment
Pool - Name of the Pool where this Instance/Deployment was launched
Provider - Provider where this Deployment was launched (example: “ec2-us-east1").
* Provider Type Type of cloud this provider is (‘EC2’, ‘RHEV-M’, etc)
* Provider Account - account used on the specified provider, denoted by username on provider
* Hardware Profile - Name of Hardware Profile a particular instance is running on (example:”m1 large")
Start Reason - Initially this will be something simple like ‘User Initiated’, but may eventually contain other values, like ‘Scale’
Start Time - Time this Instance began running
Terminate Reason - Initially this will be something simple like ‘User Requested’, but may eventually contain other values, like ‘System Crash’ or ‘Quota Exceeded’
Terminate Time - Time this Instance was terminated.
Deployable - uuid of the deployable (what a deployment is launched from)

API Design

In further discussion of these events with Scott Seago, he was of the mind that really we are exposing a subset of the data that is already modelled by the Conductor. This led to a decision to add any further data needed for these events to the conductor model, rather than some other, new, place.

<code class="ruby">
module Aeolus
  module Event
    class Base
      attr_accessor :event_id
      attr_reader :changed_fields

      # this returns a list of changed fields like 'start_time,provider,reason', or whatever changed from the last event
      def changed_fields
        @changed_fields ||= calculate_changed
      end

      def process(output_target='syslog', source='conductor', uuid=nil)
        # Calls any required transformation methods to output properly to given targets
      end

      protected

      # Example, this could also be part of the changed_fields getter method if that made more sense
      def calculate_changed
        #this method looks at attrs w/ a prefix of 'old_' and updates the changed_fields var as needed
      end
    end
  end
end
</code>

The net is to have a call from conductor, in an after_save hook, something like:

<code class="ruby">
  cur_cidr = Aeolus::Event::Cidr.new("list of attr values for this event type")
  cur_cidr.process() #ignore the other params for now, as they will default to what we need anyway
</code>

where cur_cidr is an instance of a class extending Aeolus:Event:Base, adding the attributes for that specific event, as listed above.

For the CIDR event, we need to add
* start, stop, and reason fields to the conductor instance model.
*** There is a wrinkle here in that once conductor supports reboot , we wil need to add that to the model as well. This work has the potential to happen at the same time as we add the start/stop fields, and requires more thinking yet.
*** Start and stop time may already be captured by dbomatic. If so, reuse that and add it to what is saved on instance state update. If not, we may need to see if it can be added, and dbomatic seems the most sensible place to do so.
** An after_save hook that populates the CIDR event object and calls Aeolus::Event.send on instance start or stop

CDDR should be almost the same, but are start/stop time things we can get here? scott? I dont see them in the model, wil need to see if dbomatic captures this, or really even what it would mean (I think it means the entire deployment has finished starting/stopping)

NOTE: One alternative implementation to this would be to add audit tables for both instance and deployment (and later other models that throw events). This feels more scalable to me longer term, but it is unclear if we should pursue this right now due to time constraints.

Provide feedback

Saved searches