Skip to content

Using ade to analyse Spark logs

Ayush Shridhar edited this page Aug 27, 2020 · 3 revisions

ADE on Spark logs

Along with RFC3164/RFC5424 format Linux Syslogs, ADE can also be run on Spark logs. However, we need to explicitly tell ade that we're using spark: It assumes the logs to be syslogs otherwise. As a starting point, we need to edit the setup file to add this parameter.

Usage

Here's what a relevant section of the setup.props file looks like in an ordinary case:

# --------------------------------------------------------------------
# AdeExt properties
# --------------------------------------------------------------------
adeext.msgRateReportFreq=5
adeext.msgRateMsgToKeep=1000
adeext.parseErrorToKeep=100
adeext.parseErrorDaysTolerate=2
adeext.parseErrorTrackNullComponent=false
adeext.runtimeModelDataStoreAtSource=true
adeext.useSparkLogs=true

adeext.msgRate10MinSlotsToKeep=24
adeext.msgRate10MinSubIntervalList=1,2,3,6,12,24
adeext.msgRateMergeSource=true

# --------------------------------------------------------------------
# Paths
# (ade.flowLayoutFileSpark and ade.analysisGroupToFlowNameMapperClassSpark
#  are only used when ade.useSparkLogs=true
# --------------------------------------------------------------------

ade.useSparkLogs=true
ade.flowLayoutFile=conf/xml/FlowLayout.xml
ade.flowLayoutFileSpark=conf/xml/FlowLayoutSpark.xml
ade.outputPath=output/
ade.analysisOutputPath=output/continuous
ade.xml.xsltDir=conf/xml
ade.criticalWords.file=conf/criticalWords.txt
ade.analysisGroupToFlowNameMapperClass=org.openmainframe.ade.ext.os.LinuxAnalysisGroupToFlowNameConstantMapper
ade.analysisGroupToFlowNameMapperClassSpark=org.openmainframe.ade.ext.os.SparkAnalysisGroupToFlowNameConstantMapper
ade.outputFilenameGenerator=org.openmainframe.ade.ext.output.ExtOutputFilenameGenerator
ade.inputTimeZone=GMT+00:00
ade.outputTimeZone=GMT

The ade.useSparkLogs parameter can be toggled to indicate if we're using Spark (true) or not.

Internals

Internally, spark log analysis is very similar to that of syslogs, with a few subtle changes. At the heart of it, we have a SparklogLineParser that parses a single Spark message from the log file. This uses regular expression matching to extract the timestamp, text, source, component and other relevant fields. This information is used by SparkLogParser to send data to SparklogMessageReader that process it and send it directly to the output stream.

Spark parsing

Running the model

The easiest way to run ade on spark data is to run the spark_analysis_comp_test.sh file (similar to running ade on syslogs). Prerequisites include: Java 8, Apache derby. Suppose we have derby and ade-1.0.4 installed in the home (~) directory. To run the test, execute the following statements:

>>> cd ~
>>> ./db-derby-10.11.1.1-bin/bin/startNetworkServer       # start the derby database server
>>> cd ade-1.0.4
>>> ./bin/test/spark_analysis_comp_test.sh

The training and analysis data is stored in ade-1.0.4/baseline/spark/upload/ and ade-1.0.4/baseline/spark/analyze/. The script will perform the following steps:

  1. create a temporary database
  2. upload the training data to the database
  3. Train the model groups
  4. Use these trained model groups to analyze the analysis data

If you're interested in invoking all the functions rather than calling a script, you'd need to use the controldb create, controldb upload, train all, analyze commands. You can read more about the ade command summary here.