Add Logram from TSE'20

logpai · Sep 5, 2023 · fcb019d · fcb019d
1 parent 93d94a6
commit fcb019d
Show file tree

Hide file tree

Showing 14 changed files with 559 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 <div>
 <a href="https://pypi.org/project/logparser3"><img src="https://img.shields.io/badge/python-3.6+-blue" style="max-width: 100%;" alt="Python version"></a>
 <a href="https://pypi.org/project/logparser3"><img src="https://img.shields.io/pypi/v/logparser3.svg" style="max-width: 100%;" alt="Pypi version"></a>
-<a href="https://github.com/logpai/logparser/actions/workflows/ci.yml"><img src="https://github.com/logpai/logparser/workflows/CI/badge.svg" style="max-width: 100%;" alt="Pypi version"></a>
+<a href="https://github.com/logpai/logparser/actions/workflows/ci.yml"><img src="https://github.com/logpai/logparser/workflows/CI/badge.svg?event=push" style="max-width: 100%;" alt="Pypi version"></a>
 <a href="https://pepy.tech/project/logparser3"><img src="https://static.pepy.tech/badge/logparser3" style="max-width: 100%;" alt="Downloads"></a>
 <a href="https://github.com/logpai/logparser/blob/main/LICENSE"><img src="https://img.shields.io/github/license/logpai/logparser.svg" style="max-width: 100%;" alt="License"></a>
 <a href="https://github.com/logpai/logparser#discussion"><img src="https://img.shields.io/badge/chat-wechat-brightgreen?style=flat" style="max-width: 100%;" alt="License"></a>
@@ -22,7 +22,7 @@ Logparser provides a machine learning toolkit and benchmarks for automated log p
 
 ### 🌈 New updates
 
-+ Since the first release of logparser, many PRs and issues have been submitted due to incompatibility with Python 3. Finally, we update logparser v1.0.0 with support for Python 3. Thanks for all the contributions! ([#PR86](https://github.com/logpai/logparser/pull/86), [#PR85](https://github.com/logpai/logparser/pull/85), [#PR83](https://github.com/logpai/logparser/pull/83), [#PR80](https://github.com/logpai/logparser/pull/80), [#PR65](https://github.com/logpai/logparser/pull/65), [#PR57](https://github.com/logpai/logparser/pull/57), [#PR53](https://github.com/logpai/logparser/pull/53), [#PR52](https://github.com/logpai/logparser/pull/52), [#PR51](https://github.com/logpai/logparser/pull/51), [#PR49](https://github.com/logpai/logparser/pull/49), [#PR18](https://github.com/logpai/logparser/pull/18), [#PR22](https://github.com/logpai/logparser/pull/22))
++ Since the first release of logparser, many PRs and issues have been submitted due to incompatibility with Python 3. Finally, we update logparser v1.0.0 with support for Python 3. Thanks for all the contributions ([#PR86](https://github.com/logpai/logparser/pull/86), [#PR85](https://github.com/logpai/logparser/pull/85), [#PR83](https://github.com/logpai/logparser/pull/83), [#PR80](https://github.com/logpai/logparser/pull/80), [#PR65](https://github.com/logpai/logparser/pull/65), [#PR57](https://github.com/logpai/logparser/pull/57), [#PR53](https://github.com/logpai/logparser/pull/53), [#PR52](https://github.com/logpai/logparser/pull/52), [#PR51](https://github.com/logpai/logparser/pull/51), [#PR49](https://github.com/logpai/logparser/pull/49), [#PR18](https://github.com/logpai/logparser/pull/18), [#PR22](https://github.com/logpai/logparser/pull/22))!
 + We build the package wheel logparser3 and release it on pypi. Please install via `pip install logparser3`.
 + We refactor the code structure and beautify the code via the Python code formatter black.
 
@@ -43,6 +43,7 @@ Logparser provides a machine learning toolkit and benchmarks for automated log p
 | ICDM'16 | [Spell](https://github.com/logpai/logparser/tree/main/logparser/Spell#spell) | [Spell: Streaming Parsing of System Event Logs](https://www.cs.utah.edu/~lifeifei/papers/spell.pdf), by Min Du, Feifei Li.  |
 | ICWS'17 | [Drain](https://github.com/logpai/logparser/tree/main/logparser/Drain#drain) | [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), by Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.|
 | ICPC'18 | [MoLFI](https://github.com/logpai/logparser/tree/main/logparser/MoLFI#molfi) | [A Search-based Approach for Accurate Identification of Log Message Formats](http://publications.uni.lu/bitstream/10993/35286/1/ICPC-2018.pdf), by Salma Messaoudi, Annibale Panichella, Domenico Bianculli, Lionel Briand, Raimondas Sasnauskas.  |
+| TSE'20 | [Logram](https://github.com/logpai/logparser/tree/main/logparser/Logram#logram) | [Logram: Efficient Log Parsing Using n-Gram Dictionaries](https://arxiv.org/pdf/2001.03038.pdf), by Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun (Peter) Chen. |
 
 :bulb: Welcome to submit a PR to push your parser code to logparser and add your paper to the table.
 
@@ -121,7 +122,7 @@ The main goal of logparser is used for research and benchmark purpose. Researche
 
 + Please be aware of the licenses of [third-party libraries](https://github.com/logpai/logparser/blob/main/THIRD_PARTIES.md) used in logparser. We suggest to keep one parser and delete the others and then re-build the package wheel. This would not break the use of logparser.
 + Please enhance logparser with efficiency and scalability with multi-processing, add failure recovery, add persistence to disk or message queue Kafka.
-+ [Drain3](https://github.com/logpai/Drain3) provides a good example for your reference that is built with [practical enhancements] for production scenarios.
++ [Drain3](https://github.com/logpai/Drain3) provides a good example for your reference that is built with [practical enhancements](https://github.com/logpai/Drain3#new-features) for production scenarios.
 
 ### Citation
 👋 If you use our logparser tools or benchmarking results in your publication, please cite the following papers.

diff --git a/THIRD_PARTIES.md b/THIRD_PARTIES.md
@@ -7,3 +7,4 @@ The logparser package is built on top of the following third-party libraries:
 |    LenMa   |        https://github.com/keiichishima/templateminer       |     BSD    |
 | MoLFI  |  https://github.com/SalmaMessaoudi/MoLFI  |     Apache-2.0     |
 |    alignment (LogMine)  |      https://gist.github.com/aziele/6192a38862ce569fe1b9cbe377339fbe      | GPL |
+|    Logram  |      https://github.com/BlueLionLogram/Logram      | NA |
diff --git a/logparser/Logram/README.md b/logparser/Logram/README.md
@@ -0,0 +1,60 @@
+# Logram
+
+Logram is an automated log parsing technique, which leverages n-gram dictionaries to achieve efficient log parsing. 
+
+Read more information about Logram from the following paper:
+
++ Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun (Peter) Chen. [Logram: Efficient Log Parsing Using n-Gram
+Dictionaries](https://arxiv.org/pdf/2001.03038.pdf), *IEEE Transactions on Software Engineering (TSE)*, 2020.
+
+### Running
+
+The code has been tested in the following enviornment:
++ python 3.7.6
++ regex 2022.3.2
++ pandas 1.0.1
++ numpy 1.18.1
++ scipy 1.4.1
+
+Run the following scripts to start the demo:
+
+```
+python demo.py
+```
+
+Run the following scripts to execute the benchmark:
+
+```
+python benchmark.py
+```
+
+### Benchmark
+
+Running the benchmark script on Loghub_2k datasets, you could obtain the following results.
+
+|   Dataset   | F1_measure | Accuracy |
+|:-----------:|:----------|:--------|
+|     HDFS    | 0.990518   | 0.93     |
+|    Hadoop   | 0.78249    | 0.451    |
+|    Spark    | 0.479691   | 0.282    |
+|  Zookeeper  | 0.923936   | 0.7235   |
+|     BGL     | 0.956032   | 0.587    |
+|     HPC     | 0.993748   | 0.9105   |
+| Thunderbird | 0.993876   | 0.554    |
+|   Windows   | 0.913735   | 0.694    |
+|    Linux    | 0.541378   | 0.361    |
+|   Android   | 0.975017   | 0.7945   |
+|  HealthApp  | 0.587935   | 0.2665   |
+|    Apache   | 0.637665   | 0.3125   |
+|  Proxifier  | 0.750476   | 0.5035   |
+|   OpenSSH   | 0.979348   | 0.6115   |
+|  OpenStack  | 0.742866   | 0.3255   |
+|     Mac     | 0.892896   | 0.568    |
+
+
+### Citation
+
+:telescope: If you use our logparser tools or benchmarking results in your publication, please kindly cite the following papers.
+
++ [**ICSE'19**] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. [Tools and Benchmarks for Automated Log Parsing](https://arxiv.org/pdf/1811.03509.pdf). *International Conference on Software Engineering (ICSE)*, 2019.
++ [**DSN'16**] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. [An Evaluation Study on Log Parsing and Its Use in Log Mining](https://jiemingzhu.github.io/pub/pjhe_dsn2016.pdf). *IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)*, 2016.
diff --git a/logparser/Logram/__init__.py b/logparser/Logram/__init__.py
@@ -0,0 +1 @@
+from .src.Logram import *
diff --git a/logparser/Logram/benchmark.py b/logparser/Logram/benchmark.py
@@ -0,0 +1,183 @@
+# =========================================================================
+# Copyright (C) 2016-2023 LOGPAI (https://github.com/logpai).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# =========================================================================
+
+
+import sys
+sys.path.append("../../")
+from logparser.Logram import LogParser
+from logparser.utils import evaluator
+import os
+import pandas as pd
+
+
+input_dir = "../../data/loghub_2k/"  # The input directory of log file
+output_dir = "Logram_result/"  # The output directory of parsing results
+
+benchmark_settings = {
+    "HDFS": {
+        "log_file": "HDFS/HDFS_2k.log",
+        "log_format": "<Date> <Time> <Pid> <Level> <Component>: <Content>",
+        "regex": [
+            r"blk_(|-)[0-9]+",  # block id
+            r"(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)",  # IP
+            r"(?<=[^A-Za-z0-9])(\-?\+?\d+)(?=[^A-Za-z0-9])|[0-9]+$",
+        ],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "Hadoop": {
+        "log_file": "Hadoop/Hadoop_2k.log",
+        "log_format": "<Date> <Time> <Level> \[<Process>\] <Component>: <Content>",
+        "regex": [r"(\d+\.){3}\d+"],
+        "doubleThreshold": 9,
+        "triThreshold": 10,
+    },
+    "Spark": {
+        "log_file": "Spark/Spark_2k.log",
+        "log_format": "<Date> <Time> <Level> <Component>: <Content>",
+        "regex": [r"(\d+\.){3}\d+", r"\b[KGTM]?B\b", r"([\w-]+\.){2,}[\w-]+"],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "Zookeeper": {
+        "log_file": "Zookeeper/Zookeeper_2k.log",
+        "log_format": "<Date> <Time> - <Level>  \[<Node>:<Component>@<Id>\] - <Content>",
+        "regex": [r"(/|)(\d+\.){3}\d+(:\d+)?"],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "BGL": {
+        "log_file": "BGL/BGL_2k.log",
+        "log_format": "<Label> <Timestamp> <Date> <Node> <Time> <NodeRepeat> <Type> <Component> <Level> <Content>",
+        "regex": [r"core\.\d+"],
+        "doubleThreshold": 92,
+        "triThreshold": 4,
+    },
+    "HPC": {
+        "log_file": "HPC/HPC_2k.log",
+        "log_format": "<LogId> <Node> <Component> <State> <Time> <Flag> <Content>",
+        "regex": [r"=\d+"],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "Thunderbird": {
+        "log_file": "Thunderbird/Thunderbird_2k.log",
+        "log_format": "<Label> <Timestamp> <Date> <User> <Month> <Day> <Time> <Location> <Component>(\[<PID>\])?: <Content>",
+        "regex": [r"(\d+\.){3}\d+"],
+        "doubleThreshold": 35,
+        "triThreshold": 32,
+    },
+    "Windows": {
+        "log_file": "Windows/Windows_2k.log",
+        "log_format": "<Date> <Time>, <Level>                  <Component>    <Content>",
+        "regex": [r"0x.*?\s"],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "Linux": {
+        "log_file": "Linux/Linux_2k.log",
+        "log_format": "<Month> <Date> <Time> <Level> <Component>(\[<PID>\])?: <Content>",
+        "regex": [r"(\d+\.){3}\d+", r"\d{2}:\d{2}:\d{2}"],
+        "doubleThreshold": 120,
+        "triThreshold": 100,
+    },
+    "Android": {
+        "log_file": "Android/Android_2k.log",
+        "log_format": "<Date> <Time>  <Pid>  <Tid> <Level> <Component>: <Content>",
+        "regex": [
+            r"(/[\w-]+)+",
+            r"([\w-]+\.){2,}[\w-]+",
+            r"\b(\-?\+?\d+)\b|\b0[Xx][a-fA-F\d]+\b|\b[a-fA-F\d]{4,}\b",
+        ],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "HealthApp": {
+        "log_file": "HealthApp/HealthApp_2k.log",
+        "log_format": "<Time>\|<Component>\|<Pid>\|<Content>",
+        "regex": [],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "Apache": {
+        "log_file": "Apache/Apache_2k.log",
+        "log_format": "\[<Time>\] \[<Level>\] <Content>",
+        "regex": [r"(\d+\.){3}\d+"],
+        "doubleThreshold": 15,
+        "triThreshold": 10,
+    },
+    "Proxifier": {
+        "log_file": "Proxifier/Proxifier_2k.log",
+        "log_format": "\[<Time>\] <Program> - <Content>",
+        "regex": [
+            r"<\d+\ssec",
+            r"([\w-]+\.)+[\w-]+(:\d+)?",
+            r"\d{2}:\d{2}(:\d{2})*",
+            r"[KGTM]B",
+        ],
+        "doubleThreshold": 500,
+        "triThreshold": 470,
+    },
+    "OpenSSH": {
+        "log_file": "OpenSSH/OpenSSH_2k.log",
+        "log_format": "<Date> <Day> <Time> <Component> sshd\[<Pid>\]: <Content>",
+        "regex": [r"(\d+\.){3}\d+", r"([\w-]+\.){2,}[\w-]+"],
+        "doubleThreshold": 88,
+        "triThreshold": 81,
+    },
+    "OpenStack": {
+        "log_file": "OpenStack/OpenStack_2k.log",
+        "log_format": "<Logrecord> <Date> <Time> <Pid> <Level> <Component> \[<ADDR>\] <Content>",
+        "regex": [r"((\d+\.){3}\d+,?)+", r"/.+?\s", r"\d+"],
+        "doubleThreshold": 30,
+        "triThreshold": 25,
+    },
+    "Mac": {
+        "log_file": "Mac/Mac_2k.log",
+        "log_format": "<Month>  <Date> <Time> <User> <Component>\[<PID>\]( \(<Address>\))?: <Content>",
+        "regex": [r"([\w-]+\.){2,}[\w-]+"],
+        "doubleThreshold": 2,
+        "triThreshold": 2,
+    },
+}
+
+bechmark_result = []
+for dataset, setting in benchmark_settings.items():
+    print("\n=== Evaluation on %s ===" % dataset)
+    indir = os.path.join(input_dir, os.path.dirname(setting["log_file"]))
+    log_file = os.path.basename(setting["log_file"])
+
+    parser = LogParser(
+        log_format=setting["log_format"],
+        indir=indir,
+        outdir=output_dir,
+        rex=setting["regex"],
+        doubleThreshold=setting["doubleThreshold"],
+        triThreshold=setting["triThreshold"],
+    )
+    parser.parse(log_file)
+
+    F1_measure, accuracy = evaluator.evaluate(
+        groundtruth=os.path.join(indir, log_file + "_structured.csv"),
+        parsedresult=os.path.join(output_dir, log_file + "_structured.csv"),
+    )
+    bechmark_result.append([dataset, F1_measure, accuracy])
+
+print("\n=== Overall evaluation results ===")
+df_result = pd.DataFrame(bechmark_result, columns=["Dataset", "F1_measure", "Accuracy"])
+df_result.set_index("Dataset", inplace=True)
+print(df_result)
+df_result.to_csv("Logram_bechmark_result.csv", float_format="%.6f")
diff --git a/logparser/Logram/demo.py b/logparser/Logram/demo.py
@@ -0,0 +1,22 @@
+#!/usr/bin/env python
+
+import sys
+sys.path.append('../../')
+from logparser.Logram import LogParser
+
+input_dir  = '../../data/loghub_2k/HDFS/' # The input directory of log file
+output_dir = 'demo_result/'  # The output directory of parsing results
+log_file   = 'HDFS_2k.log'  # The input log file name
+log_format = '<Date> <Time> <Pid> <Level> <Component>: <Content>'  # HDFS log format
+# Regular expression list for optional preprocessing (default: [])
+regex      = [
+    r'blk_(|-)[0-9]+' , # block id
+    r'(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)', # IP
+    r'(?<=[^A-Za-z0-9])(\-?\+?\d+)(?=[^A-Za-z0-9])|[0-9]+$', # Numbers
+]
+doubleThreshold = 15
+triThreshold = 10
+
+parser = LogParser(log_format, indir=input_dir, outdir=output_dir, rex=regex, 
+                   doubleThreshold=doubleThreshold, triThreshold=triThreshold)
+parser.parse(log_file)
diff --git a/logparser/Logram/requirements.txt b/logparser/Logram/requirements.txt
@@ -0,0 +1,4 @@
+pandas
+regex==2022.3.2
+numpy
+scipy
diff --git a/logparser/Logram/src/Common.py b/logparser/Logram/src/Common.py
@@ -0,0 +1,50 @@
+"""
+This file is modified from:
+https://github.com/BlueLionLogram/Logram/tree/master/Evaluation
+"""
+
+import regex as re
+
+MyRegex = [
+    r"blk_(|-)[0-9]+",  # block id
+    r"(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)",  # IP
+    r"(?<=[^A-Za-z0-9])(\-?\+?\d+)(?=[^A-Za-z0-9])|[0-9]+$",  # Numbers
+]
+
+
+def preprocess(logLine, specialRegex):
+    line = logLine
+    for regex in specialRegex:
+        line = re.sub(regex, "<*>", " " + logLine)
+    return line
+
+
+def tokenSpliter(logLine, regex, specialRegex):
+    match = regex.search(logLine.strip())
+    # print(match)
+    if match == None:
+        tokens = None
+        pass
+    else:
+        message = match.group("Content")
+        # print(message)
+        line = preprocess(message, specialRegex)
+        tokens = line.strip().split()
+    # print(tokens)
+    return tokens, message
+
+
+def regexGenerator(logformat):
+    headers = []
+    splitters = re.split(r"(<[^<>]+>)", logformat)
+    regex = ""
+    for k in range(len(splitters)):
+        if k % 2 == 0:
+            splitter = re.sub(" +", "\\\s+", splitters[k])
+            regex += splitter
+        else:
+            header = splitters[k].strip("<").strip(">")
+            regex += "(?P<%s>.*?)" % header
+            headers.append(header)
+    regex = re.compile("^" + regex + "$")
+    return regex