0% found this document useful (0 votes)
30 views34 pages

Sok: Llm-Based Log Parsing: Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, Andreas Rauber

This document reviews 29 methods for LLM-based log parsing, highlighting the advantages and limitations of these approaches compared to traditional methods. It provides a systematic overview, a flow chart of the parsing process, and benchmarks seven open-source parsers, emphasizing the need for automation in handling large volumes of log data. The findings aim to assist researchers and practitioners in selecting efficient log parsing solutions, with all results and code made publicly available for transparency.

Uploaded by

mumuyu.ai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views34 pages

Sok: Llm-Based Log Parsing: Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, Andreas Rauber

This document reviews 29 methods for LLM-based log parsing, highlighting the advantages and limitations of these approaches compared to traditional methods. It provides a systematic overview, a flow chart of the parsing process, and benchmarks seven open-source parsers, emphasizing the need for automation in handling large volumes of log data. The findings aim to assist researchers and practitioners in selecting efficient log parsing solutions, with all results and code made publicly available for transparency.

Uploaded by

mumuyu.ai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

SoK: LLM-based Log Parsing

VIKTOR BECK, AIT Austrian Institute of Technology, Austria


MAX LANDAUER, AIT Austrian Institute of Technology, Austria
MARKUS WURZENBERGER, AIT Austrian Institute of Technology, Austria
arXiv:2504.04877v1 [cs.LG] 7 Apr 2025

FLORIAN SKOPIK, AIT Austrian Institute of Technology, Austria


ANDREAS RAUBER, Vienna University of Technology, Austria
Log data, generated by software systems, provides crucial insights for tasks like monitoring, root cause analysis,
and anomaly detection. Due to the vast volume of logs, automated log parsing is essential to transform semi-
structured log messages into structured representations. Traditional log parsing techniques often require
manual configurations, such as defining log formats or labeling data, which limits scalability and usability.
Recent advances in large language models (LLMs) have introduced the new research field of LLM-based log
parsing, offering potential improvements in automation and adaptability. Despite promising results, there is no
structured overview of these approaches since this is a relatively new research field with the earliest advances
published in late 2023. This paper systematically reviews 29 LLM-based log parsing methods, comparing
their capabilities, limitations, and reliance on manual effort. We analyze the learning and prompt-engineering
paradigms employed, efficiency- and effectiveness-enhancing techniques, and the role of LLMs in the parsing
process. We aggregate the results of the survey in a large table comprising the characterizing features of
LLM-based log parsing approaches and derive the general process of LLM-based log parsing, incorporating all
reviewed approaches in a single flow chart. Additionally, we benchmark seven open-source LLM-based log
parsers on public datasets and critically assess their reproducibility. Our findings summarize the advances of
this new research field and provide insights for researchers and practitioners seeking efficient and user-friendly
log parsing solutions, with all code and results made publicly available for transparency.
Additional Key Words and Phrases: parsing, log data, large language models

1 Introduction
Log data consists of recorded information generated by logging statements in software applications,
systems, devices, or networks during operation. It provides a chronological record of events and
contains valuable insights for various tasks, such as monitoring [6, 60], root cause analysis [19, 46],
and anomaly detection [20, 29]. Given that modern computer systems can generate millions of log
lines per hour, manual processing is infeasible, necessitating fast and scalable automated approaches.
For instance, in 2013 the Alibaba Cloud system was reported to produce approximately 100 to 200
million log lines per hour [43]. However, extracting meaningful information from raw log data
requires transforming it into a structured format. Since logs are typically semi-structured, following
general patterns but containing variable log messages, direct analysis is challenging. To enable log
analytics tasks like anomaly detection, logs must first be parsed into a structured representation,
making log parsing a critical step in log analysis [63].
Log parsing refers to the extraction of structured data from semi-structured logs. Parsing the
logs into the structured format can simply be done by regular expression matching if the correct
log templates are given. Most methods therefore formulate log parsing as a log template extraction
problem [18, 23]. Ground-truth templates are defined somewhere in logging statements in the source
code and are therefore often not available. A wide range of log parsing techniques [12, 25, 33, 51, 52]
have therefore been proposed to overcome this issue.
Authors’ Contact Information: Viktor Beck, AIT Austrian Institute of Technology, Vienna, Austria, [email protected];
Max Landauer, AIT Austrian Institute of Technology, Vienna, Austria, [email protected]; Markus Wurzenberger, AIT
Austrian Institute of Technology, Vienna, Austria, [email protected]; Florian Skopik, AIT Austrian Institute of
Technology, Vienna, Austria, [email protected]; Andreas Rauber, Vienna University of Technology, Vienna, Austria,
[email protected].
X:2 Beck et al.

Many state of the art log parsing methods require manual configuration or other manual actions
at some point in the parsing process. Approaches may involve labeling of some of the logs from
the training set [23, 33, 66], definition of log formats [35] or regular expressions for preprocessing
[18, 51, 67], or other parameters [67] that fit the parsing algorithm to the data. From a user’s
perspective the configuration of parsing algorithms might seem overwhelming since it requires
expertise or an analysis of the data at hand. The reason log parsing methods require manual actions
is that humans have both semantic and syntactic understanding [23], but more importantly, a
natural generalization capability and potentially expert knowledge about logs and log parsing. Many
log parsers have already achieved semantic [23, 33, 37] and syntactic abilities [9, 18, 67]. While
syntax-based parsers focus on characteristics like occurrence frequencies, or word or log length,
to determine the constant part of logs, semantic-based parsers leverage the semantic differences
between static and variable parts of log messages [23, 64]. Knowledge about logs and how to parse
them can be learned by language models, such as BERT [11] or its descendants, by fine-tuning
them with logs and their templates [33, 57]. However, these methods often lack generalization and
have to be trained or fine-tuned with labeled logs, preferably with logs of the exact same log type
as the target logs [38]. Generalization arises from pre-training on a large corpus of diverse datasets,
allowing models to perform well also on novel tasks and possibly enhanced through fine-tuning
(FT) [16] or in-context learning (ICL) [5]. This is where large language models (LLMs) come into
play.
With the emergence of LLMs the new research field of LLM-based log parsing arised. In late
2023, Le and Zhang [32] reported notable performance of ChatGPT1 in a naive setting, where the
LLM was presented individual logs and simply asked for their templates. In the meantime, other
approaches adopted LLMs for this purpose as well, building sophisticated frameworks that utilize
LLMs in various ways to either parse logs directly or enhance log parsing, by adopting learning
paradigms for LLMs such as ICL [5] or fine-tuning [16]. Given the fact that the number of unique
templates is significantly lower than the number of logs and the average rate of newly discovered
templates decreases as more logs are processed [23], some LLM-based parsers [23] are even able to
achieve runtime performance comparable with the fast and well-known parser Drain [18].
To the best of our knowledge, there is an absence of a structured overview and an objective
comparison of papers and methods concerning LLM-based log parsing, as this is a rather novel
research field which only recently emerged from the popular research field of generative LLMs. So
far there is merely a survey paper about LLM-based log analysis techniques by Akhtar et al. [1],
which provides a rudimentary overview on using LLMs for log parsing and the techniques employed.
Consequently, this work aims to undertake a systematic review of the literature to identify the
commonalities and variabilities of the existing approaches and to present our gained insights. Our
main focus is an overview of LLM-based log parsing with emphasis on the users’ perspective — thus,
how the LLM is applied and how much manual effort the approaches require. The aim is therefore
to provide a comprehensive overview of the existing approaches and techniques, enabling anyone
interested in (LLM-based) log parsing to quickly and easily identify the most suitable solution for
their specific use case.
The aforementioned goals are formulated as the following research questions:
• RQ1: What are the main advantages and disadvantages of LLM-based log parsing approaches
and non-LLM-based approaches?
• RQ2: To what extent do LLM-based log parsing approaches rely on labeled data and manual
configuration?

1 https://openai.com/index/chatgpt/; accessed 10-March-2025


SoK: LLM-based Log Parsing X:3

• RQ3: Which techniques can enhance the efficiency or effectiveness of LLM-based log
parsing?
• RQ4: What experiment design choices influence the comparability of the evaluation results
between LLM-based log parsing approaches?
• RQ5: To what extent are the presented LLM-based parsing methods accessible and repro-
ducible?
We summarize the main contributions of the present work as follows:
• We provide a systematic overview of the existing methods, based on a set of feature defini-
tions that characterize LLM-based log parsing approaches, or the corresponding papers,
respectively.
• We derive the general process of LLM-based log parsing, encompassing each parsing ap-
proach of each reviewed paper in a single flow chart.
• We provide a comprehensive benchmark of seven open-source LLM-based log parsing
approaches on open-source datasets.
• Based on the literature review and our evaluation results, we answer 5 research questions
and derive further findings that may benefit the future research and design of log parsers.
• Finally, we make the source code and all the results of our evaluation publicly available in
our GitHub repository2 to ensure transparency and reproducibility.
The remainder is structured as follows: Section 2 describes the background and the core concepts
and how we understand them. Section 3 outlines the survey method, including the search strategy
and the definition of the reviewed features. Section 4 presents the results of the literature review in
a large table and describes the findings of each feature. Section 5 explains the evaluation setup,
while Sec. 6 presents the evaluation results. We discuss the findings from literature and evaluation
in Sec. 7 and conclude our paper in Sec. 8.

2 Background and Related Work


In this Section, we describe the relevant background by means of the core concepts relevant to
LLM-based log parsing.

2.1 Log Parsing


Log parsing is the foremost step in log analytics and denotes the process of extracting structured
data from logs. Logs are given as single- or multi-line events and are available in textual form [29].
The diverse and often un- or semi-structured nature of logs makes log parsing a difficult task. It
is the precondition for many log analytics tasks, hence the effectiveness of downstream tasks is
therefore inherently dependent on the effectiveness of the parser and also determines the runtime
efficiency of the whole process, which is especially important for online (real-time) or incremental
approaches, such as live-monitoring for cybersecurity [23]. Incremental approaches are designed to
handle continuous data streams, updating the parsing model dynamically as new log data arrives.
This allows for real-time analysis and quicker responses to emerging issues.
In this work, we specifically focus on template extraction, since the extraction of the variables
from the log messages is trivial once the (correct) templates are identified, for example, with regular
expression matching. The template extraction process is therefore the key component in log parsing.

2.2 Log Templates


A log template is a predefined format used to log system events in software applications. It typically
includes placeholders for essential details like timestamps, log levels (e.g., INFO, ERROR), message
2 https://github.com/ait-aecid/LLM-log-parsing.git
X:4 Beck et al.

content, contextual metadata and more [35]. Previous research differs between the unstructured
log message part of a log and the structured log headers [17]. The structured part is thereby
straightforward to extract with regular expressions and it has been shown that parsing only the log
message part improves the parsing accuracy [17]. Many methods [12, 12, 18, 66], therefore require
the log format of a dataset as input parameters. An example of a log template from the Apache
dataset [77] is given in Fig. 1. The log format parameter for this example would be “[<Time>]
[<Level>] <Content>”, whereas “Time” and “Level” refers to log headers and “Content” refers
to the log message. In the remainder, we understand “log format” as this parameter defining the
positions of the log headers and of the content within different log types, or different datasets,
respectively.

Source Code
logging.info(f"workerEnv.init() ok {file_path}")
logging.error(f"mod_jk child workerEnv in error state {error_state}")
logging.info(f"jk2_init() Found child {child} in scoreboard slot {slot}")

Logs
[Sun Dec 04 04:51:52 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Sun Dec 04 04:51:55 2005] [error] mod_jk child workerEnv in error state 6
[Sun Dec 04 04:52:04 2005] [notice] jk2_init() Found child 6738 in scoreboard slot 6

Parsed Logs
Time Level Template Parameters
Sun Dec 04 04:51:52 2005 notice workerEnv.init() ok <*> [/etc/httpd/conf/workers2

Sun Dec 04 04:51:55 2005 error mod_jk child workerEnv in error state <*> [6]

Sun Dec 04 04:52:04 2005 notice jk2_init() Found child <*> in scoreboard slot <*> [6738, 6]

Fig. 1. A simple example of log parsing, from logging statement to log file to parsed log.

2.3 Large Language Models (LLMs)


Large Language Models (LLMs) have evolved significantly, transitioning from early statistical and
rule-based approaches to deep learning architectures powered by Generative Pretrained Transform-
ers (GPTs). These models leverage self-attention mechanisms and large-scale datasets to generate
human-like text, enabling advancements in natural language processing (NLP) tasks such as sum-
marization, translation, content generation, and complex problem-solving. A key enhancement in
LLMs is instruction tuning which specializes the pretrained LLMs to follow instructions provided
in prompts. Instruction tuning enables effective ICL. Instead of modifying model parameters, ICL
allows LLMs to learn from prompts that include instructions, examples, and queries, making them
more adaptable to specialized applications [69].
The term LLM is only loosely defined, hence, for this work, we define LLMs as generative
pretrained transformers (GPTs) such as ChatGPT, Mistral or Qwen. Models like BERT [11] (encoder-
only) are thereby excluded, as we are only interested in typically larger, generative models that
excel in both understanding and generating text, making them suitable for a broader range of tasks.
Many works included BERT into their definition of LLMs which complicated the manual filtering
process after the keyword search.

2.4 Log Parsing with LLMs


Large-scale model series like ChatGPT and many more have demonstrated potential in this domain.
Some approaches fine-tune smaller LLMs [42] or BERT [33] to this specific task while reporting
SoK: LLM-based Log Parsing X:5

superior parsing performance to state of the art approaches. However, fine-tuning LLMs for log
parsing is resource-intensive in terms of computational costs, runtime, as well as labeled examples,
making ICL a compelling alternative [64]. Research shows that ICL enhances LLMs’ performance
in logical reasoning and fact retrieval, suggesting its viability for structured log analysis [5].
Despite these advantages, integrating LLMs into log parsing presents challenges. These models
are not inherently designed for log parsing, leading to inconsistencies in token classification and
errors in log templates. Such inaccuracies can negatively impact downstream tasks like anomaly
detection [78]. Additionally, given the vast volume of real-world log data, efficient processing is
crucial. The computational demands of LLMs contribute to inference overhead and network latency,
necessitating optimizations for practical deployment [23].

3 Survey Method
3.1 Search Strategy
Subsequently, this section describes our search strategy and the inclusion and exclusion criteria
derived from the core concepts of Sec. 2 and early insights from LLM-based log parsing approaches.
For the keyword search, we determined the following keywords:
(1) log(s) (in title);
(2) LLM(s), large-language-model(s), or large language model(s);
(3) parsing, parser(s), or parse.
The term "log" or "logs" must be included in the publication title, while the remaining keywords
are entered into the default search engine of the respective database. This approach should ensure
that the search results are focused on publications where log data is the central aspect. The search
statistics are given in Table 1.
The numbers of search results per database sum up to 176, excluding duplicates found on multiple
databases. The results were then filtered by exclusion criteria. A publication is excluded if,
• It is not written in English.
• It is not accessible in electronic format.
• It is a book, technical report, lecture note, presentation, patent, thesis, or dissertation.
• A more recent publication is available that presents the same study (by the same authors),
ensuring that this SoK focuses on the latest versions of specific approaches.
• It only applies an existing approach without introducing any novelties to LLM-based log
parsing.
• It does not apply log parsing or LLMs by definitions of Sec. 2.
• It is not accessible with our institutions’ licenses.
By applying these exclusion criteria on both abstract and full text, we omitted 147 of the initial
publications found: 1 publication was omitted because it was not written in English. 8 publications
were omitted due to being completely unrelated to the topic. 110 are related but excluded, as they do
not directly cover LLM-supported log data parsing by the definitions of Sec. 2 or are about other log
analytics tasks, but employ parsing-free methods or existing conventional parsers, such as Drain
[18] or SPELL [12]. 21 were omitted because they were books, theses, dissertations, or technical
reports. 2 were excluded for not being available freely with any license from our institutions. 5
were outdated versions of newer papers.
The final selection consists of 29 papers. One publication is from the year 2023, one from 2025,
while the remaining are all from 2024. Since there is a recent hype around the usage of LLMs for
log related issues, we include pre-print papers to capture the newest research. There are 10 preprint
papers in the final selection (∼ 34%), all from 2024.
X:6 Beck et al.

Table 1. Keyword search results (status 29-01-2025). “#R” and “#I” stand for the number of results and the
number of papers that were still included after all exclusion criteria were applied.

Database Search String #R #I


IEEE Xplore ("Document Title": "log" OR "Document Title": "logs") AND 23 11
(LLM OR LLMs OR "large language model" OR "large language
models") AND (parser OR parsers OR parsing OR parse)
Science-Direct title: ("log?") AND title, abstract, keywords: ("LLM" OR 6 0
"LLMs" OR "large language model" OR "large language models"
OR "large-language-model" OR "large-language-models") AND
("parser" OR "parsers" OR "parsing" OR "parse")
ACM Digital Library [[Title: log] OR [Title: logs]] AND [[All: "llm"] OR [All: 40 9
"llms"] OR [All: "large language model"] OR [All: "large
language models"] OR [All: "large-language-model"] OR [All:
"large-language-models"]] AND [[All: "parser"] OR [All:
"parsers"] OR [All: "parsing"] OR [All: "parse"]]
Google Scholar intitle:"log" OR intitle:"logs" "LLM" OR "LLMs" OR 166 29
"large-language-model" OR "large-language-models" OR
"large language model" OR "large language models" "parsing"
OR "parser" OR "parsers" OR "parse"
Web of Science TI=("log" OR "logs") AND TS=("LLM" OR "LLMs" OR 2 2
"large language model" OR "large language models" OR
"large-language-model" OR "large-language-models") AND
TS=("parser" OR "parsers" OR "parsing" OR "parse")
Springer Link Title:(log OR logs) AND (LLM OR LLMs OR "large 6 0
language model" OR "large language models" OR
"large-language-model" OR "large-language-models") AND
(parser OR parsers OR parsing OR parse)

3.2 Feature Definition


This section describes each feature and its classes of Table 2 which we report from the analyzed
methods. We came up with this specific set of features, since they map to our research questions
and pop up significantly often in many of the analyzed works. Additionally, we think that these
are the most relevant for anyone who wants to apply LLM-based log parsing. The features either
describe general properties (GP), processing steps (PS), or concern the reproducibility (R)
of the approaches. For reproducibility, we take inspiration from the work of Olszewski et al. [47]
where they analyzed machine learning papers of Tier 1 security conferences for reproducibility.
The researchers defined a set of questions to determine the reproducibility of research findings.
In Table 2 features are reported with “✓” (true), an empty cell (false) or a descriptive name or
abbreviation. If a paper provides an unclear answer to the feature in question, we write “?”. For
instance, one papers provides a link to its code repository but the repository is empty, thus we
write “?” for the corresponding feature. If a paper does not include the feature, or if the feature is
not applicable, we leave it blank.

GP-1 — Supervision. In log parsing a labeled log is one for which a template, acting as the ground
truth for that log, is available. We differentiate approaches based on the requirement for labels:
• Supervised parsing (sup) requires at least some log instances of the training set to be labeled.
• Unsupervised parsing (un) does not require labels.
SoK: LLM-based Log Parsing X:7

GP-2 — Parsing Mode. Many methods are described as online approaches, but their interpretations
vary. Some consider online processing to be incremental processing or streaming, while others
apply batch-wise processing yet still label their method as online. Additionally, some works do
not specify whether their approach is online or offline. Our initial incentive was to classify these
methods accordingly, but this is often not feasible due to a lack of context or sufficient explanations.
As a consequence, we devised the following categories that describe how many logs are processed
at once:
• Stream (S): The logs are processed one-by-one.
• Batch (B): Multiple logs are processed at once, whereby it is possible to apply local (within
each batch) optimizations. A batch is significantly smaller than the whole dataset.
• Total (T): The entire dataset is processed at once, whereby it is possible to apply global
optimizations to the process.
GP-3 — Learning / Prompting. We report four different types of learning or prompt engineering
techniques. The type of learning can strongly influence how the prompt is designed which is why
we report this in a single feature:
• In-context learning (ZS/FS/S): ICL [5] leverages the context provided within a single LLM
call to generate responses, adapting to specific needs without any updates to the model’s
parameters. Thereby, we differentiate between a zero-shot setting (ZS), where the model
performs a task based solely on an instruction in the prompt, and few-shot ICL (FS), where
a small number of demonstrations is provided in the prompt to guide the model’s behavior.
Demonstrations can be retrieved randomly from the dataset or with sophisticated strategies.
They can also be static (S) which we report separately.
• Chain of Thought (CoT): CoT [61] refers to a series of subsequent LLM calls, where the
(complex) task is broken down into (easier) subtasks or “thoughts” by the LLM itself and
answered in a step-by-step manner, rather than jumping to the solution directly. The process
also enhances transparency since intermediate steps can be monitored by users.
• Fine-tuning (FT): In addition to ICL and CoT, which are solely modifications of the prompt,
fine-tuning [16] modifies the parameters of the LLM, but mostly affects only the last few
layers of the neural network, which are usually the most decisive ones for the outcome.
• Pre-training (PT): Pre-training refers to the initial phase of training the model on a large
corpus of text data to learn the structure and patterns of language.
PS-1 — Manual Configuration. This is true if at least one manual configuration step is required
(e.g., when a parser is applied to an unseen log type). This includes manual definition of input
parameters such as regular expressions, log formats or other essential parameters for different
datasets, which is often required for many conventional log parsers such as Drain [18], Brain [67],
or Spell [12], and others featured in the LogPAI log parser repository3 [78]. This feature does not
include optional parameters that are generic enough to be left unchanged for a new log type.
PS-2 — Retrieval-Augmented Generation. Retrieval-augmented generation, or RAG, is a paradigm
where the LLM is provided with information retrieval capabilities. In case of log parsing, most
approaches utilize a sampling strategy to include either logs or logs and their templates in the
prompt. The LLM should then use the provided context to learn the variability and commonality of
logs [35] or learn parsing directly from log-template pairs. Many approaches create clusters, trees,
buckets or other kinds of aggregations from which they sample their demonstrations for ICL and
retrieve them based on some kind of similarity measure. We differentiate two cases:
3 https://github.com/logpai/logparser; accessed 9-December-2024
X:8 Beck et al.

• Random retrieval (R): The process is a random retrieval of demonstrations from training
data.
• Strategic retrieval (S): There is a refined strategy for retrieving demonstrations from specific
data storages (e.g., clusters, lists, etc.).
PS-3 — Caching. Some approaches increase their efficiency by storing parsed logs’ templates in
some kind of data structure and only call the LLM if a subsequent log does not fit any of the existing
templates. These data structures are mostly tree-like structures or similarity-based clusters.
PS-4 — LLM Usage. The employed LLM is used in at least one of three different ways:
• Direct parsing (dir): The LLM receives one or more logs directly in the prompt and is asked
for the corresponding template.
• post-processor (post): The LLM is used to post-process the log lines, which is mostly merging
similar templates or correcting templates based on new information obtained by subsequent
log lines in an online approach.
• Preprocessor (pre): The LLM functions as a helper in preprocessing, for example, for identi-
fying the timestamp, the log format, or other relevant features.
PS-5 — Template Revision. This feature describes whether the templates are revised in a post-
processing step, such as merging similar templates or correcting templates, based on the information
obtained from new logs (in an incremental approach). This step can be done with or without the
help of LLMs.
R-1 — Evaluation Data. Since the performance of parsers is strongly dependent on the data, we
report the different types of datasets or dataset collections used in the papers:
• LogHub (L): The well-known LogHub [77, 78] repository is widely used for evaluating log
analysis methods. It features 16 annotated datasets from different domains with 2000 logs
each.
• Corrected LogHub (CL): This is a version of LogHub with corrected templates. The original
templates have been observed to have inconsistencies due to inconsistent labeling styles
[27].
• LogHub-2.0 (L2): LogHub-2.0 [24] is based on Loghub and features 14 large-scale datasets
from various domains with numbers of log in the order of 104 to 107 .
• Custom (CA/CU): If the parser was evaluated with custom data we report whether it is
publicly available (CA) or unavailable (CU).
R-2 — Evaluation Metrics. We report the various evaluation metrics employed to assess the effective-
ness of log parsing techniques. Each metric provides a unique perspective on the parsing process,
focusing on different aspects of effectiveness:
• Group Accuracy (GA): A log message is considered correctly parsed if and only if its event
template corresponds to the same group of log messages as the ground truth does. GA
is then the number of correctly parsed log messages divided by the total number of log
messages [78]. GA is also known as RandIndex [50].
• F1-score of Group Accuracy (FGA): 𝑁𝑝 is the number of templates that are generated by a
log parser, and 𝑁𝑐 the number of templates that are correctly parsed by the log parser. 𝑁𝑔 is
the actual correct number of templates in the ground truth. The precision of the GA (PGA)
is then 𝑁𝑐 /𝑁𝑝 and the recall of GA (RGA) is 𝑁𝑐 /𝑁𝑔 . Then, FGA is the harmonic mean of
PGA and RGA [24].
• Parsing Accuracy (PA): A log is considered correctly parsed if and only if all its static text
and dynamic variables are correctly identified. PA is then the number of correctly parsed log
SoK: LLM-based Log Parsing X:9

messages divided by the total number of log messages [9]. PA is the same as message-level
accuracy (MLA) [37, 66].
• Edit Distance (ED): ED is the minimum number of operations, such as insertions, deletions,
or substitutions, needed to convert one string into the other [45].
• Precision Template Accuracy (PTA): A template is correctly parsed from log messages if and
only if it is identical (token-by-token) to the ground-truth template(s) of the log messages.
PTA is the ratio of correctly identified templates to the total number of identified templates
[27].
• Recall Template Accuracy (RTA): Complementary to PTA, RTA is the ratio of correctly
identified templates to the total number of ground-truth templates [27].
• F1-score of Template Accuracy (FTA): Is the harmonic mean of PTA and RTA [24].
• Other: If metrics other than the above are used we write “other”.
R-3 — Used Models. A variety of different LLMs are used in the analyzed works. We report only the
base models, or the name of the model series, of the used LLMs and only the ones used for the task
of log parsing.
R-4 — Code availability. For a potential user, it might be essential that a parser is already imple-
mented, thus we report the availability of code repositories. This feature is true if a link was provided
to the implementation and if the link does not lead to an empty repository or a non-existent page,
and false otherwise.
R-5 — Preprint. This is true if the reviewed paper is a preprint paper. It is possible that the corre-
sponding code repository to the paper, the results, findings, or interpretations are only preliminary,
which is why we report this feature.

4 Literature Review Results


The results of our literature review are aggregated in Table 2. The following subsections discuss
the results and findings of the literature review in the context of this table.

4.1 Process Pipeline


After reviewing the paper selection, we were able to develop a process pipeline that summarizes
all the approaches in a single flow chart, given in Fig. 2. The dashed arrows and boxes represent
optional components while the continuous ones represent essential components. The user and the
LLM represent key actors that can act on other components. The arrows pointing away from the user
indicate steps that can require manual effort. The arrow from user to the logs represent the manual
labeling effort for supervised parsers while the arrow pointing to the preprocessing component
represents manual configuration. We provide detailed explanations about the components in the
following sections.

4.2 Supervision (GP-1)


From the reviewed papers 6 are fully supervised and 9 approaches describe a supervised and an
unsupervised setting. Supervised approaches require labeled logs for ICL or fine-tuning, but the
amount of templates varies from a few guiding seed examples [75] to significant proportions of a
dataset’s templates [23, 66]. The works that fine-tune LLMs [26, 38, 42, 48, 75] all require logs and
their templates for the process which is why we classify these approaches as supervised.
Similar to earlier work [33], LILAC [23] and DivLog [66] utilize specialized algorithms to sample
from the training data. The objective of the algorithms is to maximize the sample diversity of the
labeled logs required for ICL. They call this process candidate sampling. LILAC has a sampling
X:10 Beck et al.

Table 2. Categorization of the selected papers by features.

Learning / Prompting

Code availability
Processing mode

Manual config.

Template corr.
Supervision

LLM usage

Preprint
Datasets
Caching

Metrics

Models
RAG
General Properties Processing Steps Reproducibility
Approach GP-1 GP-2 GP-3 PS-1 PS-2 PS-3 PS-4 PS-5 R-1 R-2 R-3 R-4 R-5
Astekin et al. [2] un, sup S ZS, S dir CL GA, ED, other GPT, Claude, Llama
Astekin et al. [3] un, sup S S dir CL other GPT, Claude, Llama,
Mistral
Cui et al. [8] un S ZS ? dir L PA, ED GPT, Claude, Llama, ✓ ✓
(LogEval) Gemini, InternLM,
Qwen, AquilaChat,
Mistral, Baichuan,
DevOps, ChatGLM
Fariha et al. [14] un ? FS? pre L other GPT
Huang et al. [21] un B ZS S ✓ dir L2 GA, FGA, PA, GPT ? ✓
(LUNAR) FTA
Ji et al. [22] sup ? PT, ZS? dir L GA, other Llama, GPT, Qwen, ✓ ✓
(SuperLog) OWL
Jiang et al. [23] sup S FS ✓ S ✓ dir ✓ L2 GA, FGA, PA, GPT ✓
(LILAC) FTA
Karanjai et al. [26] un, sup S FT; FS S ✓ dir ✓ L2 GA, FGA, PTA, GPT ✓
(LogBabylon) PA, RTA, FTA
Le et al. [32] un, sup S ZS, FS R dir CL GA, PA, ED GPT
Liu et al. [36] un B ZS ✓ dir L other GPT, Vicuna ✓
(LogPrompt)
Ma et al. [38] sup S FT; ZS, FS ✓ ? dir CL GA, PA Llama, ChatGLM ✓
(LLMParser)
Ma et al. [39] un B FS ✓ S ✓ dir ✓ L2 GA, PA Llama, Mistral, ✓ ✓
(OpenLogParser) Gemma, ChatGLM,
Mehrabi et al. [42] sup S FT; ZS, S dir CA PA, ED, FGA Mistral, GPT ✓
Pang et al. [48] un, sup S FT; ZS dir CU other Llama
(ONLA-LLM)
Pei et al. [49] un, sup B FS ✓ S ✓ dir ✓ L GA, PA, PTA, GPT ✓
(SelfLog) RTA
Sun et al. [54] (Loki) un S ZS ✓ dir
Sun et al. [55] un, sup S FS, CoT ? dir L PA GPT ✓ ✓
(Semirald)
Sun et al. [56] sup S FS, CoT ? dir L GA, PA GPT ? ✓
(SemiSMAC-<T>)
Vaarandi et al. [59] un B S ✓ dir ✓ CA other OpenChat, Mistral, ✓ ✓
(LLM-TD) WizardLM
Wu et al. [62] un S FS ✓ S ✓ dir, post ✓ L, L2 GA, FGA, PA, GPT, Gemini, ✓
(AdaParser) FTA Claude, DeepSeek,
Qwen
Xiao et al. [64] un B ZS ✓ S ✓ dir ✓ L2, CL GA, PA, ED GPT ✓
(LogBatcher)
Xu et al. [66] sup S FS ✓ S dir L PA, PTA, RTA GPT ✓
(DivLog)
Xu et al. [65] un, sup B ZS, S; CoT ✓ dir ✓ L2 GA, FGA, PA, Claude ✓
(HELP) FTA
Yu et al. [68] un T? ZS ✓ pre L, CA PA GPT, Gemma ✓
(LogGenius)
Zhang et al. [71] un S ZS ✓ ? ✓ pre ✓ L, CU GA, FGA GPT ✓
(ECLIPSE)
Zhang et al. [72] un T FS, CoT ✓ ✓ post ✓ L GA, FGA GPT ✓ ✓
(Lemur)
Zhi et al. [74] un S ZS ✓ ✓ dir ✓ CL GA, PA, ED GPT
(YALP)
Zhou et al. [76] un S ZS, FS/S? ? dir L GA, PA, ED GPT
Zhong et al. [75] un, sup S FT; ZS, FS S ✓ dir, post ✓ L, L2 GA, PA, FGA, GPT, Llama
(LogParser-LLM) FTA, other

algorithm based on hierarchical clustering while DivLog uses the Determinantal Point Process
SoK: LLM-based Log Parsing X:11

User Fine-
tuning LLM

Preprocessor Postprocessor
Direct Parser

RAG-
Pre- storage / Post-
processing Cache Parsing processing

Logs Templates

Fig. 2. LLM-based parsing process pipeline. The dashed arrows and boxes represent optional components
while the continuous ones represent essential components. The arrows pointing from the LLM to certain
process elements describe where the LLM can be applied.

(DPP) [28]. It is noteworthy that the unsupervised parser LogBatcher [64] also uses DPP, but to
maximize the sample diversity within the batches of logs that are prompted to the LLM for direct
parsing.
Huang et al. (LUNAR) [21] empirically study the average FTA of LILAC [23] and the BERT-based
parser LogPPT [33] based on the proportion of labeled logs they receive as input. It was found that
FTA exhibited a substantial decline in performance when the proportion of labels was reduced to
5%, which, in general, remains a relatively high figure, particularly in light of the typically abundant
nature of logs. The finding that the more templates are available, the better the parsers’ performance
is little surprising [21, 23]. However, labeled logs are scarce and require substantial manual effort
and expertise [64]. All supervised parsers of the selection can technically be operated unsupervised
at the cost of performance. This may require modifying the prompt so that it corresponds to a
zero-shot setting to not confuse the LLM with announced but absent examples.

4.3 Processing Mode (GP-2)


Log parsers can be categorized based on their processing mode, which determines whether they
operate online or offline. The 18 parsers classified under the “stream” category process logs in
real-time as they arrive, making them suitable for online processing. These parsers handle data
incrementally, allowing for immediate analysis and response which may be vital for downstream
tasks using live monitoring. In contrast, “batch” parsers [36, 39, 49, 59, 64, 65] process logs in
fixed-sized chunks, meaning they can function in both online and offline settings, depending on the
frequency of batch execution. When processing in batches, it is possible to optimize the process for
this batch [64]. Lastly, parsers under the category “total” [21, 68, 72] analyze an entire log dataset
at once, which is inherently an offline approach, as it requires the complete dataset to be available
before processing begins. For instance, LUNAR [21] and Lemur [72] group the logs based on certain
features (e.g. length or most frequent shared oken) into buckets. They then create clusters within
that buckets that maximize the sample diversity. The concept of maximizing the sample diversity
within certain similarity clusters is also used by LogBatcher [64] but for data batches. Similarly,
DivLog [66] and LILAC [23] use this idea for their sampling algorithm.
X:12 Beck et al.

4.4 Learning / Prompting (GP-3)


4.4.1 In-Context Learning. ICL [5] is a prompt-based learning paradigm. It includes learning from
instructions but also from examples [5]. This makes it a cheap way of learning. All approaches
therefore apply ICL in some way. Among others, 16 works use zero-shot ICL, 14 employ few-shot
ICL, and [2, 3, 42, 59, 65] use static examples. We report static examples separately from the few-shot
setting in order to illustrate which approaches use dynamic examples. This in turn implies RAG
(PS-2).

4.4.2 Chain of Thought. Lemur [72] uses CoT [61] to revise similar templates that may belong to
the same event in three dialogue rounds. These steps include revising the structure, the semantics
and finally, a decision if the templates should be merged. The other works applying CoT [55, 56, 65]
use it for direct parsing but do not explain in detail how.

4.4.3 Fine-Tuning and Pre-Training. Compared to ICL, fine-tuning is considered a rather computa-
tionally costly task [64] since it modifies the parameters of the models itself. Previous research [44]
found that fine-tuning generally outperforms ICL in effectiveness. From our selection, the works
of Ma et al. (LLMParser) [38] and Mehrabi et al.[42] support this finding. However, fine-tuning
also requires labeled logs. As the default setting for the training, [42] used 80% of the available
templates and 15 logs per template, ONLA-LLM [48] used 80% of the labeled logs of their custom
dataset, LLMParser [38] used 50 labeled logs per dataset, and [75] used 32 per dataset. Furthermore,
Mehrabi et al. [42] and Ma et al. [38] find that their fine-tuned LLMs struggle with new and unseen
log types in terms of effectiveness and robustness. This raises the question of whether this is due to
overfitting, but this is a task for future research. It remains up to the users to decide whether a costly
training phase, requiring considerable quantities of labeled logs, but with improved performance
on seen logs, is feasible for their use case.
From our selection only Ji et al. (SuperLog) [22] pre-trained a model. Pre-training requires vast
amounts of data. To this end, Ji et al. created NLPLog, a comprehensive dataset with a quarter
million question-answer pairs on log analysis tasks. The researchers report superior performance
not only in parsing but also anomaly detection, failure diagnosis, and log interpretation. They also
conducted an evaluation on unseen logs, but they do not report any of the common parsing metrics
from Sec. 3.2. This hinders the determination of the practical benefits in settings involving unseen
logs. Future research could be conducted on this aspect.

4.5 Manual Configuration (PS-1)


The datasets from LogHub and its descendants [27, 77, 78] have been used for the evaluation of log
parsing for many years now and it is known from existing literature which regular expressions,
log formats, or other parameters have to be used to achieve a certain baseline of performance.
Some of the reviewed methods require multiple parameters that are tuned specifically for each
dataset and claim to achieve superior performance on the evaluation metrics. These matters may
lead to unrealistic expectations for real-life applications, where sometimes no previous knowledge
of the versatile format of logs is given [10]. For instance, Dai et al. have shown that the parsing
results of (non-LLM-based) parsers like Drain [18], IPLoM [40], LenMA [52], AEL [25] and Logram
[9] are strongly influenced by their parameter settings. Of the reviewed studies, 9 require certain
configuration parameters. All of these 9 require the log format of the logs to extract the log content
from the raw logs and [39, 72] additionally require regular expressions for preprocessing.
The requirement for manual configuration complicates a parser’s generic application for users
and industry. Compared to the above mentioned list of non-LLM-based parsers, the reviewed
LLM-based parsers exhibit less (essential) configuration parameters. This indicates that involving
SoK: LLM-based Log Parsing X:13

the LLM in the parsing process makes parsers more generic, thus reducing the need of human
effort.

4.6 RAG (PS-2)


Ten approaches describe a sophisticated retrieval process, one describes random retrieval and for six
approaches it is not clear how the approaches apply RAG. RAG is used for demonstration retrieval
but also for retrieval of multiple target logs where a set of similar logs is chosen. The latter is only
relevant for approaches parsing in batches or the whole dataset at once (GP-2). LogBatcher [64],
OpenLogParser [39], LUNAR [21] prompt batches of logs that are in the same similarity cluster
but maximize the dissimilarity within that cluster to highlight the variabilities and commonalities
in logs for the LLM. Demonstrations can either be only logs [26] or logs and their templates
[2, 23, 39, 49, 62, 64, 66, 75]. The templates are either available through being parsed before, yet
with no guaranty of correctness, or they origin from candidate sets of labeled logs for supervised
settings [23, 49, 56, 66, 75](GP-1).
As mentioned in [39] the quality of demonstrations is of high importance for the parsing
effectiveness since it can also introduce noise and confuse the LLM. The works [23, 64–66, 71]
support this finding in their ablation studies by comparing the performance of their RAG sampling
techniques to random retrieval or static examples, which generally achieves lower scores than
strategic retrieval.
Another noteworthy phenomenon observed in generative language models is recency bias which
refers to the tendency of the model to give disproportionate weight or attention to information
that appears closer to the end of the prompt [73]. Consistently, Xu et al. (DivLog) [66] observed
a significant impact on parsing performance based on the order of the retrieved examples in the
prompt, whereby an ascending order of examples by similarity achieved the best results. The
ordering of demonstrations in ascending order is adopted by multiple works [23, 62, 64, 66].

4.7 Caching (PS-3)


As LLM calls are costly compared to traditional algorithms, many approaches therefore create
certain structures like clusters [21, 23, 26, 49, 56, 64, 65, 72], (template) trees [23, 26, 26, 39, 62, 75] or
other structures containing logs and their already parsed templates. In case the target log (or logs)
matches an existing template (from an already parsed log) or is at least very similar, the parsing
algorithm skips the LLM call and returns the corresponding template retrieved from the structure
for this log. This self-evolutionary process (as Pei et al. [49] call it) is visualized in Fig. 2 by the
arrow pointing from the templates to the cache component. This approach can greatly increase
efficiency, allowing the usage of parsing with LLMs in settings where fast and cost efficient parsing
is crucial, with reported runtimes comparable to runtime efficient parsers, like Drain [18]. A parsing
cache can also enhance effectiveness by mitigating the instability of results that often occurs with
generative language models [23]. Astekin et al. [3] find that even with temperature 0 some models
do not answer deterministically. This emphasizes the usefulness of caches even more. However,
it is possible that the first retrieved and stored template for an event type is incorrect, even if it
matches some of the incoming logs. Such an event can be counteracted by a template revision
step (PS-5). All but one paper [54], which described an approach with some form of caching, also
described a post-processing step for correcting templates.
Since the total number of templates is drastically less than the number of logs, LLM parsing
becomes a scalable approach with caching [75]. For example, the LogHub-2.0 datasets [24] contain
more than 50 million logs but less than 3500 templates. 14 approaches apply some form of caching,
yet some do not explicitly name it “caching”. The structures used for caching can be efficiently and
X:14 Beck et al.

effectively used for sampling similar logs or relevant parsing demonstrations for ICL with RAG.
Caching is not bound to the usage with LLMs and can be built into any other parsing approach.

4.8 LLM Usage (PS-4)


We observed three different ways an LLM can be applied: preprocessing, post-processing and direct
parsing, from which we derived the main components of the flow chart in Fig. 2.
Three approaches employ the LLM for preprocessing. ECLIPSE [71] applies the LLM for the
extraction of semantic information from the logs that is then used by an LCS- (longest common
subsequence) and entropy-based parsing algorithm. Fariha et al. [14] used an LLM for extracting
the regular expressions necessary for subsequent parsing with SPELL [12]. LogGenius [68] uses
an LLM to augment log groups with low diversity. The actual parsing task is then left to existing
unsupervised parsers, like Drain [18] or SPELL [12], for which their parsing effectiveness should
be enhanced by the increased diversity of the logs.
Five approaches perform post-processing with the LLM. In all cases, this is done to correct the
already parsed logs to yield more accurate templates. How this is performed is explained in Sec 4.9.
Direct parsing is the most straightforward way of parsing with LLMs which is utilized in all but
6 approaches. There are differences in how the approaches prompt the LLM, yet they all have a
similar structure. The prompts these approaches use consist of the following components — note
that the components are not strictly separated from each other and can be interlaced:

• Instruction: Instructs the LLM to parse the log(s) and how. The instruction may also
contain rules, such as replacement rules for the timestamp or IP address [56] or special
treatment rules for logs concerning exceptions, errors or similar.
• Context explanations (optional): This part contains explanations about logs or log
parsing. For instance, Zhong et al. [75] state to improve the LLM’s capabilities of variable
identification by including explanations about variable categories as outlined in [35]. Other
approaches [2, 21, 55, 56, 65, 75] also provide information about variable types in the prompt.
The categorization of the variables can also be beneficial for downstream tasks and the
interpretability of the results [35, 75].
• Context examples (optional): This part either contains multiple logs to illustrate their
variabilities and commonalities or logs and their templates as parsing demonstrations. This
part can be static, or dynamic using RAG. Providing examples standardizes the LLM’s output
format, leading to more stable results [39].
• Output constraints (optional): This part instructs the LLM how to format the output.
Some approaches enforce JSON format, but most approaches determine one or two markers
(e.g., often backticks or something like /START and /END) so that in a post-processing step
the generated template can be extracted easily with regex after the marker or between
markers. This improves output quality, because generative language models sometimes
generate unwanted output.
• Target log(s): Contains the log(s) to be parsed. The approaches [21, 39, 49, 59, 64, 65] provide
multiple target logs at once in the query, while the rest provide single logs. Providing multiple
logs at once can help the LLM in abstracting variable and static parts from the logs [49].

Le et al. [32] find that with a simple prompt, where no context is provided nor a detailed
instruction is given, GPT-3.5 is hardly able to understand the concept of log parsing. Sun et al. [55]
achieve a significantly higher PA (20% to 60%) with a prompt that contains extraction rules for
direct parsing (on HDFS and BGL with GPT-3.5 and GPT-4) than without extraction rules.
SoK: LLM-based Log Parsing X:15

4.9 Template Revision (PS-5)


In total, 14 papers revise templates in a post-processing step. All works that correct their templates
in a postprocessing step also use a cache (PS-3). The works [36, 62, 72, 75] employ the LLM to revise
templates per prompt: Lemur [72] uses CoT in three dialogue rounds for revising semantically
similar templates for potential merge. First, the structure is revised, then the semantics, and in a
final round the LLM is asked for a solution based on the first two rounds. LogParser-LLM [75] and
LogBabylon [26] generate a parsing tree from the templates during parsing. In case, a loose match
is identified the LLM decides whether the template can be merged or not, leading to an update of
the template tree, or the creation of a new tree root node. The works [36, 39, 62] re-prompt the
logs to the LLM, with the same prompt they were initially parsed, to the LLM if they do not match
all the related logs until all logs are matched [36] or if a certain number of re-prompts is exceeded
[39]. AdaParser [62] also revises templates if their variables contain certain keywords linked to
exceptions, failures or interrupts, which they state should not be abstracted with a wildcard [62].
Updating the previously parsed templates can adapt the parser to changes in the computer
system and help correct faulty templates. Templates are typically faulty because they have static
parts of the log identified as a variable, or variable parts were identified as static. Faulty templates
can be identified by high similarity with an existing template [23, 71, 75], if not all logs within
a similarity cluster can be matched [64], or if the centroids of log clusters move closer together
through newer incoming logs added to clusters [65]. Templates can be merged with templates
that partially match with an existing template, by traversing a prefix tree [23, 26, 75], LCS [71, 74],
implicitly by reentering the logs into the parsing queue [64], or by monitoring whether multiple
paths of a prefix tree join together in a subsequent node. A previously parsed template can also be
replaced with a new and more permissive template [59, 74]. More permissive in this context means
that more variables were identified for the newer template.

4.10 Datasets (R-1)


All of the papers in our selection use one of the LogHub versions. The most popular dataset is
LogHub [77] used in 14 of the them. LogHub-2.0 [24] is used 8 times, and corrected LogHub [27] is
used 6 times. The works [62, 64, 68, 71, 75] evaluate on two or more datasets, or dataset collections.
Zhang et al. (ECLIPSE) [71] and Pang et al. [48] use datasets, other than any of the LogHub ones,
that are not publicly available, while [42, 59, 68] use custom open-source datasets. The usage of
custom datasets along with the datasets from the LogHub versions provides a more comprehensive
view on the performance of the parsers. While it is positive to use a variety of datasets for the
evaluation, it can also hinder the comparability of the parsing results. For instance, Khan et al.
reported performance differences of the conventional parsers on LogHub [77] versus corrected
LogHub [27]. Fu et al. [15] and Zhang et al. (ECLIPSE) [71] showed that many state-of-the-art log
parsers perform poorly on their custom datasets while performing well on the LogHub datasets
[77].

4.11 Metrics (R-2)


Since the correctness evaluation of templates is not straightforward, there is a strong variety of dif-
ferent evaluation metrics. The set of all used metrics is {𝐸𝐷, 𝐹𝐺𝐴, 𝐹𝑇 𝐴, 𝐺𝐴, 𝑃𝐴, 𝑃𝑇 𝐴, 𝑅𝑇 𝐴, other}.
Except for “other”, these are the traditional metrics that have been widely used [24, 27]. If we
compute the Jaccard Index of the used metrics and all available metrics and average the results
we get roughly 0.31, which means that on average only 31% of the full range of metrics is used
(including “other” as a metric). More specifically, 5 publications use only PA, or PA and GA, 7
X:16 Beck et al.

use ED, 9 publications use at least one metric that is insensitive to class imbalance (FGA, FTA), 8
publications use metrics other than the traditional ones.
The metrics cover different characteristics of correctness. It is therefore important for evaluating
parsers to cover the relevant evaluation aspects with these metrics. For instance, Jiang et al. [24]
proposed using FGA and FTA since they are insensitive to class imbalance, while Khan et al. [27]
recommend using GA, PA, RTA and PTA to cover all aspects (FTA is the harmonic mean of RTA
and PTA). For example, the anomaly detection tool, DeepLog [13], detects anomalies solely based
on the event ID. Each event ID corresponds to a unique template. Consequently, it is indifferent if
the parsed templates are correct at the template level as long as they are grouped together correctly,
which is indicated by a high score for GA. Since detection algorithms can focus on a variety of data
characteristics [4], it is important to report metrics that cover these characteristics.

4.12 Model (R-3)


An overview of the LLMs used is given in Table 3. We list only the base models or the name of the
model series, since there is an overwhelming variety of different versions and sizes of LLMs. The
most widely used model series is GPT. The works [2, 3, 8, 22, 36, 38, 39, 42, 59, 62, 68, 75] evaluate
multiple LLMs and compare their performance. Since the papers use different evaluation settings
(metrics, datasets) or configurations, it is not straightforward to say which LLM performs best. To
this end, we refer to the aforementioned works that compare the results of multiple LLMs and
specifically the comprehensive benchmark by Cui et al. (LogEval) [8] which evaluates 18 different
LLMs (partly from the same series, but with different sizes).

Table 3. Overview of used LLMs.

Base model / Model series Creator Availability Times used


GPT OpenAI API 23
Llama Meta Open-source 8
Claude Anthropic API 5
Mistral Mistral AI Open-source 5
ChatGLM Tsinghua University Open-source 2
Gemini Google API 2
Gemma Google Open-source 2
Qwen Alibaba Cloud Open-source 2
AquilaChat BAAI Open-source 1
Baichuan Baichuan AI Open-source 1
DeepSeek DeepSeek AI Open-source 1
DevOps-Model CodeFuse Open-source 1
InternLM Shanghai AI Lab Open-source 1
OpenChat OpenChat Team Open-source 1
OWL Camel AI Open-source 1
Vicuna LMSYS Open-source 1
WizardLM Microsoft Open-source 1

4.13 Code (R-4)


From our selection, 16 papers make their implementation available and provide links to their code
repositories. At the time of writing, two of them [21, 56] lead to an empty or non-existing repository,
but consider that both are preprint versions. The rest did not make their code open-source. A study
by Olzewski et al. [47] analyzed the reproducibility and replicability of 750 machine learning papers
and their codebases and datasets from Tier 1 security conferences between the years 2013 and 2022.
SoK: LLM-based Log Parsing X:17

They found that about 59% of papers did not provide any reproducible artifacts. In our study, 45%
of papers do not provide any reproducible artifacts.
4.13.1 Code Quality. A large proportion of the approaches do not provide code and therefore
can not be used for the evaluation in Sec. 6. Even with available code, some do not provide the
comprehensive functionality to replicate the processes described in their papers and comprehen-
sively correcting the code of others is out of scope for this work. At the time of review (February
14th, 2025) the code of Lemur4 [72] lacks a script for selecting the previously parsed templates for
the subsequent CoT template merging process. The code of LogGenius5 [68] runs into multiple
errors due to missing folders and files, while there is no README file providing instructions. The
LogEval repository6 [8] does not provide the scripts for parsing nor instructions. LogPrompt7
[36] encounters multiple errors due to simple typos. Furthermore, they describe three different
prompt strategies, but it is not clear which they used for their result nor do they explain it in the
instruction file. Moreover, they do not provide the functionality to run the parsing process except
for the self-prompt setting. SelfLog [49] proposes a self-evolutionary approach, in which they
“retrieve the most similar historical logs and their templates from the data through an Approximate
Nearest Neighbors (ANN) search, serving as the corpus for In-Context Learning (ICL)” [49]. One could
think that this means they update the database incrementally after each newly parsed log. However,
in the code8 the script provided for the effectiveness evaluation does not contain the self-evolution
functionality. They provide a second file for online parsing that does provide this functionality but
it does not work ad hoc due to missing files and missing function parameters. Also, by default they
do not use ANN but cosine similarity for retrieval which is somewhere hidden in the code (not an
adjustable parameter). The code repository of SuperLog9 [22] only provides code for LLM training,
but not the described framework for parsing. Moreover, we found that many repositories provide
sparse instructions, which further impedes reproducibility and application, especially when the
execution scripts do not work ad hoc.
These issues appear to be widespread rather than isolated cases. Similar findings were reported in
a large-scale study by Trisovic et al. [58], which examined thousands of replication code repositories
containing R files published between 2010 and 2020. Their analysis revealed that a significant
majority of these files failed to execute properly, even after code cleaning. Likewise, research by
Olzewski et al. [47] showed that more than half of the artifacts from nearly 300 reviewed papers
could not be run at all. Even among the repositories that did execute, only a fraction produced the
expected results, while others either generated different outcomes or lacked crucial components
such as arguments or outputs.
4.13.2 Licenses. As we focus on an application perspective, we also report the code licenses for
the approaches for which the code is available. They are given in Table 4. Keep in mind that it is
possible that code repositories of preprint papers might be preliminary versions or unfinished.

4.14 Notable Mentions and Promising Techniques


Besides the features we defined in Sec 3.2, we identified other promising techniques and ideas,
worth mentioning in the reviewed papers. For instance, some approaches specifically focus on
highlighting the commonalities and variabilities of logs. Lemur [72], LUNAR [21], OpenLogParser
4 https://github.com/zwpride/lemur; accessed 14-February-2025.
5 https://github.com/huashengyihao/LogGenius; accessed 14-February-2025.
6 https://github.com/LinDuoming/LogEval; accessed 14-Febrary-2025.
7 https://github.com/lunyiliu/LogPrompt; accessed 14-February-2025.
8 https://github.com/CSTCloudOps/SelfLog; accessed 14-February-2025.
9 https://github.com/J-York/SuperLog; accessed 14-Febrary-2025
X:18 Beck et al.

Table 4. Licenses of the approaches. Most approaches do not provide a license. Please note, that especially
for preprint papers the license might change or, if absent, be added in the future (status 7-March-2025).

Approach License Preprint?


Cui et al. [8] (LogEval) N/A ✗
Ji et al. [22] (SuperLog) Apache License Version 2.0, January 2004 ✗
Jiang et al. [23] (LILAC) N/A
Liu et al. [36] (LogPrompt) N/A
Ma et al. [38] (LLMParser) N/A
Ma et al. [39] (OpenLogParser) N/A ✗
Mehrabi et al. [42] N/A
Pei et al. [49] (SelfLog) N/A
Sun et al. [55] (Semirald) N/A ✗
Vaarandi et al. [59] (LLM-TD) GNU General Public License version 2 ✗
Xiao et al. [64] (LogBatcher) MIT License 2024
Xu et al. [66] (DivLog) Apache License Version 2.0, January 2004
Yu et al. [68] (LogGenius) N/A
Zhang et al. [72] (Lemur) Apache License Version 2.0, January 2004 ✗

[39], and LogBatcher [64] cluster the logs into multilevel clusters for parsing while the supervised
parsers DivLog [66] and LILAC [23] use this concept for sampling labeled logs. The coarse-grained
clusters thereby capture the commonality of the logs, that, in the best case, all belong to the same
log group (i.e. the same event), and the fine-grained clusters should highlight the variabilites within
that group. This idea is also used by LogShrink [34] for compressing large log files.
Xiao et al. (LogBatcher) [64] report to use the heuristic rules proposed by Khan et al. [27] to
correct templates. This includes measures, such as combining subsequent wildcards, like <*><*>,
into a single one <*>. This technique is also used by[23, 65, 66] and other work outside of our
selection [33, 37]. By manually inspecting some the templates generated by the parsers used for
the evaluation, we found that a significant proportion of templates would have been considered
correct by the evaluation functions, if some of these steps would have been applied.

5 Evaluation Setup
In terms of evaluation, the existing literature regarding LLM-based log parsing misses a common
ground. While there are some commonalities regarding single features like used datasets, used eval-
uation metrics, or used models, the combination of these is highly varying between the approaches
- see feature R-1, R-2, and R-3 in Table 2. As a consequence, we create a benchmark featuring a
subset of 7 out of the 29 approaches. To counteract randomness from sampling, LLM output and
runtime fluctuations, we run each parser 3 times take the average.

5.1 Selection for the Benchmark


For the benchmark we only select parsers which made their code publicly available — see feature
R-4 in Table 2. We distinguish between approaches that solely focus on methods changing model
parameters, like fine-tuning and pre-training, and approaches that apply the LLM as a tool within a
framework. Prior studies showed that fine-tuned [42] or pre-trained [22] models can achieve SotA
accuracy scores without complex pre- and post-processing modules for RAG (PS-2), caching (PS-3),
and template revision (PS-5). However, it is clear that improved quality of the LLM’s output leads
to overall improved output quality of the whole framework. It is therefore meaningful to separate
model-centric and model-wrapping approaches. For our evaluation, we focus on the latter and
therefore do not include LLMParser [38], SuperLog [22] and the work of Mehrabi et al. [42] since
SoK: LLM-based Log Parsing X:19

they do not focus on frameworks but rather on the LLM itself, even tough they provide their code or
model, respectively. We also exclude Semirald [55] from the evaluation since its parsing approach
consists only of naive queries to an LLM without other processing steps or model training.
In Sec. 4.13.1 we described that some approaches provide code, but do not provide functioning
parsers. They are consequently not included in the benchmark (except for LogPrompt).
Given the above-stated matters the remaining approaches featured in the benchmark are LILAC
[23], OpenLogParser [39], LogBatcher [64], SelfLog [49], DivLog [66], LLM-TD [59], Log-
Prompt [36]. To a large extent, this selection represents the state of the art of LLM-based log
parsing frameworks, as their approaches cover most of the aspects of this field. Unfortunately, due
to the aforementioned selection criteria, this benchmark could not include any work using LLMs
as pre- or post-processors.

5.2 Datasets and Baseline


Following previous work [2, 3, 32, 33, 38, 64, 74] we take the corrected version of LogHub from
Khan et al. [27] as the default dataset for our evaluation. It consists of 16 datasets with ground-truth
templates from different domains such as distributed and supercomputer systems, and server
application. Note that we found minor errors in the “Content” column of the corrected version,
mostly additional character spaces, that led to errors when evaluating. We exchanged this column
(during runtime) with the “Content” column from the original LogHub version [77] (only the
templates are different but the content is the same). In Sec. 6.1, we also use the original LogHub
version [77] which contains the non-corrected templates. For the runtime evaluation in Sec. 6.2
we use the large-scale datasets from LogHub-2.0 [24].
Since the LogHub datasets are so widely used for the evaluation of log parsers (see Sec. 4.5)
there is a tendency that the parsers work especially well on these datasets but not on others.
To simulate a use case closer to a real life example, we extend the 16 datasets by a custom
dataset based on AIT Log Dataset V2 [30, 31]. The dataset contains 2000 audit logs (from
russellmitchell/gather/intranet_server/logs/audit) with 8 unique templates. The logs
were manually annotated by ourselves, adhering to the style of the templates from the corrected
LogHub version [27]. Audit logs track system activities, user actions, and changes for security,
compliance, and troubleshooting. The corresponding files can be found in our code repository.
For the baseline of the evaluation, we select five non-LLM-based parsers from the LogPAI
logparser repository10 , namely AEL [25], SPELL [12], Drain [18], ULP [51] and Brain [67]. AEL,
SPELL and Drain have been widely used in the log parsing research and their performance has been
thoroughly investigated by [24, 27, 70]. Their papers were published in 2008 (AEL), 2016 (SPELL)
and 2017 (Drain). ULP and Brain are newer advancements from the years 2022 and 2023. For the
baseline parsers the default configuration was used, but consider that the default configuration
of all these parsers contains the log format, regular expressions (but empty for some datasets),
and two or more other parameters that are specific to each of the LogHub datasets. For the Audit
dataset [30, 31] the log format “type=<Type> msg=audit(<Time>): <Content>” was used and
the regular expression parameter was left empty. The other parameters were simply the parameter
configuration that is used most often for the other datasets. The parameters of the configuration
were not specifically tuned.

5.3 Evaluation Metrics


Following previous work [24] we select Group Accuracy (GA), Parsing Accuracy (PA), and F1-
score of Template Accuracy (FTA) which are described in Sec. 3.2. There, we also described the
10 https://github.com/logpai/logparser; accessed 21-March-2025.
X:20 Beck et al.

Edit Distance (ED) from which we compute the Normalized Edit Distance (NED) [41] following
[64].
GA and PA are sensitive to the total number of log messages, which is problematic since most
log datasets contain a large number of logs but a much smaller number of unique templates [24].
Therefore, Jiang et al. [24] proposed to use F-metrics since they are insensitive to class imbalance
which is why we use FTA. Furthermore, we use NED since it constitutes a similarity metric between
groundtruth and predicted template. The normalization makes NED also a more intuitive indicator
for the effectiveness than ED.

5.4 Implementation and Settings


This section describes the implementation details and the settings used.
5.4.1 Settings. The evaluation was performed using three different LLMs of different sizes (small,
medium, large):
(1) CodeLlama11 (codellama:7b-instruct) is built on top of Llama2 from Meta and has been
designed for code-related task. It offers strong performance while requiring significantly
fewer parameters than other models. This model is employed to illustrate the performance
of a relatively small and efficient LLM (7 Billion parameters). Astekin et al. [2] report
that CodeLlama outperforms other LLMs (such as GPT-3.5) in parsing accuracy (PA). In
a separate study, Astekin et al. [3] find that both CodeLlama and GPT-3.5 deliver stable
results for the log parsing task. The model was run on a Ubuntu 24.04 LTS server with a
15-core Intel Xeon Gold 6226R CPU and a Tesla V100S (32 GB) GPU via Ollama12 .
(2) GPT-3.513 (gpt-3.5-turbo-0125) is a widely known closed-source model, reaching outstand-
ing performance in the earlier stages of the LLM hype. It was chosen because it is the
most used one from all reviewed papers and it is considered to be medium sized (the size
is actually undisclosed but a paper by researchers of Microsoft [53] state it is 20 billion
parameters). The model is proprietary and was accessed via the OpenAI API.
(3) DeepSeek R114 utilizes a novel reinforcement learning approach for model training. It has
recently received wide public attention due to its outstanding performance, while being
relatively small (but still large) compared to similarly performing models like OpenAI’s
o1 model, in combination with its open-source release. This model is used to assess the
potential gains in accuracy that can be achieved by utilizing one of the most effective models
(671 billion parameters). DeepSeek R1 was called via API from TogetherAI15 .
The runtime evaluation with the LogHub-2.0 dataset [24] was run with GPT-3.5 on an Ubuntu
24.04 LTS server with a 32-core AMD EPYC-Milan CPU and 64 GB RAM.
5.4.2 Configuration and Code Changes. For all parsers we set the initial temperature of the LLMs
to 0 for a more deterministic output. Note that some parsers increase the temperature in case of a
template mismatch or other reasons for some of their LLM calls.
As mentioned in Sec. 4.13.1, there are some issues with some of the parsers’ code which had
to be resolved for the benchmark. To ensure the comparability of the output of the parsers, other
changes were also necessary to the code of each parser. These changes are kept as small as possible
to not disturb the original design or interfere with performance, or at least as little as possible.
11 https://ollama.com/library/codellama; accessed 7-March-2025
12 https://ollama.com; accessed 14-March-2025
13 https://platform.openai.com/docs/models/gpt-3.5-turbo; accessed 7-March-2025
14 https://github.com/deepseek-ai/DeepSeek-R1; accessed 25-February-2025
15 https://www.together.ai/; accessed 14-March-2025
SoK: LLM-based Log Parsing X:21

To ensure model or API compatibility, we updated deprecated versions of the OpenAI (python)
package to the newer working ones and we added the functionality to call the OpenAI, Ollama,
and TogetherAI API if not given. To have consistent output format, we ensure that the template
placeholder symbols are always <*> and that the output is always a CSV file with at least the
columns “Content”, and “EventTemplate”. “Content” refers to the log filtered by the log format
(i.e. the log message). Since some parsers do not only parse the log content but the entire log, we
modify the input to these parsers to only parse the content as well. As found by He et al. [17], this
typically improves the parsing accuracy.
Specific changes were made to DivLog [66] which is one of the earliest works on LLM-based log
parsing. It does not feature a caching mechanism and therefore, calls the LLM for each log line.
Since this is a significant burden in terms of runtime (and cost) and therefore also unrealistic for
real-life adaptation, we added a simple cache consisting only of the set of already parsed templates.
The cache returns the corresponding template in the event of a match instead of calling the LLM.
This slightly affects the accuracy metrics due to improved output consistency. Additionally, we
implemented a minor post-processing step where multiple consecutive wildcards were replaced
by a single wildcard <*> for efficient matching, to resolve an issue with the generation of infinite
consecutive wildcards for some of the logs. This measure slightly increases DivLog’s accuracy.
LogPrompt [36] accepts a maximum prompt size parameter which is by default 3000, correspond-
ing to roughly 25 logs parsed at once (as they state). Early experiments revealed that this performs
extremely poor, hence we set the maximum prompt size to 1000. This corresponds to roughly 5 to
10 logs per prompt, which is in the recommended range of simultaneously parsed logs by Xiao
et al. [64]. If a template cannot be matched with a log it is again sent to the LLM. We limit the
maximum number of re-prompts to 3 to avoid infinite LLM calls. LogPrompt also required some
code corrections, other than typos. For some datasets, LogPrompt runs into an infinite loop due
to a faulty template extraction process from the LLM responses. This was fixed by computing
the enumeration of the logs instead of extracting it from the prompt again. From the paper it is
not clear whether they used ICL, CoT or the self-prompt paradigm for their parsing performance
evaluation, yet the parsing functionality is only given for the self-prompt case.
SelfLog [49] requires the user to set up an SQL database, yet the main script (run.py) does not
contain the self-evolution functionality described in the paper (updating the database with new
templates). Therefore, we removed the log-groundtruth template examples part from the prompt
and operated the parser as unsupervised and without RAG (no database).

5.4.3 Sampling for Supervised Parsers. From the selected parsers LILAC [23] and DivLog [66]
are supervised parsers and require labeled logs to achieve competitive accuracy. In this study,
we emphasize minimal human effort and therefore keep the number of available templates 𝑛
small, namely 𝑛 = 2 and 𝑛 = 4. Note that the Apache dataset from LogHub [77] contains the
smallest number of templates, which is only 6. For LILAC and DivLog it has been shown that
the performance increases with an increasing number of labeled logs available [23, 66]. This is
obvious as the similarity search retrieves the exact template the target log belongs to with a higher
likeliness. Multiple works from our selection [23, 66] as well as [33] state that having a diverse
candidate set is crucial to reduce the risk of overfitting to a specific log template, which is why
they create specific sampling algorithms to sample from labeled logs. However, having a large
proportion of templates at hand is an unrealistic setting in real-life application and requires users
to label logs manually [21].
To ensure comparability, we exchanged the sampling processes of the supervised parsers DivLog
[66] and LILAC [23] with a simple custom sampling approach: from all labeled logs of a dataset we
randomly select 𝑛 (unique) templates. For each of those templates we randomly select a matching
X:22 Beck et al.

log and thus yield a set of log-template pairs which constitutes the candidate set for the supervised
parsers. We sample a new set before each run. In the further course of this paper, we indicate the
two supervised parsers by the suffix “-𝑛”, thus LILAC-𝑛, DivLog-𝑛, with 𝑛 ∈ [2, 4].

6 Evaluation Results
This section contains the evaluation results with focus on effectiveness and efficiency.

6.1 Effectiveness
6.1.1 Baseline Performance. The performance of the baseline parsers is visualized in Fig. 3. For
some reason, the implementation of ULP [51] outputs the templates in a special format which is
why the matching process of the evaluation function does not recognize correctly parsed logs, thus
PA and FTA are 0 (GA is not affected).

AEL Brain Drain SPELL ULP


1.0
0.8
0.6
0.4
0.2
0.0
GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED

Fig. 3. Performance of the baseline parsers on the corrected LogHub dataset including Audit.

6.1.2 Performance of LLM-based Parsers. The evaluation results of the parsers with the three LLMs
on LogHub [77] are given in Fig. 4a, Fig. 4b, and Fig. 4c. Note that we excluded LogPrompt from the
evaluation with DeepSeek R1 because the LLM was barely able to extract any templates, resulting
in scores of roughly 0 in experimental runs. The LLM did not provide the output in the structured
format that was requested. Therefore, the template extraction process, which is based on markers
positioned directly before the extracted template, was not able to properly extract the template. This
resulted in long inner monologues of DeepSeek R1 and maximum reprompts due to log-template
mismatches which would have resulted in an unreasonable financial expense given the cost of the
LLM and the poor performance.
In the boxplots of Fig. 4a, Fig. 4b, and Fig. 4c we can see that the best performance on each
metric and for each LLM is achieved by LogBatcher [64], followed by LILAC [23] for 2 and 4 shots.
LogBatcher and LILAC clearly outperform the conventional parsers, while the conventional parsers
outperform the rest of the LLM-based parsers. In general, the performance of each LLM parser is
rather stable across the different models. CodeLlama visibly performs worst, but it is also by far the
smallest of the models with 7 billion parameters. Given the size difference of GPT-3.5 and DeepSeek
R1, it is surprising that it does not outperform GPT-3.5, given its size. Furthermore, we report the
exact numbers for the evaluation with GPT-3.5 in Table 5 to show how the parsers performed on
each individual dataset.
6.1.3 LLM General Performance. Figure 5 shows the performance for CodeLlama, GPT-3.5 and
DeepSeek R1 averaged over all parsers and datasets. One can see that CodeLlama performs worst
while GPT-3.5 slightly outperforms DeepSeek R1 except for PA. Consequently, the financial im-
plications of employing DeepSeek R1 do not justify the cost. For instance, the OpenAI API for
GPT-3.5 (gpt-3.5-0125) costs 0.5$ per 1 million tokens (input and output) while DeepSeek R1 from
SoK: LLM-based Log Parsing X:23

LLM_TD LogBatcher LogPrompt OpenLogParser SelfLog DivLog-2 DivLog-4 LILAC-4 LILAC-2


1.0
0.8
0.6
0.4
0.2
0.0
GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED

(a) Performance with CodeLlama.


LLM_TD LogBatcher LogPrompt OpenLogParser SelfLog DivLog-2 DivLog-4 LILAC-4 LILAC-2
1.0
0.8
0.6
0.4
0.2
0.0
GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED

(b) Performance with GPT-3.5.


LLM_TD LogBatcher LogPrompt OpenLogParser SelfLog DivLog-2 DivLog-4 LILAC-4 LILAC-2
1.0
0.8
0.6
0.4
0.2
0.0
GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED

(c) Performance with DeepSeek R1.

Fig. 4. Performance of the selected parsers on the corrected LogHub datasets including Audit

the TogetherAI API costs 3$ for input and 7$ for the output per 1 million tokens. We found that the
reasoning steps of DeepSeek R1 are not always helpful and especially when a structured output is
required may confuse the LLM. A possible reason could be that the process of log parsing, not being
complex enough, leads to overthinking [7]. However, in [7] the researchers state that DeepSeek R1
is robust against overthinking. The usefulness of reasoning models for log parsing should therefore
be investigated in future research.

6.1.4 Performance Difference on LogHub and corrected LogHub. As in the study by Khan et al. [27],
we report the performance difference between the evaluation on the corrected LogHub datasets [27]
and the original LogHub datasets [77] (performance of corrected minus performance of original)
with GPT-3.5. In Fig. 6 we can see that the highest differences are given for LogBatcher [64],
followed by LILAC [23] for both sample numbers. Naturally, LILAC’s and DivLog’s [66] samples
were collected and evaluated with the same dataset. That the difference is rather on the positive
side indicates that they achieve better performance on the corrected LogHub dataset. Interestingly
LILAC was originally evaluated with LogHub-2.0 whose templates correspond to the original
LogHub templates and not to the corrected ones [23]. In general, we can see a slight tendency that
the parsers output templates that correspond more to the corrected datasets’ templates, which
suggests that LLMs “intuitively” prefer the corrected templates’ format.
X:24 Beck et al.

Table 5. Performance of the selected parsers with GPT-3.5 on the corrected LogHub datasets including Audit

Thunderbird
HealthApp

OpenStack

Zookeeper
OpenSSH

Windows
Proxifier
Android

Average
Hadoop
Apache

HDFS

Linux

Spark

Audit
HPC
BGL

Mac
Baseline
AEL GA 0.77 1.00 0.96 1.00 0.90 0.87 0.57 0.40 0.76 0.54 0.25 0.97 0.90 0.94 0.69 0.92 0.14 0.74
FTA 0.44 0.50 0.22 0.53 0.46 0.24 0.10 0.27 0.17 0.23 0.08 0.44 0.28 0.20 0.27 0.45 0.00 0.29
PA 0.39 0.69 0.34 0.62 0.68 0.51 0.16 0.17 0.17 0.25 0.02 0.67 0.38 0.04 0.15 0.75 0.00 0.35
NED 0.90 0.92 0.84 0.93 0.97 0.70 0.56 0.74 0.71 0.92 0.60 0.97 0.71 0.71 0.90 0.92 0.38 0.79
Brain GA 0.86 1.00 0.99 1.00 0.94 0.95 1.00 0.35 0.94 1.00 0.49 0.53 1.00 0.97 1.00 0.99 0.21 0.84
FTA 0.37 0.67 0.24 0.80 0.46 0.20 0.40 0.40 0.33 0.23 0.21 0.74 0.44 0.34 0.45 0.57 0.00 0.40
PA 0.24 0.70 0.41 0.96 0.66 0.16 0.25 0.17 0.34 0.28 0.11 0.70 0.38 0.04 0.46 0.78 0.00 0.39
NED 0.84 0.92 0.84 1.00 0.88 0.64 0.88 0.82 0.86 0.95 0.96 0.97 0.96 0.82 0.92 0.94 0.41 0.86
Drain GA 0.83 1.00 0.96 1.00 0.89 0.95 0.78 0.42 0.79 0.79 0.22 0.76 0.92 0.96 1.00 0.97 0.14 0.79
FTA 0.52 0.50 0.22 0.53 0.39 0.30 0.11 0.40 0.20 0.44 0.02 0.41 0.40 0.25 0.41 0.52 0.00 0.33
PA 0.55 0.69 0.34 0.63 0.65 0.51 0.23 0.18 0.22 0.51 0.02 0.68 0.38 0.05 0.46 0.79 0.00 0.41
NED 0.90 0.92 0.84 0.93 0.96 0.79 0.61 0.78 0.76 0.97 0.51 0.96 0.96 0.81 0.86 0.94 0.41 0.82
SPELL GA 0.86 1.00 0.79 1.00 0.65 0.78 0.64 0.15 0.76 0.56 0.26 0.53 0.90 0.84 0.99 0.96 0.14 0.69
FTA 0.25 0.50 0.06 0.43 0.39 0.15 0.10 0.14 0.05 0.23 0.00 0.04 0.19 0.15 0.10 0.37 0.00 0.18
PA 0.15 0.69 0.20 0.30 0.53 0.11 0.15 0.09 0.03 0.19 0.00 0.48 0.32 0.03 0.00 0.75 0.00 0.24
NED 0.81 0.92 0.67 0.94 0.84 0.49 0.52 0.75 0.66 0.86 0.50 0.93 0.67 0.71 0.77 0.94 0.38 0.73
ULP GA 0.74 1.00 0.93 1.00 0.95 0.99 0.90 0.18 0.81 0.43 0.47 0.02 0.92 0.68 0.41 0.99 0.34 0.69
FTA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
PA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
NED 0.63 0.68 0.65 0.58 0.66 0.68 0.74 0.58 0.66 0.75 0.64 0.76 0.67 0.70 0.43 0.76 0.50 0.65
Unsupervised
LLM_TD GA 0.00 1.00 0.50 0.98 0.22 0.47 0.24 0.39 0.00 0.25 0.82 0.27 0.29 0.26 0.57 0.35 0.33 0.41
FTA 0.00 0.50 0.20 0.00 0.19 0.17 0.10 0.14 0.00 0.40 0.57 0.00 0.12 0.04 0.10 0.16 0.00 0.16
PA 0.00 0.69 0.45 0.00 0.20 0.12 0.24 0.28 0.00 0.36 0.25 0.00 0.23 0.10 0.01 0.11 0.00 0.18
NED 0.05 0.95 0.51 0.43 0.26 0.47 0.28 0.42 0.06 0.68 0.46 0.22 0.42 0.29 0.48 0.18 0.09 0.37
LogBatcher GA 0.97 1.00 0.99 1.00 0.95 0.99 1.00 0.84 0.92 1.00 0.98 1.00 1.00 0.90 1.00 0.99 0.27 0.93
FTA 0.78 1.00 0.83 1.00 0.75 0.74 0.95 0.75 0.50 0.82 0.79 1.00 0.76 0.67 0.62 0.81 0.13 0.76
PA 0.83 1.00 0.94 1.00 0.94 0.89 0.99 0.84 0.48 0.97 0.77 1.00 0.97 0.84 0.61 0.99 0.00 0.83
NED 0.92 1.00 0.99 1.00 1.00 0.97 1.00 0.94 0.86 0.99 0.94 1.00 0.98 0.96 0.85 1.00 0.77 0.95
LogPrompt GA 0.13 0.19 0.14 0.00 0.17 0.07 0.06 0.12 0.20 0.00 0.01 0.00 0.01 0.07 0.12 0.45 0.00 0.10
FTA 0.12 0.00 0.03 0.00 0.07 0.08 0.06 0.15 0.09 0.00 0.00 0.00 0.03 0.18 0.05 0.07 0.00 0.06
PA 0.24 0.16 0.38 0.00 0.55 0.15 0.27 0.13 0.23 0.13 0.10 0.00 0.25 0.09 0.33 0.50 0.00 0.21
NED 0.78 0.85 0.83 0.49 0.85 0.70 0.71 0.61 0.77 0.73 0.43 0.54 0.82 0.69 0.78 0.86 0.54 0.70
OpenLogParser GA 0.85 1.00 0.96 0.85 0.93 0.96 0.82 0.42 0.81 0.86 0.57 0.76 0.87 0.94 0.99 0.96 0.14 0.81
FTA 0.32 0.50 0.42 0.00 0.59 0.28 0.46 0.45 0.27 0.65 0.05 0.00 0.41 0.36 0.15 0.41 0.00 0.31
PA 0.24 0.69 0.72 0.00 0.89 0.09 0.66 0.24 0.26 0.69 0.06 0.00 0.36 0.06 0.01 0.48 0.00 0.32
NED 0.17 0.46 0.37 0.27 0.30 0.21 0.26 0.27 0.16 0.39 0.24 0.37 0.21 0.19 0.44 0.27 0.48 0.30
SelfLog GA 0.89 1.00 0.97 1.00 0.89 0.99 0.92 0.29 0.79 0.44 0.44 0.53 0.94 0.70 0.72 0.99 0.14 0.74
FTA 0.39 0.33 0.29 0.00 0.16 0.27 0.48 0.31 0.24 0.17 0.54 0.00 0.38 0.35 0.31 0.24 0.00 0.26
PA 0.46 0.28 0.66 0.00 0.22 0.14 0.55 0.09 0.29 0.15 0.31 0.12 0.76 0.08 0.59 0.24 0.00 0.29
NED 0.76 0.85 0.94 0.86 0.65 0.82 0.89 0.79 0.80 0.85 0.73 0.79 0.93 0.82 0.81 0.80 0.51 0.80
Supervised
DivLog-2 GA 0.56 0.91 0.60 0.49 0.52 0.84 0.68 0.16 0.38 0.47 0.30 0.72 0.46 0.36 0.52 0.93 0.42 0.55
FTA 0.39 0.48 0.32 0.06 0.23 0.31 0.57 0.32 0.19 0.33 0.11 0.39 0.23 0.31 0.19 0.59 0.48 0.32
PA 0.52 0.70 0.72 0.57 0.77 0.32 0.76 0.33 0.26 0.64 0.24 0.76 0.67 0.35 0.31 0.80 0.49 0.54
NED 0.78 0.92 0.93 0.76 0.90 0.80 0.92 0.79 0.52 0.95 0.52 0.95 0.87 0.82 0.70 0.92 0.94 0.82
DivLog-4 GA 0.67 0.53 0.55 0.65 0.36 0.70 0.86 0.20 0.59 0.59 0.27 0.32 0.67 0.40 0.52 0.87 0.36 0.53
FTA 0.42 0.43 0.22 0.25 0.18 0.25 0.56 0.34 0.30 0.55 0.10 0.29 0.48 0.30 0.22 0.53 0.45 0.34
PA 0.50 0.70 0.67 0.71 0.53 0.32 0.76 0.38 0.36 0.65 0.25 0.85 0.77 0.27 0.63 0.79 0.56 0.57
NED 0.78 0.91 0.93 0.87 0.63 0.69 0.91 0.81 0.78 0.95 0.48 0.91 0.93 0.64 0.80 0.93 0.77 0.81
LILAC-4 GA 0.95 1.00 0.98 0.94 0.95 0.97 0.91 0.30 0.81 0.75 0.49 0.95 0.97 0.94 0.79 0.99 1.00 0.86
FTA 0.66 1.00 0.89 0.46 0.75 0.59 0.80 0.70 0.47 0.84 0.83 0.92 0.67 0.56 0.57 0.69 0.88 0.72
PA 0.57 1.00 0.96 0.53 0.92 0.83 0.89 0.42 0.49 0.78 0.43 0.99 0.93 0.73 0.70 0.64 0.78 0.74
NED 0.85 1.00 0.99 0.82 0.99 0.95 0.96 0.92 0.86 0.97 0.90 1.00 0.97 0.93 0.85 0.82 0.99 0.93
LILAC-2 GA 0.96 1.00 0.96 0.94 0.88 0.89 0.91 0.29 0.79 0.63 0.49 0.85 0.92 0.96 0.69 0.99 1.00 0.83
FTA 0.70 1.00 0.80 0.50 0.70 0.57 0.80 0.61 0.48 0.75 0.74 0.77 0.66 0.61 0.57 0.70 0.67 0.68
PA 0.57 1.00 0.94 0.48 0.87 0.70 0.89 0.37 0.48 0.64 0.37 0.92 0.88 0.78 0.52 0.66 0.49 0.68
NED 0.87 1.00 0.98 0.91 0.98 0.90 0.96 0.87 0.87 0.94 0.90 0.99 0.99 0.95 0.77 0.87 0.98 0.93

6.1.5 Performance on the Audit Dataset. An analysis of the performance of all parsers on the Audit
dataset reveals that only the supervised parser, LILAC [23], and to some extent also DivLog [66], are
SoK: LLM-based Log Parsing X:25

CodeLlama GPT-3.5 DeepSeek R1


1.0
0.8
0.6
0.4
0.2
0.0
GA FTA PA NED GA FTA PA NED GA FTA PA NED

Fig. 5. Averaged performance of all parsers for CodeLlama, GPT-3.5 and DeepSeek R1.

LLM_TD LogBatcher LogPrompt OpenLogParser SelfLog DivLog-2 DivLog-4 LILAC-4 LILAC-2


1.0

0.5

0.0

0.5

1.0
GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED

Fig. 6. Performance of the parsers with GPT-3.5 on corrected LogHub minus the performance on the original
LogHub.

able to parse the logs effectively. It should be noted that the configuration of the baseline parsers
for Audit was not specifically tuned. It appears that the intended templates’ format of the logs is
not straightforward for GPT-3.5 without supervised demonstrations. This finding underscores the
efficacy of supervised parsers, demonstrating that a mere two templates are sufficient to attain a
reasonable performance level, potentially even a single one demonstrating the preferred template
style. Nevertheless, it should be noted that this is merely a solitary example. Future research could
establish a more extensive benchmark by incorporating a range of novel and diverse datasets.

6.2 Efficiency
For the evaluation regarding efficiency we select the well-known datasets HDFS and BGL from
LogHub-2.0 [24]. HDFS contains 11.2 million logs with 46 unique templates while BGL contains 4.6
million logs with 320 unique templates. We excluded DivLog [66] and LogPrompt [36] from this
evaluation since they do not utilize caching. The computation would therefore scale linearly with
the number of logs and require multiple million LLM calls.
Figure 8 visualizes the computation time and LLM invocation time of the parsers, Fig. 9 shows
the number of LLM calls made. The invocation time for conventional parsers is, of course, zero.
Outstanding runtime efficiency is achieved by the conventional parser ULP [51] on both datasets.
While LLM-TD is the second fastest and also calls the LLM the least times it is rather the opposite
on the BGL dataset. As mentioned in Sec. 4.13.1 and 5.4.2 there were some implementation problems
with SelfLog [49]. They provided a faulty script for the efficiency evaluation which is why we
ran the evaluation with the script they designed for the evaluation of the 2000 log lines version
X:26 Beck et al.

LILAC-4 LILAC-2 DivLog-2 DivLog-4 ULP LLM_TD LogBatcher


1.00
0.75
0.50
0.25
0.00
GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED
Brain AEL Drain SPELL SelfLog OpenLogParser LogPrompt
1.00
0.75
0.50
0.25
0.00
GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED GAFTA PANED

Fig. 7. Performance of the selected parsers with GPT-3.5 only on the Audit data.

of LogHub [77]. However, for both HDFS and BGL the processes got killed and are therefore not
featured in the plots.

HDFS BGL
ULP 78 Computation Time ULP 47 Computation Time
LLM_TD 274+35 LogBatcher 63+284
LogBatcher 289+38 Invocation Time Brain 410 Invocation Time
LILAC-4 573+34 Drain 450
LILAC-2 578+30 LILAC-4 806+249
Brain 1058 LLM_TD 1610+1360
AEL 1142 SPELL 1846
Drain 1222 LILAC-2 1931+249
SPELL 1328 OpenLogParser 2318+441
OpenLogParser 5207+59 AEL 3360
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000
Time (seconds) Time (seconds)

Fig. 8. Computation and invocation time of parsers on HDFS and BGL with GPT-3.5.

HDFS BGL
LLM_TD 21.0 LILAC-2 325.0
LILAC-2 41.3 LILAC-4 326.3
LILAC-4 42.0 LogBatcher 380.0
LogBatcher 49.3 OpenLogParser 416.0
OpenLogParser 73.0 LLM_TD 1534.0
0 20 40 60 80 0 500 1000 1500
Number of LLM calls Number of LLM calls

Fig. 9. Number of LLM calls of parsers on HDFS and BGL with GPT-3.5.

7 Discussion
This section summarizes the findings of the literature review from Sec. 4 and of the evaluation
from Sec. 6 by answering the research questions from Sec. 1.
SoK: LLM-based Log Parsing X:27

7.1 Answers to Research Questions


RQ1: What are the main advantages and disadvantages of LLM-based log parsing approaches and
non-LLM-based approaches? LLM-based log parsing approaches offer key advantages such as
adaptability to diverse log formats, the ability to generalize across different system logs, and
robustness to unseen log templates as demonstrated by the evaluation on the Audit dataset [30, 31]
and other work [15]. Approaches like the unsupervised parser LogBatcher [64] and the supervised
parser LILAC [23] can handle complex and unstructured logs more effectively than traditional
methods. However, they also come with drawbacks, including higher computational costs, potential
hallucinations and overthinking for reasoning models [7], and difficulty in interpreting results due
to their black-box nature. In contrast, non-LLM-based approaches, such as Brain [67] and ULP [51],
tend to be more interpretable, computationally efficient, and easier to deploy but lack the flexibility
of LLMs in handling evolving log templates.

RQ2: To what extent do LLM-based log parsing approaches rely on labeled data and manual config-
uration? LLM-based log parsing approaches typically require labeled datasets for fine-tuning or
evaluation, though some methods leverage self-evolutionary approaches using previously parsed
templates to substitute the need for a priori labels. However, with only 2 and 4 shots of labeled logs,
LILAC [23] achieves superior performance on the corrected LogHub dataset [27] and Audit [30, 31],
compared to the conventional parsers and other LLM-based parsers, excluding LogBatcher [64].
Furthermore, LogBatcher demonstrates impressive performance on both runtime and effective-
ness while being unsupervised and only requiring the log format as input parameter. Fine-tuning
approaches also report outstanding performance, yet their stronger reliance on labeled logs and
computational power and the robust and high performance of LogBatcher and LILAC demonstrate
that ICL is a cheaper and performant alternative.
In comparison to the baseline parsers AEL [25], SPELL [12], Drain [18], ULP [51] and Brain
[67], the LLM-based parsers of our evaluation selection require on average less configuration
parameters. This finding is particularly noteworthy in light of the satisfactory performance achieved
by LogBatcher [64]. This finding suggests that the deployment of LLMs can effectively substitute a
substantial proportion of manual effort, thus enhance usability.

RQ3: Which techniques can enhance the efficiency or effectiveness of LLM-based log parsing? It is
evident that ICL and fine-tuning present powerful learning paradigms for LLM-based log parsing.
They are not mutually exclusive to each other and can be employed simultaneously to improve
parsing accuracy. ICL can be enhanced with RAG or with previously parsed or labeled logs as
demonstrations. The ablative studies of reviewed papers [23, 64–66, 71] unanimously conclude
that dynamic demonstrations and sophisticated demonstration selection algorithms improve the
accuracy scores. Smart caching solutions have been shown to lead to efficiency improvements, in
terms of runtime and monetary expenses, since LLM calls are usually costly. Template revision
methods that correct, delete or merge related templates retrospectively can further improve cached
templates. This can also help adapt the parser to evolving logs due to updates or other behavioral
changes of the computer systems without reconfiguration or retraining. Evaluations show that is
also possible to improve the effectiveness of parsers with more powerful LLMs, yet the effective
application of reasoning models like DeepSeek R1 should be investigated more closely. It has been
shown that variable aware prompts improve the effectiveness in direct parsing with LLMs. Naturally,
more explanations in the prompt mean more tokens per parsed log which negatively affects the
inference time of LLMs. Short and concise instructions and explanations could thereby alleviate
this issue. Furthermore, creating coarse-grained clusters to capture the commonalities of logs and
fine-grained clusters to capture the variabilities of logs to support the template extraction process
X:28 Beck et al.

constitutes another promising technique, that can be applied by conventional and LLM-based
parsers alike.
RQ4: What experiment design choices influence the comparability of the evaluation results between
LLM-based log parsing approaches? Key experiment design choices include the selection of bench-
mark datasets, evaluation metrics, training methodologies, and hardware specifications. Consistency
in log datasets and pre-processing methods is crucial for fair comparisons. For instance, not all
parsers require the log format as an input parameter, such as LLM-TD [59] and LUNAR [21]. LUNAR
explicitly includes log headers, such as IP addresses, timestamps, or levels, to grasp the full context
of the log. For instance, partitioning logs into groups based on their timestamps is also possible with
time intervals [76]. Previous research [17] found that parsing only the extracted log message part,
according to the log format, improves effectiveness, but especially LLMs with their generalization
capabilities could utilize this extended context.
A set of metrics such as GA, PA, FTA, but also NED (or ED), should be standardized across studies
to cover all relevant characteristics of templates, since each metrics maps to specific requirements
of specific downstream tasks. The results of our evaluation show high variance between these
metrics for different parsers, indicating that the selection covers a variety of aspects. The choice of
model hyperparameters also significantly impacts comparability. For example, the researchers of
Lemur [72] report near-perfect scores of 0.999 and 0.996 for FGA and GA on LogHub [77], yet they
require log format, regular expressions, and four other (hyperparameter-tuned) parameters for
each dataset. This demonstrates the potential performance levels that can be achieved. However,
the number of parameters required is impractical for real-world applications and a comparison to
the performance of other parsers is not feasible.
The aforementioned experiment design choices concern all types of parsers. Specific to LLM-
based parsers, the employed LLM is also an important factor to consider regarding comparability of
evaluation results. As demonstrated, CodeLlama, GPT-3.5, and DeepSeek R1 perform significantly
different on the effectiveness scores. Additionally, variations in API versions, underlying model
updates, and differences in temperature settings further contribute to comparability issues in
performance across studies. This highlights the need for standardized benchmarking practices to
ensure fair and reliable comparisons between different LLM-based parsing approaches.
RQ5: To what extent are the presented LLM-based parsing methods accessible and reproducible? The
accessibility and reproducibility of the presented LLM-based parsing methods are significantly
hindered by issues related to code availability, documentation, and execution reliability. Many
approaches do not provide public code, and even when code is available, it often lacks essential
components, such as necessary scripts or instructions, making replication difficult. Several reposito-
ries suffer from missing files, incorrect implementations, or incomplete functionality that does not
align with the descriptions in their respective papers. Furthermore, discrepancies in the code, such
as missing scripts for crucial steps or incorrect default settings, complicate reproducibility. Even
when modifications are made to ensure compatibility — such as fixing typos, updating libraries, or
implementing minor adjustments to improve consistency — these changes highlight the necessity
of external intervention to make the methods functional. The need for such corrections aligns with
broader findings from Trisovic et al. [58] and Olzewski et al. [47], who observed that a majority
of replication code repositories contain errors that prevent immediate execution. Olzewski et al.
further found that only a small proportion of the running codebases also produce the claimed
results of the related papers [47]. Additionally, some parsers require substantial configuration
efforts, further limiting accessibility. Whilst minor improvements, such as adding caching mech-
anisms to DivLog [66] or modifying LogPrompt’s [36] prompt size, enhance usability, they also
introduce deviations from the original implementations, raising concerns about reproducibility. The
SoK: LLM-based Log Parsing X:29

discrepancy between our own evaluation results and the reported results raises further concerns
about the validity of the publications’ evaluations but naturally also about our own evaluation.
Future benchmarks are expected to address these concerns. Finally, while log parsers based on LLM
show great potential, their current state significantly hinders accessibility and reliable reproduction
without substantial external effort.

7.2 Threats to Validity


We identified two major threats to the validity of the results of our evaluation concerning the
adoption of the code and randomness.
7.2.1 Mistakes in Code Adoption. In Sec. 4.13.1 we describe issues with the code of some of the
parsers and in Sec. 5.4.2 we describe the changes we made due to the mentioned issues and to ensure
transparency. We found that a significant number of parsers do not attain the performance levels
claimed in their respective papers. This raises the question whether the observed discrepancies can
be attributed to our alterations in their code or the inaccuracy of their evaluation. Alternatively, it
is possible that other factors, such as the utilization of different language models, are responsible
for the observed variations. However, given that the alterations were predominantly minor, we
anticipate that the correctness was not negatively influenced. The code has been made available on
GitHub for replication. Should any flaws in the adoption of the parsers be identified, we request to
report an issue.
7.2.2 Randomness. The validity of results is threatened by the randomness in LLM output and
the randomness introduced by the random sampling for the supervised parsers. To mitigate the
randomness in LLMs, we set the LLMs’ initial temperature to 0. The temperature determines the
creativity of an LLM. However, even a temperature of 0 does not guarantee a deterministic output
[3]. Therefore, we repeated the evaluation three times and computed the average of the measured
values.

8 Conclusion
In this work, we conducted a systematic review of LLM-based log parsing approaches, analyzed
their strengths and weaknesses in comparison to non-LLM-based methods, and performed an
extensive benchmark evaluation to assess their effectiveness, efficiency, and usability. Our findings
demonstrate that LLM-based log parsers provide notable advantages, particularly in their adapt-
ability to diverse log formats and their capability to generalize across unseen templates. However,
they also come with challenges such as high computational costs, susceptibility to hallucinations,
and issues with interpretability.
A key observation is that LLM-based log parsers often reduce the need for manual configuration
and labeling, making them more accessible to users with limited domain expertise. Techniques
such as in-context learning (ICL), fine-tuning, retrieval-augmented generation (RAG) and tem-
plate revision can significantly enhance efficiency and accuracy of these methods. Sophisticated
dynamic demonstration selection and caching strategies have been shown to significantly impact
performance, reducing both LLM inference time and cost. However, despite these improvements,
only two out of seven LLM-based parsers, namely LILAC [23] and LogBatcher [64], are able to
outperform the non-LLM-based parsers that constitute our baseline. Furthermore, we identified a
lack of reproducibility and comparability, as many implementations lack publicly available code or
datasets, comprehensive documentation, or consistency with reported results. Our study highlights
the necessity of standardized benchmarking practices for LLM-based log parsing. Differences in
experimental setups, dataset preprocessing, and metric selection complicate direct comparisons
between methods and impede a meaningful selection for user-specific downstream tasks.
X:30 Beck et al.

Overall, LLM-based log parsing represents a promising direction for automated log analysis.
Techniques like caching, RAG, template revision, especially when used in combination, have clearly
proven their value in this field, but further research is required to fully realize the potential of LLM-
based log parsing. We hope that our systematic review, benchmark study, and insights contribute
to the development of more effective, transparent, and user-friendly log parsing solutions. To
facilitate future research and reproducibility, we make our evaluation results and source code
publicly available on our GitHub repository.

Acknowledgments
Funded by the European Union under the European Defence Fund (GA no. 101121403 - NEWSROOM)
and under the Horizon Europe Research and Innovation programme (GA no. 101168144 - MIRANDA).
Views and opinions expressed are however those of the author(s) only and do not necessarily reflect
those of the European Union or the European Commission. Neither the European Union nor the
granting authority can be held responsible for them. This work is co-funded by the Austrian FFG
Kiras project ASOC (GA no. FO999905301).

References
[1] Siraaj Akhtar, Saad Khan, and Simon Parkinson. 2025. LLM-based event log analysis techniques: A survey. https:
//doi.org/10.48550/arXiv.2502.00677 arXiv:2502.00677 [cs].
[2] Merve Astekin, Max Hort, and Leon Moonen. 2024. A Comparative Study on Large Language Models for Log Parsing. In
Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM
’24). Association for Computing Machinery, New York, NY, USA, 234–244. https://doi.org/10.1145/3674805.3686684
[3] Merve Astekin, Max Hort, and Leon Moonen. 2024. An Exploratory Study on How Non-Determinism in Large
Language Models Affects Log Parsing. In Proceedings of the ACM/IEEE 2nd International Workshop on Interpretability,
Robustness, and Benchmarking in Neural Software Engineering (InteNSE ’24). Association for Computing Machinery,
New York, NY, USA, 13–18. https://doi.org/10.1145/3643661.3643952
[4] Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, and Andreas Rauber. 2024. Semi-supervised
Configuration and Optimization of Anomaly Detection Algorithms on Log Data. In 2024 IEEE International Conference
on Big Data (BigData). 2575–2585. https://doi.org/10.1109/BigData62323.2024.10825202 ISSN: 2573-2978.
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165
arXiv:2005.14165 [cs].
[6] Jinfu Chen, Weiyi Shang, Ahmed E. Hassan, Yong Wang, and Jiangbin Lin. 2019. An Experience Report of Generating
Load Tests Using Log-Recovered Workloads at Varying Granularities of User Behaviour. In 2019 34th IEEE/ACM
International Conference on Automated Software Engineering (ASE). 669–681. https://doi.org/10.1109/ASE.2019.00068
ISSN: 2643-1572.
[7] Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar
Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and
Joseph E. Gonzalez. 2025. The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks.
https://doi.org/10.48550/arXiv.2502.08235 arXiv:2502.08235 [cs].
[8] Tianyu Cui, Shiyu Ma, Ziang Chen, Tong Xiao, Shimin Tao, Yilun Liu, Shenglin Zhang, Duoming Lin, Changchang
Liu, Yuzhe Cai, Weibin Meng, Yongqian Sun, and Dan Pei. 2024. LogEval: A Comprehensive Benchmark Suite for
Large Language Models In Log Analysis. http://arxiv.org/abs/2407.01896 arXiv:2407.01896.
[9] Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2022. Logram: Efficient Log Parsing Using
nn-Gram Dictionaries. IEEE Transactions on Software Engineering 48, 3 (March 2022), 879–892. https://doi.org/10.
1109/TSE.2020.3007554 Conference Name: IEEE Transactions on Software Engineering.
[10] Hetong Dai, Yiming Tang, Heng Li, and Weiyi Shang. 2023. PILAR: Studying and Mitigating the Influence of
Configurations on Log Parsing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).
818–829. https://doi.org/10.1109/ICSE48619.2023.00077 ISSN: 1558-1225.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter
SoK: LLM-based Log Parsing X:31

of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis,
Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[12] Min Du and Feifei Li. 2016. Spell: Streaming Parsing of System Event Logs. In 2016 IEEE 16th International Conference
on Data Mining (ICDM). 859–864. https://doi.org/10.1109/ICDM.2016.0103 ISSN: 2374-8486.
[13] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System
Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security (CCS ’17). Association for Computing Machinery, New York, NY, USA, 1285–1298. https://doi.org/10.1145/
3133956.3134015
[14] Asma Fariha, Vida Gharavian, Masoud Makrehchi, Shahryar Rahnamayan, Sanaa Alwidian, and Akramul Azim. 2024.
Log Anomaly Detection by Leveraging LLM-Based Parsing and Embedding with Attention Mechanism. In 2024 IEEE
Canadian Conference on Electrical and Computer Engineering (CCECE). 859–863. https://doi.org/10.1109/CCECE59415.
2024.10667308 ISSN: 2576-7046.
[15] Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, and Dan Yang. 2022. Investigating and
improving log parsing in practice. In Proceedings of the 30th ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New
York, NY, USA, 1566–1577. https://doi.org/10.1145/3540250.3558947
[16] Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto
Navigli (Eds.). Association for Computational Linguistics, Online, 3816–3830. https://doi.org/10.18653/v1/2021.acl-
long.295
[17] Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R. Lyu. 2016. An Evaluation Study on Log Parsing and Its Use
in Log Mining. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
654–661. https://doi.org/10.1109/DSN.2016.66 ISSN: 2158-3927.
[18] Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed
Depth Tree. In 2017 IEEE International Conference on Web Services (ICWS). 33–40. https://doi.org/10.1109/ICWS.2017.13
[19] Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R. Lyu. 2021. A Survey on Automated Log
Analysis for Reliability Engineering. ACM Comput. Surv. 54, 6 (July 2021), 130:1–130:37. https://doi.org/10.1145/3460345
[20] Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2016. Experience Report: System Log Analysis for Anomaly
Detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). 207–218. https:
//doi.org/10.1109/ISSRE.2016.21 ISSN: 2332-6549.
[21] Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael R. Lyu. 2024. LUNAR: Unsupervised LLM-based Log
Parsing. http://arxiv.org/abs/2406.07174 arXiv:2406.07174.
[22] Yuhe Ji, Yilun Liu, Feiyu Yao, Minggui He, Shimin Tao, Xiaofeng Zhao, Su Chang, Xinhua Yang, Weibin Meng, Yuming
Xie, Boxing Chen, and Hao Yang. 2024. Adapting Large Language Models to Log Analysis with Interpretable Domain
Knowledge. https://doi.org/10.48550/arXiv.2412.01377 arXiv:2412.01377 [cs].
[23] Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and
Michael R. Lyu. 2024. LILAC: Log Parsing using LLMs with Adaptive Parsing Cache. Proc. ACM Softw. Eng. 1, FSE
(July 2024), 7:137–7:160. https://doi.org/10.1145/3643733
[24] Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and
Michael R. Lyu. 2024. A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We?. In Proceedings of the
33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing
Machinery, New York, NY, USA, 223–234. https://doi.org/10.1145/3650212.3652123
[25] Zhen Ming Jiang, Ahmed E. Hassan, Parminder Flora, and Gilbert Hamann. 2008. Abstracting Execution Logs to
Execution Events for Enterprise Applications (Short Paper). In 2008 The Eighth International Conference on Quality
Software. 181–186. https://doi.org/10.1109/QSIC.2008.50 ISSN: 2332-662X.
[26] Rabimba Karanjai, Yang Lu, Dana Alsagheer, Keshav Kasichainula, Lei Xu, Weidong Shi, and Shou-Hsuan Stephen
Huang. 2024. LogBabylon: A Unified Framework for Cross-Log File Integration and Analysis. https://doi.org/10.1145/
3672608.3707883 arXiv:2412.12364 [cs].
[27] Zanis Ali Khan, Donghwan Shin, Domenico Bianculli, and Lionel Briand. 2022. Guidelines for assessing the accuracy
of log message template identification techniques. In Proceedings of the 44th International Conference on Software
Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1095–1106. https://doi.org/10.
1145/3510003.3510101
[28] Alex Kulesza and Ben Taskar. 2012. Determinantal Point Processes for Machine Learning. Foundations and Trends® in
Machine Learning 5, 2–3 (Dec. 2012), 123–286. https://doi.org/10.1561/2200000044 Publisher: Now Publishers, Inc..
X:32 Beck et al.

[29] Max Landauer, Sebastian Onder, Florian Skopik, and Markus Wurzenberger. 2023. Deep learning for anomaly detection
in log data: A survey. Machine Learning with Applications 12 (June 2023), 100470. https://doi.org/10.1016/j.mlwa.2023.
100470
[30] Max Landauer, Florian Skopik, Maximilian Frank, Wolfgang Hotwagner, Markus Wurzenberger, and Andreas Rauber.
2023. Maintainable Log Datasets for Evaluation of Intrusion Detection Systems. IEEE Transactions on Dependable and
Secure Computing 20, 4 (July 2023), 3466–3482. https://doi.org/10.1109/TDSC.2022.3201582 Conference Name: IEEE
Transactions on Dependable and Secure Computing.
[31] Max Landauer, Florian Skopik, Markus Wurzenberger, Wolfgang Hotwagner, and Andreas Rauber. 2021. Have it
Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed. IEEE Transactions on
Reliability 70, 1 (March 2021), 402–415. https://doi.org/10.1109/TR.2020.3031317 Conference Name: IEEE Transactions
on Reliability.
[32] Van-Hoang Le and Hongyu Zhang. 2023. Log Parsing: How Far Can ChatGPT Go?. In 2023 38th IEEE/ACM International
Conference on Automated Software Engineering (ASE). 1699–1704. https://doi.org/10.1109/ASE56229.2023.00206 ISSN:
2643-1572.
[33] Van-Hoang Le and Hongyu Zhang. 2023. Log Parsing with Prompt-Based Few-Shot Learning. In Proceedings of the
45th International Conference on Software Engineering (ICSE ’23). IEEE Press, Melbourne, Victoria, Australia, 2438–2449.
https://doi.org/10.1109/ICSE48619.2023.00204
[34] Xiaoyun Li, Hongyu Zhang, Van-Hoang Le, and Pengfei Chen. 2024. LogShrink: Effective Log Compression by
Leveraging Commonality and Variability of Log Data. In Proceedings of the IEEE/ACM 46th International Conference on
Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.
1145/3597503.3608129
[35] Zhenhao Li, Chuan Luo, Tse-Hsun Chen, Weiyi Shang, Shilin He, Qingwei Lin, and Dongmei Zhang. 2023. Did We Miss
Something Important? Studying and Exploring Variable-Aware Log Abstraction. In 2023 IEEE/ACM 45th International
Conference on Software Engineering (ICSE). 830–842. https://doi.org/10.1109/ICSE48619.2023.00078 ISSN: 1558-1225.
[36] Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yuhang Chen, Yanqing Zhao, Hao Yang, and
Yanfei Jiang. 2024. Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies. In
2024 IEEE/ACM 32nd International Conference on Program Comprehension (ICPC). 35–46. https://ieeexplore.ieee.org/
document/10556497/ ISSN: 2643-7171.
[37] Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei Lin, Yingnong
Dang, Saravan Rajmohan, and Dongmei Zhang. 2022. UniParser: A Unified Log Parser for Heterogeneous Log Data.
In Proceedings of the ACM Web Conference 2022 (WWW ’22). Association for Computing Machinery, New York, NY,
USA, 1893–1901. https://doi.org/10.1145/3485447.3511993
[38] Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Peter Chen, and Shaowei Wang. 2024. LLMParser: An Exploratory
Study on Using Large Language Models for Log Parsing. In 2024 IEEE/ACM 46th International Conference on Software
Engineering (ICSE). 1209–1221. https://doi.org/10.1145/3597503.3639150 ISSN: 1558-1225.
[39] Zeyang Ma, Dong Jae Kim, and Tse-Hsun Chen. 2024. LibreLog: Accurate and Efficient Unsupervised Log Parsing
Using Open-Source Large Language Models. https://doi.org/10.48550/arXiv.2408.01585 arXiv:2408.01585 [cs].
[40] Adetokunbo A.O. Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios. 2009. Clustering event logs using
iterative partitioning. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining (KDD ’09). Association for Computing Machinery, New York, NY, USA, 1255–1264. https://doi.org/10.
1145/1557019.1557154
[41] A. Marzal and E. Vidal. 1993. Computation of normalized edit distance and applications. IEEE Transactions on Pattern
Analysis and Machine Intelligence 15, 9 (Sept. 1993), 926–932. https://doi.org/10.1109/34.232078 Conference Name:
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[42] Maryam Mehrabi, Abdelwahab Hamou-Lhadj, and Hossein Moosavi. 2024. The Effectiveness of Compact Fine-Tuned
LLMs in Log Parsing. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE,
438–448. https://ieeexplore.ieee.org/abstract/document/10795057/
[43] Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, and Hua Cai. 2013. Toward Fine-Grained,
Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems. IEEE Transactions on Parallel
and Distributed Systems 24, 6 (June 2013), 1245–1255. https://doi.org/10.1109/TPDS.2013.21 Conference Name: IEEE
Transactions on Parallel and Distributed Systems.
[44] Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. 2023. Few-shot Fine-tuning vs.
In-context Learning: A Fair Comparison and Evaluation. https://doi.org/10.48550/arXiv.2305.16938 arXiv:2305.16938
[cs].
[45] Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2021. Self-supervised Log
Parsing. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track, Yuxiao Dong, Dunja
Mladenić, and Craig Saunders (Eds.). Springer International Publishing, Cham, 122–138. https://doi.org/10.1007/978-
SoK: LLM-based Log Parsing X:33

3-030-67667-4_8
[46] Paolo Notaro, Soroush Haeri, Jorge Cardoso, and Michael Gerndt. 2023. LogRule: Efficient Structured Log Mining
for Root Cause Analysis. IEEE Transactions on Network and Service Management 20, 4 (Dec. 2023), 4231–4243. https:
//doi.org/10.1109/TNSM.2023.3282270 Conference Name: IEEE Transactions on Network and Service Management.
[47] Daniel Olszewski, Allison Lu, Carson Stillman, Kevin Warren, Cole Kitroser, Alejandro Pascual, Divyajyoti Ukirde,
Kevin Butler, and Patrick Traynor. 2023. "Get in Researchers; We’re Measuring Reproducibility": A Reproducibility
Study of Machine Learning Papers in Tier 1 Security Conferences. In Proceedings of the 2023 ACM SIGSAC Conference
on Computer and Communications Security (CCS ’23). Association for Computing Machinery, New York, NY, USA,
3433–3459. https://doi.org/10.1145/3576915.3623130
[48] Yue Pang, Min Zhang, Yanli Liu, Xiangbin Li, Yidi Wang, Yahang Huan, Zhuo Liu, Jin Li, and Danshi Wang. 2024.
Large language model-based optical network log analysis using LLaMA2 with instruction tuning. Journal of Optical
Communications and Networking 16, 11 (Nov. 2024), 1116–1132. https://doi.org/10.1364/JOCN.527874 Conference
Name: Journal of Optical Communications and Networking.
[49] Changhua Pei, Zihan Liu, Jianhui Li, Erhan Zhang, Le Zhang, Haiming Zhang, Wei Chen, Dan Pei, and Gaogang Xie.
2024. Self-Evolutionary Group-wise Log Parsing Based on Large Language Model. In 2024 IEEE 35th International
Symposium on Software Reliability Engineering (ISSRE). IEEE, 49–60. https://ieeexplore.ieee.org/abstract/document/
10771304/
[50] William M. Rand. 1971. Objective Criteria for the Evaluation of Clustering Methods. J. Amer. Statist. Assoc.
66, 336 (Dec. 1971), 846–850. https://doi.org/10.1080/01621459.1971.10482356 Publisher: ASA Website _eprint:
https://www.tandfonline.com/doi/pdf/10.1080/01621459.1971.10482356.
[51] Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, and Mohammed A. Shehab. 2022. An Effective
Approach for Parsing Large Log Files. In 2022 IEEE International Conference on Software Maintenance and Evolution
(ICSME). 1–12. https://doi.org/10.1109/ICSME55016.2022.00009 ISSN: 2576-3148.
[52] Keiichi Shima. 2016. Length Matters: Clustering System Log Messages using Length of Words. https://doi.org/10.
48550/arXiv.1611.03213 arXiv:1611.03213 [cs].
[53] Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, and Gust Verbruggen. 2023. CodeFusion: A
Pre-trained Diffusion Model for Code Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics,
Singapore, 11697–11708. https://doi.org/10.18653/v1/2023.emnlp-main.716
[54] Yuchen Sun, Yanpiao Chen, Haotian Zhao, and Shan Peng. 2023. Design and Development of a Log Management
System Based on Cloud Native Architecture. In 2023 9th International Conference on Systems and Informatics (ICSAI).
1–6. https://doi.org/10.1109/ICSAI61474.2023.10423328
[55] Yicheng Sun, Jacky Keung, Zhen Yang, Shuo Liu, and Hi Kuen Yu. 2024. Semirald: A Semi-Supervised Hybrid Language
Model for Robust Anomalous Log Detection. https://papers.ssrn.com/abstract=4927951
[56] Yicheng Sun, Jacky Keung, Zhen Yang, Shuo Liu, and Jingyu Zhang. 2024. Advancing Semi-Supervised Anomaly
Detection in Software Logs with Deep Grouping and Auto-Optimization. https://doi.org/10.2139/ssrn.4918203
[57] Shimin Tao, Weibin Meng, Yimeng Cheng, Yichen Zhu, Ying Liu, Chunning Du, Tao Han, Yongpeng Zhao, Xiangguang
Wang, and Hao Yang. 2022. LogStamp: Automatic Online Log Parsing Based on Sequence Labelling. SIGMETRICS
Perform. Eval. Rev. 49, 4 (June 2022), 93–98. https://doi.org/10.1145/3543146.3543168
[58] Ana Trisovic, Matthew K. Lau, Thomas Pasquier, and Mercè Crosas. 2022. A large-scale study on research code quality
and execution. Scientific Data 9, 1 (Feb. 2022), 60. https://doi.org/10.1038/s41597-022-01143-6 Publisher: Nature
Publishing Group.
[59] Risto Vaarandi and Hayretdin Bahsi. 2024. Using Large Language Models for Template Detection from Security Event
Logs. http://arxiv.org/abs/2409.05045 arXiv:2409.05045.
[60] Zehao Wang, Haoxiang Zhang, Tse-Hsun (Peter) Chen, and Shaowei Wang. 2021. Would you like a quick peek?
providing logging support to monitor data processing in big data applications. In Proceedings of the 29th ACM Joint
Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
(ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 516–526. https://doi.org/10.1145/3468264.
3468613
[61] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny
Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Informa-
tion Processing Systems 35 (Dec. 2022), 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/hash/
9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
[62] Yifan Wu, Siyu Yu, and Ying Li. 2024. Log Parsing with Self-Generated In-Context Learning and Self-Correction.
http://arxiv.org/abs/2406.03376 arXiv:2406.03376.
[63] Markus Wurzenberger, Max Landauer, Florian Skopik, and Wolfgang Kastner. 2019. AECID-PG: A Tree-Based Log
Parser Generator To Enable Log Analysis. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management
X:34 Beck et al.

(IM). 7–12. https://ieeexplore.ieee.org/abstract/document/8717887 ISSN: 1573-0077.


[64] Yi Xiao, Van-Hoang Le, and Hongyu Zhang. 2024. Demonstration-Free: Towards More Practical Log Parsing with Large
Language Models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE
’24). Association for Computing Machinery, New York, NY, USA, 153–165. https://doi.org/10.1145/3691620.3694994
[65] Andy Xu and Arno Gau. 2024. HELP: Hierarchical Embeddings-based Log Parsing. http://arxiv.org/abs/2408.08300
arXiv:2408.08300.
[66] Junjielong Xu, Ruichun Yang, Yintong Huo, Chengyu Zhang, and Pinjia He. 2024. DivLog: Log Parsing with Prompt
Enhanced In-Context Learning. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
(ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3597503.3639155
[67] Siyu Yu, Pinjia He, Ningjiang Chen, and Yifan Wu. 2023. Brain: Log Parsing With Bidirectional Parallel Tree. IEEE
Transactions on Services Computing 16, 5 (Sept. 2023), 3224–3237. https://doi.org/10.1109/TSC.2023.3270566 Conference
Name: IEEE Transactions on Services Computing.
[68] Xian Yu, Shengxi Nong, Dongbiao He, Weijie Zheng, Teng Ma, Ning Liu, Jianhui Li, and Gaogang Xie. 2024. LogGenius:
An Unsupervised Log Parsing Framework with Zero-shot Prompt Engineering. In 2024 IEEE International Conference
on Web Services (ICWS). 1321–1328. https://doi.org/10.1109/ICWS62655.2024.00159 ISSN: 2836-3868.
[69] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang,
Fei Wu, and Guoyin Wang. 2024. Instruction Tuning for Large Language Models: A Survey. https://doi.org/10.48550/
arXiv.2308.10792 arXiv:2308.10792 [cs].
[70] Tianzhu Zhang, Han Qiu, Gabriele Castellano, Myriana Rifai, Chung Shue Chen, and Fabio Pianese. 2023. System
Log Parsing: A Survey. IEEE Transactions on Knowledge and Data Engineering 35, 8 (Aug. 2023), 8596–8614. https:
//doi.org/10.1109/TKDE.2022.3222417 Conference Name: IEEE Transactions on Knowledge and Data Engineering.
[71] Wei Zhang, Xianfu Cheng, Yi Zhang, Jian Yang, Hongcheng Guo, Zhoujun Li, Xiaolin Yin, Xiangyuan Guan, Xu Shi,
Liangfan Zheng, and Bo Zhang. 2024. ECLIPSE: Semantic Entropy-LCS for Cross-Lingual Industrial Log Parsing.
http://arxiv.org/abs/2405.13548 arXiv:2405.13548.
[72] Wei Zhang, Hongcheng Guo, Anjie Le, Jian Yang, Jiaheng Liu, and Zhoujun Li. 2025. Lemur: Log Parsing with Entropy
Sampling and Chain-of-Thought Merging. https://doi.org/10.48550/arXiv.2402.18205 arXiv:2402.18205 [cs].
[73] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate Before Use: Improving Few-shot
Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning. PMLR,
12697–12706. https://proceedings.mlr.press/v139/zhao21c.html ISSN: 2640-3498.
[74] Chen Zhi, Liye Cheng, Meilin Liu, Xinkui Zhao, Yueshen Xu, and Shuiguang Deng. 2024. LLM-powered Zero-shot
Online Log Parsing. In 2024 IEEE International Conference on Web Services (ICWS). 877–887. https://doi.org/10.1109/
ICWS62655.2024.00106 ISSN: 2836-3868.
[75] Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesheng Wu, Quanzheng Li, and Qingsong
Wen. 2024. LogParser-LLM: Advancing Efficient Log Parsing with Large Language Models. In Proceedings of the 30th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24). Association for Computing Machinery,
New York, NY, USA, 4559–4570. https://doi.org/10.1145/3637528.3671810
[76] Yihan Zhou, Yan Chen, Xuanming Rao, Yukang Zhou, Yuxin Li, and Chao Hu. 2024. Leveraging Large Language
Models and BERT for Log Parsing and Anomaly Detection. MATHEMATICS 12, 17 (Sept. 2024), 2758. https://doi.org/
10.3390/math12172758 Num Pages: 20 Place: Basel Publisher: MDPI Web of Science ID: WOS:001310957300001.
[77] Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, and Michael R. Lyu. 2023. Loghub: A Large Collection of System Log
Datasets for AI-driven Log Analytics. In 2023 IEEE 34th International Symposium on Software Reliability Engineering
(ISSRE). 355–366. https://doi.org/10.1109/ISSRE59848.2023.00071 ISSN: 2332-6549.
[78] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2019. Tools and Benchmarks for
Automated Log Parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering
in Practice (ICSE-SEIP). 121–130. https://doi.org/10.1109/ICSE-SEIP.2019.00021

You might also like