Sok: Llm-Based Log Parsing: Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, Andreas Rauber
Sok: Llm-Based Log Parsing: Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, Andreas Rauber
1 Introduction
Log data consists of recorded information generated by logging statements in software applications,
systems, devices, or networks during operation. It provides a chronological record of events and
contains valuable insights for various tasks, such as monitoring [6, 60], root cause analysis [19, 46],
and anomaly detection [20, 29]. Given that modern computer systems can generate millions of log
lines per hour, manual processing is infeasible, necessitating fast and scalable automated approaches.
For instance, in 2013 the Alibaba Cloud system was reported to produce approximately 100 to 200
million log lines per hour [43]. However, extracting meaningful information from raw log data
requires transforming it into a structured format. Since logs are typically semi-structured, following
general patterns but containing variable log messages, direct analysis is challenging. To enable log
analytics tasks like anomaly detection, logs must first be parsed into a structured representation,
making log parsing a critical step in log analysis [63].
Log parsing refers to the extraction of structured data from semi-structured logs. Parsing the
logs into the structured format can simply be done by regular expression matching if the correct
log templates are given. Most methods therefore formulate log parsing as a log template extraction
problem [18, 23]. Ground-truth templates are defined somewhere in logging statements in the source
code and are therefore often not available. A wide range of log parsing techniques [12, 25, 33, 51, 52]
have therefore been proposed to overcome this issue.
Authors’ Contact Information: Viktor Beck, AIT Austrian Institute of Technology, Vienna, Austria, [email protected];
Max Landauer, AIT Austrian Institute of Technology, Vienna, Austria, [email protected]; Markus Wurzenberger, AIT
Austrian Institute of Technology, Vienna, Austria, [email protected]; Florian Skopik, AIT Austrian Institute of
Technology, Vienna, Austria, [email protected]; Andreas Rauber, Vienna University of Technology, Vienna, Austria,
[email protected].
X:2 Beck et al.
Many state of the art log parsing methods require manual configuration or other manual actions
at some point in the parsing process. Approaches may involve labeling of some of the logs from
the training set [23, 33, 66], definition of log formats [35] or regular expressions for preprocessing
[18, 51, 67], or other parameters [67] that fit the parsing algorithm to the data. From a user’s
perspective the configuration of parsing algorithms might seem overwhelming since it requires
expertise or an analysis of the data at hand. The reason log parsing methods require manual actions
is that humans have both semantic and syntactic understanding [23], but more importantly, a
natural generalization capability and potentially expert knowledge about logs and log parsing. Many
log parsers have already achieved semantic [23, 33, 37] and syntactic abilities [9, 18, 67]. While
syntax-based parsers focus on characteristics like occurrence frequencies, or word or log length,
to determine the constant part of logs, semantic-based parsers leverage the semantic differences
between static and variable parts of log messages [23, 64]. Knowledge about logs and how to parse
them can be learned by language models, such as BERT [11] or its descendants, by fine-tuning
them with logs and their templates [33, 57]. However, these methods often lack generalization and
have to be trained or fine-tuned with labeled logs, preferably with logs of the exact same log type
as the target logs [38]. Generalization arises from pre-training on a large corpus of diverse datasets,
allowing models to perform well also on novel tasks and possibly enhanced through fine-tuning
(FT) [16] or in-context learning (ICL) [5]. This is where large language models (LLMs) come into
play.
With the emergence of LLMs the new research field of LLM-based log parsing arised. In late
2023, Le and Zhang [32] reported notable performance of ChatGPT1 in a naive setting, where the
LLM was presented individual logs and simply asked for their templates. In the meantime, other
approaches adopted LLMs for this purpose as well, building sophisticated frameworks that utilize
LLMs in various ways to either parse logs directly or enhance log parsing, by adopting learning
paradigms for LLMs such as ICL [5] or fine-tuning [16]. Given the fact that the number of unique
templates is significantly lower than the number of logs and the average rate of newly discovered
templates decreases as more logs are processed [23], some LLM-based parsers [23] are even able to
achieve runtime performance comparable with the fast and well-known parser Drain [18].
To the best of our knowledge, there is an absence of a structured overview and an objective
comparison of papers and methods concerning LLM-based log parsing, as this is a rather novel
research field which only recently emerged from the popular research field of generative LLMs. So
far there is merely a survey paper about LLM-based log analysis techniques by Akhtar et al. [1],
which provides a rudimentary overview on using LLMs for log parsing and the techniques employed.
Consequently, this work aims to undertake a systematic review of the literature to identify the
commonalities and variabilities of the existing approaches and to present our gained insights. Our
main focus is an overview of LLM-based log parsing with emphasis on the users’ perspective — thus,
how the LLM is applied and how much manual effort the approaches require. The aim is therefore
to provide a comprehensive overview of the existing approaches and techniques, enabling anyone
interested in (LLM-based) log parsing to quickly and easily identify the most suitable solution for
their specific use case.
The aforementioned goals are formulated as the following research questions:
• RQ1: What are the main advantages and disadvantages of LLM-based log parsing approaches
and non-LLM-based approaches?
• RQ2: To what extent do LLM-based log parsing approaches rely on labeled data and manual
configuration?
• RQ3: Which techniques can enhance the efficiency or effectiveness of LLM-based log
parsing?
• RQ4: What experiment design choices influence the comparability of the evaluation results
between LLM-based log parsing approaches?
• RQ5: To what extent are the presented LLM-based parsing methods accessible and repro-
ducible?
We summarize the main contributions of the present work as follows:
• We provide a systematic overview of the existing methods, based on a set of feature defini-
tions that characterize LLM-based log parsing approaches, or the corresponding papers,
respectively.
• We derive the general process of LLM-based log parsing, encompassing each parsing ap-
proach of each reviewed paper in a single flow chart.
• We provide a comprehensive benchmark of seven open-source LLM-based log parsing
approaches on open-source datasets.
• Based on the literature review and our evaluation results, we answer 5 research questions
and derive further findings that may benefit the future research and design of log parsers.
• Finally, we make the source code and all the results of our evaluation publicly available in
our GitHub repository2 to ensure transparency and reproducibility.
The remainder is structured as follows: Section 2 describes the background and the core concepts
and how we understand them. Section 3 outlines the survey method, including the search strategy
and the definition of the reviewed features. Section 4 presents the results of the literature review in
a large table and describes the findings of each feature. Section 5 explains the evaluation setup,
while Sec. 6 presents the evaluation results. We discuss the findings from literature and evaluation
in Sec. 7 and conclude our paper in Sec. 8.
content, contextual metadata and more [35]. Previous research differs between the unstructured
log message part of a log and the structured log headers [17]. The structured part is thereby
straightforward to extract with regular expressions and it has been shown that parsing only the log
message part improves the parsing accuracy [17]. Many methods [12, 12, 18, 66], therefore require
the log format of a dataset as input parameters. An example of a log template from the Apache
dataset [77] is given in Fig. 1. The log format parameter for this example would be “[<Time>]
[<Level>] <Content>”, whereas “Time” and “Level” refers to log headers and “Content” refers
to the log message. In the remainder, we understand “log format” as this parameter defining the
positions of the log headers and of the content within different log types, or different datasets,
respectively.
Source Code
logging.info(f"workerEnv.init() ok {file_path}")
logging.error(f"mod_jk child workerEnv in error state {error_state}")
logging.info(f"jk2_init() Found child {child} in scoreboard slot {slot}")
Logs
[Sun Dec 04 04:51:52 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Sun Dec 04 04:51:55 2005] [error] mod_jk child workerEnv in error state 6
[Sun Dec 04 04:52:04 2005] [notice] jk2_init() Found child 6738 in scoreboard slot 6
Parsed Logs
Time Level Template Parameters
Sun Dec 04 04:51:52 2005 notice workerEnv.init() ok <*> [/etc/httpd/conf/workers2
Sun Dec 04 04:51:55 2005 error mod_jk child workerEnv in error state <*> [6]
Sun Dec 04 04:52:04 2005 notice jk2_init() Found child <*> in scoreboard slot <*> [6738, 6]
Fig. 1. A simple example of log parsing, from logging statement to log file to parsed log.
superior parsing performance to state of the art approaches. However, fine-tuning LLMs for log
parsing is resource-intensive in terms of computational costs, runtime, as well as labeled examples,
making ICL a compelling alternative [64]. Research shows that ICL enhances LLMs’ performance
in logical reasoning and fact retrieval, suggesting its viability for structured log analysis [5].
Despite these advantages, integrating LLMs into log parsing presents challenges. These models
are not inherently designed for log parsing, leading to inconsistencies in token classification and
errors in log templates. Such inaccuracies can negatively impact downstream tasks like anomaly
detection [78]. Additionally, given the vast volume of real-world log data, efficient processing is
crucial. The computational demands of LLMs contribute to inference overhead and network latency,
necessitating optimizations for practical deployment [23].
3 Survey Method
3.1 Search Strategy
Subsequently, this section describes our search strategy and the inclusion and exclusion criteria
derived from the core concepts of Sec. 2 and early insights from LLM-based log parsing approaches.
For the keyword search, we determined the following keywords:
(1) log(s) (in title);
(2) LLM(s), large-language-model(s), or large language model(s);
(3) parsing, parser(s), or parse.
The term "log" or "logs" must be included in the publication title, while the remaining keywords
are entered into the default search engine of the respective database. This approach should ensure
that the search results are focused on publications where log data is the central aspect. The search
statistics are given in Table 1.
The numbers of search results per database sum up to 176, excluding duplicates found on multiple
databases. The results were then filtered by exclusion criteria. A publication is excluded if,
• It is not written in English.
• It is not accessible in electronic format.
• It is a book, technical report, lecture note, presentation, patent, thesis, or dissertation.
• A more recent publication is available that presents the same study (by the same authors),
ensuring that this SoK focuses on the latest versions of specific approaches.
• It only applies an existing approach without introducing any novelties to LLM-based log
parsing.
• It does not apply log parsing or LLMs by definitions of Sec. 2.
• It is not accessible with our institutions’ licenses.
By applying these exclusion criteria on both abstract and full text, we omitted 147 of the initial
publications found: 1 publication was omitted because it was not written in English. 8 publications
were omitted due to being completely unrelated to the topic. 110 are related but excluded, as they do
not directly cover LLM-supported log data parsing by the definitions of Sec. 2 or are about other log
analytics tasks, but employ parsing-free methods or existing conventional parsers, such as Drain
[18] or SPELL [12]. 21 were omitted because they were books, theses, dissertations, or technical
reports. 2 were excluded for not being available freely with any license from our institutions. 5
were outdated versions of newer papers.
The final selection consists of 29 papers. One publication is from the year 2023, one from 2025,
while the remaining are all from 2024. Since there is a recent hype around the usage of LLMs for
log related issues, we include pre-print papers to capture the newest research. There are 10 preprint
papers in the final selection (∼ 34%), all from 2024.
X:6 Beck et al.
Table 1. Keyword search results (status 29-01-2025). “#R” and “#I” stand for the number of results and the
number of papers that were still included after all exclusion criteria were applied.
GP-1 — Supervision. In log parsing a labeled log is one for which a template, acting as the ground
truth for that log, is available. We differentiate approaches based on the requirement for labels:
• Supervised parsing (sup) requires at least some log instances of the training set to be labeled.
• Unsupervised parsing (un) does not require labels.
SoK: LLM-based Log Parsing X:7
GP-2 — Parsing Mode. Many methods are described as online approaches, but their interpretations
vary. Some consider online processing to be incremental processing or streaming, while others
apply batch-wise processing yet still label their method as online. Additionally, some works do
not specify whether their approach is online or offline. Our initial incentive was to classify these
methods accordingly, but this is often not feasible due to a lack of context or sufficient explanations.
As a consequence, we devised the following categories that describe how many logs are processed
at once:
• Stream (S): The logs are processed one-by-one.
• Batch (B): Multiple logs are processed at once, whereby it is possible to apply local (within
each batch) optimizations. A batch is significantly smaller than the whole dataset.
• Total (T): The entire dataset is processed at once, whereby it is possible to apply global
optimizations to the process.
GP-3 — Learning / Prompting. We report four different types of learning or prompt engineering
techniques. The type of learning can strongly influence how the prompt is designed which is why
we report this in a single feature:
• In-context learning (ZS/FS/S): ICL [5] leverages the context provided within a single LLM
call to generate responses, adapting to specific needs without any updates to the model’s
parameters. Thereby, we differentiate between a zero-shot setting (ZS), where the model
performs a task based solely on an instruction in the prompt, and few-shot ICL (FS), where
a small number of demonstrations is provided in the prompt to guide the model’s behavior.
Demonstrations can be retrieved randomly from the dataset or with sophisticated strategies.
They can also be static (S) which we report separately.
• Chain of Thought (CoT): CoT [61] refers to a series of subsequent LLM calls, where the
(complex) task is broken down into (easier) subtasks or “thoughts” by the LLM itself and
answered in a step-by-step manner, rather than jumping to the solution directly. The process
also enhances transparency since intermediate steps can be monitored by users.
• Fine-tuning (FT): In addition to ICL and CoT, which are solely modifications of the prompt,
fine-tuning [16] modifies the parameters of the LLM, but mostly affects only the last few
layers of the neural network, which are usually the most decisive ones for the outcome.
• Pre-training (PT): Pre-training refers to the initial phase of training the model on a large
corpus of text data to learn the structure and patterns of language.
PS-1 — Manual Configuration. This is true if at least one manual configuration step is required
(e.g., when a parser is applied to an unseen log type). This includes manual definition of input
parameters such as regular expressions, log formats or other essential parameters for different
datasets, which is often required for many conventional log parsers such as Drain [18], Brain [67],
or Spell [12], and others featured in the LogPAI log parser repository3 [78]. This feature does not
include optional parameters that are generic enough to be left unchanged for a new log type.
PS-2 — Retrieval-Augmented Generation. Retrieval-augmented generation, or RAG, is a paradigm
where the LLM is provided with information retrieval capabilities. In case of log parsing, most
approaches utilize a sampling strategy to include either logs or logs and their templates in the
prompt. The LLM should then use the provided context to learn the variability and commonality of
logs [35] or learn parsing directly from log-template pairs. Many approaches create clusters, trees,
buckets or other kinds of aggregations from which they sample their demonstrations for ICL and
retrieve them based on some kind of similarity measure. We differentiate two cases:
3 https://github.com/logpai/logparser; accessed 9-December-2024
X:8 Beck et al.
• Random retrieval (R): The process is a random retrieval of demonstrations from training
data.
• Strategic retrieval (S): There is a refined strategy for retrieving demonstrations from specific
data storages (e.g., clusters, lists, etc.).
PS-3 — Caching. Some approaches increase their efficiency by storing parsed logs’ templates in
some kind of data structure and only call the LLM if a subsequent log does not fit any of the existing
templates. These data structures are mostly tree-like structures or similarity-based clusters.
PS-4 — LLM Usage. The employed LLM is used in at least one of three different ways:
• Direct parsing (dir): The LLM receives one or more logs directly in the prompt and is asked
for the corresponding template.
• post-processor (post): The LLM is used to post-process the log lines, which is mostly merging
similar templates or correcting templates based on new information obtained by subsequent
log lines in an online approach.
• Preprocessor (pre): The LLM functions as a helper in preprocessing, for example, for identi-
fying the timestamp, the log format, or other relevant features.
PS-5 — Template Revision. This feature describes whether the templates are revised in a post-
processing step, such as merging similar templates or correcting templates, based on the information
obtained from new logs (in an incremental approach). This step can be done with or without the
help of LLMs.
R-1 — Evaluation Data. Since the performance of parsers is strongly dependent on the data, we
report the different types of datasets or dataset collections used in the papers:
• LogHub (L): The well-known LogHub [77, 78] repository is widely used for evaluating log
analysis methods. It features 16 annotated datasets from different domains with 2000 logs
each.
• Corrected LogHub (CL): This is a version of LogHub with corrected templates. The original
templates have been observed to have inconsistencies due to inconsistent labeling styles
[27].
• LogHub-2.0 (L2): LogHub-2.0 [24] is based on Loghub and features 14 large-scale datasets
from various domains with numbers of log in the order of 104 to 107 .
• Custom (CA/CU): If the parser was evaluated with custom data we report whether it is
publicly available (CA) or unavailable (CU).
R-2 — Evaluation Metrics. We report the various evaluation metrics employed to assess the effective-
ness of log parsing techniques. Each metric provides a unique perspective on the parsing process,
focusing on different aspects of effectiveness:
• Group Accuracy (GA): A log message is considered correctly parsed if and only if its event
template corresponds to the same group of log messages as the ground truth does. GA
is then the number of correctly parsed log messages divided by the total number of log
messages [78]. GA is also known as RandIndex [50].
• F1-score of Group Accuracy (FGA): 𝑁𝑝 is the number of templates that are generated by a
log parser, and 𝑁𝑐 the number of templates that are correctly parsed by the log parser. 𝑁𝑔 is
the actual correct number of templates in the ground truth. The precision of the GA (PGA)
is then 𝑁𝑐 /𝑁𝑝 and the recall of GA (RGA) is 𝑁𝑐 /𝑁𝑔 . Then, FGA is the harmonic mean of
PGA and RGA [24].
• Parsing Accuracy (PA): A log is considered correctly parsed if and only if all its static text
and dynamic variables are correctly identified. PA is then the number of correctly parsed log
SoK: LLM-based Log Parsing X:9
messages divided by the total number of log messages [9]. PA is the same as message-level
accuracy (MLA) [37, 66].
• Edit Distance (ED): ED is the minimum number of operations, such as insertions, deletions,
or substitutions, needed to convert one string into the other [45].
• Precision Template Accuracy (PTA): A template is correctly parsed from log messages if and
only if it is identical (token-by-token) to the ground-truth template(s) of the log messages.
PTA is the ratio of correctly identified templates to the total number of identified templates
[27].
• Recall Template Accuracy (RTA): Complementary to PTA, RTA is the ratio of correctly
identified templates to the total number of ground-truth templates [27].
• F1-score of Template Accuracy (FTA): Is the harmonic mean of PTA and RTA [24].
• Other: If metrics other than the above are used we write “other”.
R-3 — Used Models. A variety of different LLMs are used in the analyzed works. We report only the
base models, or the name of the model series, of the used LLMs and only the ones used for the task
of log parsing.
R-4 — Code availability. For a potential user, it might be essential that a parser is already imple-
mented, thus we report the availability of code repositories. This feature is true if a link was provided
to the implementation and if the link does not lead to an empty repository or a non-existent page,
and false otherwise.
R-5 — Preprint. This is true if the reviewed paper is a preprint paper. It is possible that the corre-
sponding code repository to the paper, the results, findings, or interpretations are only preliminary,
which is why we report this feature.
Learning / Prompting
Code availability
Processing mode
Manual config.
Template corr.
Supervision
LLM usage
Preprint
Datasets
Caching
Metrics
Models
RAG
General Properties Processing Steps Reproducibility
Approach GP-1 GP-2 GP-3 PS-1 PS-2 PS-3 PS-4 PS-5 R-1 R-2 R-3 R-4 R-5
Astekin et al. [2] un, sup S ZS, S dir CL GA, ED, other GPT, Claude, Llama
Astekin et al. [3] un, sup S S dir CL other GPT, Claude, Llama,
Mistral
Cui et al. [8] un S ZS ? dir L PA, ED GPT, Claude, Llama, ✓ ✓
(LogEval) Gemini, InternLM,
Qwen, AquilaChat,
Mistral, Baichuan,
DevOps, ChatGLM
Fariha et al. [14] un ? FS? pre L other GPT
Huang et al. [21] un B ZS S ✓ dir L2 GA, FGA, PA, GPT ? ✓
(LUNAR) FTA
Ji et al. [22] sup ? PT, ZS? dir L GA, other Llama, GPT, Qwen, ✓ ✓
(SuperLog) OWL
Jiang et al. [23] sup S FS ✓ S ✓ dir ✓ L2 GA, FGA, PA, GPT ✓
(LILAC) FTA
Karanjai et al. [26] un, sup S FT; FS S ✓ dir ✓ L2 GA, FGA, PTA, GPT ✓
(LogBabylon) PA, RTA, FTA
Le et al. [32] un, sup S ZS, FS R dir CL GA, PA, ED GPT
Liu et al. [36] un B ZS ✓ dir L other GPT, Vicuna ✓
(LogPrompt)
Ma et al. [38] sup S FT; ZS, FS ✓ ? dir CL GA, PA Llama, ChatGLM ✓
(LLMParser)
Ma et al. [39] un B FS ✓ S ✓ dir ✓ L2 GA, PA Llama, Mistral, ✓ ✓
(OpenLogParser) Gemma, ChatGLM,
Mehrabi et al. [42] sup S FT; ZS, S dir CA PA, ED, FGA Mistral, GPT ✓
Pang et al. [48] un, sup S FT; ZS dir CU other Llama
(ONLA-LLM)
Pei et al. [49] un, sup B FS ✓ S ✓ dir ✓ L GA, PA, PTA, GPT ✓
(SelfLog) RTA
Sun et al. [54] (Loki) un S ZS ✓ dir
Sun et al. [55] un, sup S FS, CoT ? dir L PA GPT ✓ ✓
(Semirald)
Sun et al. [56] sup S FS, CoT ? dir L GA, PA GPT ? ✓
(SemiSMAC-<T>)
Vaarandi et al. [59] un B S ✓ dir ✓ CA other OpenChat, Mistral, ✓ ✓
(LLM-TD) WizardLM
Wu et al. [62] un S FS ✓ S ✓ dir, post ✓ L, L2 GA, FGA, PA, GPT, Gemini, ✓
(AdaParser) FTA Claude, DeepSeek,
Qwen
Xiao et al. [64] un B ZS ✓ S ✓ dir ✓ L2, CL GA, PA, ED GPT ✓
(LogBatcher)
Xu et al. [66] sup S FS ✓ S dir L PA, PTA, RTA GPT ✓
(DivLog)
Xu et al. [65] un, sup B ZS, S; CoT ✓ dir ✓ L2 GA, FGA, PA, Claude ✓
(HELP) FTA
Yu et al. [68] un T? ZS ✓ pre L, CA PA GPT, Gemma ✓
(LogGenius)
Zhang et al. [71] un S ZS ✓ ? ✓ pre ✓ L, CU GA, FGA GPT ✓
(ECLIPSE)
Zhang et al. [72] un T FS, CoT ✓ ✓ post ✓ L GA, FGA GPT ✓ ✓
(Lemur)
Zhi et al. [74] un S ZS ✓ ✓ dir ✓ CL GA, PA, ED GPT
(YALP)
Zhou et al. [76] un S ZS, FS/S? ? dir L GA, PA, ED GPT
Zhong et al. [75] un, sup S FT; ZS, FS S ✓ dir, post ✓ L, L2 GA, PA, FGA, GPT, Llama
(LogParser-LLM) FTA, other
algorithm based on hierarchical clustering while DivLog uses the Determinantal Point Process
SoK: LLM-based Log Parsing X:11
User Fine-
tuning LLM
Preprocessor Postprocessor
Direct Parser
RAG-
Pre- storage / Post-
processing Cache Parsing processing
Logs Templates
Fig. 2. LLM-based parsing process pipeline. The dashed arrows and boxes represent optional components
while the continuous ones represent essential components. The arrows pointing from the LLM to certain
process elements describe where the LLM can be applied.
(DPP) [28]. It is noteworthy that the unsupervised parser LogBatcher [64] also uses DPP, but to
maximize the sample diversity within the batches of logs that are prompted to the LLM for direct
parsing.
Huang et al. (LUNAR) [21] empirically study the average FTA of LILAC [23] and the BERT-based
parser LogPPT [33] based on the proportion of labeled logs they receive as input. It was found that
FTA exhibited a substantial decline in performance when the proportion of labels was reduced to
5%, which, in general, remains a relatively high figure, particularly in light of the typically abundant
nature of logs. The finding that the more templates are available, the better the parsers’ performance
is little surprising [21, 23]. However, labeled logs are scarce and require substantial manual effort
and expertise [64]. All supervised parsers of the selection can technically be operated unsupervised
at the cost of performance. This may require modifying the prompt so that it corresponds to a
zero-shot setting to not confuse the LLM with announced but absent examples.
4.4.2 Chain of Thought. Lemur [72] uses CoT [61] to revise similar templates that may belong to
the same event in three dialogue rounds. These steps include revising the structure, the semantics
and finally, a decision if the templates should be merged. The other works applying CoT [55, 56, 65]
use it for direct parsing but do not explain in detail how.
4.4.3 Fine-Tuning and Pre-Training. Compared to ICL, fine-tuning is considered a rather computa-
tionally costly task [64] since it modifies the parameters of the models itself. Previous research [44]
found that fine-tuning generally outperforms ICL in effectiveness. From our selection, the works
of Ma et al. (LLMParser) [38] and Mehrabi et al.[42] support this finding. However, fine-tuning
also requires labeled logs. As the default setting for the training, [42] used 80% of the available
templates and 15 logs per template, ONLA-LLM [48] used 80% of the labeled logs of their custom
dataset, LLMParser [38] used 50 labeled logs per dataset, and [75] used 32 per dataset. Furthermore,
Mehrabi et al. [42] and Ma et al. [38] find that their fine-tuned LLMs struggle with new and unseen
log types in terms of effectiveness and robustness. This raises the question of whether this is due to
overfitting, but this is a task for future research. It remains up to the users to decide whether a costly
training phase, requiring considerable quantities of labeled logs, but with improved performance
on seen logs, is feasible for their use case.
From our selection only Ji et al. (SuperLog) [22] pre-trained a model. Pre-training requires vast
amounts of data. To this end, Ji et al. created NLPLog, a comprehensive dataset with a quarter
million question-answer pairs on log analysis tasks. The researchers report superior performance
not only in parsing but also anomaly detection, failure diagnosis, and log interpretation. They also
conducted an evaluation on unseen logs, but they do not report any of the common parsing metrics
from Sec. 3.2. This hinders the determination of the practical benefits in settings involving unseen
logs. Future research could be conducted on this aspect.
the LLM in the parsing process makes parsers more generic, thus reducing the need of human
effort.
effectively used for sampling similar logs or relevant parsing demonstrations for ICL with RAG.
Caching is not bound to the usage with LLMs and can be built into any other parsing approach.
• Instruction: Instructs the LLM to parse the log(s) and how. The instruction may also
contain rules, such as replacement rules for the timestamp or IP address [56] or special
treatment rules for logs concerning exceptions, errors or similar.
• Context explanations (optional): This part contains explanations about logs or log
parsing. For instance, Zhong et al. [75] state to improve the LLM’s capabilities of variable
identification by including explanations about variable categories as outlined in [35]. Other
approaches [2, 21, 55, 56, 65, 75] also provide information about variable types in the prompt.
The categorization of the variables can also be beneficial for downstream tasks and the
interpretability of the results [35, 75].
• Context examples (optional): This part either contains multiple logs to illustrate their
variabilities and commonalities or logs and their templates as parsing demonstrations. This
part can be static, or dynamic using RAG. Providing examples standardizes the LLM’s output
format, leading to more stable results [39].
• Output constraints (optional): This part instructs the LLM how to format the output.
Some approaches enforce JSON format, but most approaches determine one or two markers
(e.g., often backticks or something like /START and /END) so that in a post-processing step
the generated template can be extracted easily with regex after the marker or between
markers. This improves output quality, because generative language models sometimes
generate unwanted output.
• Target log(s): Contains the log(s) to be parsed. The approaches [21, 39, 49, 59, 64, 65] provide
multiple target logs at once in the query, while the rest provide single logs. Providing multiple
logs at once can help the LLM in abstracting variable and static parts from the logs [49].
Le et al. [32] find that with a simple prompt, where no context is provided nor a detailed
instruction is given, GPT-3.5 is hardly able to understand the concept of log parsing. Sun et al. [55]
achieve a significantly higher PA (20% to 60%) with a prompt that contains extraction rules for
direct parsing (on HDFS and BGL with GPT-3.5 and GPT-4) than without extraction rules.
SoK: LLM-based Log Parsing X:15
use ED, 9 publications use at least one metric that is insensitive to class imbalance (FGA, FTA), 8
publications use metrics other than the traditional ones.
The metrics cover different characteristics of correctness. It is therefore important for evaluating
parsers to cover the relevant evaluation aspects with these metrics. For instance, Jiang et al. [24]
proposed using FGA and FTA since they are insensitive to class imbalance, while Khan et al. [27]
recommend using GA, PA, RTA and PTA to cover all aspects (FTA is the harmonic mean of RTA
and PTA). For example, the anomaly detection tool, DeepLog [13], detects anomalies solely based
on the event ID. Each event ID corresponds to a unique template. Consequently, it is indifferent if
the parsed templates are correct at the template level as long as they are grouped together correctly,
which is indicated by a high score for GA. Since detection algorithms can focus on a variety of data
characteristics [4], it is important to report metrics that cover these characteristics.
They found that about 59% of papers did not provide any reproducible artifacts. In our study, 45%
of papers do not provide any reproducible artifacts.
4.13.1 Code Quality. A large proportion of the approaches do not provide code and therefore
can not be used for the evaluation in Sec. 6. Even with available code, some do not provide the
comprehensive functionality to replicate the processes described in their papers and comprehen-
sively correcting the code of others is out of scope for this work. At the time of review (February
14th, 2025) the code of Lemur4 [72] lacks a script for selecting the previously parsed templates for
the subsequent CoT template merging process. The code of LogGenius5 [68] runs into multiple
errors due to missing folders and files, while there is no README file providing instructions. The
LogEval repository6 [8] does not provide the scripts for parsing nor instructions. LogPrompt7
[36] encounters multiple errors due to simple typos. Furthermore, they describe three different
prompt strategies, but it is not clear which they used for their result nor do they explain it in the
instruction file. Moreover, they do not provide the functionality to run the parsing process except
for the self-prompt setting. SelfLog [49] proposes a self-evolutionary approach, in which they
“retrieve the most similar historical logs and their templates from the data through an Approximate
Nearest Neighbors (ANN) search, serving as the corpus for In-Context Learning (ICL)” [49]. One could
think that this means they update the database incrementally after each newly parsed log. However,
in the code8 the script provided for the effectiveness evaluation does not contain the self-evolution
functionality. They provide a second file for online parsing that does provide this functionality but
it does not work ad hoc due to missing files and missing function parameters. Also, by default they
do not use ANN but cosine similarity for retrieval which is somewhere hidden in the code (not an
adjustable parameter). The code repository of SuperLog9 [22] only provides code for LLM training,
but not the described framework for parsing. Moreover, we found that many repositories provide
sparse instructions, which further impedes reproducibility and application, especially when the
execution scripts do not work ad hoc.
These issues appear to be widespread rather than isolated cases. Similar findings were reported in
a large-scale study by Trisovic et al. [58], which examined thousands of replication code repositories
containing R files published between 2010 and 2020. Their analysis revealed that a significant
majority of these files failed to execute properly, even after code cleaning. Likewise, research by
Olzewski et al. [47] showed that more than half of the artifacts from nearly 300 reviewed papers
could not be run at all. Even among the repositories that did execute, only a fraction produced the
expected results, while others either generated different outcomes or lacked crucial components
such as arguments or outputs.
4.13.2 Licenses. As we focus on an application perspective, we also report the code licenses for
the approaches for which the code is available. They are given in Table 4. Keep in mind that it is
possible that code repositories of preprint papers might be preliminary versions or unfinished.
Table 4. Licenses of the approaches. Most approaches do not provide a license. Please note, that especially
for preprint papers the license might change or, if absent, be added in the future (status 7-March-2025).
[39], and LogBatcher [64] cluster the logs into multilevel clusters for parsing while the supervised
parsers DivLog [66] and LILAC [23] use this concept for sampling labeled logs. The coarse-grained
clusters thereby capture the commonality of the logs, that, in the best case, all belong to the same
log group (i.e. the same event), and the fine-grained clusters should highlight the variabilites within
that group. This idea is also used by LogShrink [34] for compressing large log files.
Xiao et al. (LogBatcher) [64] report to use the heuristic rules proposed by Khan et al. [27] to
correct templates. This includes measures, such as combining subsequent wildcards, like <*><*>,
into a single one <*>. This technique is also used by[23, 65, 66] and other work outside of our
selection [33, 37]. By manually inspecting some the templates generated by the parsers used for
the evaluation, we found that a significant proportion of templates would have been considered
correct by the evaluation functions, if some of these steps would have been applied.
5 Evaluation Setup
In terms of evaluation, the existing literature regarding LLM-based log parsing misses a common
ground. While there are some commonalities regarding single features like used datasets, used eval-
uation metrics, or used models, the combination of these is highly varying between the approaches
- see feature R-1, R-2, and R-3 in Table 2. As a consequence, we create a benchmark featuring a
subset of 7 out of the 29 approaches. To counteract randomness from sampling, LLM output and
runtime fluctuations, we run each parser 3 times take the average.
they do not focus on frameworks but rather on the LLM itself, even tough they provide their code or
model, respectively. We also exclude Semirald [55] from the evaluation since its parsing approach
consists only of naive queries to an LLM without other processing steps or model training.
In Sec. 4.13.1 we described that some approaches provide code, but do not provide functioning
parsers. They are consequently not included in the benchmark (except for LogPrompt).
Given the above-stated matters the remaining approaches featured in the benchmark are LILAC
[23], OpenLogParser [39], LogBatcher [64], SelfLog [49], DivLog [66], LLM-TD [59], Log-
Prompt [36]. To a large extent, this selection represents the state of the art of LLM-based log
parsing frameworks, as their approaches cover most of the aspects of this field. Unfortunately, due
to the aforementioned selection criteria, this benchmark could not include any work using LLMs
as pre- or post-processors.
Edit Distance (ED) from which we compute the Normalized Edit Distance (NED) [41] following
[64].
GA and PA are sensitive to the total number of log messages, which is problematic since most
log datasets contain a large number of logs but a much smaller number of unique templates [24].
Therefore, Jiang et al. [24] proposed to use F-metrics since they are insensitive to class imbalance
which is why we use FTA. Furthermore, we use NED since it constitutes a similarity metric between
groundtruth and predicted template. The normalization makes NED also a more intuitive indicator
for the effectiveness than ED.
To ensure model or API compatibility, we updated deprecated versions of the OpenAI (python)
package to the newer working ones and we added the functionality to call the OpenAI, Ollama,
and TogetherAI API if not given. To have consistent output format, we ensure that the template
placeholder symbols are always <*> and that the output is always a CSV file with at least the
columns “Content”, and “EventTemplate”. “Content” refers to the log filtered by the log format
(i.e. the log message). Since some parsers do not only parse the log content but the entire log, we
modify the input to these parsers to only parse the content as well. As found by He et al. [17], this
typically improves the parsing accuracy.
Specific changes were made to DivLog [66] which is one of the earliest works on LLM-based log
parsing. It does not feature a caching mechanism and therefore, calls the LLM for each log line.
Since this is a significant burden in terms of runtime (and cost) and therefore also unrealistic for
real-life adaptation, we added a simple cache consisting only of the set of already parsed templates.
The cache returns the corresponding template in the event of a match instead of calling the LLM.
This slightly affects the accuracy metrics due to improved output consistency. Additionally, we
implemented a minor post-processing step where multiple consecutive wildcards were replaced
by a single wildcard <*> for efficient matching, to resolve an issue with the generation of infinite
consecutive wildcards for some of the logs. This measure slightly increases DivLog’s accuracy.
LogPrompt [36] accepts a maximum prompt size parameter which is by default 3000, correspond-
ing to roughly 25 logs parsed at once (as they state). Early experiments revealed that this performs
extremely poor, hence we set the maximum prompt size to 1000. This corresponds to roughly 5 to
10 logs per prompt, which is in the recommended range of simultaneously parsed logs by Xiao
et al. [64]. If a template cannot be matched with a log it is again sent to the LLM. We limit the
maximum number of re-prompts to 3 to avoid infinite LLM calls. LogPrompt also required some
code corrections, other than typos. For some datasets, LogPrompt runs into an infinite loop due
to a faulty template extraction process from the LLM responses. This was fixed by computing
the enumeration of the logs instead of extracting it from the prompt again. From the paper it is
not clear whether they used ICL, CoT or the self-prompt paradigm for their parsing performance
evaluation, yet the parsing functionality is only given for the self-prompt case.
SelfLog [49] requires the user to set up an SQL database, yet the main script (run.py) does not
contain the self-evolution functionality described in the paper (updating the database with new
templates). Therefore, we removed the log-groundtruth template examples part from the prompt
and operated the parser as unsupervised and without RAG (no database).
5.4.3 Sampling for Supervised Parsers. From the selected parsers LILAC [23] and DivLog [66]
are supervised parsers and require labeled logs to achieve competitive accuracy. In this study,
we emphasize minimal human effort and therefore keep the number of available templates 𝑛
small, namely 𝑛 = 2 and 𝑛 = 4. Note that the Apache dataset from LogHub [77] contains the
smallest number of templates, which is only 6. For LILAC and DivLog it has been shown that
the performance increases with an increasing number of labeled logs available [23, 66]. This is
obvious as the similarity search retrieves the exact template the target log belongs to with a higher
likeliness. Multiple works from our selection [23, 66] as well as [33] state that having a diverse
candidate set is crucial to reduce the risk of overfitting to a specific log template, which is why
they create specific sampling algorithms to sample from labeled logs. However, having a large
proportion of templates at hand is an unrealistic setting in real-life application and requires users
to label logs manually [21].
To ensure comparability, we exchanged the sampling processes of the supervised parsers DivLog
[66] and LILAC [23] with a simple custom sampling approach: from all labeled logs of a dataset we
randomly select 𝑛 (unique) templates. For each of those templates we randomly select a matching
X:22 Beck et al.
log and thus yield a set of log-template pairs which constitutes the candidate set for the supervised
parsers. We sample a new set before each run. In the further course of this paper, we indicate the
two supervised parsers by the suffix “-𝑛”, thus LILAC-𝑛, DivLog-𝑛, with 𝑛 ∈ [2, 4].
6 Evaluation Results
This section contains the evaluation results with focus on effectiveness and efficiency.
6.1 Effectiveness
6.1.1 Baseline Performance. The performance of the baseline parsers is visualized in Fig. 3. For
some reason, the implementation of ULP [51] outputs the templates in a special format which is
why the matching process of the evaluation function does not recognize correctly parsed logs, thus
PA and FTA are 0 (GA is not affected).
Fig. 3. Performance of the baseline parsers on the corrected LogHub dataset including Audit.
6.1.2 Performance of LLM-based Parsers. The evaluation results of the parsers with the three LLMs
on LogHub [77] are given in Fig. 4a, Fig. 4b, and Fig. 4c. Note that we excluded LogPrompt from the
evaluation with DeepSeek R1 because the LLM was barely able to extract any templates, resulting
in scores of roughly 0 in experimental runs. The LLM did not provide the output in the structured
format that was requested. Therefore, the template extraction process, which is based on markers
positioned directly before the extracted template, was not able to properly extract the template. This
resulted in long inner monologues of DeepSeek R1 and maximum reprompts due to log-template
mismatches which would have resulted in an unreasonable financial expense given the cost of the
LLM and the poor performance.
In the boxplots of Fig. 4a, Fig. 4b, and Fig. 4c we can see that the best performance on each
metric and for each LLM is achieved by LogBatcher [64], followed by LILAC [23] for 2 and 4 shots.
LogBatcher and LILAC clearly outperform the conventional parsers, while the conventional parsers
outperform the rest of the LLM-based parsers. In general, the performance of each LLM parser is
rather stable across the different models. CodeLlama visibly performs worst, but it is also by far the
smallest of the models with 7 billion parameters. Given the size difference of GPT-3.5 and DeepSeek
R1, it is surprising that it does not outperform GPT-3.5, given its size. Furthermore, we report the
exact numbers for the evaluation with GPT-3.5 in Table 5 to show how the parsers performed on
each individual dataset.
6.1.3 LLM General Performance. Figure 5 shows the performance for CodeLlama, GPT-3.5 and
DeepSeek R1 averaged over all parsers and datasets. One can see that CodeLlama performs worst
while GPT-3.5 slightly outperforms DeepSeek R1 except for PA. Consequently, the financial im-
plications of employing DeepSeek R1 do not justify the cost. For instance, the OpenAI API for
GPT-3.5 (gpt-3.5-0125) costs 0.5$ per 1 million tokens (input and output) while DeepSeek R1 from
SoK: LLM-based Log Parsing X:23
Fig. 4. Performance of the selected parsers on the corrected LogHub datasets including Audit
the TogetherAI API costs 3$ for input and 7$ for the output per 1 million tokens. We found that the
reasoning steps of DeepSeek R1 are not always helpful and especially when a structured output is
required may confuse the LLM. A possible reason could be that the process of log parsing, not being
complex enough, leads to overthinking [7]. However, in [7] the researchers state that DeepSeek R1
is robust against overthinking. The usefulness of reasoning models for log parsing should therefore
be investigated in future research.
6.1.4 Performance Difference on LogHub and corrected LogHub. As in the study by Khan et al. [27],
we report the performance difference between the evaluation on the corrected LogHub datasets [27]
and the original LogHub datasets [77] (performance of corrected minus performance of original)
with GPT-3.5. In Fig. 6 we can see that the highest differences are given for LogBatcher [64],
followed by LILAC [23] for both sample numbers. Naturally, LILAC’s and DivLog’s [66] samples
were collected and evaluated with the same dataset. That the difference is rather on the positive
side indicates that they achieve better performance on the corrected LogHub dataset. Interestingly
LILAC was originally evaluated with LogHub-2.0 whose templates correspond to the original
LogHub templates and not to the corrected ones [23]. In general, we can see a slight tendency that
the parsers output templates that correspond more to the corrected datasets’ templates, which
suggests that LLMs “intuitively” prefer the corrected templates’ format.
X:24 Beck et al.
Table 5. Performance of the selected parsers with GPT-3.5 on the corrected LogHub datasets including Audit
Thunderbird
HealthApp
OpenStack
Zookeeper
OpenSSH
Windows
Proxifier
Android
Average
Hadoop
Apache
HDFS
Linux
Spark
Audit
HPC
BGL
Mac
Baseline
AEL GA 0.77 1.00 0.96 1.00 0.90 0.87 0.57 0.40 0.76 0.54 0.25 0.97 0.90 0.94 0.69 0.92 0.14 0.74
FTA 0.44 0.50 0.22 0.53 0.46 0.24 0.10 0.27 0.17 0.23 0.08 0.44 0.28 0.20 0.27 0.45 0.00 0.29
PA 0.39 0.69 0.34 0.62 0.68 0.51 0.16 0.17 0.17 0.25 0.02 0.67 0.38 0.04 0.15 0.75 0.00 0.35
NED 0.90 0.92 0.84 0.93 0.97 0.70 0.56 0.74 0.71 0.92 0.60 0.97 0.71 0.71 0.90 0.92 0.38 0.79
Brain GA 0.86 1.00 0.99 1.00 0.94 0.95 1.00 0.35 0.94 1.00 0.49 0.53 1.00 0.97 1.00 0.99 0.21 0.84
FTA 0.37 0.67 0.24 0.80 0.46 0.20 0.40 0.40 0.33 0.23 0.21 0.74 0.44 0.34 0.45 0.57 0.00 0.40
PA 0.24 0.70 0.41 0.96 0.66 0.16 0.25 0.17 0.34 0.28 0.11 0.70 0.38 0.04 0.46 0.78 0.00 0.39
NED 0.84 0.92 0.84 1.00 0.88 0.64 0.88 0.82 0.86 0.95 0.96 0.97 0.96 0.82 0.92 0.94 0.41 0.86
Drain GA 0.83 1.00 0.96 1.00 0.89 0.95 0.78 0.42 0.79 0.79 0.22 0.76 0.92 0.96 1.00 0.97 0.14 0.79
FTA 0.52 0.50 0.22 0.53 0.39 0.30 0.11 0.40 0.20 0.44 0.02 0.41 0.40 0.25 0.41 0.52 0.00 0.33
PA 0.55 0.69 0.34 0.63 0.65 0.51 0.23 0.18 0.22 0.51 0.02 0.68 0.38 0.05 0.46 0.79 0.00 0.41
NED 0.90 0.92 0.84 0.93 0.96 0.79 0.61 0.78 0.76 0.97 0.51 0.96 0.96 0.81 0.86 0.94 0.41 0.82
SPELL GA 0.86 1.00 0.79 1.00 0.65 0.78 0.64 0.15 0.76 0.56 0.26 0.53 0.90 0.84 0.99 0.96 0.14 0.69
FTA 0.25 0.50 0.06 0.43 0.39 0.15 0.10 0.14 0.05 0.23 0.00 0.04 0.19 0.15 0.10 0.37 0.00 0.18
PA 0.15 0.69 0.20 0.30 0.53 0.11 0.15 0.09 0.03 0.19 0.00 0.48 0.32 0.03 0.00 0.75 0.00 0.24
NED 0.81 0.92 0.67 0.94 0.84 0.49 0.52 0.75 0.66 0.86 0.50 0.93 0.67 0.71 0.77 0.94 0.38 0.73
ULP GA 0.74 1.00 0.93 1.00 0.95 0.99 0.90 0.18 0.81 0.43 0.47 0.02 0.92 0.68 0.41 0.99 0.34 0.69
FTA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
PA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
NED 0.63 0.68 0.65 0.58 0.66 0.68 0.74 0.58 0.66 0.75 0.64 0.76 0.67 0.70 0.43 0.76 0.50 0.65
Unsupervised
LLM_TD GA 0.00 1.00 0.50 0.98 0.22 0.47 0.24 0.39 0.00 0.25 0.82 0.27 0.29 0.26 0.57 0.35 0.33 0.41
FTA 0.00 0.50 0.20 0.00 0.19 0.17 0.10 0.14 0.00 0.40 0.57 0.00 0.12 0.04 0.10 0.16 0.00 0.16
PA 0.00 0.69 0.45 0.00 0.20 0.12 0.24 0.28 0.00 0.36 0.25 0.00 0.23 0.10 0.01 0.11 0.00 0.18
NED 0.05 0.95 0.51 0.43 0.26 0.47 0.28 0.42 0.06 0.68 0.46 0.22 0.42 0.29 0.48 0.18 0.09 0.37
LogBatcher GA 0.97 1.00 0.99 1.00 0.95 0.99 1.00 0.84 0.92 1.00 0.98 1.00 1.00 0.90 1.00 0.99 0.27 0.93
FTA 0.78 1.00 0.83 1.00 0.75 0.74 0.95 0.75 0.50 0.82 0.79 1.00 0.76 0.67 0.62 0.81 0.13 0.76
PA 0.83 1.00 0.94 1.00 0.94 0.89 0.99 0.84 0.48 0.97 0.77 1.00 0.97 0.84 0.61 0.99 0.00 0.83
NED 0.92 1.00 0.99 1.00 1.00 0.97 1.00 0.94 0.86 0.99 0.94 1.00 0.98 0.96 0.85 1.00 0.77 0.95
LogPrompt GA 0.13 0.19 0.14 0.00 0.17 0.07 0.06 0.12 0.20 0.00 0.01 0.00 0.01 0.07 0.12 0.45 0.00 0.10
FTA 0.12 0.00 0.03 0.00 0.07 0.08 0.06 0.15 0.09 0.00 0.00 0.00 0.03 0.18 0.05 0.07 0.00 0.06
PA 0.24 0.16 0.38 0.00 0.55 0.15 0.27 0.13 0.23 0.13 0.10 0.00 0.25 0.09 0.33 0.50 0.00 0.21
NED 0.78 0.85 0.83 0.49 0.85 0.70 0.71 0.61 0.77 0.73 0.43 0.54 0.82 0.69 0.78 0.86 0.54 0.70
OpenLogParser GA 0.85 1.00 0.96 0.85 0.93 0.96 0.82 0.42 0.81 0.86 0.57 0.76 0.87 0.94 0.99 0.96 0.14 0.81
FTA 0.32 0.50 0.42 0.00 0.59 0.28 0.46 0.45 0.27 0.65 0.05 0.00 0.41 0.36 0.15 0.41 0.00 0.31
PA 0.24 0.69 0.72 0.00 0.89 0.09 0.66 0.24 0.26 0.69 0.06 0.00 0.36 0.06 0.01 0.48 0.00 0.32
NED 0.17 0.46 0.37 0.27 0.30 0.21 0.26 0.27 0.16 0.39 0.24 0.37 0.21 0.19 0.44 0.27 0.48 0.30
SelfLog GA 0.89 1.00 0.97 1.00 0.89 0.99 0.92 0.29 0.79 0.44 0.44 0.53 0.94 0.70 0.72 0.99 0.14 0.74
FTA 0.39 0.33 0.29 0.00 0.16 0.27 0.48 0.31 0.24 0.17 0.54 0.00 0.38 0.35 0.31 0.24 0.00 0.26
PA 0.46 0.28 0.66 0.00 0.22 0.14 0.55 0.09 0.29 0.15 0.31 0.12 0.76 0.08 0.59 0.24 0.00 0.29
NED 0.76 0.85 0.94 0.86 0.65 0.82 0.89 0.79 0.80 0.85 0.73 0.79 0.93 0.82 0.81 0.80 0.51 0.80
Supervised
DivLog-2 GA 0.56 0.91 0.60 0.49 0.52 0.84 0.68 0.16 0.38 0.47 0.30 0.72 0.46 0.36 0.52 0.93 0.42 0.55
FTA 0.39 0.48 0.32 0.06 0.23 0.31 0.57 0.32 0.19 0.33 0.11 0.39 0.23 0.31 0.19 0.59 0.48 0.32
PA 0.52 0.70 0.72 0.57 0.77 0.32 0.76 0.33 0.26 0.64 0.24 0.76 0.67 0.35 0.31 0.80 0.49 0.54
NED 0.78 0.92 0.93 0.76 0.90 0.80 0.92 0.79 0.52 0.95 0.52 0.95 0.87 0.82 0.70 0.92 0.94 0.82
DivLog-4 GA 0.67 0.53 0.55 0.65 0.36 0.70 0.86 0.20 0.59 0.59 0.27 0.32 0.67 0.40 0.52 0.87 0.36 0.53
FTA 0.42 0.43 0.22 0.25 0.18 0.25 0.56 0.34 0.30 0.55 0.10 0.29 0.48 0.30 0.22 0.53 0.45 0.34
PA 0.50 0.70 0.67 0.71 0.53 0.32 0.76 0.38 0.36 0.65 0.25 0.85 0.77 0.27 0.63 0.79 0.56 0.57
NED 0.78 0.91 0.93 0.87 0.63 0.69 0.91 0.81 0.78 0.95 0.48 0.91 0.93 0.64 0.80 0.93 0.77 0.81
LILAC-4 GA 0.95 1.00 0.98 0.94 0.95 0.97 0.91 0.30 0.81 0.75 0.49 0.95 0.97 0.94 0.79 0.99 1.00 0.86
FTA 0.66 1.00 0.89 0.46 0.75 0.59 0.80 0.70 0.47 0.84 0.83 0.92 0.67 0.56 0.57 0.69 0.88 0.72
PA 0.57 1.00 0.96 0.53 0.92 0.83 0.89 0.42 0.49 0.78 0.43 0.99 0.93 0.73 0.70 0.64 0.78 0.74
NED 0.85 1.00 0.99 0.82 0.99 0.95 0.96 0.92 0.86 0.97 0.90 1.00 0.97 0.93 0.85 0.82 0.99 0.93
LILAC-2 GA 0.96 1.00 0.96 0.94 0.88 0.89 0.91 0.29 0.79 0.63 0.49 0.85 0.92 0.96 0.69 0.99 1.00 0.83
FTA 0.70 1.00 0.80 0.50 0.70 0.57 0.80 0.61 0.48 0.75 0.74 0.77 0.66 0.61 0.57 0.70 0.67 0.68
PA 0.57 1.00 0.94 0.48 0.87 0.70 0.89 0.37 0.48 0.64 0.37 0.92 0.88 0.78 0.52 0.66 0.49 0.68
NED 0.87 1.00 0.98 0.91 0.98 0.90 0.96 0.87 0.87 0.94 0.90 0.99 0.99 0.95 0.77 0.87 0.98 0.93
6.1.5 Performance on the Audit Dataset. An analysis of the performance of all parsers on the Audit
dataset reveals that only the supervised parser, LILAC [23], and to some extent also DivLog [66], are
SoK: LLM-based Log Parsing X:25
Fig. 5. Averaged performance of all parsers for CodeLlama, GPT-3.5 and DeepSeek R1.
0.5
0.0
0.5
1.0
GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED GA FTA PA NED
Fig. 6. Performance of the parsers with GPT-3.5 on corrected LogHub minus the performance on the original
LogHub.
able to parse the logs effectively. It should be noted that the configuration of the baseline parsers
for Audit was not specifically tuned. It appears that the intended templates’ format of the logs is
not straightforward for GPT-3.5 without supervised demonstrations. This finding underscores the
efficacy of supervised parsers, demonstrating that a mere two templates are sufficient to attain a
reasonable performance level, potentially even a single one demonstrating the preferred template
style. Nevertheless, it should be noted that this is merely a solitary example. Future research could
establish a more extensive benchmark by incorporating a range of novel and diverse datasets.
6.2 Efficiency
For the evaluation regarding efficiency we select the well-known datasets HDFS and BGL from
LogHub-2.0 [24]. HDFS contains 11.2 million logs with 46 unique templates while BGL contains 4.6
million logs with 320 unique templates. We excluded DivLog [66] and LogPrompt [36] from this
evaluation since they do not utilize caching. The computation would therefore scale linearly with
the number of logs and require multiple million LLM calls.
Figure 8 visualizes the computation time and LLM invocation time of the parsers, Fig. 9 shows
the number of LLM calls made. The invocation time for conventional parsers is, of course, zero.
Outstanding runtime efficiency is achieved by the conventional parser ULP [51] on both datasets.
While LLM-TD is the second fastest and also calls the LLM the least times it is rather the opposite
on the BGL dataset. As mentioned in Sec. 4.13.1 and 5.4.2 there were some implementation problems
with SelfLog [49]. They provided a faulty script for the efficiency evaluation which is why we
ran the evaluation with the script they designed for the evaluation of the 2000 log lines version
X:26 Beck et al.
Fig. 7. Performance of the selected parsers with GPT-3.5 only on the Audit data.
of LogHub [77]. However, for both HDFS and BGL the processes got killed and are therefore not
featured in the plots.
HDFS BGL
ULP 78 Computation Time ULP 47 Computation Time
LLM_TD 274+35 LogBatcher 63+284
LogBatcher 289+38 Invocation Time Brain 410 Invocation Time
LILAC-4 573+34 Drain 450
LILAC-2 578+30 LILAC-4 806+249
Brain 1058 LLM_TD 1610+1360
AEL 1142 SPELL 1846
Drain 1222 LILAC-2 1931+249
SPELL 1328 OpenLogParser 2318+441
OpenLogParser 5207+59 AEL 3360
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000
Time (seconds) Time (seconds)
Fig. 8. Computation and invocation time of parsers on HDFS and BGL with GPT-3.5.
HDFS BGL
LLM_TD 21.0 LILAC-2 325.0
LILAC-2 41.3 LILAC-4 326.3
LILAC-4 42.0 LogBatcher 380.0
LogBatcher 49.3 OpenLogParser 416.0
OpenLogParser 73.0 LLM_TD 1534.0
0 20 40 60 80 0 500 1000 1500
Number of LLM calls Number of LLM calls
Fig. 9. Number of LLM calls of parsers on HDFS and BGL with GPT-3.5.
7 Discussion
This section summarizes the findings of the literature review from Sec. 4 and of the evaluation
from Sec. 6 by answering the research questions from Sec. 1.
SoK: LLM-based Log Parsing X:27
RQ2: To what extent do LLM-based log parsing approaches rely on labeled data and manual config-
uration? LLM-based log parsing approaches typically require labeled datasets for fine-tuning or
evaluation, though some methods leverage self-evolutionary approaches using previously parsed
templates to substitute the need for a priori labels. However, with only 2 and 4 shots of labeled logs,
LILAC [23] achieves superior performance on the corrected LogHub dataset [27] and Audit [30, 31],
compared to the conventional parsers and other LLM-based parsers, excluding LogBatcher [64].
Furthermore, LogBatcher demonstrates impressive performance on both runtime and effective-
ness while being unsupervised and only requiring the log format as input parameter. Fine-tuning
approaches also report outstanding performance, yet their stronger reliance on labeled logs and
computational power and the robust and high performance of LogBatcher and LILAC demonstrate
that ICL is a cheaper and performant alternative.
In comparison to the baseline parsers AEL [25], SPELL [12], Drain [18], ULP [51] and Brain
[67], the LLM-based parsers of our evaluation selection require on average less configuration
parameters. This finding is particularly noteworthy in light of the satisfactory performance achieved
by LogBatcher [64]. This finding suggests that the deployment of LLMs can effectively substitute a
substantial proportion of manual effort, thus enhance usability.
RQ3: Which techniques can enhance the efficiency or effectiveness of LLM-based log parsing? It is
evident that ICL and fine-tuning present powerful learning paradigms for LLM-based log parsing.
They are not mutually exclusive to each other and can be employed simultaneously to improve
parsing accuracy. ICL can be enhanced with RAG or with previously parsed or labeled logs as
demonstrations. The ablative studies of reviewed papers [23, 64–66, 71] unanimously conclude
that dynamic demonstrations and sophisticated demonstration selection algorithms improve the
accuracy scores. Smart caching solutions have been shown to lead to efficiency improvements, in
terms of runtime and monetary expenses, since LLM calls are usually costly. Template revision
methods that correct, delete or merge related templates retrospectively can further improve cached
templates. This can also help adapt the parser to evolving logs due to updates or other behavioral
changes of the computer systems without reconfiguration or retraining. Evaluations show that is
also possible to improve the effectiveness of parsers with more powerful LLMs, yet the effective
application of reasoning models like DeepSeek R1 should be investigated more closely. It has been
shown that variable aware prompts improve the effectiveness in direct parsing with LLMs. Naturally,
more explanations in the prompt mean more tokens per parsed log which negatively affects the
inference time of LLMs. Short and concise instructions and explanations could thereby alleviate
this issue. Furthermore, creating coarse-grained clusters to capture the commonalities of logs and
fine-grained clusters to capture the variabilities of logs to support the template extraction process
X:28 Beck et al.
constitutes another promising technique, that can be applied by conventional and LLM-based
parsers alike.
RQ4: What experiment design choices influence the comparability of the evaluation results between
LLM-based log parsing approaches? Key experiment design choices include the selection of bench-
mark datasets, evaluation metrics, training methodologies, and hardware specifications. Consistency
in log datasets and pre-processing methods is crucial for fair comparisons. For instance, not all
parsers require the log format as an input parameter, such as LLM-TD [59] and LUNAR [21]. LUNAR
explicitly includes log headers, such as IP addresses, timestamps, or levels, to grasp the full context
of the log. For instance, partitioning logs into groups based on their timestamps is also possible with
time intervals [76]. Previous research [17] found that parsing only the extracted log message part,
according to the log format, improves effectiveness, but especially LLMs with their generalization
capabilities could utilize this extended context.
A set of metrics such as GA, PA, FTA, but also NED (or ED), should be standardized across studies
to cover all relevant characteristics of templates, since each metrics maps to specific requirements
of specific downstream tasks. The results of our evaluation show high variance between these
metrics for different parsers, indicating that the selection covers a variety of aspects. The choice of
model hyperparameters also significantly impacts comparability. For example, the researchers of
Lemur [72] report near-perfect scores of 0.999 and 0.996 for FGA and GA on LogHub [77], yet they
require log format, regular expressions, and four other (hyperparameter-tuned) parameters for
each dataset. This demonstrates the potential performance levels that can be achieved. However,
the number of parameters required is impractical for real-world applications and a comparison to
the performance of other parsers is not feasible.
The aforementioned experiment design choices concern all types of parsers. Specific to LLM-
based parsers, the employed LLM is also an important factor to consider regarding comparability of
evaluation results. As demonstrated, CodeLlama, GPT-3.5, and DeepSeek R1 perform significantly
different on the effectiveness scores. Additionally, variations in API versions, underlying model
updates, and differences in temperature settings further contribute to comparability issues in
performance across studies. This highlights the need for standardized benchmarking practices to
ensure fair and reliable comparisons between different LLM-based parsing approaches.
RQ5: To what extent are the presented LLM-based parsing methods accessible and reproducible? The
accessibility and reproducibility of the presented LLM-based parsing methods are significantly
hindered by issues related to code availability, documentation, and execution reliability. Many
approaches do not provide public code, and even when code is available, it often lacks essential
components, such as necessary scripts or instructions, making replication difficult. Several reposito-
ries suffer from missing files, incorrect implementations, or incomplete functionality that does not
align with the descriptions in their respective papers. Furthermore, discrepancies in the code, such
as missing scripts for crucial steps or incorrect default settings, complicate reproducibility. Even
when modifications are made to ensure compatibility — such as fixing typos, updating libraries, or
implementing minor adjustments to improve consistency — these changes highlight the necessity
of external intervention to make the methods functional. The need for such corrections aligns with
broader findings from Trisovic et al. [58] and Olzewski et al. [47], who observed that a majority
of replication code repositories contain errors that prevent immediate execution. Olzewski et al.
further found that only a small proportion of the running codebases also produce the claimed
results of the related papers [47]. Additionally, some parsers require substantial configuration
efforts, further limiting accessibility. Whilst minor improvements, such as adding caching mech-
anisms to DivLog [66] or modifying LogPrompt’s [36] prompt size, enhance usability, they also
introduce deviations from the original implementations, raising concerns about reproducibility. The
SoK: LLM-based Log Parsing X:29
discrepancy between our own evaluation results and the reported results raises further concerns
about the validity of the publications’ evaluations but naturally also about our own evaluation.
Future benchmarks are expected to address these concerns. Finally, while log parsers based on LLM
show great potential, their current state significantly hinders accessibility and reliable reproduction
without substantial external effort.
8 Conclusion
In this work, we conducted a systematic review of LLM-based log parsing approaches, analyzed
their strengths and weaknesses in comparison to non-LLM-based methods, and performed an
extensive benchmark evaluation to assess their effectiveness, efficiency, and usability. Our findings
demonstrate that LLM-based log parsers provide notable advantages, particularly in their adapt-
ability to diverse log formats and their capability to generalize across unseen templates. However,
they also come with challenges such as high computational costs, susceptibility to hallucinations,
and issues with interpretability.
A key observation is that LLM-based log parsers often reduce the need for manual configuration
and labeling, making them more accessible to users with limited domain expertise. Techniques
such as in-context learning (ICL), fine-tuning, retrieval-augmented generation (RAG) and tem-
plate revision can significantly enhance efficiency and accuracy of these methods. Sophisticated
dynamic demonstration selection and caching strategies have been shown to significantly impact
performance, reducing both LLM inference time and cost. However, despite these improvements,
only two out of seven LLM-based parsers, namely LILAC [23] and LogBatcher [64], are able to
outperform the non-LLM-based parsers that constitute our baseline. Furthermore, we identified a
lack of reproducibility and comparability, as many implementations lack publicly available code or
datasets, comprehensive documentation, or consistency with reported results. Our study highlights
the necessity of standardized benchmarking practices for LLM-based log parsing. Differences in
experimental setups, dataset preprocessing, and metric selection complicate direct comparisons
between methods and impede a meaningful selection for user-specific downstream tasks.
X:30 Beck et al.
Overall, LLM-based log parsing represents a promising direction for automated log analysis.
Techniques like caching, RAG, template revision, especially when used in combination, have clearly
proven their value in this field, but further research is required to fully realize the potential of LLM-
based log parsing. We hope that our systematic review, benchmark study, and insights contribute
to the development of more effective, transparent, and user-friendly log parsing solutions. To
facilitate future research and reproducibility, we make our evaluation results and source code
publicly available on our GitHub repository.
Acknowledgments
Funded by the European Union under the European Defence Fund (GA no. 101121403 - NEWSROOM)
and under the Horizon Europe Research and Innovation programme (GA no. 101168144 - MIRANDA).
Views and opinions expressed are however those of the author(s) only and do not necessarily reflect
those of the European Union or the European Commission. Neither the European Union nor the
granting authority can be held responsible for them. This work is co-funded by the Austrian FFG
Kiras project ASOC (GA no. FO999905301).
References
[1] Siraaj Akhtar, Saad Khan, and Simon Parkinson. 2025. LLM-based event log analysis techniques: A survey. https:
//doi.org/10.48550/arXiv.2502.00677 arXiv:2502.00677 [cs].
[2] Merve Astekin, Max Hort, and Leon Moonen. 2024. A Comparative Study on Large Language Models for Log Parsing. In
Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM
’24). Association for Computing Machinery, New York, NY, USA, 234–244. https://doi.org/10.1145/3674805.3686684
[3] Merve Astekin, Max Hort, and Leon Moonen. 2024. An Exploratory Study on How Non-Determinism in Large
Language Models Affects Log Parsing. In Proceedings of the ACM/IEEE 2nd International Workshop on Interpretability,
Robustness, and Benchmarking in Neural Software Engineering (InteNSE ’24). Association for Computing Machinery,
New York, NY, USA, 13–18. https://doi.org/10.1145/3643661.3643952
[4] Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, and Andreas Rauber. 2024. Semi-supervised
Configuration and Optimization of Anomaly Detection Algorithms on Log Data. In 2024 IEEE International Conference
on Big Data (BigData). 2575–2585. https://doi.org/10.1109/BigData62323.2024.10825202 ISSN: 2573-2978.
[5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. https://doi.org/10.48550/arXiv.2005.14165
arXiv:2005.14165 [cs].
[6] Jinfu Chen, Weiyi Shang, Ahmed E. Hassan, Yong Wang, and Jiangbin Lin. 2019. An Experience Report of Generating
Load Tests Using Log-Recovered Workloads at Varying Granularities of User Behaviour. In 2019 34th IEEE/ACM
International Conference on Automated Software Engineering (ASE). 669–681. https://doi.org/10.1109/ASE.2019.00068
ISSN: 2643-1572.
[7] Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar
Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and
Joseph E. Gonzalez. 2025. The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks.
https://doi.org/10.48550/arXiv.2502.08235 arXiv:2502.08235 [cs].
[8] Tianyu Cui, Shiyu Ma, Ziang Chen, Tong Xiao, Shimin Tao, Yilun Liu, Shenglin Zhang, Duoming Lin, Changchang
Liu, Yuzhe Cai, Weibin Meng, Yongqian Sun, and Dan Pei. 2024. LogEval: A Comprehensive Benchmark Suite for
Large Language Models In Log Analysis. http://arxiv.org/abs/2407.01896 arXiv:2407.01896.
[9] Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2022. Logram: Efficient Log Parsing Using
nn-Gram Dictionaries. IEEE Transactions on Software Engineering 48, 3 (March 2022), 879–892. https://doi.org/10.
1109/TSE.2020.3007554 Conference Name: IEEE Transactions on Software Engineering.
[10] Hetong Dai, Yiming Tang, Heng Li, and Weiyi Shang. 2023. PILAR: Studying and Mitigating the Influence of
Configurations on Log Parsing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).
818–829. https://doi.org/10.1109/ICSE48619.2023.00077 ISSN: 1558-1225.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter
SoK: LLM-based Log Parsing X:31
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis,
Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[12] Min Du and Feifei Li. 2016. Spell: Streaming Parsing of System Event Logs. In 2016 IEEE 16th International Conference
on Data Mining (ICDM). 859–864. https://doi.org/10.1109/ICDM.2016.0103 ISSN: 2374-8486.
[13] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System
Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security (CCS ’17). Association for Computing Machinery, New York, NY, USA, 1285–1298. https://doi.org/10.1145/
3133956.3134015
[14] Asma Fariha, Vida Gharavian, Masoud Makrehchi, Shahryar Rahnamayan, Sanaa Alwidian, and Akramul Azim. 2024.
Log Anomaly Detection by Leveraging LLM-Based Parsing and Embedding with Attention Mechanism. In 2024 IEEE
Canadian Conference on Electrical and Computer Engineering (CCECE). 859–863. https://doi.org/10.1109/CCECE59415.
2024.10667308 ISSN: 2576-7046.
[15] Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, and Dan Yang. 2022. Investigating and
improving log parsing in practice. In Proceedings of the 30th ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New
York, NY, USA, 1566–1577. https://doi.org/10.1145/3540250.3558947
[16] Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language Models Better Few-shot Learners. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto
Navigli (Eds.). Association for Computational Linguistics, Online, 3816–3830. https://doi.org/10.18653/v1/2021.acl-
long.295
[17] Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R. Lyu. 2016. An Evaluation Study on Log Parsing and Its Use
in Log Mining. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
654–661. https://doi.org/10.1109/DSN.2016.66 ISSN: 2158-3927.
[18] Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed
Depth Tree. In 2017 IEEE International Conference on Web Services (ICWS). 33–40. https://doi.org/10.1109/ICWS.2017.13
[19] Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R. Lyu. 2021. A Survey on Automated Log
Analysis for Reliability Engineering. ACM Comput. Surv. 54, 6 (July 2021), 130:1–130:37. https://doi.org/10.1145/3460345
[20] Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2016. Experience Report: System Log Analysis for Anomaly
Detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). 207–218. https:
//doi.org/10.1109/ISSRE.2016.21 ISSN: 2332-6549.
[21] Junjie Huang, Zhihan Jiang, Zhuangbin Chen, and Michael R. Lyu. 2024. LUNAR: Unsupervised LLM-based Log
Parsing. http://arxiv.org/abs/2406.07174 arXiv:2406.07174.
[22] Yuhe Ji, Yilun Liu, Feiyu Yao, Minggui He, Shimin Tao, Xiaofeng Zhao, Su Chang, Xinhua Yang, Weibin Meng, Yuming
Xie, Boxing Chen, and Hao Yang. 2024. Adapting Large Language Models to Log Analysis with Interpretable Domain
Knowledge. https://doi.org/10.48550/arXiv.2412.01377 arXiv:2412.01377 [cs].
[23] Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and
Michael R. Lyu. 2024. LILAC: Log Parsing using LLMs with Adaptive Parsing Cache. Proc. ACM Softw. Eng. 1, FSE
(July 2024), 7:137–7:160. https://doi.org/10.1145/3643733
[24] Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and
Michael R. Lyu. 2024. A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We?. In Proceedings of the
33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing
Machinery, New York, NY, USA, 223–234. https://doi.org/10.1145/3650212.3652123
[25] Zhen Ming Jiang, Ahmed E. Hassan, Parminder Flora, and Gilbert Hamann. 2008. Abstracting Execution Logs to
Execution Events for Enterprise Applications (Short Paper). In 2008 The Eighth International Conference on Quality
Software. 181–186. https://doi.org/10.1109/QSIC.2008.50 ISSN: 2332-662X.
[26] Rabimba Karanjai, Yang Lu, Dana Alsagheer, Keshav Kasichainula, Lei Xu, Weidong Shi, and Shou-Hsuan Stephen
Huang. 2024. LogBabylon: A Unified Framework for Cross-Log File Integration and Analysis. https://doi.org/10.1145/
3672608.3707883 arXiv:2412.12364 [cs].
[27] Zanis Ali Khan, Donghwan Shin, Domenico Bianculli, and Lionel Briand. 2022. Guidelines for assessing the accuracy
of log message template identification techniques. In Proceedings of the 44th International Conference on Software
Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1095–1106. https://doi.org/10.
1145/3510003.3510101
[28] Alex Kulesza and Ben Taskar. 2012. Determinantal Point Processes for Machine Learning. Foundations and Trends® in
Machine Learning 5, 2–3 (Dec. 2012), 123–286. https://doi.org/10.1561/2200000044 Publisher: Now Publishers, Inc..
X:32 Beck et al.
[29] Max Landauer, Sebastian Onder, Florian Skopik, and Markus Wurzenberger. 2023. Deep learning for anomaly detection
in log data: A survey. Machine Learning with Applications 12 (June 2023), 100470. https://doi.org/10.1016/j.mlwa.2023.
100470
[30] Max Landauer, Florian Skopik, Maximilian Frank, Wolfgang Hotwagner, Markus Wurzenberger, and Andreas Rauber.
2023. Maintainable Log Datasets for Evaluation of Intrusion Detection Systems. IEEE Transactions on Dependable and
Secure Computing 20, 4 (July 2023), 3466–3482. https://doi.org/10.1109/TDSC.2022.3201582 Conference Name: IEEE
Transactions on Dependable and Secure Computing.
[31] Max Landauer, Florian Skopik, Markus Wurzenberger, Wolfgang Hotwagner, and Andreas Rauber. 2021. Have it
Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed. IEEE Transactions on
Reliability 70, 1 (March 2021), 402–415. https://doi.org/10.1109/TR.2020.3031317 Conference Name: IEEE Transactions
on Reliability.
[32] Van-Hoang Le and Hongyu Zhang. 2023. Log Parsing: How Far Can ChatGPT Go?. In 2023 38th IEEE/ACM International
Conference on Automated Software Engineering (ASE). 1699–1704. https://doi.org/10.1109/ASE56229.2023.00206 ISSN:
2643-1572.
[33] Van-Hoang Le and Hongyu Zhang. 2023. Log Parsing with Prompt-Based Few-Shot Learning. In Proceedings of the
45th International Conference on Software Engineering (ICSE ’23). IEEE Press, Melbourne, Victoria, Australia, 2438–2449.
https://doi.org/10.1109/ICSE48619.2023.00204
[34] Xiaoyun Li, Hongyu Zhang, Van-Hoang Le, and Pengfei Chen. 2024. LogShrink: Effective Log Compression by
Leveraging Commonality and Variability of Log Data. In Proceedings of the IEEE/ACM 46th International Conference on
Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.
1145/3597503.3608129
[35] Zhenhao Li, Chuan Luo, Tse-Hsun Chen, Weiyi Shang, Shilin He, Qingwei Lin, and Dongmei Zhang. 2023. Did We Miss
Something Important? Studying and Exploring Variable-Aware Log Abstraction. In 2023 IEEE/ACM 45th International
Conference on Software Engineering (ICSE). 830–842. https://doi.org/10.1109/ICSE48619.2023.00078 ISSN: 1558-1225.
[36] Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yuhang Chen, Yanqing Zhao, Hao Yang, and
Yanfei Jiang. 2024. Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies. In
2024 IEEE/ACM 32nd International Conference on Program Comprehension (ICPC). 35–46. https://ieeexplore.ieee.org/
document/10556497/ ISSN: 2643-7171.
[37] Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei Lin, Yingnong
Dang, Saravan Rajmohan, and Dongmei Zhang. 2022. UniParser: A Unified Log Parser for Heterogeneous Log Data.
In Proceedings of the ACM Web Conference 2022 (WWW ’22). Association for Computing Machinery, New York, NY,
USA, 1893–1901. https://doi.org/10.1145/3485447.3511993
[38] Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Peter Chen, and Shaowei Wang. 2024. LLMParser: An Exploratory
Study on Using Large Language Models for Log Parsing. In 2024 IEEE/ACM 46th International Conference on Software
Engineering (ICSE). 1209–1221. https://doi.org/10.1145/3597503.3639150 ISSN: 1558-1225.
[39] Zeyang Ma, Dong Jae Kim, and Tse-Hsun Chen. 2024. LibreLog: Accurate and Efficient Unsupervised Log Parsing
Using Open-Source Large Language Models. https://doi.org/10.48550/arXiv.2408.01585 arXiv:2408.01585 [cs].
[40] Adetokunbo A.O. Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios. 2009. Clustering event logs using
iterative partitioning. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and
data mining (KDD ’09). Association for Computing Machinery, New York, NY, USA, 1255–1264. https://doi.org/10.
1145/1557019.1557154
[41] A. Marzal and E. Vidal. 1993. Computation of normalized edit distance and applications. IEEE Transactions on Pattern
Analysis and Machine Intelligence 15, 9 (Sept. 1993), 926–932. https://doi.org/10.1109/34.232078 Conference Name:
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[42] Maryam Mehrabi, Abdelwahab Hamou-Lhadj, and Hossein Moosavi. 2024. The Effectiveness of Compact Fine-Tuned
LLMs in Log Parsing. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE,
438–448. https://ieeexplore.ieee.org/abstract/document/10795057/
[43] Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, and Hua Cai. 2013. Toward Fine-Grained,
Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems. IEEE Transactions on Parallel
and Distributed Systems 24, 6 (June 2013), 1245–1255. https://doi.org/10.1109/TPDS.2013.21 Conference Name: IEEE
Transactions on Parallel and Distributed Systems.
[44] Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. 2023. Few-shot Fine-tuning vs.
In-context Learning: A Fair Comparison and Evaluation. https://doi.org/10.48550/arXiv.2305.16938 arXiv:2305.16938
[cs].
[45] Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2021. Self-supervised Log
Parsing. In Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track, Yuxiao Dong, Dunja
Mladenić, and Craig Saunders (Eds.). Springer International Publishing, Cham, 122–138. https://doi.org/10.1007/978-
SoK: LLM-based Log Parsing X:33
3-030-67667-4_8
[46] Paolo Notaro, Soroush Haeri, Jorge Cardoso, and Michael Gerndt. 2023. LogRule: Efficient Structured Log Mining
for Root Cause Analysis. IEEE Transactions on Network and Service Management 20, 4 (Dec. 2023), 4231–4243. https:
//doi.org/10.1109/TNSM.2023.3282270 Conference Name: IEEE Transactions on Network and Service Management.
[47] Daniel Olszewski, Allison Lu, Carson Stillman, Kevin Warren, Cole Kitroser, Alejandro Pascual, Divyajyoti Ukirde,
Kevin Butler, and Patrick Traynor. 2023. "Get in Researchers; We’re Measuring Reproducibility": A Reproducibility
Study of Machine Learning Papers in Tier 1 Security Conferences. In Proceedings of the 2023 ACM SIGSAC Conference
on Computer and Communications Security (CCS ’23). Association for Computing Machinery, New York, NY, USA,
3433–3459. https://doi.org/10.1145/3576915.3623130
[48] Yue Pang, Min Zhang, Yanli Liu, Xiangbin Li, Yidi Wang, Yahang Huan, Zhuo Liu, Jin Li, and Danshi Wang. 2024.
Large language model-based optical network log analysis using LLaMA2 with instruction tuning. Journal of Optical
Communications and Networking 16, 11 (Nov. 2024), 1116–1132. https://doi.org/10.1364/JOCN.527874 Conference
Name: Journal of Optical Communications and Networking.
[49] Changhua Pei, Zihan Liu, Jianhui Li, Erhan Zhang, Le Zhang, Haiming Zhang, Wei Chen, Dan Pei, and Gaogang Xie.
2024. Self-Evolutionary Group-wise Log Parsing Based on Large Language Model. In 2024 IEEE 35th International
Symposium on Software Reliability Engineering (ISSRE). IEEE, 49–60. https://ieeexplore.ieee.org/abstract/document/
10771304/
[50] William M. Rand. 1971. Objective Criteria for the Evaluation of Clustering Methods. J. Amer. Statist. Assoc.
66, 336 (Dec. 1971), 846–850. https://doi.org/10.1080/01621459.1971.10482356 Publisher: ASA Website _eprint:
https://www.tandfonline.com/doi/pdf/10.1080/01621459.1971.10482356.
[51] Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, and Mohammed A. Shehab. 2022. An Effective
Approach for Parsing Large Log Files. In 2022 IEEE International Conference on Software Maintenance and Evolution
(ICSME). 1–12. https://doi.org/10.1109/ICSME55016.2022.00009 ISSN: 2576-3148.
[52] Keiichi Shima. 2016. Length Matters: Clustering System Log Messages using Length of Words. https://doi.org/10.
48550/arXiv.1611.03213 arXiv:1611.03213 [cs].
[53] Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, and Gust Verbruggen. 2023. CodeFusion: A
Pre-trained Diffusion Model for Code Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics,
Singapore, 11697–11708. https://doi.org/10.18653/v1/2023.emnlp-main.716
[54] Yuchen Sun, Yanpiao Chen, Haotian Zhao, and Shan Peng. 2023. Design and Development of a Log Management
System Based on Cloud Native Architecture. In 2023 9th International Conference on Systems and Informatics (ICSAI).
1–6. https://doi.org/10.1109/ICSAI61474.2023.10423328
[55] Yicheng Sun, Jacky Keung, Zhen Yang, Shuo Liu, and Hi Kuen Yu. 2024. Semirald: A Semi-Supervised Hybrid Language
Model for Robust Anomalous Log Detection. https://papers.ssrn.com/abstract=4927951
[56] Yicheng Sun, Jacky Keung, Zhen Yang, Shuo Liu, and Jingyu Zhang. 2024. Advancing Semi-Supervised Anomaly
Detection in Software Logs with Deep Grouping and Auto-Optimization. https://doi.org/10.2139/ssrn.4918203
[57] Shimin Tao, Weibin Meng, Yimeng Cheng, Yichen Zhu, Ying Liu, Chunning Du, Tao Han, Yongpeng Zhao, Xiangguang
Wang, and Hao Yang. 2022. LogStamp: Automatic Online Log Parsing Based on Sequence Labelling. SIGMETRICS
Perform. Eval. Rev. 49, 4 (June 2022), 93–98. https://doi.org/10.1145/3543146.3543168
[58] Ana Trisovic, Matthew K. Lau, Thomas Pasquier, and Mercè Crosas. 2022. A large-scale study on research code quality
and execution. Scientific Data 9, 1 (Feb. 2022), 60. https://doi.org/10.1038/s41597-022-01143-6 Publisher: Nature
Publishing Group.
[59] Risto Vaarandi and Hayretdin Bahsi. 2024. Using Large Language Models for Template Detection from Security Event
Logs. http://arxiv.org/abs/2409.05045 arXiv:2409.05045.
[60] Zehao Wang, Haoxiang Zhang, Tse-Hsun (Peter) Chen, and Shaowei Wang. 2021. Would you like a quick peek?
providing logging support to monitor data processing in big data applications. In Proceedings of the 29th ACM Joint
Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
(ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 516–526. https://doi.org/10.1145/3468264.
3468613
[61] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny
Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Informa-
tion Processing Systems 35 (Dec. 2022), 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/hash/
9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
[62] Yifan Wu, Siyu Yu, and Ying Li. 2024. Log Parsing with Self-Generated In-Context Learning and Self-Correction.
http://arxiv.org/abs/2406.03376 arXiv:2406.03376.
[63] Markus Wurzenberger, Max Landauer, Florian Skopik, and Wolfgang Kastner. 2019. AECID-PG: A Tree-Based Log
Parser Generator To Enable Log Analysis. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management
X:34 Beck et al.