0% found this document useful (0 votes)
39 views

Efficient Name Generation Using The Boyer-Moore Algorithm For Meaningful Combinations

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Efficient Name Generation Using The Boyer-Moore Algorithm For Meaningful Combinations

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Efficient Name Generation Using the Boyer-Moore

Algorithm for Meaningful Combinations


Ellijah Darrellshane Suryanegara - 13522097
Program Studi Teknik Informatika
Sekolah Teknik Elektro dan Informatika
Institut Teknologi Bandung, Jalan Ganesha 10 Bandung
E-mail (gmail): [email protected]

algorithms, it is possible to systematically create names that


Abstract— This paper explores the application of the align with specific phonetic and semantic criteria. This paper
Boyer-Moore string matching algorithm in the context of explores several methodologies, ranging from simple pattern
generating names with specific meanings. The Boyer-Moore matching to more sophisticated techniques that incorporate
algorithm, known for its efficiency in string searching, elements of machine learning and natural language
leverages two heuristics—the Bad Character Heuristic and the understanding.
Good Suffix Heuristic—to minimize the number of The significance of generating names with specific
comparisons required during the matching process. We meanings extends beyond mere nomenclature. In branding, for
implement this algorithm to match user-defined keywords instance, a name that resonates with the intended message or
against name descriptions stored in a database, aiming to evokes particular emotions can greatly enhance a product's
generate names that closely align with desired attributes such marketability. Similarly, in literature and media, a well-chosen
as "Strength," "Love," "Hope," "Beauty," and "Wisdom." name can add depth to a character, making them more
memorable and relatable to the audience. This paper aims to
Our results demonstrate that the Boyer-Moore algorithm's
demonstrate how string matching can be effectively employed
ability to efficiently handle large datasets and perform quick,
to automate the creation of such meaningful names, thereby
accurate pattern matching makes it an ideal choice for
blending computational efficiency with the nuances of human
applications requiring high performance in text searching.
creativity. Through detailed analysis and practical examples,
The generated names provide a meaningful and personalized
we aim to showcase the potential of these techniques in
experience for users, showcasing the practical utility of
various applications, ultimately contributing to the broader
advanced string matching techniques in creative and data-
field of computational linguistics and creative automation.
driven applications.
Keywords—Boyer-Moore Algorithm, String Matching, Pattern II. THEORETICAL BASIS
Matching, Name Generation
A. String Matching
I. INTRODUCTION String matching is a fundamental concept in computer
In the realm of computational linguistics and natural science that involves finding occurrences of a substring (often
language processing, the challenge of generating names with referred to as a "pattern") within a larger string (the "text").
specific meanings represents a fascinating intersection of This task is pivotal in various applications, including text
creativity and algorithmic precision. This paper delves into the editing, search engines, DNA sequencing, and network
innovative application of string matching techniques to security. The primary goal of string matching is to efficiently
generate names that embody predetermined semantic locate all instances where the pattern appears in the text.
attributes. The ability to generate meaningful names is not At its core, string matching can be approached through
only an intriguing problem from a theoretical standpoint but several algorithms, each varying in complexity and efficiency.
also has practical implications in various domains, including The most straightforward method is the naive algorithm,
brand creation, character naming in storytelling, and which checks every possible position in the text to see if the
personalized content generation. pattern matches. While easy to understand and implement, the
String matching, a fundamental concept in computer naive approach can be inefficient, especially for long texts and
science, involves the identification and comparison of patterns, as it requires a comparison at every character
substrings within a larger string. Traditionally utilized for position.
tasks such as text searching and pattern recognition, string To address the inefficiencies of the naive algorithm, more
matching algorithms have evolved to address more complex sophisticated techniques have been developed. One such
linguistic challenges. By leveraging these advanced method is the Knuth-Morris-Pratt (KMP) algorithm, which
preprocesses the pattern to create a partial match table (also

Makalah IF2211 Strategi Algoritma, Semester II Tahun 2023/2024


known as the "prefix function"). This table is used to skip is present in the pattern, the pattern is shifted so that this
unnecessary comparisons, thereby improving the search character in the text aligns with its last occurrence in the
efficiency. KMP's time complexity is linear in relation to the pattern.
length of the text and the pattern, making it significantly faster
than the naive approach for larger inputs.
Another prominent algorithm is the Boyer-Moore
algorithm, which preprocesses the pattern to generate two
heuristic tables: the "bad character" and "good suffix" tables.
These heuristics allow the algorithm to skip sections of the
text, jumping over characters that have already been
processed. Boyer-Moore is particularly efficient for longer
patterns and texts because it minimizes the number of
comparisons needed.
Image 1. Case 1 for Bad Character Heuristic in BM
For more advanced and specific applications, such as
genomic sequencing, suffix trees and arrays are utilized. These In the above example, we got a mismatch at position 3.
data structures provide a compact representation of all Here our mismatching character is “A”. Now we will search
possible substrings of a text, enabling extremely fast substring for last occurrence of “A” in pattern. We got “A” at position 1
searches. They are especially useful in scenarios where in pattern (displayed in Blue) and this is the last occurrence of
multiple queries need to be performed on the same text. it. Now we will shift pattern 2 times so that “A” in pattern get
aligned with “A” in text.
B. Boyer Moore
b. Case 2 – Pattern move past the mismatch character
The Boyer-Moore algorithm combines two powerful
techniques: the Bad Character Heuristic and the Good
Suffix Heuristic. These heuristics can be utilized
independently to search for a pattern within a text, but when
combined, they form a highly efficient algorithm. To
understand how these two independent methods work together
in the Boyer-Moore algorithm, it's helpful to compare it with
other string matching algorithms.
In contrast to the naive algorithm, which slides the pattern
over the text one character at a time, and the KMP algorithm,
which preprocesses the pattern to allow for shifts greater than
one, the Boyer-Moore algorithm also preprocesses the pattern. Image 2. Case 2 for Bad Character Heuristic in BM
It creates separate arrays for each of the two heuristics. During
the search process, the pattern is shifted by the maximum Here we have a mismatch at position 7. The mismatching
distance suggested by either of the heuristics at each step. This character “C” does not exist in pattern before position 7 so
means that the Boyer-Moore algorithm uses the greatest offset we’ll shift pattern past to the position 7 and eventually in
recommended by both heuristics to achieve efficient pattern above example we have got a perfect match of pattern
matching. (displayed in Green). We are doing this because “C” does not
A unique feature of the Boyer-Moore algorithm is that it exist in the pattern so at every shift before position 7 we will
starts matching the pattern from its last character rather than get mismatch and our search will be fruitless.
the first. In this discussion, we'll explore the Bad Character
Heuristic, and the Good Suffix Heuristic will be covered in a Good Suffix Heuristic
subsequent discussion.
The Good Suffix Heuristic is another component of the
Boyer-Moore algorithm. Let's consider a substring ttt of the
Bad Character Heuristic text TTT that matches a substring of the pattern PPP. When a
The Bad Character Heuristic is based on a straightforward mismatch occurs after this match, the pattern is shifted based
idea. The character in the text that does not match the current on the following criteria:
character of the pattern is referred to as the Bad Character.
When a mismatch occurs, the algorithm shifts the pattern 1. Align another occurrence of ttt in PPP with ttt in
according to one of two criteria: TTT.
2. Align a prefix of PPP with the suffix of ttt.
a. Case 1 – Mismatch becomes a match
3. Move PPP past ttt.
When a mismatch occurs, the algorithm looks up the
position of the last occurrence of the mismatched
character within the pattern. If the mismatched character

Makalah IF2211 Strategi Algoritma, Semester II Tahun 2023/2024


a. Case 1 – Another occurrence of t in P matched with t
in T
The pattern PPP might have multiple occurrences of
ttt. In such scenarios, the algorithm shifts the pattern to
align the next occurrence of ttt in PPP with ttt in TTT.
For example:

Image 5. Case 3 for Good Suffix Heuristic in BM

If above example, there exist no occurrence of t (“AB”) in


P and also there is no prefix in P which matches with the
suffix of t. So, in that case, we can never find any perfect
Image 3. Case 1 for Good Suffix Heuristic in BM
match before index 4, so we will shift the P past the t ie. to
In the above example, we have got a substring t of
index 5. We’ll lookup the position of last occurrence of
text T matched with pattern P (in green) before mismatch
mismatching character in pattern and if character does not
at index 2. Now we will search for occurrence of t
exist we will shift pattern past the mismatching character.
(“AB”) in P. We have found an occurrence starting at
position 1 (in yellow background) so we will right shift
the pattern 2 times to align t in P with t in T. This is weak III. ANALYSIS AND IMPLEMENTATION
rule of original Boyer Moore and not much effective
The name generator application described in the provided
b. Case 2 – A prefix of P, which matches with suffix of t code leverages the Boyer-Moore algorithm to match
in T meaningful keywords within name descriptions stored in a
It is not always likely that we will find the occurrence database. The primary goal is to find and generate names that
of t in P. Sometimes there is no occurrence at all, in such closely align with the user-specified meanings.
cases sometimes we can search for some suffix of t
matching with some prefix of P and try to align them by Steps Involved:
shifting P. For example –
1. Input Handling: The user inputs a desired meaning or set
of attributes (e.g., "Strength Love Hope Beauty Wisdom")
and specifies the gender of the names (male, female, or
unisex).

2. Tokenization: The input string is tokenized into


individual words, which are then used as search patterns.

3. Database Query: The application connects to a MySQL


database containing names and their corresponding
Image 4. Case 2 for Good Suffix Heuristic in BM
meanings. Depending on the specified gender, an
appropriate query is executed to fetch relevant name-
In above example, we have got t (“BAB”) matched meaning pairs.
with P (in green) at index 2—4 before mismatch. But
because there exists no occurrence of t in P we will 4. Pattern Matching: For each name-meaning pair retrieved
search for some prefix of P which matches with some from the database, the Boyer-Moore algorithm is used to
suffix of t. We have found prefix “AB” (in the yellow count how many of the input words (patterns) are present
background) starting at index 0 which matches not with in the meaning.
whole t but the suffix of t “AB” starting at index 3. So
now we will shift pattern 3 times to align prefix with the 5. Filtering and Combination: The results are filtered to
suffix. identify the names with the highest number of matching
patterns. The best three-word name combinations are then
c. Case 3 – P moves past t generated based on these filtered results.
If the above two cases are not satisfied, we will shift
6. Output: The application outputs the names and their
the pattern past the t. For example –
meanings, highlighting those that most closely match the
desired attributes.

Makalah IF2211 Strategi Algoritma, Semester II Tahun 2023/2024


Breakdown C. Count Matching Words
A. Bad Character Heuristic

Image 6. Bad Char Heuristic function

This function initializes a table with 256 entries (for Image 8. Count Matching Words function
ASCII characters), setting each entry to -1. It then iterates
through the pattern, updating the table with the index of This function iterates over each input pattern and uses
the last occurrence of each character. This table is used to the Boyer-Moore search function to count how many of the
determine how far to shift the pattern when a mismatch patterns are found in the text (name meaning). The total
occurs. count of matches is then returned.
B. Boyer-Moore Search
D. Filter Words

Image 7. BM Search function

This function performs the Boyer-Moore search. It


initializes the length variables for the pattern and the text,
generates the bad character table, and sets the initial shift
(s) to 0. It then enters a loop to slide the pattern over the
text: Image 9. Filter Words function

• Matching from the End: The algorithm starts


This function connects to the MySQL database and
comparing characters from the end of the pattern
retrieves names and their meanings based on the specified
towards the beginning. gender. It then uses the Boyer-Moore algorithm to count
• Mismatch Handling: If a mismatch is found, the the number of matching input words in each meaning. The
names with the highest number of matches are stored in a
shift is determined by the bad character heuristic. The
results list.
pattern is moved to align the mismatched character in
the text with its last occurrence in the pattern.
• Pattern Found: If the entire pattern matches the text,
the function returns True, indicating that the pattern
is found at the current shift position.

Identify applicable sponsor/s here. If no sponsors, delete this text box


(sponsors).

Makalah IF2211 Strategi Algoritma, Semester II Tahun 2023/2024


E. Generate Full Names a. TC1

b. TC2

c. TC3

Image 10. Generate Full Names function

This function generates the final full names based on • Unisex


the best-matching combinations. It filters the names using a. TC1
the Boyer-Moore algorithm and then generates three-word
name combinations. It calculates the number of matches
for each combination's combined meaning and selects the
names with the highest number of matching attributes.

Test Cases
• Female
a. TC1

b. TC2

b. TC2

c. TC3

c. TC3

IV. CONCLUSION
The Boyer-Moore algorithm's integration into the name
generator application demonstrates its capability to efficiently
• Male match patterns within large texts. By leveraging the bad
character and good suffix heuristics, the algorithm minimizes

Makalah IF2211 Strategi Algoritma, Semester II Tahun 2023/2024


unnecessary comparisons and accelerates the search process. VI. REFERENCES
In this context, the algorithm is used to identify names that [1] GeeksForGeeks, Boyer Moore Algorithm | Good Suffix heuristic.
closely align with user-defined meanings, resulting in a more October 2023. Accessed through https://www.geeksforgeeks.org/boyer-
personalized and meaningful name generation process. This moore-algorithm-good-suffix-heuristic/ on June 11th, 2024.
approach highlights the versatility and efficiency of the Boyer- [2] GeeksForGeeks, Boyer Moore Algorithm for Pattern Searching. March
2024. Accessed through https://www.geeksforgeeks.org/boyer-moore-
Moore algorithm in real-world applications, particularly in algorithm-for-pattern-searching/ on June 11th, 2024.
tasks involving large datasets and the need for rapid, accurate [3] GeeksForGeeks, Boyer Moore Algorithm for Pattern Searching. March
pattern matching 2024. Accessed through https://www.geeksforgeeks.org/boyer-moore-
algorithm-for-pattern-searching/ on June 11th, 2024.
[4] Munir, Rinaldi, Pencocokan String (String/Pattern Matching). Institut
V. ACKNOWLEDGMENT Teknologi Bandung, 2021. Accessed through
https://informatika.stei.itb.ac.id/~rinaldi.munir/Stmik/2020-
I am profoundly grateful to God Almighty, creator of the 2021/Pencocokan-string-2021.pdf on June 11th, 2024.
universe, for His guidance throughout the journey of crafting [5] https://www.momjunction.com/baby-names as data source. Accessed on
this paper. I would also like to express my sincere gratitude all June 12th, 2024.
the following individuals who have played pivotal roles in the
completion of this paper:
LINK OF GITHUB AND YOUTUBE
1. Dr. Ir. Rinaldi Munir, M.T., my dedicated class https://github.com/HenryofSkalitz1202/NameGenerator
professor and course coordinator, whose guiding and
https://youtu.be/77bVz15UyM4
syllabus have been instrumental in providing
invaluable insights that helped put the trajectory of
this research endeavor on course.
2. My esteemed colleagues of IF'22, whose collaborative STATEMENT OF ORIGINALITY
spirit and shared enthusiasm fostered an enriching I hereby declare that this paper is an original composition of
academic environment, stimulating meaningful my own, not of any adaptation or translation from the authored
discussions and enhancing the overall research works of others, and free from plagiarism.
experience.
3. Last but not least, my heartfelt appreciation goes to Bandung, 12 Juni 2024
my parents for their enduring support, encouragement,
and understanding. Their unwavering belief in my
academic pursuits has been a constant source of
inspiration, and I am truly grateful for their love and
encouragement throughout this academic journey.

Ellijah Darrellshane Suryanegara


13522097

Makalah IF2211 Strategi Algoritma, Semester II Tahun 2023/2024

You might also like