Open In App

How to Find Chinese And Japanese Character in a String in Python

Last Updated : 15 Sep, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Detecting Chinese or Japanese characters in a string can be useful for a variety of applications, such as text preprocessing, language detection, and character classification. In this article, we’ll explore simple yet effective ways to identify Chinese or Japanese characters in a string using Python.

Understanding Unicode Character Ranges

Chinese, Japanese, and Korean (CJK) characters are part of the Unicode standard. These characters belong to specific Unicode ranges, and we can leverage these ranges to detect the presence of CJK characters in a string.

Here are the relevant Unicode ranges for Chinese and Japanese characters:

Chinese characters (CJK Unified Ideographs):

  • \u4E00 to \u9FFF (Common and uncommon Chinese characters)
  • \u3400 to \u4DBF (More rare Chinese characters)

Japanese characters:

  • Hiragana: \u3040 to \u309F
  • Katakana: \u30A0 to \u30FF
  • Kanji: Same as Chinese (CJK Unified Ideographs)

Using Regular Expressions to Find Chinese or Japanese Characters

The most efficient way to find Chinese or Japanese characters in a string is by using regular expressions (regex) to match Unicode character ranges. By specifying the Unicode ranges of Chinese or Japanese characters in a regular expression, we can search for them in any given string.

1. Finding Chinese Characters

Chinese characters, including Simplified and Traditional characters, fall within the CJK Unified Ideographs block. In this example, we provide the range of Chinese characters to the compile() method to convert the expression pattern into a regex object. The findall() method is used to search the entire string (text) for all matches that correspond to Chinese characters.

Python
import re

def find_chinese(text):
    # Regex pattern for Chinese characters (CJK Unified Ideographs)
    chinese_pattern = re.compile(r'[\u4E00-\u9FFF]')
    return chinese_pattern.findall(text)

# Example usage
text = "This is an example string 包含中文字符."
chinese_chars = find_chinese(text)
print("Chinese characters found:", chinese_chars)

Output
Chinese characters found: ['包', '含', '中', '文', '字', '符']

2. Finding Japanese Characters

To detect Japanese characters, we need to include the Unicode ranges for Hiragana, Katakana, and Kanji (CJK Ideographs). In this example, we provide the range of Japanese characters to the compile() method to convert the expression pattern into a regex object. The findall() method is used to search the entire string (text) for all matches that correspond to Japanese characters.

Python
# import regular expression module
import re

def find_japanese(text):
    # Regex pattern for Hiragana, Katakana, and Kanji (CJK Ideographs)
    japanese_pattern = re.compile(r'[\u3040-\u30FF\u4E00-\u9FFF]')
    return japanese_pattern.findall(text)

# Example usage
text = "これは日本語の文字列です。"
japanese_chars = find_japanese(text)
print("Japanese characters found:", japanese_chars)

Output
Japanese characters found: ['こ', 'れ', 'は', '日', '本', '語', 'の', '文', '字', '列', 'で', 'す']

3. Detecting Both Chinese and Japanese Characters

Since Chinese and Japanese characters overlap in certain Unicode blocks (such as CJK Ideographs), we can write a single function to detect both languages.

Python
import re

def find_cjk(text):
    # Regex pattern for CJK characters (Chinese, Japanese)
    cjk_pattern = re.compile(r'[\u3040-\u30FF\u3400-\u4DBF\u4E00-\u9FFF]')
    return cjk_pattern.findall(text)

# Example usage
text = "This string contains 漢字 and カタカナ."
cjk_chars = find_cjk(text)
print("CJK characters found:", cjk_chars)

Output
CJK characters found: ['漢', '字', 'カ', 'タ', 'カ', 'ナ']

Iterating Over the String to Check for CJK Characters

Another method involves manually iterating over the string and checking each character to see if it belongs to a specific Unicode range.

1. Checking for Chinese Characters

In this example, we will use a simple for loop to iterate over each character in the string and then using if condition, check if it belongs to the Chinese characters or not.

Python
def contains_chinese(text):
    for char in text:
        if '\u4E00' <= char <= '\u9FFF' or '\u3400' <= char <= '\u4DBF':
            return True
    return False

# Example usage
text = "This sentence contains 中文."
if contains_chinese(text):
    print("Chinese characters detected.")
else:
    print("No Chinese characters found.")

Output
Chinese characters detected.

Checking for Japanese Characters

In this example, we will use a simple for loop to iterate over each character in the string and then using if condition, check if it belongs to the Japanese characters or not.

Python
def contains_japanese(text):
    for char in text:
        if '\u3040' <= char <= '\u30FF' or '\u4E00' <= char <= '\u9FFF':
            return True
    return False

# Example usage
text = "これはテストです。"
if contains_japanese(text):
    print("Japanese characters detected.")
else:
    print("No Japanese characters found.")

Output
Japanese characters detected.

Detecting Multiple CJK Characters Simultaneously

If we want to check whether a string contains both Chinese and Japanese characters, we can combine the approaches.

Python
def detect_cjk_languages(text):
    has_chinese = any('\u4E00' <= char <= '\u9FFF' or 
                      '\u3400' <= char <= '\u4DBF' for char in text)
    has_japanese = any('\u3040' <= char <= '\u30FF' or 
                       '\u4E00' <= char <= '\u9FFF' for char in text)
    
    if has_chinese:
        print("Chinese characters detected.")
    if has_japanese:
        print("Japanese characters detected.")
    if not has_chinese and not has_japanese:
        print("No CJK characters detected.")

# Example usage
text = "これはテストです and this contains 漢字."
detect_cjk_languages(text)

Output
Chinese characters detected.
Japanese characters detected.

Real-World Use Cases

  • Text Preprocessing for NLP: Detecting and filtering out Chinese or Japanese characters can be useful in multilingual text processing and machine learning tasks.
  • Language Detection: Before analyzing a text, we might want to check for the presence of different languages and take actions accordingly (e.g., translating the text, applying specific language models).
  • Character Categorization: This method can be used to categorize and filter strings based on character types.

Conclusion

Detecting Chinese or Japanese characters in Python can be easily achieved using regular expressions or iterating over characters in the string. Whether we’re looking to preprocess multilingual text, detect specific languages, or filter certain character types, these methods provide a straightforward approach to handling CJK characters in our Python applications.


Next Article
Practice Tags :

Similar Reads