forked from takaki/python-kconv
-
Notifications
You must be signed in to change notification settings - Fork 0
License
git-show/python-kconv
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This is release 1.0- of kconv.
NAME
kconv - Kanji code Conversion library
DESCRIPTION
Kconv is a library for kanji code conversion in Python. A string encoded in
any encoding can be converted into any other encoding. Any linebreak characters can
also be converted. In addition, if hankaku katakana charaters are processed, they
will be converted into zenkaku katakana.
The C++ version works faster than the pure Python version.
BUT, the pure Python version will work on any platform where Python works.
The APIs of C++ version and those of the pure Python version are entirely identical,
so one can easily switch between the C++ and Python implementations with no need
rewrite source. NOTE: This file contains examples of Japanese text in the EUC encoding.
Please make sure that your viewer can display EUC.
INSTALLATION
*C++ version
> tar xzvf kconv-0.5.5c.tar.gz
> cd kconv-0.5.5c
> ./configure [--enable-optimize]
> make
> su
# make install
If you want to enable Python optimization in the compiler, use
>./configure --enable-optimize
instead of
> ./configure
Or you can use kconv without installing it as root. To do this, simply copy
the kconv files into your desired directory
(and make sure that that directory is in the PYTHONPATH????)
*Python version
After byte compiling, copy the "kconv" directory into the Python library directory
or your directory.
ex.)
> tar xzvf kconv-0.5.5p.tar.gz
> cd kconv-0.5.5p
> python compile.py [optimize]
> tar cvf - kconv |
(cd /usr/local/lib/python1.5/site-packages/;tar xvpf -)
NOTE: preceding command must be one line
If you don't have tar, use the following command to copy the directory structure
> cp -R kconv /usr/local/lib/python1.5/site-packages/
INTRODUCTION TO THE ROUTINES
First, import "kconv" module
>>> import kconv
The following are forms of the kconv constructor
1.normal arguments
>>> kc = kconv.Kconv([outcode[,incode[,hankanaconvert[,checkmode
[,mode[,blcode]]]]]])
2.keyword arguments
>>> kc = kconv.Kconv([outcode = kconv.???][,incode = kconv.???]....)
Allowed Arguments
* out-code
Specify output encoding. The default is DEFAULT_OUTPUT_CODING in defaults.py.
kconv.EUC - EUC
kconv.SJIS - Shift Jis
kconv.JIS - Jis
kconv.UNICODE - Unicode(UCS2)
kconv.UTF8 - UTF-8
* in-code
If you know input encoding, specify input code for higher performance.
If no input encoding is specified, auto-detection is automatically enabled.
kconv.EUC - EUC
kconv.SJIS - Shift Jis
kconv.JIS - Jis
kconv.UNICODE - Unicode(UCS2)
kconv.UTF8 - UTF-8
kconv.AUTO - Auto detection(DEFAULT)
* hankanaconvert
Specify whether kconv converts hankaku katakana to zenkaku katakana or
not. If the output encoding is EUC and you do want not to convert hankaku katakna to
zenkaku katakana, kconv outputs 2 byte hankaku katakana in EUC.
kconv.ZENKAKU - Convert hankaku katakana to zenkaku katakana(DEFAULT)
kconv.HANKAKU - Output hankaku katakana if input string includes hankaku katakana
* checkmode
Choice auto dectection algorithms.
If you specify input-code, checkmode is ignored.
kconv.FAST - Discriminate using the first definitive character.
(Fast and unreliable.)
kconv.FULL - Discriminate with whole input sting. (Weighted Average)
kconv.TABLE - Discriminate with a 1-byte freqency table.(DEFAULT)
kconv.TABLE2 - Discriminate with a 2-byte freqency table.
The accurracy of TABLE2 is higher than TABLE, but
TABLE2 is slightly slower.
* mode
Specify conversion of the whole input string or line-by-line.
kconv.LINE - Convert line-by-line. For cases when input code may change on a
line-by-line basis or when you would like to
use less buffer memory.
kconv.WHOLE - Convert whole input string.(DEFAULT)
* blcode
Specify output breakline character
Default breakline character is DEFAULT_BREAKLINE_CODE at defaults.py
kconv.LF - LF (0x0A) Unix
kconv.CR - CR (0x0D) Macintosh
kconv.CL - CR + LF (0x0D + 0x0A) Windows / DOS
EXAMPLES
ex1) The most simple usage.
1.Import kconv
>>> import kconv
2.Create instance of kconv
>>> toEuc = kconv.Kconv()
3.Call convert() of the instance
>>> print toEuc.convert( input_string )
The method convert() will return the converted string.
ex2) Converting by specifying the input(JIS) and output(SJIS) encodings.
>>> kc = kconv.Kconv(kconv.SJIS,kconv.JIS,kconv.HANKAKU)
>>> print kc.convert("JISコードの文字列だもん")
ex3) Converting to EUC, using auto-detection and converting to zenkaku katakana with
the full string detection algorithm.
>>> kc = kconv.Kconv(kconv.SJIS,kconv.AUTO,kconv.ZENKAKU,
kconv.FULL)
>>> print kc.convert("文字コードは任意だよ。")
CUSTOMIZATION
You can customize kconv by editing defaults.py
DEFAULT_INPUT_CODING This will be used then autodetection fails.
DEFAULT_OUTPUT_CODING The default output code.
DEFAULT_BREAKLINE_CODE Specifies the default breakline character
OTEHR FEATURES
Kconv contains several other functions for processing Japanese text.
All functions accept a second argument to specify input encoding just as
the input encoding in the constructor's argument. The default detection mode is kconv.AUTO.
The output encoding is EUC.
*kconv.ChkHiragana(input_string[,icode])
*kconv.ChkKatakana(input_string[,icode])
Return 1 if the input string hiragana hiragana(or katakana) charatcters, whitespace,
breakline character.
ex)
>>>print kconv.ChkHiragana("ひらがなだけなんだもん")
1
>>>print kconv.ChkKatakana("カタカナだけじゃないみたい")
0
*kconv.NumberConvert(input_string[,icode])
Converts zenkaku numeric characters into hankaku numeric characters.
ex)
>>>print kconv.NumberConvert("34223632343368024")
'34223632343361024'
*kconv.Han2Zen(input_string,icode)
*kconv.Zen2Han(input_stirng,icode)
Converts hankaku(or zenkaku) characters to zenkaku(or hankaku) characters.
(This covers roman and non-roman characters)
Characters that are imposssible to convert will not beconverted.
ex)
>>>print kconv.Han2Zen("ぜんかくとHankaku(^^;")
ぜんかくとHankaku(^^;
>>>print kconv.Zen2Han("ぜんかくとHankaku(^^;")
ぜんかくとHankaku(^^;
*kconv.Hira2Kata(input_string,icode)
*kconv.Kata2Hira(input_string,icode)
Converts hiragana(or katakana) to katakana(or hiragana).
Roman letters and numbers are not effected.
ex)
>>>print kconv.Hira2Kata("東方は赤く燃えているようです")
東方ハ赤ク燃エテイルヨウデス
>>>print kconv.Kata2Hira("デフォルトのコード認識ルーチン")
でふぉるとのこーど認識るーちん
*kconv.Upper(input_string,icode)
*kconv.Lower(input_string,icode)
Convert lower-case roman letters(or upper-case) to upper-case( or lowercase)
(on both hankaku and zenkaku)
Other characters willl not be effected.
ex)
>>>print kconv.Upper("Captain AMERICA")
CAPTAIN AMERICA
>>>print kconv.Lower("Hoe Ukyu")
hoe ukyu
NOTES
*Kconv is distributed under the most recent GPL.
*Kconv works with the EUC,JIS,SJIS,Unicode(UCS2),UTF-8 encodings
*The content of kconv constants are different between the python implementation and
the C++ version
*Kconv.FAST is not that fast and sometimes fails to correctly detect the encoding.
*Once kconv detects the Unicode encoding in line-by-line mode, it will work under the
assumption that Unicode is the input encoding for all subsequent lines.
This is because kconv detects Unicode by using the characters 0x00 and 0xFF
which are not used in other encodings. All non-Unicode data must not
include the NULL character 0x00. If it includes the NULL character, kconv
assume that it is a Unicode string.
*The Python implementation will raise the exception kconv.KconvError when an error occurs.
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- C++ 83.4%
- Python 12.7%
- C 3.9%