Python实现全角半角字符互转的方法
前言
相信对于每一个编程人员来说,在文本处理的时候,经常会遇到全角半角不一致的问题。于是需要程序能够快速的在两者之间互转。由于全角半角本身存在着映射关系,所以处理起来并不复杂。
具体规则为:
全角字符unicode编码从65281~65374(十六进制0xFF01~0xFF5E)
半角字符unicode编码从33~126(十六进制0x21~0x7E)
空格比较特殊,全角为12288(0x3000),半角为32(0x20)
而且除空格外,全角/半角按unicode编码排序在顺序上是对应的(半角+65248=全角)
所以可以直接通过用+-法来处理非空格数据,对空格单独处理。
用到的一些函数
chr()函数用一个范围在range(256)内的(就是0~255)整数作参数,返回一个对应的字符。
unichr()跟它一样,只不过返回的是Unicode字符。
ord()函数是chr()函数或unichr()函数的配对函数,它以一个字符(长度为1的字符串)作为参数,返回对应的ASCII数值,或者Unicode数值。
先来打印下映射关系:
foriinxrange(33,127): printi,chr(i),i+65248,unichr(i+65248)
返回结果
33!65281!
34"65282"
35#65283#
36$65284$
37%65285%
38&65286&
39'65287'
40(65288(
41)65289)
42*65290*
43+65291+
44,65292,
45-65293-
46.65294.
47/65295/
480652960
491652971
502652982
513652993
524653004
535653015
546653026
557653037
568653048
579653059
58:65306:
59;65307;
60<65308<
61=65309=
62>65310>
63?65311?
64@65312@
65A65313A
66B65314B
67C65315C
68D65316D
69E65317E
70F65318F
71G65319G
72H65320H
73I65321I
74J65322J
75K65323K
76L65324L
77M65325M
78N65326N
79O65327O
80P65328P
81Q65329Q
82R65330R
83S65331S
84T65332T
85U65333U
86V65334V
87W65335W
88X65336X
89Y65337Y
90Z65338Z
91[65339[
92\65340\
93]65341]
94^65342^
95_65343_
96`65344`
97a65345a
98b65346b
99c65347c
100d65348d
101e65349e
102f65350f
103g65351g
104h65352h
105i65353i
106j65354j
107k65355k
108l65356l
109m65357m
110n65358n
111o65359o
112p65360p
113q65361q
114r65362r
115s65363s
116t65364t
117u65365u
118v65366v
119w65367w
120x65368x
121y65369y
122z65370z
123{65371{
124|65372|
125}65373}
126~65374~
把全角转成半角:
deffull2half(s):
n=[]
s=s.decode('utf-8')
forcharins:
num=ord(char)
ifnum==0x3000:
num=32
elif0xFF01<=num<=0xFF5E:
num-=0xfee0
num=unichr(num)
n.append(num)
return''.join(n)
把半角转成全角:
defhalf2full(s):
n=[]
s=s.decode('utf-8')
forcharins:
num=char(char)
ifnum==320:
num=0x3000
elif0x21<=num<=0x7E:
num+=0xfee0
num=unichr(num)
n.append(num)
return''.join(n)
上面的实现方式非常的简单,但是现实情况下可能并不会把所以的字符统一进行转换,比如中文文章中我们期望将所有出现的字母和数字全部转化成半角,而常见标点符号统一使用全角,上面的转化就不适合了。
解决方案,是自定义词典。
#!/usr/bin/envpython
#-*-coding:utf-8-*-
FH_SPACE=FHS=((u"",u""),)
FH_NUM=FHN=(
(u"0",u"0"),(u"1",u"1"),(u"2",u"2"),(u"3",u"3"),(u"4",u"4"),
(u"5",u"5"),(u"6",u"6"),(u"7",u"7"),(u"8",u"8"),(u"9",u"9"),
)
FH_ALPHA=FHA=(
(u"a",u"a"),(u"b",u"b"),(u"c",u"c"),(u"d",u"d"),(u"e",u"e"),
(u"f",u"f"),(u"g",u"g"),(u"h",u"h"),(u"i",u"i"),(u"j",u"j"),
(u"k",u"k"),(u"l",u"l"),(u"m",u"m"),(u"n",u"n"),(u"o",u"o"),
(u"p",u"p"),(u"q",u"q"),(u"r",u"r"),(u"s",u"s"),(u"t",u"t"),
(u"u",u"u"),(u"v",u"v"),(u"w",u"w"),(u"x",u"x"),(u"y",u"y"),(u"z",u"z"),
(u"A",u"A"),(u"B",u"B"),(u"C",u"C"),(u"D",u"D"),(u"E",u"E"),
(u"F",u"F"),(u"G",u"G"),(u"H",u"H"),(u"I",u"I"),(u"J",u"J"),
(u"K",u"K"),(u"L",u"L"),(u"M",u"M"),(u"N",u"N"),(u"O",u"O"),
(u"P",u"P"),(u"Q",u"Q"),(u"R",u"R"),(u"S",u"S"),(u"T",u"T"),
(u"U",u"U"),(u"V",u"V"),(u"W",u"W"),(u"X",u"X"),(u"Y",u"Y"),(u"Z",u"Z"),
)
FH_PUNCTUATION=FHP=(
(u".",u"."),(u",",u","),(u"!",u"!"),(u"?",u"?"),(u"”",u'"'),
(u"'",u"'"),(u"‘",u"`"),(u"@",u"@"),(u"_",u"_"),(u":",u":"),
(u";",u";"),(u"#",u"#"),(u"$",u"$"),(u"%",u"%"),(u"&",u"&"),
(u"(",u"("),(u")",u")"),(u"‐",u"-"),(u"=",u"="),(u"*",u"*"),
(u"+",u"+"),(u"-",u"-"),(u"/",u"/"),(u"<",u"<"),(u">",u">"),
(u"[",u"["),(u"¥",u"\\"),(u"]",u"]"),(u"^",u"^"),(u"{",u"{"),
(u"|",u"|"),(u"}",u"}"),(u"~",u"~"),
)
FH_ASCII=HAC=lambda:((fr,to)formin(FH_ALPHA,FH_NUM,FH_PUNCTUATION)forfr,toinm)
HF_SPACE=HFS=((u"",u""),)
HF_NUM=HFN=lambda:((h,z)forz,hinFH_NUM)
HF_ALPHA=HFA=lambda:((h,z)forz,hinFH_ALPHA)
HF_PUNCTUATION=HFP=lambda:((h,z)forz,hinFH_PUNCTUATION)
HF_ASCII=ZAC=lambda:((h,z)forz,hinFH_ASCII())
defconvert(text,*maps,**ops):
"""全角/半角转换
args:
text:unicodestringneedtoconvert
maps:conversionmaps
skip:skipoutofcharacter.Inatupleorstring
return:convertedunicodestring
"""
if"skip"inops:
skip=ops["skip"]
ifisinstance(skip,basestring):
skip=tuple(skip)
defreplace(text,fr,to):
returntextiffrinskipelsetext.replace(fr,to)
else:
defreplace(text,fr,to):
returntext.replace(fr,to)
forminmaps:
ifcallable(m):
m=m()
elifisinstance(m,dict):
m=m.items()
forfr,toinm:
text=replace(text,fr,to)
returntext
if__name__=='__main__':
text=u"成田空港—【JR特急成田エクスプレス号・横浜行,2站】—東京—【JR新幹線はやぶさ号・新青森行,6站】—新青森—【JR特急スーパー白鳥号・函館行,4站】—函館"
printconvert(text,FH_ASCII,{u"【":u"[",u"】":u"]",u",":u",",u".":u"。",u"?":u"?",u"!":u"!"},spit=",。?!“”")
特别注意:引号在英语体系中引号是不区分前引号和后引号。
总结
以上就是关于Python实现全角半角字符互转的方法,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流。