Python实现全角半角字符互转的方法
前言
相信对于每一个编程人员来说,在文本处理的时候,经常会遇到全角半角不一致的问题。于是需要程序能够快速的在两者之间互转。由于全角半角本身存在着映射关系,所以处理起来并不复杂。
具体规则为:
全角字符unicode编码从65281~65374(十六进制0xFF01~0xFF5E)
半角字符unicode编码从33~126(十六进制0x21~0x7E)
空格比较特殊,全角为12288(0x3000),半角为32(0x20)
而且除空格外,全角/半角按unicode编码排序在顺序上是对应的(半角+65248=全角)
所以可以直接通过用+-法来处理非空格数据,对空格单独处理。
用到的一些函数
chr()函数用一个范围在range(256)内的(就是0~255)整数作参数,返回一个对应的字符。
unichr()跟它一样,只不过返回的是Unicode字符。
ord()函数是chr()函数或unichr()函数的配对函数,它以一个字符(长度为1的字符串)作为参数,返回对应的ASCII数值,或者Unicode数值。
先来打印下映射关系:
foriinxrange(33,127): printi,chr(i),i+65248,unichr(i+65248)
返回结果
33!65281! 34"65282" 35#65283# 36$65284$ 37%65285% 38&65286& 39'65287' 40(65288( 41)65289) 42*65290* 43+65291+ 44,65292, 45-65293- 46.65294. 47/65295/ 480652960 491652971 502652982 513652993 524653004 535653015 546653026 557653037 568653048 579653059 58:65306: 59;65307; 60<65308< 61=65309= 62>65310> 63?65311? 64@65312@ 65A65313A 66B65314B 67C65315C 68D65316D 69E65317E 70F65318F 71G65319G 72H65320H 73I65321I 74J65322J 75K65323K 76L65324L 77M65325M 78N65326N 79O65327O 80P65328P 81Q65329Q 82R65330R 83S65331S 84T65332T 85U65333U 86V65334V 87W65335W 88X65336X 89Y65337Y 90Z65338Z 91[65339[ 92\65340\ 93]65341] 94^65342^ 95_65343_ 96`65344` 97a65345a 98b65346b 99c65347c 100d65348d 101e65349e 102f65350f 103g65351g 104h65352h 105i65353i 106j65354j 107k65355k 108l65356l 109m65357m 110n65358n 111o65359o 112p65360p 113q65361q 114r65362r 115s65363s 116t65364t 117u65365u 118v65366v 119w65367w 120x65368x 121y65369y 122z65370z 123{65371{ 124|65372| 125}65373} 126~65374~
把全角转成半角:
deffull2half(s): n=[] s=s.decode('utf-8') forcharins: num=ord(char) ifnum==0x3000: num=32 elif0xFF01<=num<=0xFF5E: num-=0xfee0 num=unichr(num) n.append(num) return''.join(n)
把半角转成全角:
defhalf2full(s): n=[] s=s.decode('utf-8') forcharins: num=char(char) ifnum==320: num=0x3000 elif0x21<=num<=0x7E: num+=0xfee0 num=unichr(num) n.append(num) return''.join(n)
上面的实现方式非常的简单,但是现实情况下可能并不会把所以的字符统一进行转换,比如中文文章中我们期望将所有出现的字母和数字全部转化成半角,而常见标点符号统一使用全角,上面的转化就不适合了。
解决方案,是自定义词典。
#!/usr/bin/envpython #-*-coding:utf-8-*- FH_SPACE=FHS=((u"",u""),) FH_NUM=FHN=( (u"0",u"0"),(u"1",u"1"),(u"2",u"2"),(u"3",u"3"),(u"4",u"4"), (u"5",u"5"),(u"6",u"6"),(u"7",u"7"),(u"8",u"8"),(u"9",u"9"), ) FH_ALPHA=FHA=( (u"a",u"a"),(u"b",u"b"),(u"c",u"c"),(u"d",u"d"),(u"e",u"e"), (u"f",u"f"),(u"g",u"g"),(u"h",u"h"),(u"i",u"i"),(u"j",u"j"), (u"k",u"k"),(u"l",u"l"),(u"m",u"m"),(u"n",u"n"),(u"o",u"o"), (u"p",u"p"),(u"q",u"q"),(u"r",u"r"),(u"s",u"s"),(u"t",u"t"), (u"u",u"u"),(u"v",u"v"),(u"w",u"w"),(u"x",u"x"),(u"y",u"y"),(u"z",u"z"), (u"A",u"A"),(u"B",u"B"),(u"C",u"C"),(u"D",u"D"),(u"E",u"E"), (u"F",u"F"),(u"G",u"G"),(u"H",u"H"),(u"I",u"I"),(u"J",u"J"), (u"K",u"K"),(u"L",u"L"),(u"M",u"M"),(u"N",u"N"),(u"O",u"O"), (u"P",u"P"),(u"Q",u"Q"),(u"R",u"R"),(u"S",u"S"),(u"T",u"T"), (u"U",u"U"),(u"V",u"V"),(u"W",u"W"),(u"X",u"X"),(u"Y",u"Y"),(u"Z",u"Z"), ) FH_PUNCTUATION=FHP=( (u".",u"."),(u",",u","),(u"!",u"!"),(u"?",u"?"),(u"”",u'"'), (u"'",u"'"),(u"‘",u"`"),(u"@",u"@"),(u"_",u"_"),(u":",u":"), (u";",u";"),(u"#",u"#"),(u"$",u"$"),(u"%",u"%"),(u"&",u"&"), (u"(",u"("),(u")",u")"),(u"‐",u"-"),(u"=",u"="),(u"*",u"*"), (u"+",u"+"),(u"-",u"-"),(u"/",u"/"),(u"<",u"<"),(u">",u">"), (u"[",u"["),(u"¥",u"\\"),(u"]",u"]"),(u"^",u"^"),(u"{",u"{"), (u"|",u"|"),(u"}",u"}"),(u"~",u"~"), ) FH_ASCII=HAC=lambda:((fr,to)formin(FH_ALPHA,FH_NUM,FH_PUNCTUATION)forfr,toinm) HF_SPACE=HFS=((u"",u""),) HF_NUM=HFN=lambda:((h,z)forz,hinFH_NUM) HF_ALPHA=HFA=lambda:((h,z)forz,hinFH_ALPHA) HF_PUNCTUATION=HFP=lambda:((h,z)forz,hinFH_PUNCTUATION) HF_ASCII=ZAC=lambda:((h,z)forz,hinFH_ASCII()) defconvert(text,*maps,**ops): """全角/半角转换 args: text:unicodestringneedtoconvert maps:conversionmaps skip:skipoutofcharacter.Inatupleorstring return:convertedunicodestring """ if"skip"inops: skip=ops["skip"] ifisinstance(skip,basestring): skip=tuple(skip) defreplace(text,fr,to): returntextiffrinskipelsetext.replace(fr,to) else: defreplace(text,fr,to): returntext.replace(fr,to) forminmaps: ifcallable(m): m=m() elifisinstance(m,dict): m=m.items() forfr,toinm: text=replace(text,fr,to) returntext if__name__=='__main__': text=u"成田空港—【JR特急成田エクスプレス号・横浜行,2站】—東京—【JR新幹線はやぶさ号・新青森行,6站】—新青森—【JR特急スーパー白鳥号・函館行,4站】—函館" printconvert(text,FH_ASCII,{u"【":u"[",u"】":u"]",u",":u",",u".":u"。",u"?":u"?",u"!":u"!"},spit=",。?!“”")
特别注意:引号在英语体系中引号是不区分前引号和后引号。
总结
以上就是关于Python实现全角半角字符互转的方法,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流。