前言
相信对于每一个编程人员来说,在文本处理的时候,经常会遇到全角半角不一致的问题。于是需要程序能够快速的在两者之间互转。由于全角半角本身存在着映射关系,所以处理起来并不复杂。
具体规则为:
全角字符unicode编码从65281~65374 (十六进制 0xFF01 ~ 0xFF5E)
半角字符unicode编码从33~126 (十六进制 0x21~ 0x7E)
空格比较特殊,全角为 12288(0x3000),半角为 32(0x20)
而且除空格外,全角/半角按unicode编码排序在顺序上是对应的(半角 + 65248 = 全角)
所以可以直接通过用+-法来处理非空格数据,对空格单独处理。
用到的一些函数
chr()
函数用一个范围在range(256)内的(就是0~255)整数作参数,返回一个对应的字符。
unichr()
跟它一样,只不过返回的是Unicode字符。
ord()
函数是chr()
函数或unichr()
函数的配对函数,它以一个字符(长度为1的字符串)作为参数,返回对应的ASCII数值,或者Unicode数值。
先来打印下映射关系:
1
2
|
for i in xrange ( 33 , 127 ): print i, chr (i),i + 65248 , unichr (i + 65248 ) |
返回结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
|
33 ! 65281 ! 34 " 65282 " 35 # 65283 # 36 $ 65284 $ 37 % 65285 % 38 & 65286 & 39 ' 65287 ' 40 ( 65288 ( 41 ) 65289 ) 42 * 65290 * 43 + 65291 + 44 , 65292 , 45 - 65293 - 46 . 65294 . 47 / 65295 / 48 0 65296 0 49 1 65297 1 50 2 65298 2 51 3 65299 3 52 4 65300 4 53 5 65301 5 54 6 65302 6 55 7 65303 7 56 8 65304 8 57 9 65305 9 58 : 65306 : 59 ; 65307 ; 60 < 65308 < 61 = 65309 = 62 > 65310 > 63 ? 65311 ? 64 @ 65312 @ 65 A 65313 A 66 B 65314 B 67 C 65315 C 68 D 65316 D 69 E 65317 E 70 F 65318 F 71 G 65319 G 72 H 65320 H 73 I 65321 I 74 J 65322 J 75 K 65323 K 76 L 65324 L 77 M 65325 M 78 N 65326 N 79 O 65327 O 80 P 65328 P 81 Q 65329 Q 82 R 65330 R 83 S 65331 S 84 T 65332 T 85 U 65333 U 86 V 65334 V 87 W 65335 W 88 X 65336 X 89 Y 65337 Y 90 Z 65338 Z 91 [ 65339 [ 92 \ 65340 \ 93 ] 65341 ] 94 ^ 65342 ^ 95 _ 65343 _ 96 ` 65344 ` 97 a 65345 a 98 b 65346 b 99 c 65347 c 100 d 65348 d 101 e 65349 e 102 f 65350 f 103 g 65351 g 104 h 65352 h 105 i 65353 i 106 j 65354 j 107 k 65355 k 108 l 65356 l 109 m 65357 m 110 n 65358 n 111 o 65359 o 112 p 65360 p 113 q 65361 q 114 r 65362 r 115 s 65363 s 116 t 65364 t 117 u 65365 u 118 v 65366 v 119 w 65367 w 120 x 65368 x 121 y 65369 y 122 z 65370 z 123 { 65371 { 124 | 65372 | 125 } 65373 } 126 ~ 65374 ~ |
把全角转成半角:
1
2
3
4
5
6
7
8
9
10
11
12
|
def full2half(s): n = [] s = s.decode( 'utf-8' ) for char in s: num = ord (char) if num = = 0x3000 : num = 32 elif 0xFF01 < = num < = 0xFF5E : num - = 0xfee0 num = unichr (num) n.append(num) return ''.join(n) |
把半角转成全角:
1
2
3
4
5
6
7
8
9
10
11
12
|
def half2full(s): n = [] s = s.decode( 'utf-8' ) for char in s: num = char(char) if num = = 320 : num = 0x3000 elif 0x21 < = num < = 0x7E : num + = 0xfee0 num = unichr (num) n.append(num) return ''.join(n) |
上面的实现方式非常的简单,但是现实情况下可能并不会把所以的字符统一进行转换,比如中文文章中我们期望将所有出现的字母和数字全部转化成半角,而常见标点符号统一使用全角,上面的转化就不适合了。
解决方案,是自定义词典。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
|
#!/usr/bin/env python # -*- coding: utf-8 -*- FH_SPACE = FHS = ((u " " , u " " ),) FH_NUM = FHN = ( (u "0" , u "0" ), (u "1" , u "1" ), (u "2" , u "2" ), (u "3" , u "3" ), (u "4" , u "4" ), (u "5" , u "5" ), (u "6" , u "6" ), (u "7" , u "7" ), (u "8" , u "8" ), (u "9" , u "9" ), ) FH_ALPHA = FHA = ( (u "a" , u "a" ), (u "b" , u "b" ), (u "c" , u "c" ), (u "d" , u "d" ), (u "e" , u "e" ), (u "f" , u "f" ), (u "g" , u "g" ), (u "h" , u "h" ), (u "i" , u "i" ), (u "j" , u "j" ), (u "k" , u "k" ), (u "l" , u "l" ), (u "m" , u "m" ), (u "n" , u "n" ), (u "o" , u "o" ), (u "p" , u "p" ), (u "q" , u "q" ), (u "r" , u "r" ), (u "s" , u "s" ), (u "t" , u "t" ), (u "u" , u "u" ), (u "v" , u "v" ), (u "w" , u "w" ), (u "x" , u "x" ), (u "y" , u "y" ), (u "z" , u "z" ), (u "A" , u "A" ), (u "B" , u "B" ), (u "C" , u "C" ), (u "D" , u "D" ), (u "E" , u "E" ), (u "F" , u "F" ), (u "G" , u "G" ), (u "H" , u "H" ), (u "I" , u "I" ), (u "J" , u "J" ), (u "K" , u "K" ), (u "L" , u "L" ), (u "M" , u "M" ), (u "N" , u "N" ), (u "O" , u "O" ), (u "P" , u "P" ), (u "Q" , u "Q" ), (u "R" , u "R" ), (u "S" , u "S" ), (u "T" , u "T" ), (u "U" , u "U" ), (u "V" , u "V" ), (u "W" , u "W" ), (u "X" , u "X" ), (u "Y" , u "Y" ), (u "Z" , u "Z" ), ) FH_PUNCTUATION = FHP = ( (u "." , u "." ), (u "," , u "," ), (u "!" , u "!" ), (u "?" , u "?" ), (u "”" , u '"' ), (u "'", u"'" ), (u "‘" , u "`" ), (u "@" , u "@" ), (u "_" , u "_" ), (u ":" , u ":" ), (u ";" , u ";" ), (u "#" , u "#" ), (u "$" , u "$" ), (u "%" , u "%" ), (u "&" , u "&" ), (u "(" , u "(" ), (u ")" , u ")" ), (u "‐" , u "-" ), (u "=" , u "=" ), (u "*" , u "*" ), (u "+" , u "+" ), (u "-" , u "-" ), (u "/" , u "/" ), (u "<" , u "<" ), (u ">" , u ">" ), (u "[" , u "[" ), (u "¥" , u "\\"), (u" ] ", u" ] "), (u" ^ ", u" ^ "), (u" { ", u" {"), (u "|" , u "|" ), (u "}" , u "}" ), (u "~" , u "~" ), ) FH_ASCII = HAC = lambda : ((fr, to) for m in (FH_ALPHA, FH_NUM, FH_PUNCTUATION) for fr, to in m) HF_SPACE = HFS = ((u " " , u " " ),) HF_NUM = HFN = lambda : ((h, z) for z, h in FH_NUM) HF_ALPHA = HFA = lambda : ((h, z) for z, h in FH_ALPHA) HF_PUNCTUATION = HFP = lambda : ((h, z) for z, h in FH_PUNCTUATION) HF_ASCII = ZAC = lambda : ((h, z) for z, h in FH_ASCII()) def convert(text, * maps, * * ops): """ 全角/半角转换 args: text: unicode string need to convert maps: conversion maps skip: skip out of character. In a tuple or string return: converted unicode string """ if "skip" in ops: skip = ops[ "skip" ] if isinstance (skip, basestring ): skip = tuple (skip) def replace(text, fr, to): return text if fr in skip else text.replace(fr, to) else : def replace(text, fr, to): return text.replace(fr, to) for m in maps: if callable (m): m = m() elif isinstance (m, dict ): m = m.items() for fr, to in m: text = replace(text, fr, to) return text if __name__ = = '__main__' : text = u "成田空港—【JR特急成田エクスプレス号・横浜行,2站】—東京—【JR新幹線はやぶさ号・新青森行,6站 】—新青森—【JR特急スーパー白鳥号・函館行,4站 】—函館" print convert(text, FH_ASCII, {u "【" : u "[" , u "】" : u "]" , u "," : u "," , u "." : u "。" , u "?" : u "?" , u "!" : u "!" }, spit = ",。?!“”" ) |
特别注意:引号在英语体系中引号是不区分前引号和后引号。
总结
以上就是关于Python实现全角半角字符互转的方法,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流。
原文链接:http://www.biaodianfu.com/python-convert-between-unicode-fullwidth-halfwidth-characters.html