浅述emoji的那些坑
前言
emoji是什么?emoji,绘文字(日语:絵文字/えもじ emoji)是日本在无线通信中所使用的视觉情感符号,绘意指图形,文字则是图形的隐喻,可用来代表多种表情,如笑脸表示笑、蛋糕表示食物等。自苹果公司发布的iOS 5输入法中加入了emoji后,这种表情符号开始席卷全球,目前emoji已被大多数现代计算机系统所兼容的Unicode编码采纳,普遍应用于各种手机短信和社交网络中。
是的,不管你知不知道emoji,你只要程序处理到它,那么请记住一点:Unicode编码!
述述emoji的坑坑洼洼
微信处理emoji
我们都知道,在做微信公众号开发接入的时候,与用户的对话互动中,涉及到的文本信息不仅仅是文字那么简单,其中可能还会包含着各种表情字符,例如「emoji表情」。
由于微信接口中对于emoji表情使用的是UTF-8的二进制字符串,并没有解码,表现就是当收到微信端用户发来的emoji表情时,显示为一个方块型或是无法显示的字符,这时就需要对其进行转码。同理在向微信服务器发送带有emojib表情的文本消息时,也需要将表情字符编码为此格式进行发送(早前微信可以直接发送unicode码显示emoji表情,但是现在已经不支持了,坑)。
首先是对收取消息时的解析部分 :
$tmpStr = json_encode($data['Content']);
$tmpStr = preg_replace("#(\\\ue[0-9a-f]{3})#ie","addslashes('\\1')",$tmpStr);
$text = json_decode($tmpStr);
(这里进行JSON编码就是为了获得字符的unicode码,所以json_encode函数中不需要增加避免unicode的可选参数了)
例如: “你好 [emoji] hello 123″ 将被编码为" \u4f60\u597d \ue415 hello 123 "
之后就可以存储信息了,在读取出信息到页面的时候就可以进行字符替换和模板渲染了。
下面是 发送部分 ,就更简单了:
使用正则筛选出文本中的emoji unicode,对其进行二进制pack,UTF8转码,再放入原文本中 即可(这一步应该在消息发送前最后来做,先准备好完整的文本消息再进行转码处理)。代码如下:
$text = "你好 \ue415 hello 123"; //可以为将要发送的微信消息,包含emoji表情unicode字符串,需要转为utf8二进制字符串
$text = preg_replace("#\\\u([0-9a-f]+)#ie","iconv('UCS-2','UTF-8', pack('H4', '\\1'))",$text); //对emoji unicode进行二进制pack并转utf8
echo $text;//你好 hello 123
emoji表情的编码会破坏生成excel的格式(phpexcel)
解决方法:使用下面附录里的php-emoji,生成excel时使用里面emoji_unified_to_softbank($str);可以去掉iphone表情中不识别的编码字符串。
顺带说几句
上面微信处理收到的信息,你可以发现那个正则匹配有时并不适用(⊙﹏⊙)。因为收到的编码存在5.0版本和4.0版本的区别:
他们最大的区别在于4.0的编码格式是1个表情=一个字符,而5.0的编码格式是1个表情=两个字符。
提供给大家emoji5.0之后的编码供参考:
\ud83d\ude04
\ud83d\ude0a
\ud83d\ude03
\u263a
\ud83d\ude09
\ud83d\ude0d
\ud83d\ude18
\ud83d\ude1a
\ud83d\ude33
\ud83d\ude0c
\ud83d\ude01
\ud83d\ude1c
\ud83d\ude1d
\ud83d\ude12
\ud83d\ude0f
\ud83d\ude13
\ud83d\ude14
\ud83d\ude1e
\ud83d\ude16
\ud83d\ude25
\ud83d\ude30
\ud83d\ude28
\ud83d\ude23
\ud83d\ude22
\ud83d\ude2d
\ud83d\ude02
\ud83d\ude32
\ud83d\ude31
\ud83d\ude20
\ud83d\ude21
\ud83d\ude2a
\ud83d\ude37
\ud83d\udc7f
\ud83d\udc7d
\ud83d\udc9b
\ud83d\udc99
\ud83d\udc9c
\ud83d\udc97
\ud83d\udc9a
\u2764
\ud83d\udc94
\ud83d\udc93
\ud83d\udc98
\u2728
\ud83c\udf1f
\ud83d\udca2
\u2755
\u2754
\ud83d\udca4
\ud83d\udca8
\ud83d\udca6
\ud83c\udfb6
\ud83c\udfb5
\ud83d\udd25
\ud83d\udca9
\ud83d\udc4d
\ud83d\udc4e
\ud83d\udc4c
\ud83d\udc4a
\u270a
\u270c
\ud83d\udc4b
\u270b
\ud83d\udc50
\ud83d\udc46
\ud83d\udc47
\ud83d\udc49
\ud83d\udc48
\ud83d\ude4c
\ud83d\ude4f
\u261d
\ud83d\udc4f
\ud83d\udcaa
\ud83d\udeb6
\ud83c\udfc3
\ud83d\udc6b
\ud83d\udc83
\ud83d\udc6f
\ud83d\ude46
\ud83d\ude45
\ud83d\udc81
\ud83d\ude47
\ud83d\udc8f
\ud83d\udc91
\ud83d\udc86
\ud83d\udc87
\ud83d\udc85
\ud83d\udc66
\ud83d\udc67
\ud83d\udc69
\ud83d\udc68
\ud83d\udc76
\ud83d\udc75
\ud83d\udc74
\ud83d\udc71
\ud83d\udc72
\ud83d\udc73
\ud83d\udc77
\ud83d\udc6e
\ud83d\udc7c
\ud83d\udc78
\ud83d\udc82
\ud83d\udc80
\ud83d\udc63
\ud83d\udc8b
\ud83d\udc44
\ud83d\udc42
\ud83d\udc40
\ud83d\udc43
\u2600
\u2614
\u2601
\u26c4
\ud83c\udf19
\u26a1
\ud83c\udf00
\ud83c\udf0a
\ud83d\udc31
\ud83d\udc36
\ud83d\udc2d
\ud83d\udc39
\ud83d\udc30
\ud83d\udc3a
\ud83d\udc38
\ud83d\udc2f
\ud83d\udc28
\ud83d\udc3b
\ud83d\udc37
\ud83d\udc2e
\ud83d\udc17
\ud83d\udc35
\ud83d\udc12
\ud83d\udc34
\ud83d\udc0e
\ud83d\udc2b
\ud83d\udc11
\ud83d\udc18
\ud83d\udc0d
\ud83d\udc26
\ud83d\udc24
\ud83d\udc14
\ud83d\udc27
\ud83d\udc1b
\ud83d\udc19
\ud83d\udc20
\ud83d\udc1f
\ud83d\udc33
\ud83d\udc2c
\ud83d\udc90
\ud83c\udf38
\ud83c\udf37
\ud83c\udf40
\ud83c\udf39
\ud83c\udf3b
\ud83c\udf3a
\ud83c\udf41
\ud83c\udf43
\ud83c\udf42
\ud83c\udf34
\ud83c\udf35
\ud83c\udf3e
\ud83d\udc1a
\ud83c\udf8d
\ud83d\udc9d
\ud83c\udf8e
\ud83c\udf92
\ud83c\udf93
\ud83c\udf8f
\ud83c\udf86
\ud83c\udf87
\ud83c\udf90
\ud83c\udf91
\ud83c\udf83
\ud83d\udc7b
\ud83c\udf85
\ud83c\udf84
\ud83c\udf81
\ud83d\udd14
\ud83c\udf89
\ud83c\udf88
\ud83d\udcbf
\ud83d\udcc0
\ud83d\udcf7
\ud83c\udfa5
\ud83d\udcbb
\ud83d\udcfa
\ud83d\udcf1
\ud83d\udce0
\u260e
\ud83d\udcbd
\ud83d\udcfc
\ud83d\udd0a
\ud83d\udce2
\ud83d\udce3
\ud83d\udcfb
\ud83d\udce1
\u27bf
\ud83d\udd0d
\ud83d\udd13
\ud83d\udd12
\ud83d\udd11
\u2702
\ud83d\udd28
\ud83d\udca1
\ud83d\udcf2
\ud83d\udce9
\ud83d\udceb
\ud83d\udcee
\ud83d\udec0
\ud83d\udebd
\ud83d\udcba
\ud83d\udcb0
\ud83d\udd31
\ud83d\udeac
\ud83d\udca3
\ud83d\udd2b
\ud83d\udc8a
\ud83d\udc89
\ud83c\udfc8
\ud83c\udfc0
\u26bd
\u26be
\ud83c\udfbe
\u26f3
\ud83c\udfb1
\ud83c\udfca
\ud83c\udfc4
\ud83c\udfbf
\u2660
\u2665
\u2663
\u2666
\ud83c\udfc6
\ud83d\udc7e
\ud83c\udfaf
\ud83c\udc04
\ud83c\udfac
\ud83d\udcdd
\ud83d\udcd6
\ud83c\udfa8
\ud83c\udfa4
\ud83c\udfa7
\ud83c\udfba
\ud83c\udfb7
\ud83c\udfb8
\u303d
\ud83d\udc5f
\ud83d\udc61
\ud83d\udc60
\ud83d\udc62
\ud83d\udc55
\ud83d\udc54
\ud83d\udc57
\ud83d\udc58
\ud83d\udc59
\ud83c\udf80
\ud83c\udfa9
\ud83d\udc51
\ud83d\udc52
\ud83c\udf02
\ud83d\udcbc
\ud83d\udc5c
\ud83d\udc84
\ud83d\udc8d
\ud83d\udc8e
\u2615
\ud83c\udf75
\ud83c\udf7a
\ud83c\udf7b
\ud83c\udf78
\ud83c\udf76
\ud83c\udf74
\ud83c\udf54
\ud83c\udf5f
\ud83c\udf5d
\ud83c\udf5b
\ud83c\udf71
\ud83c\udf63
\ud83c\udf59
\ud83c\udf58
\ud83c\udf5a
\ud83c\udf5c
\ud83c\udf72
\ud83c\udf5e
\ud83c\udf73
\ud83c\udf62
\ud83c\udf61
\ud83c\udf66
\ud83c\udf67
\ud83c\udf82
\ud83c\udf70
\ud83c\udf4e
\ud83c\udf4a
\ud83c\udf49
\ud83c\udf53
\ud83c\udf46
\ud83c\udf45
\ud83c\udfe0
\ud83c\udfeb
\ud83c\udfe2
\ud83c\udfe3
\ud83c\udfe5
\ud83c\udfe6
\ud83c\udfea
\ud83c\udfe9
\ud83c\udfe8
\ud83d\udc92
\u26ea
\ud83c\udfec
\ud83c\udf07
\ud83c\udf06
\ud83c\udfe7
\ud83c\udfef
\ud83c\udff0
\u26fa
\ud83c\udfed
\ud83d\uddfc
\ud83d\uddfb
\ud83c\udf04
\ud83c\udf05
\ud83c\udf03
\ud83d\uddfd
\ud83c\udf08
\ud83c\udfa1
\u26f2
\ud83c\udfa2
\ud83d\udea2
\ud83d\udea4
\u26f5
\u2708
\ud83d\ude80
\ud83d\udeb2
\ud83d\ude99
\ud83d\ude97
\ud83d\ude95
\ud83d\ude8c
\ud83d\ude93
\ud83d\ude92
\ud83d\ude91
\ud83d\ude9a
\ud83d\ude83
\ud83d\ude89
\ud83d\ude84
\ud83d\ude85
\ud83c\udfab
\u26fd
\ud83d\udea5
\u26a0
\ud83d\udea7
\ud83d\udd30
\ud83c\udfb0
\ud83d\ude8f
\ud83d\udc88
\u2668
\ud83c\udfc1
\ud83c\udf8c
\ud83c\uddef
\ud83c\uddf5
\ud83c\uddf0
\ud83c\uddf7
\ud83c\udde8
\ud83c\uddf3
\ud83c\uddfa
\ud83c\uddf8
\ud83c\uddeb
\ud83c\uddf7
\ud83c\uddea
\ud83c\uddf8
\ud83c\uddee
\ud83c\uddf9
\ud83c\uddf7
\ud83c\uddfa
\ud83c\uddec
\ud83c\udde7
\ud83c\udde9
\ud83c\uddea
1\u20e3
2\u20e3
3\u20e3
4\u20e3
5\u20e3
6\u20e3
7\u20e3
8\u20e3
9\u20e3
0\u20e3
#\u20e3
\u2b06
\u2b07
\u2b05
\u27a1
\u2197
\u2196
\u2198
\u2199
\u25c0
\u25b6
\u23ea
\u23e9
\ud83c\udd97
\ud83c\udd95
\ud83d\udd1d
\ud83c\udd99
\ud83c\udd92
\ud83c\udfa6
\ud83c\ude01
\ud83d\udcf6
\ud83c\ude35
\ud83c\ude33
\ud83c\ude50
\ud83c\ude39
\ud83c\ude2f
\ud83c\ude3a
\ud83c\ude36
\ud83c\ude1a
\ud83c\ude37
\ud83c\ude38
\ud83c\ude02
\ud83d\udebb
\ud83d\udeb9
\ud83d\udeba
\ud83d\udebc
\ud83d\udead
\ud83c\udd7f
\u267f
\ud83d\ude87
\ud83d\udebe
\u3299
\u3297
\ud83d\udd1e
\ud83c\udd94
\u2733
\u2734
\ud83d\udc9f
\ud83c\udd9a
\ud83d\udcf3
\ud83d\udcf4
\ud83d\udcb9
\ud83d\udcb1
\u2648
\u2649
\u264a
\u264b
\u264c
\u264d
\u264e
\u264f
\u2650
\u2651
\u2652
\u2653
\u26ce
\ud83d\udd2f
\ud83c\udd70
\ud83c\udd71
\ud83c\udd8e
\ud83c\udd7e
\ud83d\udd32
\ud83d\udd34
\ud83d\udd33
\ud83d\udd5b
\ud83d\udd50
\ud83d\udd51
\ud83d\udd52
\ud83d\udd53
\ud83d\udd54
\ud83d\udd55
\ud83d\udd56
\ud83d\udd57
\ud83d\udd58
\ud83d\udd59
\ud83d\udd5a
\u2b55
\u274c
\u00a9
\u00ae
\u2122
这套编码的可用性不大,不仅因为xcode会报错,还因为我们需要的是Unicode编码的!
然而你可以发现用搜狗拼音输入法等一些输入法输出的emoji表情都是上面那种的(好坑(ToT)/~~~)
于是我们就需要将上面那些转成标准的 emoji 表情代码,然后各种编码就任我们转了O(∩_∩)O~
下面举例:
搜狗输入法中选择笑脸表情后,提交到后端, json_encode 得到 \ud83d\ude04 。
关键是一个公式,可以把 0xd83d0xde04 转成 1f604 ,而 U+1F604 就是 Unified 编码的笑脸表情代码。
$h = 0xd83d; //高位
$l = 0xde04; //低位
$code = ($h - 0xD800) * 0x400 + 0x10000 + $l - 0xDC00; // 转换算法
echo "U+" . strtoupper(dechex($code));
//echo 结果是 U+1F604
有了unified编码,我们就可以转换成其它编码格式的了。
附录
- php-emoji:https://github.com/iamcal/php-emoji
- http://cn.v2ex.com/t/228273
- Emoji表情代码大全(网页版)
- emoji unicode的编码数据表:https://github.com/mc-zone/emoji-code