Unicodeスカラ値の文字列をUnicode文字列に変換する - C#

Unicodeスカラ値の文字列をUnicode文字列に変換するコードを紹介します。

Unicodeスカラ値とは

Unicodeの文字をASCII文字で表現するため、"U+" の後に16進数でその値を記述する書式です。"#U"に続けて16進数で値を記述するものもあります。

プログラム

ASP.NET Webアプリケーションを作成します。

UI

下図のWebフォームを作成します。TextBoxを2つ、Buttonを1つ配置します。

aspxファイルのコードは以下になります。

UnicodeScalarToUnicode.aspx

<%@ Page Language="C#" AutoEventWireup="true" 
  CodeBehind="UnicodeScalarToUnicode.aspx.cs" Inherits="UnicodeConverter.UnicodeScalarToUnicode" %>

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title></title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
      <div><asp:TextBox ID="TextBox1" runat="server" Width="400px"></asp:TextBox> </div>
      <div><asp:Button ID="Button1" runat="server" Text="変換" OnClick="Button1_Click" /></div>
      <div><asp:TextBox ID="TextBox2" runat="server" Width="400px"></asp:TextBox></div>
    </div>
    </form>
</body>
</html>

コード

以下のコードを記述します。

UnicodeScalarToUnicode.aspx.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Text;
using System.Text.RegularExpressions;

namespace UnicodeConverter
{
  public partial class UnicodeScalarToUnicode : System.Web.UI.Page
  {
    protected void Page_Load(object sender, EventArgs e)
    {

    }

    protected void Button1_Click(object sender, EventArgs e)
    {
      string input = TextBox1.Text;
      string regExpr = "U\\+(?<head>[0-9A-Fa-f]{2})(?<tail>[0-9A-Fa-f]{2})";

      Regex reg = new Regex(regExpr);
      Match match = reg.Match(input);

      while (match.Success == true) {
        byte[] b = new byte[] {
          Convert.ToByte(match.Groups["tail"].Value, 16), 
          Convert.ToByte(match.Groups["head"].Value, 16)
        };

        UnicodeEncoding unicode = new UnicodeEncoding();
        string unicodeStr = unicode.GetString(b);

        input = input.Replace(match.Value, unicodeStr);
        match = reg.Match(input);
      }

      TextBox2.Text = input;
    }
  }
}

解説

下記のコードで、テキストボックスに入力された文字列から"U+xxxx"のパターンを見つけます。入力文字列にパターンがある限りwhileループを回ります。

  string input = TextBox1.Text;
  string regExpr = "U\\+(?<head>[0-9A-Fa-f]{2})(?<tail>[0-9A-Fa-f]{2})";

  Regex reg = new Regex(regExpr);
  Match match = reg.Match(input);

  while (match.Success == true) {
    match = reg.Match(input);
  }

下記のwhileループ内のコードで入力文字列で見つかった"U+xxxx"のパターンを処理します。"U+hhtt"のパターンから"hh"の部分と"tt"の部分をそれぞれ取得し文字列を数値に変換し配列b[]に格納します。
UnicodeEncodingクラスのGetStringメソッドを呼び出しバイト配列b[]を文字列に変換します。入力文字列のマッチしたパターン"U+xxxx"を変換後後の文字列に置換します。これを"U+xxxx"のパターンがなくなるまで繰り返すことで、Unicodeスカラ値の文字列をUnicode文字列に変換できます。

    byte[] b = new byte[] {
      Convert.ToByte(match.Groups["tail"].Value, 16), 
      Convert.ToByte(match.Groups["head"].Value, 16)
    };

    UnicodeEncoding unicode = new UnicodeEncoding();
    string unicodeStr = unicode.GetString(b);

    input = input.Replace(match.Value, unicodeStr);