search for in the  
<utf8_decodexml_error_string>
Last updated: Thu, 19 May 2005

utf8_encode

(PHP 3 >= 3.0.6, PHP 4, PHP 5)

utf8_encode -- Encodes an ISO-8859-1 string to UTF-8

Description

string utf8_encode ( string data )

This function encodes the string data to UTF-8, and returns the encoded version. UTF-8 is a standard mechanism used by Unicode for encoding wide character values into a byte stream. UTF-8 is transparent to plain ASCII characters, is self-synchronized (meaning it is possible for a program to figure out where in the bytestream characters start) and can be used with normal string comparison functions for sorting and such. PHP encodes UTF-8 characters in up to four bytes, like this:

Table 1. UTF-8 encoding

bytesbitsrepresentation
170bbbbbbb
211110bbbbb 10bbbbbb
3161110bbbb 10bbbbbb 10bbbbbb
42111110bbb 10bbbbbb 10bbbbbb 10bbbbbb
Each b represents a bit that can be used to store character data.



User Contributed Notes
utf8_encode
JF Sebastian
09-Apr-2005 05:54
The following Perl regular expression tests if a string is well-formed Unicode UTF-8 (Broken up after each | since long lines are not permitted here. Please join as a single line, no spaces, before use.):

^([\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
\xe0[\xa0-\xbf][\x80-\xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
\xed[\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
f0[\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
\xf4[\x80-\x8f][\x80-\xbf]{2})*$

NOTE: This strictly follows the Unicode standard 4.0, as described in chapter 3.9, table 3-6, "Well-formed UTF-8 byte sequences" ( http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703 ).

ISO-10646, a super-set of Unicode, uses UTF-8 (there called "UCS", see http://www.unicode.org/faq/utf_bom.html#1 ) in a relaxed variant that supports a 31-bit space encoded into up to six bytes instead of Unicode's 21 bits in up to four bytes. To check for ISO-10646 UTF-8, use the following Perl regular expression (again, broken up, see above):

^([\x00-\x7f]|
[\xc0-\xdf][\x80-\xbf]|
[\xe0-\xef][\x80-\xbf]{2}|
[\xf0-\xf7][\x80-\xbf]{3}|
[\xf8-\xfb][\x80-\xbf]{4}|
[\xfc-\xfd][\x80-\xbf]{5})*$

The following function may be used with above expressions for a quick UTF-8 test, e.g. to distinguish ISO-8859-1-data from UTF-8-data if submitted from a <form accept-charset="utf-8,iso-8859-1" method=..>.

function is_utf8($string) {
   return (preg_match('/[insert regular expression here]/', $string) === 1);
}
http://iubito.free.fr
10-Mar-2005 01:57
Here's a function I made to know if one string or textfile is already encoded in UTF8 :

<?php
/**
 * Returns <kbd>true</kbd> if the string or array of string is encoded in UTF8.
 *
 * Example of use. If you want to know if a file is saved in UTF8 format :
 * <code> $array = file('one file.txt');
 * $isUTF8 = isUTF8($array);
 * if (!$isUTF8) --> we need to apply utf8_encode() to be in UTF8
 * else --> we are in UTF8 :)
 * </code>
 * @param mixed A string, or an array from a file() function.
 * @return boolean
 */
function isUTF8($string)
{
   if (
is_array($string))
   {
      
$enc = implode('', $string);
       return @!((
ord($enc[0]) != 239) && (ord($enc[1]) != 187) && (ord($enc[2]) != 191));
   }
   else
   {
       return (
utf8_encode(utf8_decode($string)) == $string);
   }   
}
?>
Denis G.
24-Feb-2005 07:32
Sniplet to convert ASCII coded text to UTF-8:

$text= preg_replace ('/([\x80-\xff])/se', "pack (\"C*\", (ord ($1) >> 6) | 0xc0, (ord ($1) & 0x3f) | 0x80)", $text);
anonymous at anonymous dot com
24-Jan-2005 04:49
A few bugs in your example code:

 function code2utf($num){
  if($num<128)return chr($num);
  if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
  if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
  if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
  return '';
 }
schofei at yahoo dot de
11-Jan-2005 05:23
regarding the above code2utf function...

> romans at void dot lv
> 02-Oct-2002 09:59
> Here is optimized function which converts
> binary UTF symbol code into unicoded string....

Thanks for providing your nice conversion code, however due to some missing parenthesis 4-byte utf-8 chars are not converted properly.

Here is a corrected version of the code2utf function:

 function code2utf($num){
  if($num<128)return chr($num);
  if($num<1024)return chr(($num>>6)+192).chr(($num&63)+128);
  if($num<32768)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
  if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
  return '';
 }
 
regards
Scho Fei
hrpeters (at) gmx (dot) net
14-Dec-2004 12:46
// Validate Unicode UTF-8 Version 4
// This function takes as reference the table 3.6 found at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
// It also flags overlong bytes as error

function is_validUTF8($str)
{
   // values of -1 represent disalloweded values for the first bytes in current UTF-8
   static $trailing_bytes = array (
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
       -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
       -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
       -1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
       2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
   );

   $ups = unpack('C*', $str);
   if (!($aCnt = count($ups))) return true; // Empty string *is* valid UTF-8
   for ($i = 1; $i <= $aCnt;)
   {
       if (!($tbytes = $trailing_bytes[($b1 = $ups[$i++])])) continue;
       if ($tbytes == -1) return false;
      
       $first = true;
       while ($tbytes > 0 && $i <= $aCnt)
       {
           $cbyte = $ups[$i++];
           if (($cbyte & 0xC0) != 0x80) return false;
          
           if ($first)
           {
               switch ($b1)
               {
                   case 0xE0:
                       if ($cbyte < 0xA0) return false;
                       break;
                   case 0xED:
                       if ($cbyte > 0x9F) return false;
                       break;
                   case 0xF0:
                       if ($cbyte < 0x90) return false;
                       break;
                   case 0xF4:
                       if ($cbyte > 0x8F) return false;
                       break;
                   default:
                       break;
               }
               $first = false;
           }
           $tbytes--;
       }
       if ($tbytes) return false; // incomplete sequence at EOS
   }       
   return true;
}
Mark AT modernbill DOT com
09-Nov-2004 01:56
If you haven't guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren't saving a bunch of ???? into your database.
Aidan Kehoe <php-manual at parhasard dot net>
30-Aug-2004 09:05
Here's some code that addresses the issue that Steven describes in the previous comment;

<?php

/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
   as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
   the UTF-8 encoding of the non-control characters that Windows-1252 places
   at the equivalent code points. */

$cp1252_map = array(
  
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
  
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
  
"\xc2\x83" => "\xc6\x92",    /* LATIN SMALL LETTER F WITH HOOK */
  
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
  
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
  
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
  
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
  
"\xc2\x88" => "\xcb\x86",    /* MODIFIER LETTER CIRCUMFLEX ACCENT */
  
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
  
"\xc2\x8a" => "\xc5\xa0",    /* LATIN CAPITAL LETTER S WITH CARON */
  
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
  
"\xc2\x8c" => "\xc5\x92",    /* LATIN CAPITAL LIGATURE OE */
  
"\xc2\x8e" => "\xc5\xbd",    /* LATIN CAPITAL LETTER Z WITH CARON */
  
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
  
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
  
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
  
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
  
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
  
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
  
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

  
"\xc2\x98" => "\xcb\x9c",    /* SMALL TILDE */
  
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
  
"\xc2\x9a" => "\xc5\xa1",    /* LATIN SMALL LETTER S WITH CARON */
  
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
  
"\xc2\x9c" => "\xc5\x93",    /* LATIN SMALL LIGATURE OE */
  
"\xc2\x9e" => "\xc5\xbe",    /* LATIN SMALL LETTER Z WITH CARON */
  
"\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);

function
cp1252_to_utf8($str) {
       global
$cp1252_map;
       return 
strtr(utf8_encode($str), $cp1252_map);
}

?>
steven -at- acko -dot- net
17-Aug-2004 04:45
Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft's Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0x80-0x9F whose codepoints in Unicode do not match the byte's value (in Unicode, codepoints U+80 - U+9F are unassigned).

utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.

E.g. in 1252, byte 0x80 is the euro sign, which is U+20AC. The same goes for curly quotes, em dashes, etc.

utf8_encode() will convert 0x80 into U+0080 (an unassigned codepoint) rather than U+20AC.

To convert 1252 to UTF-8, use iconv, recode or mbstring.
Net Raven
24-Jun-2004 02:58
I often need to convert multi language text sent to me for use in websites and other apps into UTF8 encoded so I can insert it into source code and databases.

I knocked up a small web page with its charset set to UTF8 then set it up so I can paste from the original doc (eg word or excel) and have the page return the UTF8 encoded version.

Of course the browser will convert the unicode to UTF8 for you as part of the submit (I use IE5 or better for this) then all you have to do in the PHP is encode the UTF8 so the browser will show it in its raw form.

Its a bit bulky but I just convert ALL character to html numbered entities (brute force and ignorance does it again.)

I've used this to encode everything from Hebrew to Japanese without problems

<?
header
("Content-Type: text/plain; charset=utf-8");
$code = (get_magic_quotes_gpc())?stripslashes($GLOBALS[code]):$GLOBALS[code];
?>
<html>
<head>
   <title>UTF8 ENCODER PAGE</title>
   <meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<form method=post action="?seed=<?=time()?>">
   Original Unicode<br />
   <textarea name="code" cols="80" rows="10"><?=$code?></textarea><br />
   Encoded UTF8<br />
   <textarea name="encd" cols="80" rows="10"><?
      
for ($i = 0; $i < strlen($code); $i++) {
           echo
'&#'.ord(substr($code,$i,1));
       }
  
?></textarea><br />
   <input type="submit" value="encode">
</form>
</body>
</html>
lorro at lorro dot wigner dot bme dot hu
05-Apr-2004 09:12
Good news is that utf8_encode (like UTF-8) passes '<', '>', '/', '\'', '"', etc., so you are free to utf8_encode complete blocks of html text that includes tags.
Bad news is that UTF-8 is stupid enough so that utf8_encode(utf8_encode($str)) != utf8_encode($str) in most of the cases. What you can do is write utf8_ensure like:

function utf8_ensure($str) {
   return seems_utf8($str)? $str: utf8_encode($str);
}

Comes handy when your view library tries to encode the same text multiple times.
bmorel at ssi dot fr
17-Feb-2004 03:22
Here is an improved version of that function, compatible with 31-bit encoding scheme of Unicode 3.x :

<?php
function seems_utf8($Str) {
 for (
$i=0; $i<strlen($Str); $i++) {
  if (
ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
 
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
 
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
 
elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
 
elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
 
elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
 
else return false; # Does not match any model
 
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
  
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
   return
false;
  }
 }
 return
true;
}
?>
bmorel at ssi dot fr
16-Feb-2004 02:28
Here is a simple function that can help, if you want to know if a string could be UTF-8 or not :

<?php
function seems_utf8($Str) {
 for (
$i=0; $i<strlen($Str); $i++) {
  if (
ord($Str[$i]) < 0x80) $n=0; # 0bbbbbbb
 
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
 
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
 
elseif ((ord($Str[$i]) & 0xF0) == 0xF0) $n=3; # 1111bbbb
 
else return false; # Does not match any model
 
for ($j=0; $j<$n; $j++) { # n octets that match 10bbbbbb follow ?
  
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80)) return false;
  }
 }
 return
true;
}
?>
Karen
01-Oct-2003 02:33
Re the previous post about converting GB2312 code to Unicode code which displayed the following function:

<?
// Program by sadly (www.phpx.com)

function gb2unicode($gb)
{
   if(!
trim($gb))
   return
$gb;
  
$filename="gb2312.txt";
  
$tmp=file($filename);
  
$codetable=array();
   while(list(
$key,$value)=each($tmp))
  
$codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
  
$utf="";
   while(
$gb)
   {
     if (
ord(substr($gb,0,1))>127)
     {
      
$this=substr($gb,0,2);
      
$gb=substr($gb,2,strlen($gb));
      
$utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
     }
     else
     {
    
$gb=substr($gb,1,strlen($gb));
    
$utf.=substr($gb,0,1);
     }
     }
  return
$utf;
}
?>

I found that a small change was needed in the code to properly handle latin characters embedded in the middle of gb2312 text, as when the text includes a URL or email address. Just reverse the two lines in the part of the statement above that handles ord vals !>127.

Change:

$gb=substr($gb,1,strlen($gb));
$utf.=substr($gb,0,1);

to:

$utf.=substr($gb,0,1);
$gb=substr($gb,1,strlen($gb));

In the original function, the first latin chacter was dropped and it was not converting the first non-latin character after the latin text (everything was shifted one character too far to the right). Reversing those two lines makes it work correctly in every example I have tried.

Also, the source of the gb2312.txt file needed for this to work has changed. You can find it a couple places:

http://tcl.apache.org/sources/tcl/tools/encoding/gb2312.txt
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
artem at w510 dot tm dot odessa dot ua
02-Jun-2003 09:10
Loading variables in flash

you can lost a lot of hours if your charset is not iso-88951 and you cant' see your characters in flash

you must use iconv instead with your codepage
(for example windows-1251 for ukrainian, russian)

$fw = fopen("flash_input.txt", "w");
if( $fw )
{
   $utf = iconv("windows-1251","UTF-8",$variable_value);
   $out = 'variable_name='.$utf;
   fputs($fw, $out);
   fclose($fw);
}

and no urlecode is needed if you save data in file!
mualem_i at hotmail dot com
21-May-2003 08:12
Hebrew!! What a language. I had some trouble placing the Hebrew in a javascript based drop down menu, the text appeared as UTF8 so I made this function to overcome the problem (Not talking about efficiency)

function rtf_heb($string)
   {
   $array = split (" ",$string) ;
   foreach ($array as $VAL)
       {
       $VAL = str_replace("&#1488","à",$VAL);
       $VAL = str_replace("&#1489","á",$VAL);
       $VAL = str_replace("&#1490","â",$VAL);
       $VAL = str_replace("&#1491","ã",$VAL);
       $VAL = str_replace("&#1492","ä",$VAL);
       $VAL = str_replace("&#1493","å",$VAL);
       $VAL = str_replace("&#1494","æ",$VAL);
       $VAL = str_replace("&#1495","ç",$VAL);
       $VAL = str_replace("&#1496","è",$VAL);
       $VAL = str_replace("&#1497","é",$VAL);
       $VAL = str_replace("&#1499","ë",$VAL);
       $VAL = str_replace("&#1500","ì",$VAL);
       $VAL = str_replace("&#1502","î",$VAL);
       $VAL = str_replace("&#1504","ð",$VAL);
       $VAL = str_replace("&#1505","ñ",$VAL);
       $VAL = str_replace("&#1506","ò",$VAL);
       $VAL = str_replace("&#1508","ô",$VAL);
       $VAL = str_replace("&#1510","ö",$VAL);
       $VAL = str_replace("&#1511","÷",$VAL);
       $VAL = str_replace("&#1512","ø",$VAL);
       $VAL = str_replace("&#1513","ù",$VAL);
       $VAL = str_replace("&#1514","ú",$VAL);
       $VAL = str_replace("&#1498","ê",$VAL);
       $VAL = str_replace("&#1507","ó",$VAL);
       $VAL = str_replace("&#1503","ï",$VAL);
       $VAL = str_replace("&#1501","í",$VAL);
       $VAL = str_replace("&#1509","õ",$VAL);
       $VAL = str_replace(";","",$VAL);
       $send_VAR .= $VAL." ";
      
       }
       return $send_VAR;
   }
RoyLaw at 263 dot Net
19-May-2003 06:16
There is a function for converting GB2312 code to Unicode code.It maybe useful for programming on XML/WML in non-English lanaguages.

<?
// Program by sadly (www.phpx.com)

function gb2unicode($gb)
{
   if(!
trim($gb))
   return
$gb;
  
$filename="gb2312.txt";
  
$tmp=file($filename);
  
$codetable=array();
   while(list(
$key,$value)=each($tmp))
  
$codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
  
$utf="";
   while(
$gb)
   {
     if (
ord(substr($gb,0,1))>127)
     {
      
$this=substr($gb,0,2);
      
$gb=substr($gb,2,strlen($gb));
      
$utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
     }
     else
     {
    
$gb=substr($gb,1,strlen($gb));
    
$utf.=substr($gb,0,1);
     }
     }
  return
$utf;
}
?>

This function requires a code list of gb2312,you can download it at
ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/GB/GB2312.TXT
sunish_mv at rediffmail dot com
04-Apr-2003 12:50
/*Here I have a class that will convert ISCII (Indian Standard Code for Information Interchange) devnagiri (Hindi) string to unicode string. /*

<?php

 
class iscii2utf8 {

     var
$map;

     function
iscii2utf8() {

        
$this->map = array (
                  
"a0" =>  '63'  ,
                
"a1" => '2305' ,
                
"a2" => '2306' ,
                
"a3" => '2307' ,
                
"a4" => '2309' ,
                
"a5" => '2310' ,
                
"a6" => '2311' ,
                
"a7" => '2312' ,
                
"a8" => '2313' ,
                
"a9" => '2314' ,
                
"aa" => '2315' ,
                
"ab" => '2318' ,
                
"ac" => '2319' ,
                
"ad" => '2320' ,
                
"ae" => '2317' ,
                
"af" => '2322' ,
                
"b0" => '2323' ,
                
"b1" => '2324' ,
                
"b2" => '2321' ,
                
"b3" => '2325' ,
                
"b4" => '2326' ,
                
"b5" => '2327' ,
                
"b6" => '2328' ,
                
"b7" => '2329' ,
                
"b8" => '2330' ,
                
"b9" => '2331' ,
                
"ba" => '2332' ,
                
"bb" => '2333' ,
                
"bc" => '2334' ,
                
"bd" => '2335' ,
                
"be" => '2336' ,
                
"bf" => '2337' ,
                
"c0" => '2338' ,
                
"c1" => '2339' ,
                
"c2" => '2340' ,
                
"c3" => '2341' ,
                
"c4" => '2342' ,
                
"c5" => '2343' ,
                
"c6" => '2344' ,
                
"c7" => '2345' ,
                
"c8" => '2346' ,
                
"c9" => '2347' ,
                
"ca" => '2348' ,
                
"cb" => '2349' ,
                
"cc" => '2350' ,
                
"cd" => '2351' ,
                
"ce" => '2399' ,
                
"cf" => '2352' ,
                
"d0" => '2353' ,
                
"d1" => '2354' ,
                
"d2" => '2355' ,
                
"d3" => '2356' ,
                
"d4" => '2357' ,
                
"d5" => '2358' ,
                
"d6" => '2359' ,
                
"d7" => '2360' ,
                
"d8" => '2361' ,
                
"d9" =>  '63'  ,
                
"da" => '2366' ,
                
"db" => '2367' ,
                
"dc" => '2368' ,
                
"dd" => '2369' ,
                
"de" => '2370' ,
                
"df" => '2371' ,
                
"e0" => '2374' ,
                
"e1" => '2375' ,
                
"e2" => '2376' ,
                
"e3" => '2373' ,
                
"e4" => '2378' ,
                
"e5" => '2379' ,
                
"e6" => '2380' ,
                
"e7" => '2377' ,
                
"e8" => '2381' ,
                
"e9" =>  '63'  ,
                
"ea" => '2404' ,
                
"eb" =>  '63'  ,
                
"ec" =>  '63'  ,
                
"ed" =>  '63'  ,
                
"ee" =>  '63'  ,
                
"ef" =>  '63'  ,
                
"f0" =>  '63'  ,
                
"f1" => '2406' ,
                
"f2" => '2407' ,
                
"f3" => '2408' ,
                
"f4" => '2409' ,
                
"f5" => '2410' ,
                
"f6" => '2411' ,
                
"f7" => '2412' ,
                
"f8" => '2413' ,
                
"f9" => '2414' ,
                
"fa" => '2415' ,
                
"fb" =>  '63'  ,
                
"fc" =>  '63'  ,
                
"fd" =>  '63'  ,
                
"fe" =>  '63'  ,
                
"ff" =>  '63'  ,);
       }

       function
code2utf($num){

            
//Returns the utf string corresponding to the unicode value
             //courtesy - romans@void.lv

            
if($num<128)return chr($num);
             if(
$num<1024)return chr(($num>>6)+192).chr(($num&63)+128);
             if(
$num<32768)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
             if(
$num<2097152)return chr($num>>18+240).chr((($num>>12)&63)+128).chr(($num>>6)&63+128). chr($num&63+128);
             return
'';

       }

       function
convertstring($iscii) {
          
//Returs utf8 string equibalent of given iscii string
          
          
$str = "";
           for(
$i = 0; $i<strlen($iscii); $i++) {

              
$c = dechex(ord(substr($iscii,$i,1)));
               if (isset(
$this->map[$c] )) {
                  
$s = $this->code2utf($this->map[$c]);
                  
$str .= ($s == "?")?"":$s;
                   }
               else {
                  
$str .= substr($iscii,$i,1);
                   }

           }

           return
$str;
       }

   }

?>
rbotzer at yahoo dot com
01-Apr-2003 03:25
BTW, the 21-bit range is pretty old news.  Unicode 3.x uses a 31bit encoding scheme that allows for 2 billion characters.

I'll post an enhanced encoder soon.  In the meanwhile here's the current encoding scheme: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Ronen
webmaster at swisswebgroup dot com
31-Mar-2003 07:54
if you try to pass data to a flash movie with the
actionscripts functions loadVars or sendAndLoad give this a try,
if you have problems with special chars like &auml; &ouml; ....

echo "&data1=".urlencode(utf8_encode("äöü"))
   ."&data2=".urlencode(utf8_encode("ÄÖÜ"));

greets

js
romans at void dot lv
02-Oct-2002 08:59
Here is optimized function which converts binary UTF symbol code into unicoded string.

 function code2utf($num){
  if($num<128)return chr($num);
  if($num<1024)return chr(($num>>6)+192).chr(($num&63)+128);
  if($num<32768)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
  if($num<2097152)return chr($num>>18+240).chr((($num>>12)&63)+128).chr(($num>>6)&63+128). chr($num&63+128);
  return '';
 }
dimitrisATccfDOTauthDOTgr
28-Aug-2002 01:04
To make utf8_encode and utf8_decode support other than iso-8859-1 encodings, you can easily define your encoding in the PHP source.
In the file PHP_SOURCE/ext/xml/xml.c add the following code, for e.g. greek iso-8859-7:

DEFINE TWO NEW FUNCTIONS UP TOP:
inline static unsigned short xml_encode_iso_8859_7(unsigned char);
inline static char xml_decode_iso_8859_7(unsigned short);

AND THEN IMPLEMENT THEM BELOW:
/* {{{ xml_encode_iso_8859_7() - Dimitris Daskopoulos 28/8/02 */
/* map iso-8859-7 chars to Unicode chars */
inline static unsigned short xml_encode_iso_8859_7(unsigned char c)
{
       if (c < 0x80) { /* low-ASCII, leave as is */
               return (unsigned short)c;
       } else { /* Greek character in high-ASCII */
               /* map to UCS greek range (U+0310..03ff) */
               /* assume that c < 0xff */
               return (unsigned short)(c + 720);
       }
}
/* }}} */

/* {{{ xml_decode_iso_8859_7() - Dimitris Daskopoulos 28/8/02 */
/* map Unicode chars to iso-8859-7 chars */
inline static char xml_decode_iso_8859_7(unsigned short c)
{
       if (c < 0x100) { /* char in latin chart, leave as is */
               return (char)c;
       } else if (c > 0x030f && c < 0x0400) { /* char in greek chart */
               /* map back to ISO-8859-7 greek (high-ASCII) */
               return (char)(c - 720);
       } else { /* char not in latin or greek Unicode charts */
               /* return question mark character */
               return (char)('?');
       }
}
/* }}} */

These two work fine for greek iso-8859-7, but studying http://www.unicode.org/charts you
can implement mappings between unicode and other iso-8859-x quite easily.

In both functions (utf8_encode and utf8_decode), change the requested encoding to the one you prefer, e.g.

encoded = xml_utf8_encode(Z_STRVAL_PP(arg), Z_STRLEN_PP(arg), &len, "ISO-8859-7");

decoded = xml_utf8_decode(Z_STRVAL_PP(arg), Z_STRLEN_PP(arg), &len, "ISO-8859-7");

Make sure you add the new encoding
in the structure, by entering a new
row with the official name (ISO-8859-7), and the names of the
two functions you have just defined:
xml_encoding xml_encodings[] = {
       { "ISO-8859-1", xml_decode_iso_8859_1, xml_encode_iso_8859_1 },
       { "US-ASCII",  xml_decode_us_ascii,  xml_encode_us_ascii  },
       { "UTF-8",      NULL,                  NULL                  },
       { "ISO-8859-7", xml_decode_iso_8859_7, xml_encode_iso_8859_7 },
       { NULL,        NULL,                  NULL                  }
};

Finally, the following is probably not necessary, but I changed the default encoding (found in 2 spots in this file) to whatever encoding you prefer in your
pages, e.g.:
XML(default_encoding) = "ISO-8859-7";

This solution is a little messy,
since the utf8_encode function does not accept an argument for choosing the encoding method to use but hardwires the encoding method in the source code. Maybe PHP developers will provide this option in future releases. Until then, this is a quick and dirty solution that will work for
localized PHP pages.

Dimitris Daskopoulos
27-Aug-2002 01:30
For XML generation, if you want non-ASCII ISO-8859-1 characters within text and attributes, you don't absolutely need UTF-8 encoding:

The optional XML declaration can change the default encoding for characters from UTF-8 to ISO-8859-1:

<?xml version="1.0" encoding="iso-8859-1" ?>

This can save a lot of PHP code if you just want to generate ISO-8859-1 text and attribute values...

XML specification requires that all parsers support both the UTF-8 encoding (by default), and the ISO-8859-1 character set. Other character sets may be supported also by specifying them in the encoding attribute of the leading XML declaration (but the target parser must support this character set to allow automatic conversion of the source text into Unicode character entities.
dutoit at NOSPAM dot abonder dot com
01-Aug-2002 01:50
To write an XML element $title containing "exotic" (eg. non ASCII é & à ñ...) 2 solutions I found :
Fastest :
$xml .= "<title><![CDATA[" . $title ."]]></title>\n"

or cleanest :
$xml .= "<title>".utf8_encode(htmlspecialchars($title))."</title>\n"

After that, your xml can be parsed without errors.
sts at netempire dot de at nospam dot remove at this dot com
12-Apr-2002 10:18
if you want to encode/decode arrays, use these recursive functions

function utf8_encode_array (&$array, $key) {
   if(is_array($array)) {
     array_walk ($array, 'utf8_encode_array');
   } else {
     $array = utf8_encode($array);
   }
}

function utf8_decode_array (&$array, $key) {
   if(is_array($array)) {
     array_walk ($array, 'utf8_decode_array');
   } else {
     $array = utf8_decode($array);
   }
}

and call them with array_walk for e.g.
array_walk ($array_unencoded, 'utf8_decode_array');
lars(at)ioflux(dot)net
13-Mar-2002 10:29
This will also do the job for those who're interested:

<?

function utf8toiso8859($string)
{   
 
$returns = "";
 
$UTF8len = array(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6);
 
$pos = 0;
 
$antal = strlen($string);
 
  do
  {
  
$c = ord($string[$pos]);
  
$len = $UTF8len[($c >> 2) & 0x3F];
   switch (
$len)
   {
     case
6:
      
$u = $c & 0x01;
       break;
     case
5:
      
$u = $c & 0x03;
       break;
     case
4:
      
$u = $c & 0x07;
       break;
     case
3:
      
$u = $c & 0x0F;
       break;
     case
2:
      
$u = $c & 0x1F;
       break;
     case
1:
      
$u = $c & 0x7F;
       break;
     case
0/* unexpected start of a new character */
      
$u = $c & 0x3F;
      
$len = 5;
       break;
   }
   while (--
$len && (++$pos < $antal && $c =
ord($string[$pos])))
   {
     if ((
$c & 0xC0) == 0x80)
      
$u = ($u << 6) | ($c & 0x3F);
     else
     {
/* unexpected start of a new character */
      
$pos--;
       break;
     }
   }
   if (
$u <= 0xFF)
    
$returns .= chr($u);
   else
    
$returns .= '?';
  } while (++
$pos < $antal);
  return
$returns;
}

?>
ronen at greyzone dot com
07-Mar-2002 02:01
The following function will utf-8 encode unicode entities &#nnn(nn); with n={0..9}

/**
* takes a string of unicode entities and converts it to a utf-8 encoded string
* each unicode entitiy has the form &#nnn(nn); n={0..9} and can be displayed by utf-8 supporting
* browsers.  Ascii will not be modified.
* @param $source string of unicode entities [STRING]
* @return a utf-8 encoded string [STRING]
* @access public
*/
function utf8Encode ($source) {
   $utf8Str = '';
   $entityArray = explode ("&#", $source);
   $size = count ($entityArray);
   for ($i = 0; $i < $size; $i++) {
       $subStr = $entityArray[$i];
       $nonEntity = strstr ($subStr, ';');
       if ($nonEntity !== false) {
           $unicode = intval (substr ($subStr, 0, (strpos ($subStr, ';') + 1)));
           // determine how many chars are needed to reprsent this unicode char
           if ($unicode < 128) {
               $utf8Substring = chr ($unicode);
           }
           else if ($unicode >= 128 && $unicode < 2048) {
               $binVal = str_pad (decbin ($unicode), 11, "0", STR_PAD_LEFT);
               $binPart1 = substr ($binVal, 0, 5);
               $binPart2 = substr ($binVal, 5);
          
               $char1 = chr (192 + bindec ($binPart1));
               $char2 = chr (128 + bindec ($binPart2));
               $utf8Substring = $char1 . $char2;
           }
           else if ($unicode >= 2048 && $unicode < 65536) {
               $binVal = str_pad (decbin ($unicode), 16, "0", STR_PAD_LEFT);
               $binPart1 = substr ($binVal, 0, 4);
               $binPart2 = substr ($binVal, 4, 6);
               $binPart3 = substr ($binVal, 10);
          
               $char1 = chr (224 + bindec ($binPart1));
               $char2 = chr (128 + bindec ($binPart2));
               $char3 = chr (128 + bindec ($binPart3));
               $utf8Substring = $char1 . $char2 . $char3;
           }
           else {
               $binVal = str_pad (decbin ($unicode), 21, "0", STR_PAD_LEFT);
               $binPart1 = substr ($binVal, 0, 3);
               $binPart2 = substr ($binVal, 3, 6);
               $binPart3 = substr ($binVal, 9, 6);
               $binPart4 = substr ($binVal, 15);
      
               $char1 = chr (240 + bindec ($binPart1));
               $char2 = chr (128 + bindec ($binPart2));
               $char3 = chr (128 + bindec ($binPart3));
               $char4 = chr (128 + bindec ($binPart4));
               $utf8Substring = $char1 . $char2 . $char3 . $char4;
           }
          
           if (strlen ($nonEntity) > 1)
               $nonEntity = substr ($nonEntity, 1); // chop the first char (';')
           else
               $nonEntity = '';

           $utf8Str .= $utf8Substring . $nonEntity;
       }
       else {
           $utf8Str .= $subStr;
       }
   }

   return $utf8Str;
}
      
Ronen.
mued at muetdhiver dot org
14-Apr-2001 01:49
I get some trouble with utf8 under linux system. There is an utf8 option into 2.4 kernel that I tried to insert as module. <b>It works well with this done</b>.

mued

<utf8_decodexml_error_string>
 Last updated: Thu, 19 May 2005
Copyright © 2001-2005 The PHP Group
All rights reserved.
This unofficial mirror is operated at: The Server Pages
Last updated: Thu May 19 17:35:34 2005 CDT