PHP 8.5.0 Alpha 1 available for testing

mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP 8)

mb_detect_encodingDetecta un encodage

Descripción

mb_detect_encoding(string $string, array|string|null $encodings = null, bool $strict = false): string|false

Detecta el encodage más probable para la chaîne de caractères string desde una lista ordenada de candidatos.

La detección automática del juego de caracteres previsto nunca es totalmente fiable; sin información adicional, es similar a descifrar una cadena cifrada sin la clave. Siempre es preferible utilizar una indicación del juego de caracteres almacenado o transmitido con los datos, como el encabezado HTTP "Content-Type".

Esta función se utiliza principalmente con encodages multioctetos, donde no todas las secuencias de octetos forman una cadena válida. Si la cadena de entrada contiene una secuencia de este tipo, este encodage será rechazado, y el siguiente encodage será verificado.

Advertencia

El resultado no es fiable

El nombre de esta función es engañoso, realiza una « suposición » en lugar de una « detección ».

Las suposiciones están lejos de ser precisas, y por lo tanto, esta función no permite detectar de manera fiable el encodage correcto de los caracteres.

Parámetros

string

La string siendo inspeccionada.

encodings

Una lista de encodages de caracteres a probar, en orden. Esta lista puede ser especificada como un tableau de chaîne de caractères, o como una chaîne de caractères única separada por comas.

Si encodings es omitido o null, el detect_order actual (definido con la opción de configuración mbstring.detect_order, o la función mb_detect_order()) será utilizado.

strict

Controla el comportamiento cuando string no es válido en ninguno de los encodings listados. Si strict está definido como false, el encodage que corresponda más será devuelto; si strict está definido como true, false será devuelto.

El valor por omisión de strict puede ser definido con la opción de configuración mbstring.strict_detection.

Valores devueltos

El encodage de caracteres detectado, o false si la cadena no es válida en uno de los encodages listados.

Historial de cambios

Versión Descripción
8.2.0 mb_detect_encoding() ya no devolverá los siguientes encodages no textuales: "Base64", "QPrint", "UUencode", "HTML entities", "7 bit" y "8 bit".

Ejemplos

Ejemplo #1 Ejemplo con mb_detect_encoding()

<?php

$str
= "\x95\xB6\x8E\x9A\x83\x52\x81\x5B\x83\x68";

// Detecta el encodage con el detect_order actual
var_dump(mb_detect_encoding($str));

// "auto" es modificado según mbstring.language
var_dump(mb_detect_encoding($str, "auto"));

// Especifica el parámetro "encodings" con una lista separada por comas
var_dump(mb_detect_encoding($str, "JIS, eucjp-win, sjis-win"));

// Uso de un array para especificar el parámetro "encodings"
$encodings = [
"ASCII",
"JIS",
"EUC-JP"
];
var_dump(mb_detect_encoding($str, $encodings));
?>

El resultado del ejemplo sería:

string(5) "ASCII"
string(5) "ASCII"
string(8) "SJIS-win"
string(5) "ASCII"

Ejemplo #2 Efecto del parámetro strict

<?php
// 'áéóú' encoded in ISO-8859-1
$str = "\xE1\xE9\xF3\xFA";

// La cadena no es válida en ASCII o UTF-8, pero UTF-8 es considerado un mejor ajuste
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8'], true));

// Si un encodage válido es encontrado, el parámetro strict no cambia el resultado
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], false));
var_dump(mb_detect_encoding($str, ['ASCII', 'UTF-8', 'ISO-8859-1'], true));
?>

El resultado del ejemplo sería:

string(5) "UTF-8"
bool(false)
string(10) "ISO-8859-1"
string(10) "ISO-8859-1"

En ciertos casos, la misma secuencia de octetos puede formar una cadena válida en diferentes encodages de caracteres, y es imposible determinar cuál interpretación era prevista. Un ejemplo, entre otros, la secuencia de octetos "\xC4\xA2" podría ser:

  • "Ä¢" (U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS seguido de U+00A2 CENT SIGN) codificado en ISO-8859-1, ISO-8859-15, o Windows-1252
  • "ФЂ" (U+0424 CYRILLIC CAPITAL LETTER EF seguido de U+0402 CYRILLIC CAPITAL LETTER DJE) codificado en ISO-8859-5
  • "Ģ" (U+0122 LATIN CAPITAL LETTER G WITH CEDILLA) codificado en UTF-8

Ejemplo #3 Efecto del orden cuando múltiples encodages coinciden

<?php
$str
= "\xC4\xA2";

// La cadena es válida en los tres encodages, por lo que el primero listado será devuelto
var_dump(mb_detect_encoding($str, ['UTF-8', 'ISO-8859-1', 'ISO-8859-5']));
var_dump(mb_detect_encoding($str, ['ISO-8859-1', 'ISO-8859-5', 'UTF-8']));
var_dump(mb_detect_encoding($str, ['ISO-8859-5', 'UTF-8', 'ISO-8859-1']));
?>

El resultado del ejemplo sería:

string(5) "UTF-8"
string(10) "ISO-8859-1"
string(10) "ISO-8859-5"

Ver también

add a note

User Contributed Notes 19 notes

up
83
Gerg Tisza
14 years ago
If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

<?php
$str
= 'áéóú'; // ISO-8859-1
mb_detect_encoding($str, 'UTF-8'); // 'UTF-8'
mb_detect_encoding($str, 'UTF-8', true); // false
?>
up
24
mta59066 at gmail dot com
2 years ago
The documentation is no longer correct for php8.1 and mb_detect_encoding no longer supports order of encodings. The example outputs given in the documentation are also no longer correct for php8.1. This is somewhat explained here https://github.com/php/php-src/issues/8279

I understand the previous ambiguity in these functions, but in my option 8.1 should have deprecated mb_detect_encoding and mb_detect_order and came up with different functions. It now tries to find the encoding that will use the least amount of space regardless of the order, and I am not sure who needs that.

Below is an example function that will do what mb_detect_encoding was doing prior to the 8.1 change.

<?php

function mb_detect_enconding_in_order(string $string, array $encodings): string|false
{
foreach(
$encodings as $enc) {
if (
mb_check_encoding($string, $enc)) {
return
$enc;
}
}
return
false;
}

?>
up
5
geompse at gmail dot com
2 years ago
Major undocumented breaking change since 8.1.7
https://3v4l.org/BLjZ3

Make sure to replace mb_detect_encoding with a loop of calls to mb_check_encoding
up
21
Chrigu
20 years ago
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
up
19
chris AT w3style.co DOT uk
18 years ago
Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster.

<?php

function detectUTF8($string)
{
return
preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs'
, $string);
}

?>
up
19
nat3738 at gmail dot com
16 years ago
A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

<?php
// Unicode BOM is U+FEFF, but after encoded, it will look like this.
define ('UTF32_BIG_ENDIAN_BOM' , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM' , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM' , chr(0xEF) . chr(0xBB) . chr(0xBF));

function
detect_utf_encoding($filename) {

$text = file_get_contents($filename);
$first2 = substr($text, 0, 2);
$first3 = substr($text, 0, 3);
$first4 = substr($text, 0, 3);

if (
$first3 == UTF8_BOM) return 'UTF-8';
elseif (
$first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
elseif (
$first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
elseif (
$first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
elseif (
$first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
?>
up
5
rl at itfigures dot nl
17 years ago
I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset.

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){
$str=str_replace("\xE2\x82\xAC","&euro;",$str);
$str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
$str=str_replace("&euro;","\x80",$str);
}

If html-output is needed the last line is not necessary (and even unwanted).
up
5
eyecatchup at gmail dot com
12 years ago
Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

<?php
if (preg_match("//u", $string)) {
// $string is valid UTF-8
}
up
5
php-note-2005 at ryandesign dot com
20 years ago
Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {

// From http://w3.org/International/questions/qa-forms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs'
, $string);

}
// function is_utf8

?>
up
5
hmdker at gmail dot com
16 years ago
Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

<?php
function is_utf8($str) {
$c=0; $b=0;
$bits=0;
$len=strlen($str);
for(
$i=0; $i<$len; $i++){
$c=ord($str[$i]);
if(
$c > 128){
if((
$c >= 254)) return false;
elseif(
$c >= 252) $bits=6;
elseif(
$c >= 248) $bits=5;
elseif(
$c >= 240) $bits=4;
elseif(
$c >= 224) $bits=3;
elseif(
$c >= 192) $bits=2;
else return
false;
if((
$i+$bits) > $len) return false;
while(
$bits > 1){
$i++;
$b=ord($str[$i]);
if(
$b < 128 || $b > 191) return false;
$bits--;
}
}
}
return
true;
}
?>
up
2
garbage at iglou dot eu
8 years ago
For detect UTF-8, you can use:

if (preg_match('!!u', $str)) { echo 'utf-8'; }

- Norihiori
up
-2
d_maksimov
3 years ago
It was helpful for my exec(...) call. When it returned cp866 or cp1251:

try {
$line = iconv('CP866', 'CP1251', $line);
} catch(Exception $e) {
}
return iconv('CP1251', 'UTF-8', $line);
up
0
emoebel at web dot de
11 years ago
if the function " mb_detect_encoding" does not exist ...

... try:

<?php
// ----------------------------------------------------
if ( !function_exists('mb_detect_encoding') ) {

// ----------------------------------------------------------------
function mb_detect_encoding ($string, $enc=null, $ret=null) {

static
$enclist = array(
'UTF-8', 'ASCII',
'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4', 'ISO-8859-5',
'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10',
'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',
'Windows-1251', 'Windows-1252', 'Windows-1254',
);

$result = false;

foreach (
$enclist as $item) {
$sample = iconv($item, $item, $string);
if (
md5($sample) == md5($string)) {
if (
$ret === NULL) { $result = $item; } else { $result = true; }
break;
}
}

return
$result;
}
// ----------------------------------------------------------------

}
// ----------------------------------------------------
?>

example / usage of: mb_detect_encoding()

<?php
// ------------------------------------------------------
function str_to_utf8 ($str) {

if (
mb_detect_encoding($str, 'UTF-8', true) === false) {
$str = utf8_encode($str);
}

return
$str;
}
// ------------------------------------------------------
?>

$txtstr = str_to_utf8($txtstr);
up
0
maarten
20 years ago
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
// utf8 encoding validation developed based on Wikipedia entry at:
// http://en.wikipedia.org/wiki/UTF-8
//
// Implemented as a recursive descent parser based on a simple state machine
// copyright 2005 Maarten Meijer
//
// This cries out for a C-implementation to be included in PHP core
//
function valid_1byte($char) {
if(!is_int($char)) return false;
return ($char & 0x80) == 0x00;
}

function valid_2byte($char) {
if(!is_int($char)) return false;
return ($char & 0xE0) == 0xC0;
}

function valid_3byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF0) == 0xE0;
}

function valid_4byte($char) {
if(!is_int($char)) return false;
return ($char & 0xF8) == 0xF0;
}

function valid_nextbyte($char) {
if(!is_int($char)) return false;
return ($char & 0xC0) == 0x80;
}

function valid_utf8($string) {
$len = strlen($string);
$i = 0;
while( $i < $len ) {
$char = ord(substr($string, $i++, 1));
if(valid_1byte($char)) { // continue
continue;
} else if(valid_2byte($char)) { // check 1 byte
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_3byte($char)) { // check 2 bytes
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} else if(valid_4byte($char)) { // check 3 bytes
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
if(!valid_nextbyte(ord(substr($string, $i++, 1))))
return false;
} // goto next char
}
return true; // done
}

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png
up
-1
bmrkbyet at web dot de
12 years ago
a) if the FUNCTION mb_detect_encoding is not available:

### mb_detect_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_detect_encoding')) {
function
mb_detect_encoding($string, $enc=null) {

static
$list = array('utf-8', 'iso-8859-1', 'windows-1251');

foreach (
$list as $item) {
$sample = iconv($item, $item, $string);
if (
md5($sample) == md5($string)) {
if (
$enc == $item) { return true; } else { return $item; }
}
}
return
null;
}
}

// -------------------------------------------
?>

b) if the FUNCTION mb_convert_encoding is not available:

### mb_convert_encoding ... iconv ###

<?php
// -------------------------------------------

if(!function_exists('mb_convert_encoding')) {
function
mb_convert_encoding($string, $target_encoding, $source_encoding) {
$string = iconv($source_encoding, $target_encoding, $string);
return
$string;
}
}

// -------------------------------------------
?>
up
-1
telemach
19 years ago
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentu?e' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while

mb_detect_encoding('accentu?' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '?' (and probably other accentuated chars) mislead mb_detect_encoding
up
-1
recentUser at example dot com
7 years ago
In my environment (PHP 7.1.12),
"mb_detect_encoding()" doesn't work
where "mb_detect_order()" is not set appropriately.

To enable "mb_detect_encoding()" to work in such a case,
simply put "mb_detect_order('...')"
before "mb_detect_encoding()" in your script file.

Both
"ini_set('mbstring.language', '...');"
and
"ini_set('mbstring.detect_order', '...');"
DON'T work in script files for this purpose
whereas setting them in PHP.INI file may work.
up
-3
lotushzy at gmail dot com
7 years ago
About function mb_detect_encoding, the link http://php.net/manual/zh/function.mb-detect-encoding.php , like this:
mb_detect_encoding('áéóú', 'UTF-8', true); // false
but now the result is not false, can you give me reason, thanks!
To Top