grapheme_extract

(PHP 5 >= 5.3.0, PHP 7, PHP 8, PECL intl >= 1.0.0)

grapheme_extract — Extrae un grupo de grafemas de una cadena UTF-8

Descripción

Estilo procedimental

grapheme_extract(
    string $haystack,
    int $size,
    int $type = GRAPHEME_EXTR_COUNT,
    int $offset = 0,
    int &$next = null
): string|false

Esta función extrae una secuencia de grupos de grafemas por defecto de un texto en UTF-8.

Parámetros

haystack

La cadena a estudiar.

size

El número máximo de elementos, en función de type, a devolver.

type

Define el tipo de unidades indicadas por el parámetro size:

GRAPHEME_EXTR_COUNT (por defecto): size es el número de grupos de grafemas a extraer.
GRAPHEME_EXTR_MAXBYTES: size es el número de bytes a devolver.
GRAPHEME_EXTR_MAXCHARS: size es el número de caracteres UTF-8 a devolver.

offset

La posición de inicio en haystack, expresada en bytes. Debe ser positiva, nula o inferior al tamaño de haystack en bytes, o un valor negativo, que contaría desde el final de haystack. Si offset no corresponde al primer byte de un carácter UTF-8 válido, la posición de inicio será desplazada al siguiente byte válido.

next

Referencia a una variable que recibirá la próxima posición de inicio válida. Cuando la función termina, esto puede ser una posición que está más allá del tamaño de la cadena.

Valores devueltos

Una cadena que comienza en la posición offset y termina en el límite válido de un grafema, y que se ajusta a las condiciones size y type especificadas, o false si ocurre un error.

Historial de cambios

Versión	Descripción
7.1.0	Se añadió el soporte para valores negativos en `offset`.

Ejemplos

Ejemplo #1 Ejemplo con grapheme_extract()

<?php

$char_a_ring_nfd = "a\xCC\x8A";  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_o_diaeresis_nfd = "o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) normalization form "D"

print urlencode(grapheme_extract( $char_a_ring_nfd . $char_o_diaeresis_nfd, 1, GRAPHEME_EXTR_COUNT, 2));

?>

El ejemplo anterior mostrará:

o%CC%88

Ver también

grapheme_substr() - Devuelve una parte de un string
» Unicode Text Segmentation: Grapheme Cluster Boundaries

Found A Problem?

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 3 notes

down

AJH ¶

14 years ago

Here's how to use grapheme_extract() to loop across a UTF-8 string character by character.

<?php

$str = "سabcक’…";
// if the previous line didn't come through, the string contained:
//U+0633,U+0061,U+0062,U+0063,U+0915,U+2019,U+2026

$n = 0;

for (    $start = 0, $next = 0, $maxbytes = strlen($str), $c = '';
        $start < $maxbytes;
        $c = grapheme_extract($str, 1, GRAPHEME_EXTR_MAXCHARS , ($start = $next), $next)
    )
{
    if (empty($c))
        continue;
    echo "This utf8 character is " . strlen($c) . " bytes long and its first byte is " . ord($c[0]) . "\n";
    $n++;
}
echo "$n UTF-8 characters in a string of $maxbytes bytes!\n";
// Should print: 7 UTF8 characters in a string of 14 bytes!
?>

down

Philo ¶

2 years ago

The other comments on this page were helpful for me.
However, consider using something better than empty($value) when checking the value returned by grapheme_extract since it could as well return something like "0" (which of course evaluates to false).

down

yevgen dot grytsay at gmail dot com ¶

5 years ago

Looping through grapheme clusters:

<?php

// Example taken from Rust documentation: https://doc.rust-lang.org/book/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my
$str = "नमस्ते";
// Alternatively:
//$str = pack('C*', ...[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]);
$next = 0;
$maxbytes = strlen($str);

var_dump($str);

while ($next < $maxbytes) {
    $char = grapheme_extract($str, 1, GRAPHEME_EXTR_COUNT, $next, $next);
    if (empty($char)) {
        continue;
    }
    echo "{$char} - This utf8 character is " . strlen($char) . ' bytes long', PHP_EOL;
}

//string(18) "नमस्ते"
//न - This utf8 character is 3 bytes long
//म - This utf8 character is 3 bytes long
//स् - This utf8 character is 6 bytes long
//ते - This utf8 character is 6 bytes long
?>

＋add a note