Home >> Snippet List >> Snippet

Name
Convert HTML to plain text
Description
If you want to extract the text content of a HTML document (eg - get rid of all the HTML and Javascript), try the following code:
PHP Snippet
<?PHP
 
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.
 
$search = ARRAY ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript
                 "'<[/!]*?[^<>]*?>'si",          // Strip out HTML tags
                 "'([rn])[s]+'",                // Strip out white space
                 "'&(quot|#34);'i",                // Replace HTML entities
                 "'&(amp|#38);'i",
                 "'&(lt|#60);'i",
                 "'&(gt|#62);'i",
                 "'&(nbsp|#160);'i",
                 "'&(iexcl|#161);'i",
                 "'&(cent|#162);'i",
                 "'&(pound|#163);'i",
                 "'&(copy|#169);'i",
                 "'&#(d+);'e");                    // evaluate as php
 
$replace = ARRAY ("",
                 "",
                 "\1",
                 "\"",
                 "&",
                 "<",
                 ">",
                 " ",
                 CHR(161),
                 CHR(162),
                 CHR(163),
                 CHR(169),
                 "chr(\1)");
 
$text = PREG_REPLACE($search, $replace, $document);
 
?>