PHP: September 2006 Archives

2006 Sep 21

Getting the a website title and description is easy. Using the PHP's builtin file_get_contents command together with a regex pattern allows us to capture and get any website title and description without any complex methods that is if the site has a title or a description. In case a site has no description a simple excerpt function is also provided.

Getting the site title:

function getMetaTitle($content){
$pattern = "|<[\s]*title[\s]*>([^<]+)<[\s]*/[\s]*title[\s]*>|Ui";
if(preg_match($pattern, $content, $match))
return $match[1];
else
return false;
}

The code above returns the title of the site enclosed by the tags <title> and </title>. The function would return a boolean false in case there was none.

Getting the meta description:

function getMetaDescription($content) {
$metaDescription = false;
$metaDescriptionPatterns = array("/]*>/Ui", "/]*>/Ui");
foreach ($metaDescriptionPatterns as $pattern) {
if (preg_match($pattern, $content, $match))
$metaDescription = $match[1];
break;
}
return $metaDescription;
}

The code above returns the meta description of the site enclosed with single quotes or double quotes. It will return a boolean false it there wasn't any. If this would happen we could get an excerpt of maybe the first website sentence to serve as our website description instead, however getting an excerpt would not be very efficient and i had some trouble with my code. Please fell free to make a comment to optimize it.

Getting the first website sentence:

function getExcerpt($content) {
$text = html_entity_decode($content);
$excerpt = array();
//match all tags
preg_match_all("|<[^>]+>(.*)]+>|", $text, $p, PREG_PATTERN_ORDER);
for ($x = 0; $x < sizeof($p[0]); $x++) {
if (preg_match('< p >i', $p[0][$x])) {
$strip = strip_tags($p[0][$x]);
if (preg_match("/\./", $strip))
$excerpt[] = $strip;
}
if (isset($excerpt[0])){
preg_match("/([^.]+.)/", $strip,$matches);
return $matches[1];
}
}
return false;
}

The code above reads the entire page and looks for the <p> tag, then returns the first phrase that ends with a period and stripping all the html code inside.

Here's a sample code to test our script:

$url = 'http://www.tildemark.com/';
$content = file_get_contents($url);
$title = getMetaTitle($content);
$description = getMetaDescription($content);
$excerpt = getExcerpt($content);
print "title: $title ";
print "< br />";
print "description: $description ";
print "< br />";
print "excerpt: $excerpt";
?>

You may download a working copy of the title and description scraper script.

Thank you for the comment:
Yes, indeed. We could use the builtin get_meta_tags function to get the website description without any knowledge on regular expressions. here's how:

<?php $meta_data= get_meta_tags('http://www.tildemark.com/'); echo $meta_data['description']; ?>

Aside from getting the description, you could also get Author, Keyword and GeoPosition meta data using the function get_meta_data().

About this Archive

This page is a archive of entries in the PHP category from September 2006.

PHP: October 2006 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Recent Activity

Today

  • tildemark tweeted, "@jjdoblados SmartBro wont work with linux because the installer is an exe file. plus SMARTBRO sucks my dialup connection is even faster."
  • tildemark tweeted, "i think i accidentally drank 2 dosage of my medication. now, im feeling dizzy. is there a way i cound unswallow it?"

Monday

  • tildemark tweeted, "I'm feeling tired. i hope the clock will run a bit faster this time."
  • tildemark tweeted, "got sick for about 3 days. ended up finishing nwn2."

Thursday

  • tildemark tweeted, "im planning to move again, but i dont know where."

Sunday

  • tildemark tweeted, "I drank 3 sachets of instant coffee, ang now i cant sleep even if my eyes are sleepy? I can hear voices.wtf"

Friday

  • tildemark tweeted, "im so sleepy. Zzzzzzzz"

Sunday

  • tildemark tweeted, "some of my scipts are not working with godaddy. but works fine on the others. not mention their poorly coded admin page"

Thursday

  • tildemark tweeted, "so many pending tasks i need to finish. need more coffee !!!"
  • tildemark tweeted, "@gmtristan i dont think that is true."