2006 Oct 1

Scrape web contents faster

when scraping websites, i usually use the function file_get_contents. However, there are times when we only need a specific portion of the site to get; for instance: getting the title of the site or the description.

Instead of using file_get_contents function we instead use the builtin file fopen and fgets functions like this:

<?php
$url = 'http://www.tildemark.com/';
$fp = fopen( $url, 'r' );          // r means open the site for reading
$buffer = trim(fgets($fp, 1024));  // read the first 1024 bytes of data
print "<pre>$buffer</pre>";
?>

But, using CURL functions will be a lot faster. We will use CURLOPT_RANGE to get the specific amount of data from a specified url. CURLOPT_RANGE defines as range(s) of data to retrieve in the format "X-Y" where X or Y are optional. HTTP transfers also support several intervals, separated with commas in the format "X-Y,N-M".

<?php
$url = 'http://www.tildemark.com/';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RANGE, "0-1024");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec ($curl);
echo "<pre>$content</pre>";
?>

0 TrackBacks

Listed below are links to blogs that reference this entry: Scrape web contents faster.

TrackBack URL for this entry: http://www.tildemark.com/cgi-bin/mt4/mt-tb.cgi/32

Leave a comment

About this Entry

This page contains a single entry by tildemark published on October 1, 2006 2:59 PM.

Getting specific tag contents was the previous entry in this blog.

No installed service named Apache2 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Recent Activity

Friday

  • tildemark tweeted, "im so sleepy. Zzzzzzzz"

Sunday

  • tildemark tweeted, "some of my scipts are not working with godaddy. but works fine on the others. not mention their poorly coded admin page"

Today

  • tildemark tweeted, "so many pending tasks i need to finish. need more coffee !!!"
  • tildemark tweeted, "@gmtristan i dont think that is true."

Today

  • tildemark tweeted, "how does godaddy subdomain behaves? i have some problems with it on my scripts. it does not seem to accept query strings.."

Monday

  • tildemark tweeted, "i had a hard time removing the error messages generated by surf side kick. i ended up uninstalling most of my applications."

Sunday

  • tildemark tweeted, "i got hit by surf side kick and im getting numerous error messages on my screen. tskkkkk"

Saturday

  • tildemark tweeted, "check boxes, i didn't know they can also be complex"
  • tildemark tweeted, "this smart bro internet speed is depressing, i thinking of filling a complaint to the DTI next week."

Friday

  • tildemark tweeted, "the seminar turned out to be leadership training. it was fun, learned alot. i have already attended numerous seminars but this is different."