PHP抓取远程网站数据的代码

560阅读 0评论2008-12-11 futuregod
分类:

现在可能还有很多程序爱好者都会遇到同样的疑问,就是要如何像搜索引擎那样去抓取别人网站的HTML代码,然后把代码收集整理成为自己有用的数据!今天就等我介绍一些简单例子吧.

Ⅰ.抓取远程网页标题的例子:

以下是代码片段:

/*
+-------------------------------------------------------------
+抓取网页标题的代码,直接拷贝本代码片段,另存为.php文件执行即可.
+-------------------------------------------------------------
*/


error_reporting(7);
$file = fopen ("", "r");
if (!$file) {
echo "Unable to open remote file.\n";
exit;
}
while (!feof ($file)) {
$line = fgets ($file, 1024);
if (eregi ("(.*)", $line, $out)) {
$title = $out[1];
echo "".$title."";
break;
}
}
fclose($file);


//End
?>

Ⅱ.抓取远程网页HTML代码的例子:

以下是代码片段:
/*
+----------------
+DNSing Sprider
+----------------

*/

$fp = fsockopen("", 80, $errno, $errstr, 30);
if (!$fp) {
   echo "$errstr ($errno)
\n";
} else {
   $out = "GET / HTTP/1.1\r\n";
   $out .= "Host:
\r\n";
   $out .= "Connection: Close \r\n\r\n";
   fputs($fp, $out);
   while (!feof($fp)) {
     echo fgets($fp, 128);
   }
   fclose($fp);
}
//End
?>

以上两个代码片段都直接Copy回去运行就知道效果了,上面的例子只是抓取网页数据的雏形,要使其更适合自己的使用,情况有各异.所以,在此各位程序爱好者自己好好研究一下吧.

===============================

稍微有点意义的函数是:get_content_by_socket(), get_url(), get_content_url(), get_content_object 几个函数,也许能够给你点什么想法。


//获取所有内容url保存到文件
function get_index($save_file, $prefix="index_"){
     $count = 68;
     $i = 1;
     if (file_exists($save_file)) @unlink($save_file);
     $fp = fopen($save_file, "a+") or die("Open ". $save_file ." failed");
     while($i<$count){
         $url = $prefix . $i .".htm";
         echo "Get ". $url ."...";
         $url_str = get_content_url(get_url($url));
         echo " OK\n";
         fwrite($fp, $url_str);
         ++$i;
     }
     fclose($fp);
}

//获取目标多媒体对象
function get_object($url_file, $save_file, $split="|--:**:--|"){
     if (!file_exists($url_file)) die($url_file ." not exist");
     $file_arr = file($url_file);
     if (!is_array($file_arr) || empty($file_arr)) die($url_file ." not content");
     $url_arr = array_unique($file_arr);
     if (file_exists($save_file)) @unlink($save_file);
     $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
     foreach($url_arr as $url){
         if (empty($url)) continue;
         echo "Get ". $url ."...";
         $html_str = get_url($url);
         echo $html_str;
         echo $url;
         exit;
         $obj_str = get_content_object($html_str);
         echo " OK\n";
         fwrite($fp, $obj_str);
     }
     fclose($fp);
}

//遍历目录获取文件内容
function get_dir($save_file, $dir){
     $dp = opendir($dir);
     if (file_exists($save_file)) @unlink($save_file);
     $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed");
     while(($file = readdir($dp)) != false){
         if ($file!="." && $file!=".."){
             echo "Read file ". $file ."...";
             $file_content = file_get_contents($dir . $file);
             $obj_str = get_content_object($file_content);
             echo " OK\n";
             fwrite($fp, $obj_str);
         }
     }
     fclose($fp);
}


//获取指定url内容
function get_url($url){
     $reg = '/^http:\/\/[^\/].+$/';
     if (!preg_match($reg, $url)) die($url ." invalid");
     $fp = fopen($url, "r") or die("Open url: ". $url ." failed.");
     while($fc = fread($fp, 8192)){
         $content .= $fc;
     }
     fclose($fp);
     if (empty($content)){
         die("Get url: ". $url ." content failed.");
     }
     return $content;
}

//使用socket获取指定网页
function get_content_by_socket($url, $host){
     $fp = fsockopen($host, 80) or die("Open ". $url ." failed");
     $header = "GET /".$url ." HTTP/1.1\r\n";
     $header .= "Accept: */*\r\n";
     $header .= "Accept-Language: zh-cn\r\n";
     $header .= "Accept-Encoding: gzip, deflate\r\n";
     $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)\r\n";
     $header .= "Host: ". $host ."\r\n";
     $header .= "Connection: Keep-Alive\r\n";
     //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-\r\n\r\n";
     $header .= "Connection: Close\r\n\r\n";

     fwrite($fp, $header);
     while (!feof($fp)) {
         $contents .= fgets($fp, 8192);
     }
     fclose($fp);
     return $contents;
}


//获取指定内容里的url
function get_content_url($host_url, $file_contents){

     //$reg = '/^(#|javascript.*?|ftp:\/\/.+|http:\/\/.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i';
     //$reg = '/^(down.*?\.html|\d+_\d+\.htm.*?)$/i';
     $rex = "/([hH][rR][eE][Ff])\s*=\s*['\"]*([^>'\"\s]+)[\"'>]*\s*/i";
     $reg = '/^(down.*?\.html)$/i';
     preg_match_all ($rex, $file_contents, $r);
     $result = ""; //array();
     foreach($r as $c){
         if (is_array($c)){
             foreach($c as $d){
                 if (preg_match($reg, $d)){ $result .= $host_url . $d."\n"; }
             }
         }
     }
     return $result;
}

//获取指定内容中的多媒体文件
function get_content_object($str, $split="|--:**:--|"){     
     $regx = "/href\s*=\s*['\"]*([^>'\"\s]+)[\"'>]*\s*(.*?<\/b>)/i";
     preg_match_all($regx, $str, $result);

     if (count($result) == 3){
         $result[2] = str_replace("多媒体: ", "", $result[2]);
         $result[2] = str_replace("
", "", $result[2]);
         $result = $result[1][0] . $split .$result[2][0] . "\n";
     }
     return $result;
}

?>