Determining if a URL exists with Curl

May 31st, 2006

Its quite common for people to enter in their URL when signing up - but what if you want to verify that this is a real page? You can validate the URL using a regular expression up to a point, but all that tells us is that the URL is well formed. What I wanted to do was to check that the page exists - i.e. that we don't get a 404 for it.

Luckily, this is quite easy if you have the Curl extension installed.

Using Curl in PHP

Curl is a collection of client URL library functions that we can use to perform a multitude of URL related activities. You can send a POST or file upload to a URL, use it as a proxy, send/receive XML requests, and much more. In fact, there are over 100 configuration options that can customise this to your needs.

Still, for this tutorial, all we want is to check that a URL exists. We can do this by using curl to grab the headers when requesting the URL and check the Status: header returned.

PHP:
  1. $ch = curl_init();
  2. curl_setopt($ch, CURLOPT_URL, "http://www.jellyandcustard.com/");
  3. curl_setopt($ch, CURLOPT_HEADER, true);
  4. curl_setopt($ch, CURLOPT_NOBODY, true);
  5. curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  6. $data = curl_exec($ch);
  7. curl_close($ch)
  8. echo $data;

The result from running the above code is:

HTTP/1.1 200 OK
Date: Wed, 31 May 2006 12:25:26 GMT
Server: Apache
X-Powered-By: PHP/5.1.2
X-Pingback: http://www.jellyandcustard.com/xmlrpc.php
Status: 200 OK
Content-Type: text/html; charset=UTF-8

As you can see, we have our Status: 200 OK HTTP Code which means the page exists. If we get a 404 HTTP code then the page doesn't exist. A list of HTTP Codes can be found at Wikipedia.

If the domain can't be resolved, then $data will return false, which we can check against with a simple if() statement.

In order to utilise the information returned, we need to use a regular expression to grab the HTTP code, and then based upon that, we can let the user know our results:

PHP:
  1. preg_match("/HTTP\/1\.[1|0]\s(\d{3})/",$data,$matches);
  2. print_r($matches);

Array
(
    [0] => Status: 200
    [1] => 200
)

So $matches[1] has our error code. If this matches 200 then the page exists, and if it is 404, then our page doesn't exist.

A slight change has to be made for pages that use a 301/302/307 redirect. We need to instruct Curl to follow the redirect, and report what's found. These redirects produce a lot of headers, as seen when testing http://ukdomains.jellyandcustard.com:

PHP:
  1. $ch = curl_init();
  2. curl_setopt($ch, CURLOPT_URL, "http://ukdomains.jellyandcustard.com/");
  3. curl_setopt($ch, CURLOPT_HEADER, true);
  4. curl_setopt($ch, CURLOPT_NOBODY, true);
  5. curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  6. curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
  7. curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
  8. $data = curl_exec($ch);
  9. curl_close($ch);
  10. echo $data;
  11. preg_match_all("/HTTP\/1\.[1|0]\s(\d{3})/",$data,$matches);
  12. print_r($matches);

HTTP/1.1 302 Found
Date: Wed, 31 May 2006 12:50:38 GMT
Server: Apache
Location: http://www.123-reg.co.uk/affiliate.cgi?id=AF106554
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 302 Found
Date: Wed, 31 May 2006 12:50:42 GMT
Server: Apache/2.0.53
URL: http://www.123-reg.co.uk/
Set-Cookie: 123reg_affiliate=106554-92bab777b0ae96c938107985db7a98e4; path=/secure/; expires=Sat, 28-May-2016 12:50:43 GMT; secure
Location: http://www.123-reg.co.uk/
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Wed, 31 May 2006 12:50:43 GMT
Server: Apache/2.0.53
Accept-Ranges: bytes
Content-Length: 33854
Content-Type: text/html

Array
(
    [0] => Array
        (
            [0] => HTTP/1.1 302
            [1] => HTTP/1.1 302
            [2] => HTTP/1.1 200
        )

    [1] => Array
        (
            [0] => 302
            [1] => 302
            [2] => 200
        )

)

Our Status code can be found by:

PHP:
  1. $code = end($matches[1]);

Which we can now check for 200 or 404 etc to make sure that you've been given a real URL, and not one that is made up. So the entire code all together:

PHP:
  1. $ch = curl_init();
  2. curl_setopt($ch, CURLOPT_URL, "http://www.jellyandcustard.com/");
  3. curl_setopt($ch, CURLOPT_HEADER, true);
  4. curl_setopt($ch, CURLOPT_NOBODY, true);
  5. curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  6. curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
  7. curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
  8. $data = curl_exec($ch);
  9. curl_close($ch);
  10. preg_match_all("/HTTP\/1\.[1|0]\s(\d{3})/",$data,$matches);
  11. $code = end($matches[1]);
  12. if(!$data) {
  13.   echo "Domain could not be found";
  14. } else {
  15.   if($code==200) {
  16.     echo "Page Found";
  17.   } elseif($code==404) {
  18.     echo "Page Not Found";
  19.   }
  20. }

Enjoy!


 Add to del.icio.us    Digg this    Technorati

Related Posts:

Entry Filed under: PHP

6 Comments Add your own

  • 1. jd  |  June 8th, 2006 at 8:01 pm

    your regex to match the header is just missing some escape slahses. instead, try:

    "/HTTP/1.[1|0]s(d{3})/"

    otherwise, thanks!

  • 2. jd  |  June 8th, 2006 at 8:04 pm

    hrm…anyway, backslashes before the first "1", the "s" and the "d"

  • 3. Khalid  |  June 9th, 2006 at 9:32 am

    Strange, It is definately in my code I have saved, I think wordpress is removing it.

    I'll have a go at fixing it

    Thanks for the heads up!

    Khalid

  • 4. FettesPS  |  June 11th, 2007 at 8:54 pm

    Great article! Far better than just a preg_match.

    Thanks

  • 5. Tobias  |  August 27th, 2007 at 9:16 am

    Hi there,

    I were just looking up (again :-) ) how to get HTTP-Status Code via curl.

    Your method might work, but curl by itself collects and provides these information much more reliable like this
    $code = curl_getinfo($ch,CURLINFO_HTTP_CODE);

  • 6. @TheKeyboard » Blog&hellip  |  January 11th, 2008 at 9:42 pm

    […] The only tricky thing here really is the use of the end function to grab that last match of the status codes. I got the code for doing the preg_match from this site and it seems to work just fine. The reason to use end(…) as far as I can tell is to make sure that I only get the last match of the group. Neat little trick and the link I posted demonstrates it. […]

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

May 2006
M T W T F S S
« Apr   Jun »
1234567
891011121314
15161718192021
22232425262728
293031  

Most Recent Posts