Determining if a URL exists with Curl
May 31st, 2006
Its quite common for people to enter in their URL when signing up - but what if you want to verify that this is a real page? You can validate the URL using a regular expression up to a point, but all that tells us is that the URL is well formed. What I wanted to do was to check that the page exists - i.e. that we don't get a 404 for it.
Luckily, this is quite easy if you have the Curl extension installed.
Using Curl in PHP
Curl is a collection of client URL library functions that we can use to perform a multitude of URL related activities. You can send a POST or file upload to a URL, use it as a proxy, send/receive XML requests, and much more. In fact, there are over 100 configuration options that can customise this to your needs.
Still, for this tutorial, all we want is to check that a URL exists. We can do this by using curl to grab the headers when requesting the URL and check the Status: header returned.
-
$ch = curl_init();
-
curl_setopt($ch, CURLOPT_URL, "http://www.jellyandcustard.com/");
-
curl_setopt($ch, CURLOPT_HEADER, true);
-
curl_setopt($ch, CURLOPT_NOBODY, true);
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
-
$data = curl_exec($ch);
-
curl_close($ch)
-
echo $data;
The result from running the above code is:
HTTP/1.1 200 OK
Date: Wed, 31 May 2006 12:25:26 GMT
Server: Apache
X-Powered-By: PHP/5.1.2
X-Pingback: http://www.jellyandcustard.com/xmlrpc.php
Status: 200 OK
Content-Type: text/html; charset=UTF-8
As you can see, we have our Status: 200 OK HTTP Code which means the page exists. If we get a 404 HTTP code then the page doesn't exist. A list of HTTP Codes can be found at Wikipedia.
If the domain can't be resolved, then $data will return false, which we can check against with a simple if() statement.
In order to utilise the information returned, we need to use a regular expression to grab the HTTP code, and then based upon that, we can let the user know our results:
Array
(
[0] => Status: 200
[1] => 200
)
So $matches[1] has our error code. If this matches 200 then the page exists, and if it is 404, then our page doesn't exist.
A slight change has to be made for pages that use a 301/302/307 redirect. We need to instruct Curl to follow the redirect, and report what's found. These redirects produce a lot of headers, as seen when testing http://ukdomains.jellyandcustard.com:
-
$ch = curl_init();
-
curl_setopt($ch, CURLOPT_URL, "http://ukdomains.jellyandcustard.com/");
-
curl_setopt($ch, CURLOPT_HEADER, true);
-
curl_setopt($ch, CURLOPT_NOBODY, true);
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
-
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
-
curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
-
$data = curl_exec($ch);
-
curl_close($ch);
-
echo $data;
HTTP/1.1 302 Found
Date: Wed, 31 May 2006 12:50:38 GMT
Server: Apache
Location: http://www.123-reg.co.uk/affiliate.cgi?id=AF106554
Content-Type: text/html; charset=iso-8859-1HTTP/1.1 302 Found
Date: Wed, 31 May 2006 12:50:42 GMT
Server: Apache/2.0.53
URL: http://www.123-reg.co.uk/
Set-Cookie: 123reg_affiliate=106554-92bab777b0ae96c938107985db7a98e4; path=/secure/; expires=Sat, 28-May-2016 12:50:43 GMT; secure
Location: http://www.123-reg.co.uk/
Content-Type: text/html; charset=iso-8859-1HTTP/1.1 200 OK
Date: Wed, 31 May 2006 12:50:43 GMT
Server: Apache/2.0.53
Accept-Ranges: bytes
Content-Length: 33854
Content-Type: text/htmlArray ( [0] => Array ( [0] => HTTP/1.1 302 [1] => HTTP/1.1 302 [2] => HTTP/1.1 200 ) [1] => Array ( [0] => 302 [1] => 302 [2] => 200 ) )
Our Status code can be found by:
Which we can now check for 200 or 404 etc to make sure that you've been given a real URL, and not one that is made up. So the entire code all together:
-
$ch = curl_init();
-
curl_setopt($ch, CURLOPT_URL, "http://www.jellyandcustard.com/");
-
curl_setopt($ch, CURLOPT_HEADER, true);
-
curl_setopt($ch, CURLOPT_NOBODY, true);
-
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
-
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
-
curl_setopt($ch, CURLOPT_MAXREDIRS, 10); //follow up to 10 redirections - avoids loops
-
$data = curl_exec($ch);
-
curl_close($ch);
-
if(!$data) {
-
echo "Domain could not be found";
-
} else {
-
if($code==200) {
-
echo "Page Found";
-
} elseif($code==404) {
-
echo "Page Not Found";
-
}
-
}
Enjoy!
Add to del.icio.us
Digg this
Technorati
Related Posts:
- PHP, cURL, HTTP PUT, SSL and Basic Authentication
- PHP 5.1.3 Released
- Odd and Even Numbers
- File Uploads
- MySQL Events
Entry Filed under: PHP
10 Comments Add your own
1. jd | June 8th, 2006 at 8:01 pm
your regex to match the header is just missing some escape slahses. instead, try:
"/HTTP/1.[1|0]s(d{3})/"
otherwise, thanks!
2. jd | June 8th, 2006 at 8:04 pm
hrm…anyway, backslashes before the first "1", the "s" and the "d"
3. Khalid | June 9th, 2006 at 9:32 am
Strange, It is definately in my code I have saved, I think wordpress is removing it.
I'll have a go at fixing it
Thanks for the heads up!
Khalid
4. FettesPS | June 11th, 2007 at 8:54 pm
Great article! Far better than just a preg_match.
Thanks
5. Tobias | August 27th, 2007 at 9:16 am
Hi there,
I were just looking up (again
) how to get HTTP-Status Code via curl.
Your method might work, but curl by itself collects and provides these information much more reliable like this
$code = curl_getinfo($ch,CURLINFO_HTTP_CODE);6. @TheKeyboard » Blog&hellip | January 11th, 2008 at 9:42 pm
[...] The only tricky thing here really is the use of the end function to grab that last match of the status codes. I got the code for doing the preg_match from this site and it seems to work just fine. The reason to use end(…) as far as I can tell is to make sure that I only get the last match of the group. Neat little trick and the link I posted demonstrates it. [...]
7. Anthony | July 26th, 2008 at 12:48 am
Thank you so much! I have been looking all over the net for this searching for "http header check" etc. etc. I could only locate some script of which the download link did not work. Again - THANK YOU! Also, there is a small error in your first code: curl_close($ch) should have ";" behind it
God Bless & Thank you again
8. Jayapal Chandran | December 17th, 2008 at 3:10 pm
ha, wha, i was thinking it for some time and then typed in google and i am here for what i need. thank you. and i've yet to experiment tobias comment on using curl_getinfo and if that too works well then my snippet will be a complete one. anyway thanks for the snippet.
9. DK Ping | February 7th, 2009 at 5:17 pm
This code works really nicely, thanks!
10. dmsr | May 10th, 2009 at 10:20 pm
Hi, great script! But how to test on a different port, like :8888 please? ex with this site http://papa.indstate.edu:8888/ the script return an empty array
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed