cURL is a tool for transferring files and data with URL syntax, supporting many protocols including HTTP, FTP, TELNET and more. Initially, cURL was designed to be a command line tool. Lucky for us, the cURL library is also supported by PHP. In this article, we will look at some of the advanced features of cURL, and how we can use them in our PHP scripts.

 

Why cURL?

It’s true that there are other ways of fetching the contents of a web page. Many times, mostly due to laziness, I have just used simple PHP functions instead of cURL:

view plaincopy to clipboardprint?

  1. $content = file_get_contents(«http://www.nettuts.com»);
  2. // or
  3. $lines = file(«http://www.nettuts.com»);
  4. // or
  5. readfile(«http://www.nettuts.com»);

However they have virtually no flexibility and lack sufficient error handling. Also, there are certain tasks that you simply can not do, like dealing with cookies, authentication, form posts, file uploads etc.

cURL is a powerful library that supports many different protocols, options, and provides detailed information about the URL requests.

Basic Structure

Before we move on to more complicated examples, let’s review the basic structure of a cURL request in PHP. There are four main steps:

  1. Initialize
  2. Set Options
  3. Execute and Fetch Result
  4. Free up the cURL handle
view plaincopy to clipboardprint?

  1. // 1. initialize
  2. $ch = curl_init();
  3. // 2. set the options, including the url
  4. curl_setopt($ch, CURLOPT_URL, «http://www.nettuts.com»);
  5. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  6. curl_setopt($ch, CURLOPT_HEADER, 0);
  7. // 3. execute and fetch the resulting HTML output
  8. $output = curl_exec($ch);
  9. // 4. free up the curl handle
  10. curl_close($ch);

Step #2 (i.e. curl_setopt() calls) is going to be a big part of this article, because that is where all the magic happens. There is a long list of cURL options that can be set, which can configure the URL request in detail. It might be difficult to go through the whole list and digest it all at once. So today, we are just going to use some of the more common and useful options in various code examples.

Checking for Errors

Optionally, you can also add error checking:

view plaincopy to clipboardprint?

  1. // …
  2. $output = curl_exec($ch);
  3. if ($output === FALSE) {
  4. echo «cURL Error: » . curl_error($ch);
  5. }
  6. // …

Please note that we need to use “=== FALSE” for comparison instead of “== FALSE”. Because we need to distinguish between empty output vs. the boolean value FALSE, which indicates an error.

Getting Information

Another optional step is to get information about the cURL request, after it has been executed.

view plaincopy to clipboardprint?

  1. // …
  2. curl_exec($ch);
  3. $info = curl_getinfo($ch);
  4. echo ‘Took ‘ . $info[‘total_time’] . ‘ seconds for url ‘ . $info[‘url’];
  5. // …

Following information is included in the returned array:

  • “url”
  • “content_type”
  • “http_code”
  • “header_size”
  • “request_size”
  • “filetime”
  • “ssl_verify_result”
  • “redirect_count”
  • “total_time”
  • “namelookup_time”
  • “connect_time”
  • “pretransfer_time”
  • “size_upload”
  • “size_download”
  • “speed_download”
  • “speed_upload”
  • “download_content_length”
  • “upload_content_length”
  • “starttransfer_time”
  • “redirect_time”

Detect Redirection Based on Browser

In this first example, we will write a script that can detect URL redirections based on different browser settings. For example, some websites redirect cellphone browsers, or even surfers from different countries.

We are going to be using the CURLOPT_HTTPHEADER option to set our outgoing HTTP Headers including the user agent string and the accepted languages. Finally we will check to see if these websites are trying to redirect us to different URLs.

view plaincopy to clipboardprint?

  1. // test URLs
  2. $urls = array(
  3. «http://www.cnn.com»,
  4. «http://www.mozilla.com»,
  5. «http://www.facebook.com»
  6. );
  7. // test browsers
  8. $browsers = array(
  9. «standard» => array (
  10. «user_agent» => «Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)»,
  11. «language» => «en-us,en;q=0.5»
  12. ),
  13. «iphone» => array (
  14. «user_agent» => «Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A537a Safari/419.3»,
  15. «language» => «en»
  16. ),
  17. «french» => array (
  18. «user_agent» => «Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 2.0.50727)»,
  19. «language» => «fr,fr-FR;q=0.5»
  20. )
  21. );
  22. foreach ($urls as $url) {
  23. echo «URL: $url\n»;
  24. foreach ($browsers as $test_name => $browser) {
  25. $ch = curl_init();
  26. // set url
  27. curl_setopt($ch, CURLOPT_URL, $url);
  28. // set browser specific headers
  29. curl_setopt($ch, CURLOPT_HTTPHEADER, array(
  30. «User-Agent: {$browser[‘user_agent’]}»,
  31. «Accept-Language: {$browser[‘language’]}»
  32. ));
  33. // we don’t want the page contents
  34. curl_setopt($ch, CURLOPT_NOBODY, 1);
  35. // we need the HTTP Header returned
  36. curl_setopt($ch, CURLOPT_HEADER, 1);
  37. // return the results instead of outputting it
  38. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  39. $output = curl_exec($ch);
  40. curl_close($ch);
  41. // was there a redirection HTTP header?
  42. if (preg_match(«!Location: (.*)!», $output, $matches)) {
  43. echo «$test_name: redirects to $matches[1]\n»;
  44. } else {
  45. echo «$test_name: no redirection\n»;
  46. }
  47. }
  48. echo «\n\n»;
  49. }

First we have a set of URLs to test, followed by a set of browser settings to test each of these URLs against. Then we loop through these test cases and make a cURL request for each.

Because of the way setup the cURL options, the returned output will only contain the HTTP headers (saved in $output). With a simple regex, we can see if there was a “Location:” header included.

When you run this script, you should get an output like this:

ss 1

POSTing to a URL

On a GET request, data can be sent to a URL via the “query string”. For example, when you do a search on Google, the search term is located in the query string part of the URL:

view plaincopy to clipboardprint?

  1. http://www.google.com/search?q=nettuts

You may not need cURL to simulate this in a web script. You can just be lazy and hit that url with “file_get_contents()” to receive the results.

But some HTML forms are set to the POST method. When these forms are submitted through the browser, the data is sent via the HTTP Request body, rather than the query string. For example, if you do a search on the CodeIgniter forums, you will be POSTing your search query to:

  1. http://codeigniter.com/forums/do_search/

We can write a PHP script to simulate this kind of URL request. First let’s create a simple file for accepting and displaying the POST data. Let’s call it post_output.php:

view plaincopy to clipboardprint?

  1. print_r($_POST);

Next we create a PHP script to perform a cURL request:

view plaincopy to clipboardprint?

  1. $url = «http://localhost/post_output.php»;
  2. $post_data = array (
  3. «foo» => «bar»,
  4. «query» => «Nettuts»,
  5. «action» => «Submit»
  6. );
  7. $ch = curl_init();
  8. curl_setopt($ch, CURLOPT_URL, $url);
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10. // we are doing a POST request
  11. curl_setopt($ch, CURLOPT_POST, 1);
  12. // adding the post variables to the request
  13. curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
  14. $output = curl_exec($ch);
  15. curl_close($ch);
  16. echo $output;

When you run this script, you should get an output like this:

ss 2

It sent a POST to the post_output.php script, which dumped the $_POST variable, and we captured that output via cURL.

File Upload

Uploading files works very similarly to the previous POST example, since all file upload forms have the POST method.

First let’s create a file for receiving the request and call it upload_output.php:

view plaincopy to clipboardprint?

  1. print_r($_FILES);

And here is the actual script performing the file upload:

view plaincopy to clipboardprint?

  1. $url = «http://localhost/upload_output.php»;
  2. $post_data = array (
  3. «foo» => «bar»,
  4. // file to be uploaded
  5. «upload» => «@C:/wamp/www/test.zip»
  6. );
  7. $ch = curl_init();
  8. curl_setopt($ch, CURLOPT_URL, $url);
  9. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  10. curl_setopt($ch, CURLOPT_POST, 1);
  11. curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
  12. $output = curl_exec($ch);
  13. curl_close($ch);
  14. echo $output;

When you want to upload a file, all you have to do is pass its file path just like a post variable, and put the @ symbol in front of it. Now when you run this script you should get an output like this:

ss 3

Multi cURL

One of the more advanced features of cURL is the ability to create a “multi” cURL handle. This allows you to open connections to multiple URLs simultaneously and asynchronously.

On a regular cURL request, the script execution stops and waits for the URL request to finish before it can continue. If you intend to hit multiple URLs, this can take a long time, as you can only request one URL at a time. We can overcome this limitation by using the multi handle.

Let’s look at this sample code from php.net:

view plaincopy to clipboardprint?

  1. // create both cURL resources
  2. $ch1 = curl_init();
  3. $ch2 = curl_init();
  4. // set URL and other appropriate options
  5. curl_setopt($ch1, CURLOPT_URL, «http://lxr.php.net/»);
  6. curl_setopt($ch1, CURLOPT_HEADER, 0);
  7. curl_setopt($ch2, CURLOPT_URL, «http://www.php.net/»);
  8. curl_setopt($ch2, CURLOPT_HEADER, 0);
  9. //create the multiple cURL handle
  10. $mh = curl_multi_init();
  11. //add the two handles
  12. curl_multi_add_handle($mh,$ch1);
  13. curl_multi_add_handle($mh,$ch2);
  14. $active = null;
  15. //execute the handles
  16. do {
  17. $mrc = curl_multi_exec($mh, $active);
  18. } while ($mrc == CURLM_CALL_MULTI_PERFORM);
  19. while ($active && $mrc == CURLM_OK) {
  20. if (curl_multi_select($mh) != -1) {
  21. do {
  22. $mrc = curl_multi_exec($mh, $active);
  23. } while ($mrc == CURLM_CALL_MULTI_PERFORM);
  24. }
  25. }
  26. //close the handles
  27. curl_multi_remove_handle($mh, $ch1);
  28. curl_multi_remove_handle($mh, $ch2);
  29. curl_multi_close($mh);

The idea is that you can open multiple cURL handles and assign them to a single multi handle. Then you can wait for them to finish executing while in a loop.

There are two main loops in this example. The first do-while loop repeatedly calls curl_multi_exec(). This function is non-blocking. It executes as little as possible and returns a status value. As long as the returned value is the constant ‘CURLM_CALL_MULTI_PERFORM’, it means that there is still more immediate work to do (for example, sending http headers to the URLs.) That’s why we keep calling it until the return value is something else.

In the following while loop, we continue as long as the $active variable is ‘true’. This was passed as the second argument to the curl_multi_exec() call. It is set to ‘true’ as long as there are active connections withing the multi handle. Next thing we do is to call curl_multi_select(). This function is ‘blocking’ until there is any connection activity, such as receiving a response. When that happens, we go into yet another do-while loop to continue executing.

Let’s see if we can create a working example ourselves, that has a practical purpose.

WordPress Link Checker

Imagine a blog with many posts containing links to external websites. Some of these links might end up dead after a while for various reasons. Maybe the page is longer there, or the entire website is gone.

We are going to be building a script that analyzes all the links and finds non-loading websites and 404 pages and returns a report to us.

Note that this is not going to be an actual WordPress plug-in. It is only a standalone utility script, and it is just for demonstration purposes.

So let’s get started. First we need to fetch the links from the database:

view plaincopy to clipboardprint?

  1. // CONFIG
  2. $db_host = ‘localhost’;
  3. $db_user = ‘root’;
  4. $db_pass = »;
  5. $db_name = ‘wordpress’;
  6. $excluded_domains = array(
  7. ‘localhost’, ‘www.mydomain.com’);
  8. $max_connections = 10;
  9. // initialize some variables
  10. $url_list = array();
  11. $working_urls = array();
  12. $dead_urls = array();
  13. $not_found_urls = array();
  14. $active = null;
  15. // connect to MySQL
  16. if (!mysql_connect($db_host, $db_user, $db_pass)) {
  17. die(‘Could not connect: ‘ . mysql_error());
  18. }
  19. if (!mysql_select_db($db_name)) {
  20. die(‘Could not select db: ‘ . mysql_error());
  21. }
  22. // get all published posts that have links
  23. $q = «SELECT post_content FROM wp_posts
  24. WHERE post_content LIKE ‘%href=%’
  25. AND post_status = ‘publish’
  26. AND post_type = ‘post'»;
  27. $r = mysql_query($q) or die(mysql_error());
  28. while ($d = mysql_fetch_assoc($r)) {
  29. // get all links via regex
  30. if (preg_match_all(«!href=\»(.*?)\»!», $d[‘post_content’], $matches)) {
  31. foreach ($matches[1] as $url) {
  32. // exclude some domains
  33. $tmp = parse_url($url);
  34. if (in_array($tmp[‘host’], $excluded_domains)) {
  35. continue;
  36. }
  37. // store the url
  38. $url_list []= $url;
  39. }
  40. }
  41. }
  42. // remove duplicates
  43. $url_list = array_values(array_unique($url_list));
  44. if (!$url_list) {
  45. die(‘No URL to check’);
  46. }

First we have some database configuration, followed by an array of domain names we will ignore ($excluded_domains). Also we set a number for maximum simultaneous connections we will be using later ($max_connections). Then we connect to the database, fetch posts that contain links, and collect them into an array ($url_list).

Following code might be a little complex, so I will try to explain it in small steps.

view plaincopy to clipboardprint?

  1. // 1. multi handle
  2. $mh = curl_multi_init();
  3. // 2. add multiple URLs to the multi handle
  4. for ($i = 0; $i < $max_connections; $i++) {
  5. add_url_to_multi_handle($mh, $url_list);
  6. }
  7. // 3. initial execution
  8. do {
  9. $mrc = curl_multi_exec($mh, $active);
  10. } while ($mrc == CURLM_CALL_MULTI_PERFORM);
  11. // 4. main loop
  12. while ($active && $mrc == CURLM_OK) {
  13. // 5. there is activity
  14. if (curl_multi_select($mh) != -1) {
  15. // 6. do work
  16. do {
  17. $mrc = curl_multi_exec($mh, $active);
  18. } while ($mrc == CURLM_CALL_MULTI_PERFORM);
  19. // 7. is there info?
  20. if ($mhinfo = curl_multi_info_read($mh)) {
  21. // this means one of the requests were finished
  22. // 8. get the info on the curl handle
  23. $chinfo = curl_getinfo($mhinfo[‘handle’]);
  24. // 9. dead link?
  25. if (!$chinfo[‘http_code’]) {
  26. $dead_urls []= $chinfo[‘url’];
  27. // 10. 404?
  28. } else if ($chinfo[‘http_code’] == 404) {
  29. $not_found_urls []= $chinfo[‘url’];
  30. // 11. working
  31. } else {
  32. $working_urls []= $chinfo[‘url’];
  33. }
  34. // 12. remove the handle
  35. curl_multi_remove_handle($mh, $mhinfo[‘handle’]);
  36. curl_close($mhinfo[‘handle’]);
  37. // 13. add a new url and do work
  38. if (add_url_to_multi_handle($mh, $url_list)) {
  39. do {
  40. $mrc = curl_multi_exec($mh, $active);
  41. } while ($mrc == CURLM_CALL_MULTI_PERFORM);
  42. }
  43. }
  44. }
  45. }
  46. // 14. finished
  47. curl_multi_close($mh);
  48. echo «==Dead URLs==\n»;
  49. echo implode(«\n»,$dead_urls) . «\n\n»;
  50. echo «==404 URLs==\n»;
  51. echo implode(«\n»,$not_found_urls) . «\n\n»;
  52. echo «==Working URLs==\n»;
  53. echo implode(«\n»,$working_urls);
  54. // 15. adds a url to the multi handle
  55. function add_url_to_multi_handle($mh, $url_list) {
  56. static $index = 0;
  57. // if we have another url to get
  58. if ($url_list[$index]) {
  59. // new curl handle
  60. $ch = curl_init();
  61. // set the url
  62. curl_setopt($ch, CURLOPT_URL, $url_list[$index]);
  63. // to prevent the response from being outputted
  64. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  65. // follow redirections
  66. curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  67. // do not need the body. this saves bandwidth and time
  68. curl_setopt($ch, CURLOPT_NOBODY, 1);
  69. // add it to the multi handle
  70. curl_multi_add_handle($mh, $ch);
  71. // increment so next url is used next time
  72. $index++;
  73. return true;
  74. } else {
  75. // we are done adding new URLs
  76. return false;
  77. }
  78. }

And here is the explanation for the code above. Numbers in the list correspond to the numbers in the code comments.

  1. Created a multi handle.
  2. We will be creating the add_url_to_multi_handle() function later on. Every time it is called, it will add a url to the multi handle. Initially, we add 10 (based on $max_connections) URLs to the multi handle.
  3. We must run curl_multi_exec() for the initial work. As long as it returns CURLM_CALL_MULTI_PERFORM, there is work to do. This is mainly for creating the connections. It does not wait for the full URL response.
  4. This main loop runs as long as there is some activity in the multi handle.
  5. curl_multi_select() waits the script until an activity to happens with any of the URL quests.
  6. Again we must let cURL do some work, mainly for fetching response data.
  7. We check for info. There is an array returned if a URL request was finished.
  8. There is a cURL handle in the returned array. We use that to fetch info on the individual cURL request.
  9. If the link was dead or timed out, there will be no http code.
  10. If the link was a 404 page, the http code will be set to 404.
  11. Otherwise we assume it was a working link. (You may add additional checks for 500 error codes etc…)
  12. We remove the cURL handle from the multi handle since it is no longer needed, and close it.
  13. We can now add another url to the multi handle, and again do the initial work before moving on.
  14. Everything is finished. We can close the multi handle and print a report.
  15. This is the function that adds a new url to the multi handle. The static variable $index is incremented every time this function is called, so we can keep track of where we left off.

I ran the script on my blog (with some broken links added on purpose, for testing), and here is what it looked like:

ss 4

It took only less than 2 seconds to go through about 40 URLs. The performance gains are significant when dealing with even larger sets of URLs. If you open ten connections at the same time, it can run up to ten times faster. Also you can just utilize the non-blocking nature of the multi curl handle to do URL requests without stalling your web script.

Some Other Useful cURL Options

HTTP Authentication

If there is HTTP based authentication on a URL, you can use this:

view plaincopy to clipboardprint?

  1. $url = «http://www.somesite.com/members/»;
  2. $ch = curl_init();
  3. curl_setopt($ch, CURLOPT_URL, $url);
  4. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  5. // send the username and password
  6. curl_setopt($ch, CURLOPT_USERPWD, «myusername:mypassword»);
  7. // if you allow redirections
  8. curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  9. // this lets cURL keep sending the username and password
  10. // after being redirected
  11. curl_setopt($ch, CURLOPT_UNRESTRICTED_AUTH, 1);
  12. $output = curl_exec($ch);
  13. curl_close($ch);

FTP Upload

PHP does have an FTP library, but you can also use cURL:

view plaincopy to clipboardprint?

  1. // open a file pointer
  2. $file = fopen(«/path/to/file», «r»);
  3. // the url contains most of the info needed
  4. $url = «ftp://username:[email protected]:21/path/to/new/file»;
  5. $ch = curl_init();
  6. curl_setopt($ch, CURLOPT_URL, $url);
  7. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  8. // upload related options
  9. curl_setopt($ch, CURLOPT_UPLOAD, 1);
  10. curl_setopt($ch, CURLOPT_INFILE, $fp);
  11. curl_setopt($ch, CURLOPT_INFILESIZE, filesize(«/path/to/file»));
  12. // set for ASCII mode (e.g. text files)
  13. curl_setopt($ch, CURLOPT_FTPASCII, 1);
  14. $output = curl_exec($ch);
  15. curl_close($ch);

Using a Proxy

You can perform your URL request through a proxy:

view plaincopy to clipboardprint?

  1. $ch = curl_init();
  2. curl_setopt($ch, CURLOPT_URL,’http://www.example.com’);
  3. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  4. // set the proxy address to use
  5. curl_setopt($ch, CURLOPT_PROXY, ‘11.11.11.11:8080’);
  6. // if the proxy requires a username and password
  7. curl_setopt($ch, CURLOPT_PROXYUSERPWD,’user:pass’);
  8. $output = curl_exec($ch);
  9. curl_close ($ch);

Callback Functions

It is possible to have cURL call given callback functions during the URL request, before it is finished. For example, as the contents of the response is being downloaded, you can start using the data, without waiting for the whole download to complete.

view plaincopy to clipboardprint?

  1. $ch = curl_init();
  2. curl_setopt($ch, CURLOPT_URL,’http://net.tutsplus.com’);
  3. curl_setopt($ch, CURLOPT_WRITEFUNCTION,»progress_function»);
  4. curl_exec($ch);
  5. curl_close ($ch);
  6. function progress_function($ch,$str) {
  7. echo $str;
  8. return strlen($str);
  9. }

The callback function MUST return the length of the string, which is a requirement for this to work properly.

As the URL response is being fetched, every time a data packet is received, the callback function is called.

Conclusion

We have explored the power and the flexibility of the cURL library today. I hope you enjoyed and learned from the this article. Next time you need to make a URL request in your web application, consider using cURL.

Thank you and have a great day!

Write a Plus Tutorial

We’re looking for in depth and well-written tutorials on HTML, CSS, PHP, and JavaScript. If you’re of the ability, please contact Jeffrey at [email protected]

Please note that actual compensation will be dependent upon the quality of the final tutorial and screencast.

Por admin

Deja una respuesta