PHP Web Scraping Of Javascript Generated Contents
Solution 1:
Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.
So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:
http://www.govliquidation.com/json/buyer_ux/salescalendar.js
Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.
Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:
<?php
$url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";
$json = file_get_contents($url);
$data = json_decode($json);
?>
This yields a data object that you can inspect and convert in CSV by simple looping.
stdClass Object
(
[result] => stdClass Object
(
[events] => Array
(
[0] => stdClass Object
(
[yahoo_dur] => 11300
[closing_today] => 0
[language_code] => en
[mixed_id] => 9297
[event_id] => 9297
[close_meridian] => PM
[commercial_sale_flag] => 0
[close_time] => 01/06/2014
[award_time_unixtime] => 1389070800
[category] => Tires, Parts & Components
[open_time_unixtime] => 1388638800
[yahoo_date] => 20140102T000000Z
[open_time] => 01/02/2014
[event_close_time] => 2014-01-06 17:00:00
[display_event_id] => 9297
[type_code] => X3
[title] => Truck Drive Axles @ Killeen, TX
[special_flag] => 1
[demil_flag] => 0
[google_close] => 20140106
[event_open_time] => 2014-01-02 00:00:00
[google_open] => 20140102
[third_party_url] =>
[bid_package_flag] => 0
[is_open] => 1
[fda_count] => 0
[close_time_unixtime] => 1389045600
You retrieve $data->result->events
, use fputcsv()
on its items converted to array form, and Bob's your uncle.
Solution 2:
In the case of the second site, you have a table with several TR elements, and you want to catch the first two TD children of each TR.
By inspecting the source code you see something like this:
<tr>
<td> Allendale</td>
<td> Eastern Time
</td>
</tr>
<tr>
<td> Alpine</td>
<td> Eastern Time
</td>
So you just grab all the TR's
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
$fp = fopen('output.csv', 'w');
if (!$fp) die("Cannot open output CSV - permission problems maybe?");
foreach($html->find('tr') as $tr) {
$csv = array(); // Start empty. A new CSV row for each TR.
// Now find the TD children of $tr. They will make up a row.
foreach($tr->find('td') as $td) {
// Get TD's innertext, but
$csv[] = $td->innertext;
}
fputcsv($fp, $csv);
}
fclose($fp);
?>
You will notice that the CSV text is "dirty". That is because the actual text is:
<td> Alpine</td>
<td> Eastern Time[CARRIAGE RETURN HERE]
</td>
So to have "Alpine" and "Eastern Time", you have to replace
$csv[] = $td->innertext;
with something like
$csv[] = strip(
html_entity_decode (
$td->innertext,
ENT_COMPAT | ENT_HTML401,
'UTF-8'
)
);
Check out the PHP man page for html_entity_decode()
about character set encoding and entity handling. The above ought to work -- and an ought and fifty cents will get you a cup of coffee :-)
Post a Comment for "PHP Web Scraping Of Javascript Generated Contents"