Page 1 of 1
Off Topic : Need Help with simple_html_dom
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Thu Jan 06, 2011 10:22 am
by gmcust3
I am using simple_html_dom.php
I am stuck with the Code of How to parse below Content :
<div id="entry_4" class="entry clearfix "><div class="entry_title clearfix"><h1 class=" ">Smith J</h1></div><div class="full_listing"><div class="blocks"><div id="entry_4_block_0" class="block indent-level-0"><div class="share_link" wpol:entryId="719183066N00W" wpol:contactPointId="719183066N00W"><div class="save_menu"><div class="icon"></div></div><div class="share_menu"><div class="icon"></div></div><a class="screen_reader_only" rel="nofollow"
href="/mobile/send-to-mobile-accessible?entryId=719183066N00W&listingId=719183066N00W&searchType=R&channel=WP"
name="Smith">Send this listing to your mobile</a></div><span class="phone_number ">0457 599 539</span>
<div class="address"><span class="street_line">1 Martin Pl</span><span class="locality">Sydney</span><span class="state">NSW</span><span class="postcode">2000</span></div><a rel="nofollow"
class="show_map"
name="Smith"
href="/search/where-is?locality=Sydney&streetNumber=1&streetName=Martin&streetType=Pl&state=NSW&product=N00W%23719183066N00W%23Smith+J&channel=WP"
onclick="return false;">Show map...</a></div></div></div></div>
I am trying
- Code: Select all
if(!$html->find('div[id=entry_' .$i.']',0)==""){
echo "inside0000";
foreach($html->find('div[id=entry_' .$i.']') as $result){
$resultdata[]=array(
'name' => $result->find('h[class=" "]',0)->innertext,
'streetLine' => $result->find('span[class=street_line]',0)->innertext,
'locality' => $result->find('span[class=locality]',0)->innertext,
'state' => $result->find('span[class=state]',0)->innertext,
'postcode' => $result->find('span[class=postcode]',0)->innertext,
'phone' => $result->find('span[phone_number ]',0)->innertext
);
It gets Into
inside0000
But doesn't Parse the Data.
Can anyone help me please ?
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Thu Jan 06, 2011 12:16 pm
by williamconley
I would think you would get a WHOLE LOT more help with this from a more appropriate venue.
Have you posted this on a php forum?
Also: the phrase "it doesn't parse" is extremely vague. It DOES parse, actually, it just does not populate the resultdata array as you would expect, right?
so TEST. break it into smaller bits and use print_r statements within the code to show the true value so you know if the item being searched for is present and exactly what its structure is. Remember to use a form of echo for BOTH the item being sought AND the item which should contain it (either one being incorrect will invalidate the function)
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Thu Jan 06, 2011 10:11 pm
by gmcust3
101% Right.
Thats what I am trying to Achieve.
Actually I am writing a script which can extract data from WhitePages which can be further be uploaded into Dialer.
When I tried
echo "Found:::" .$html->find('div[id=entry_' .$i.']',0);
It returned me :
Found::
Means It didn't return me anything.
I tried PHP forum but didnt get any Reply for the last 3 days.So, posted here as Off Topic.
$html=str_get_html($content);
//echo $html;
$i=0;
echo "Found:::" .$html->find('div[id=entry_' .$i.']',0);
if(!$html->find('div[id=entry_' .$i.']',0)==""){
echo "inside";
foreach($html->find('div[id=entry_' .$i.']') as $result){
$resultdata[]=array(
'name' => $result->find('h[class=" "]',0)->innertext,
'streetLine' => $result->find('span[class=street_line]',0)->innertext,
'locality' => $result->find('span[class=locality]',0)->innertext,
'state' => $result->find('span[class=state]',0)->innertext,
'postcode' => $result->find('span[class=postcode]',0)->innertext,
'phone' => $result->find('span[phone_number ]',0)->innertext
);
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Thu Jan 06, 2011 10:55 pm
by williamconley
You probably didn't get a reply because your "find" function is not defined in your sample. It's not (AFAIK) a standard PHP function, so it must be defined in your code to have any idea how to use it. Since it's not ... noone can answer the question.
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Thu Jan 06, 2011 11:03 pm
by gmcust3
- Code: Select all
function find($selector, $idx=null) {
$selectors = $this->parse_selector($selector);
if (($count=count($selectors))===0) return array();
$found_keys = array();
// find each selector
for ($c=0; $c<$count; ++$c) {
if (($levle=count($selectors[0]))===0) return array();
if (!isset($this->_[HDOM_INFO_BEGIN])) return array();
$head = array($this->_[HDOM_INFO_BEGIN]=>1);
// handle descendant selectors, no recursive!
for ($l=0; $l<$levle; ++$l) {
$ret = array();
foreach($head as $k=>$v) {
$n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k];
$n->seek($selectors[$c][$l], $ret);
}
$head = $ret;
}
foreach($head as $k=>$v) {
if (!isset($found_keys[$k]))
$found_keys[$k] = 1;
}
}
// sort keys
ksort($found_keys);
$found = array();
foreach($found_keys as $k=>$v)
$found[] = $this->dom->nodes[$k];
// return nth-element or array
if (is_null($idx)) return $found;
else if ($idx<0) $idx = count($found) + $idx;
return (isset($found[$idx])) ? $found[$idx] : null;
}
// seek for given conditions
protected function seek($selector, &$ret) {
list($tag, $key, $val, $exp, $no_key) = $selector;
// xpath index
if ($tag && $key && is_numeric($key)) {
$count = 0;
foreach ($this->children as $c) {
if ($tag==='*' || $tag===$c->tag) {
if (++$count==$key) {
$ret[$c->_[HDOM_INFO_BEGIN]] = 1;
return;
}
}
}
return;
}
$end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0;
if ($end==0) {
$parent = $this->parent;
while (!isset($parent->_[HDOM_INFO_END]) && $parent!==null) {
$end -= 1;
$parent = $parent->parent;
}
$end += $parent->_[HDOM_INFO_END];
}
for($i=$this->_[HDOM_INFO_BEGIN]+1; $i<$end; ++$i) {
$node = $this->dom->nodes[$i];
$pass = true;
if ($tag==='*' && !$key) {
if (in_array($node, $this->children, true))
$ret[$i] = 1;
continue;
}
// compare tag
if ($tag && $tag!=$node->tag && $tag!=='*') {$pass=false;}
// compare key
if ($pass && $key) {
if ($no_key) {
if (isset($node->attr[$key])) $pass=false;
}
else if (!isset($node->attr[$key])) $pass=false;
}
// compare value
if ($pass && $key && $val && $val!=='*') {
$check = $this->match($exp, $val, $node->attr[$key]);
// handle multiple class
if (!$check && strcasecmp($key, 'class')===0) {
foreach(explode(' ',$node->attr[$key]) as $k) {
$check = $this->match($exp, $val, $k);
if ($check) break;
}
}
if (!$check) $pass = false;
}
if ($pass) $ret[$i] = 1;
unset($node);
}
}
protected function match($exp, $pattern, $value) {
switch ($exp) {
case '=':
return ($value===$pattern);
case '!=':
return ($value!==$pattern);
case '^=':
return preg_match("/^".preg_quote($pattern,'/')."/", $value);
case '$=':
return preg_match("/".preg_quote($pattern,'/')."$/", $value);
case '*=':
if ($pattern[0]=='/')
return preg_match($pattern, $value);
return preg_match("/".$pattern."/i", $value);
}
return false;
}
Its a Part of simple_html_dom.php file
http://sourceforge.net/projects/simplehtmldom/
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Thu Jan 06, 2011 11:20 pm
by williamconley
And now that we have the "find" function coding ... we don't have the $content value that it failed to find the pattern in. Also, according to notes on the (unmaintained for 18 months) sourceforge package site, it does not support UTF8 among various other bugs.
You're trying to parse a white pages site to extract phone numbers to harvest for placement into Vicidial ... why don't you just use CURL and parse the results with standard php?
This is a fairly deep ravine to dig to fix someone else's Open Source package. Looks like fun.
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Fri Jan 07, 2011 12:09 am
by gmcust3
Thanks a Lot William for showing me a new way. I was fighting with this for the last 20 days.
- Code: Select all
<?php
ini_set('display_errors',true);//Just in case we get some errors, let us know....
// create a new cURL resource
$ch = curl_init();
$fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
$data = curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
$file = fopen("a.txt", "r") or exit("Unable to open file!");
while(!feof($file))
{
echo fgets($file). "<br />";
// $sline = fgets($file)
// echo preg_match("<h1 class=>", $sline);
// while(preg_match("<h1 class=>", fgets($file))
// {
// echo "I found It;"
// }
}
fclose($file);
?>
I can see the Entire page on my Browser.
When I save the page as HTML and Open the same in Notepad, I see few lines as you see below ( Basically One Record )
- Code: Select all
<DIV class="entry clearfix " id=entry_9>
<DIV class="entry_title clearfix">
<H1 class=" ">Smith Colin P</H1></DIV>
<DIV class=full_listing>
<DIV class=blocks>
<DIV class="block indent-level-0" id=entry_9_block_0>
<DIV class=share_link wpol:contactPointId="711357117V00W"
wpol:entryId="711357117V00W">
<DIV class=save_menu>
<DIV class=icon></DIV></DIV>
<DIV class=share_menu>
<DIV class=icon></DIV></DIV><A class=screen_reader_only
href="http://192.168.0.2/mobile/send-to-mobile-accessible?entryId=711357117V00W&listingId=711357117V00W&searchType=R&channel=WP"
rel=nofollow name=Smith>Send this listing to your mobile</A></DIV>
<SPAN class="phone_number ">(03) 9650 4978</SPAN>
<DIV class=address><SPAN class=street_line>118 Russell St</SPAN>
<SPAN class=locality>Melbourne</SPAN><SPAN class=state>VIC</SPAN>
<SPAN class=postcode>3000</SPAN></DIV>
How to extract data from the Browser itself using preg_match ?
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Fri Jan 07, 2011 8:58 am
by williamconley
pregmatch is one way. substring searches with indexes are another.
you use a method that will return the Position of the substring within the variable. use that position to extract a substring.
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Fri Jan 07, 2011 10:29 am
by gmcust3
Substring Looks Easy than PregMatch
Let me fight with it ..!!
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 1:45 am
by gmcust3
- Code: Select all
<?php
ini_set('display_errors',true);//Just in case we get some errors, let us know....
// create a new cURL resource
$ch = curl_init();
$fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
$data = curl_exec($ch);
//$file = fopen("a.txt", "r") or exit("Unable to open file!");
$fileTxt = "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>";
//echo "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>";
while(!feof($fileTxt))
{
// echo fgets($file);
// Get Value of First Name and Last Name
// $StrPoSt = strpos(fgets($fileTxt),"clearfix\"><h1 class=",0);
// $StrPoEnd = strpos(fgets($fileTxt),"</h1></div><div class",0);
// echo $StrPoSt ."- Start<br>";
// echo $StrPoEnd ."- End<br>";
// echo "Value-" . substr(fgets($file),$StrPoSt+22,$StrPoEnd-1) . "<br>";
}
fclose($fileTxt);
// close cURL resource, and free up system resources
curl_close($ch);
?>
At while(!feof($fileTxt)) ,Looks like it gets into Indefinite Loop as my page gets hang.
Is it a Problem as I am doing $fileTxt ??
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 1:53 am
by williamconley
possibly just improper use of feof. I don't use that function, but it doesn't look like it belongs there to me.
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 1:56 am
by gmcust3
http://php.net/manual/en/function.feof.php
If the passed file pointer is not valid you may get an infinite loop, because feof() fails to return TRUE.
But My file pointer is
$fileTxt = "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>";
as I can do
echo $fileTxt;
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 2:02 am
by williamconley
on the other hand, if that can be "echo'd" that does not mean it's a valid FILE POINTER. echo does not require a file pointer to operate, it merely passes a string to stdout, right?
try it with a real filename from your HD one time, see if the loop dies.
perhaps you should hit it with an "is this a valid file pointer" function .. FIRST.
I love makin' work for other programmers.
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 2:04 am
by gmcust3
Trying
<?php
if ($f = fopen('myfile.txt', 'r')) do {
$line = fgets($f);
// do any stuff here...
} while (!feof($f));
fclose($f);
Lets see ..
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 2:07 am
by williamconley
in that case $f is a "struct" that holds information necessary to track the open file. certainly not a string. LOL
in fact: try echoing $f! bet that opens your eyes a bit. (then try print_r($f), that may be even more interesting if it works)
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 2:10 am
by gmcust3
I have saved the HTML content in a Notepad.
Now, I want to Read the Notepad by looping and Extracting the Value between the Tag and Display the Content between those tags.
- Code: Select all
[color=blue]<?php
/**
*
* @get text between tags
*
* @param string $tag The tag name
*
* @param string $html The XML or XHTML string
*
* @param int $strict Whether to use strict mode
*
* @return array
*
*/
function getTextBetweenTags($tag, $html, $strict=0)
{
/*** a new dom object ***/
$dom = new domDocument;
/*** load the html into the object ***/
if($strict==1)
{
$dom->loadXML($html);
}
else
{
$dom->loadHTML($html);
}
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the tag by its tag name ***/
$content = $dom->getElementsByTagname($tag);
/*** the array to return ***/
$out = array();
foreach ($content as $item)
{
/*** add node value to the out array ***/
$out[] = $item->nodeValue;
}
/*** return the results ***/
return $out;
}
?>
<?php
// create a new cURL resource
$ch = curl_init();
$fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
$data = curl_exec($ch);
$file = fopen("a.txt", "r") or exit("Unable to open file!");
//$fileTxt = "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>";
while(!feof($file))
{
$html = fgets($file);
$content = getTextBetweenTags('h1', $html);
$content1 = getTextBetweenTags('street_line', $html);
foreach( $content as $itemName )
{
echo "Name : " .$itemName.'<br />';
foreach( $content1 as $itemAdd)
{
echo "Address : " .$itemAdd.'<br />';
}
}
}
fclose($file);
// close cURL resource, and free up system resources
curl_close($ch);
?>
[/color]
I am getting Output for Name as tag is Like
- Code: Select all
<h1 class=" ">Smith A B</h1>
How to Get Address as there are many tags with span ?- Code: Select all
<span class="street_line">398 Lonsdale St</span>
- Code: Select all
<span class="locality">Melbourne</span>
- Code: Select all
<span class="state">VIC</span>
- Code: Select all
<span class="postcode">3000</span>
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 9:06 am
by williamconley
the span statement contains the address specifics. you'll have to extract that first and use it to control the assignment of the associated text.
lots of extra work. (ie: yep, that's programming)
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 9:29 am
by gmcust3
- Code: Select all
<div class="address"><span class="street_line">25 Spring St</span><span class="locality">Melbourne</span><span class="state">VIC</span><span class="postcode">3000</span></div>
This div has Address but There are Other Div also.
Just thinking How to Extract Address One Only ?
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 9:34 am
by williamconley
the address one is a "wrapper" around the others.
address is not "one", it is "four", it is a div that contains four spans which have the indivicudal lines of the address properly identified in it.
you extract the address by finding it's div statement and then locating the NEXT /div and extracting what is between them
then you extract the individual address lines from the included spans usng the same method.
lots of fun
![Smile :)](./images/smilies/icon_smile.gif)
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 9:44 am
by gmcust3
Thanks William.
I understand that.
Primary is <div class="address">
But Before address div, there are many more div.
Now , in above function , getTextBetweenTags , it takes h1 or div . I tried "<div class=address" to send as $tag but Didnt work.
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 9:57 am
by williamconley
probably not designed to get a tag based on sub-name, only on tag type.
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Sat Jan 08, 2011 10:20 am
by gmcust3
Working on Another Version :
- Code: Select all
<?php
ini_set('display_errors',true);//Just in case we get some errors, let us know....
// create a new cURL resource
$ch = curl_init();
$fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
$file = fopen("a.txt", "r") or exit("Unable to open file!");
while(!feof($file))
{
$regex="/clearfix\"><h1 class(.*)<\/h1><\/div><div class/";
preg_replace($regex,"",fgets($file));
}
fclose($file);
// close cURL resource, and free up system resources
curl_close($ch);
?>
![Post Post](./styles/vicidial/imageset/icon_post_target.gif)
Posted:
Wed Jan 12, 2011 9:52 pm
by gmcust3
Trying
document.getElementsByTagName
Lets see.