vicidial.org

Posted: **Thu Jan 06, 2011 10:22 am**

I am using simple_html_dom.php

I am stuck with the Code of How to parse below Content :

<div id="entry_4" class="entry clearfix "><div class="entry_title clearfix"><h1 class=" ">Smith J</h1></div><div class="full_listing"><div class="blocks"><div id="entry_4_block_0" class="block indent-level-0"><div class="share_link" wpol:entryId="719183066N00W" wpol:contactPointId="719183066N00W"><div class="save_menu"><div class="icon"></div></div><div class="share_menu"><div class="icon"></div></div><a class="screen_reader_only" rel="nofollow"
href="/mobile/send-to-mobile-accessible?entryId=719183066N00W&listingId=719183066N00W&searchType=R&channel=WP"
name="Smith">Send this listing to your mobile</a></div><span class="phone_number ">0457 599 539</span>
<div class="address"><span class="street_line">1 Martin Pl</span><span class="locality">Sydney</span><span class="state">NSW</span><span class="postcode">2000</span></div><a rel="nofollow"
class="show_map"
name="Smith"
href="/search/where-is?locality=Sydney&streetNumber=1&streetName=Martin&streetType=Pl&state=NSW&product=N00W%23719183066N00W%23Smith+J&channel=WP"
onclick="return false;">Show map...</a></div></div></div></div>

I am trying

Code: Select all: if(!$html->find('div[id=entry_' .$i.']',0)==""){ echo "inside0000"; foreach($html->find('div[id=entry_' .$i.']') as $result){ $resultdata[]=array( 'name' => $result->find('h[class=" "]',0)->innertext, 'streetLine' => $result->find('span[class=street_line]',0)->innertext, 'locality' => $result->find('span[class=locality]',0)->innertext, 'state' => $result->find('span[class=state]',0)->innertext, 'postcode' => $result->find('span[class=postcode]',0)->innertext, 'phone' => $result->find('span[phone_number ]',0)->innertext );

It gets Into

inside0000

But doesn't Parse the Data.

Can anyone help me please ?

Posted: **Thu Jan 06, 2011 12:16 pm**

I would think you would get a WHOLE LOT more help with this from a more appropriate venue.

Have you posted this on a php forum?

Also: the phrase "it doesn't parse" is extremely vague. It DOES parse, actually, it just does not populate the resultdata array as you would expect, right?

so TEST. break it into smaller bits and use print_r statements within the code to show the true value so you know if the item being searched for is present and exactly what its structure is. Remember to use a form of echo for BOTH the item being sought AND the item which should contain it (either one being incorrect will invalidate the function)

Posted: **Thu Jan 06, 2011 10:11 pm**

101% Right.

Thats what I am trying to Achieve.

Actually I am writing a script which can extract data from WhitePages which can be further be uploaded into Dialer.

When I tried

echo "Found:::" .$html->find('div[id=entry_' .$i.']',0);

It returned me :

Found::

Means It didn't return me anything.

I tried PHP forum but didnt get any Reply for the last 3 days.So, posted here as Off Topic.

$html=str_get_html($content);
//echo $html;

$i=0;
echo "Found:::" .$html->find('div[id=entry_' .$i.']',0);
if(!$html->find('div[id=entry_' .$i.']',0)==""){
echo "inside";
foreach($html->find('div[id=entry_' .$i.']') as $result){
$resultdata[]=array(
'name' => $result->find('h[class=" "]',0)->innertext,
'streetLine' => $result->find('span[class=street_line]',0)->innertext,
'locality' => $result->find('span[class=locality]',0)->innertext,
'state' => $result->find('span[class=state]',0)->innertext,
'postcode' => $result->find('span[class=postcode]',0)->innertext,
'phone' => $result->find('span[phone_number ]',0)->innertext
);

Posted: **Thu Jan 06, 2011 10:55 pm**

You probably didn't get a reply because your "find" function is not defined in your sample. It's not (AFAIK) a standard PHP function, so it must be defined in your code to have any idea how to use it. Since it's not ... noone can answer the question.

Posted: **Thu Jan 06, 2011 11:03 pm**

Code: Select all: function find($selector, $idx=null) { $selectors = $this->parse_selector($selector); if (($count=count($selectors))===0) return array(); $found_keys = array(); // find each selector for ($c=0; $c<$count; ++$c) { if (($levle=count($selectors[0]))===0) return array(); if (!isset($this->_[HDOM_INFO_BEGIN])) return array(); $head = array($this->_[HDOM_INFO_BEGIN]=>1); // handle descendant selectors, no recursive! for ($l=0; $l<$levle; ++$l) { $ret = array(); foreach($head as $k=>$v) { $n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k]; $n->seek($selectors[$c][$l], $ret); } $head = $ret; } foreach($head as $k=>$v) { if (!isset($found_keys[$k])) $found_keys[$k] = 1; } } // sort keys ksort($found_keys); $found = array(); foreach($found_keys as $k=>$v) $found[] = $this->dom->nodes[$k]; // return nth-element or array if (is_null($idx)) return $found; else if ($idx<0) $idx = count($found) + $idx; return (isset($found[$idx])) ? $found[$idx] : null; } // seek for given conditions protected function seek($selector, &$ret) { list($tag, $key, $val, $exp, $no_key) = $selector; // xpath index if ($tag && $key && is_numeric($key)) { $count = 0; foreach ($this->children as $c) { if ($tag==='*' || $tag===$c->tag) { if (++$count==$key) { $ret[$c->_[HDOM_INFO_BEGIN]] = 1; return; } } } return; } $end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0; if ($end==0) { $parent = $this->parent; while (!isset($parent->_[HDOM_INFO_END]) && $parent!==null) { $end -= 1; $parent = $parent->parent; } $end += $parent->_[HDOM_INFO_END]; } for($i=$this->_[HDOM_INFO_BEGIN]+1; $i<$end; ++$i) { $node = $this->dom->nodes[$i]; $pass = true; if ($tag==='*' && !$key) { if (in_array($node, $this->children, true)) $ret[$i] = 1; continue; } // compare tag if ($tag && $tag!=$node->tag && $tag!=='*') {$pass=false;} // compare key if ($pass && $key) { if ($no_key) { if (isset($node->attr[$key])) $pass=false; } else if (!isset($node->attr[$key])) $pass=false; } // compare value if ($pass && $key && $val && $val!=='*') { $check = $this->match($exp, $val, $node->attr[$key]); // handle multiple class if (!$check && strcasecmp($key, 'class')===0) { foreach(explode(' ',$node->attr[$key]) as $k) { $check = $this->match($exp, $val, $k); if ($check) break; } } if (!$check) $pass = false; } if ($pass) $ret[$i] = 1; unset($node); } } protected function match($exp, $pattern, $value) { switch ($exp) { case '=': return ($value===$pattern); case '!=': return ($value!==$pattern); case '^=': return preg_match("/^".preg_quote($pattern,'/')."/", $value); case '$=': return preg_match("/".preg_quote($pattern,'/')."$/", $value); case '*=': if ($pattern[0]=='/') return preg_match($pattern, $value); return preg_match("/".$pattern."/i", $value); } return false; }

Its a Part of simple_html_dom.php file

http://sourceforge.net/projects/simplehtmldom/

Posted: **Thu Jan 06, 2011 11:20 pm**

And now that we have the "find" function coding ... we don't have the $content value that it failed to find the pattern in. Also, according to notes on the (unmaintained for 18 months) sourceforge package site, it does not support UTF8 among various other bugs.

You're trying to parse a white pages site to extract phone numbers to harvest for placement into Vicidial ... why don't you just use CURL and parse the results with standard php?

This is a fairly deep ravine to dig to fix someone else's Open Source package. Looks like fun.

Posted: **Fri Jan 07, 2011 12:09 am**

Thanks a Lot William for showing me a new way. I was fighting with this for the last 20 days.

Code: Select all: <?php ini_set('display_errors',true);//Just in case we get some errors, let us know.... // create a new cURL resource $ch = curl_init(); $fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information // set URL and other appropriate options curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); // grab URL and pass it to the browser $data = curl_exec($ch); // close cURL resource, and free up system resources curl_close($ch); $file = fopen("a.txt", "r") or exit("Unable to open file!"); while(!feof($file)) { echo fgets($file). "<br />"; // $sline = fgets($file) // echo preg_match("<h1 class=>", $sline); // while(preg_match("<h1 class=>", fgets($file)) // { // echo "I found It;" // } } fclose($file); ?>

I can see the Entire page on my Browser.

When I save the page as HTML and Open the same in Notepad, I see few lines as you see below ( Basically One Record )

Code: Select all: <DIV class="entry clearfix " id=entry_9> <DIV class="entry_title clearfix"> <H1 class=" ">Smith Colin P</H1></DIV> <DIV class=full_listing> <DIV class=blocks> <DIV class="block indent-level-0" id=entry_9_block_0> <DIV class=share_link wpol:contactPointId="711357117V00W" wpol:entryId="711357117V00W"> <DIV class=save_menu> <DIV class=icon></DIV></DIV> <DIV class=share_menu> <DIV class=icon></DIV></DIV><A class=screen_reader_only href="http://192.168.0.2/mobile/send-to-mobile-accessible?entryId=711357117V00W&listingId=711357117V00W&searchType=R&channel=WP" rel=nofollow name=Smith>Send this listing to your mobile</A></DIV> <SPAN class="phone_number ">(03) 9650 4978</SPAN> <DIV class=address><SPAN class=street_line>118 Russell St</SPAN> <SPAN class=locality>Melbourne</SPAN><SPAN class=state>VIC</SPAN> <SPAN class=postcode>3000</SPAN></DIV>

How to extract data from the Browser itself using preg_match ?

Posted: **Fri Jan 07, 2011 8:58 am**

pregmatch is one way. substring searches with indexes are another.

you use a method that will return the Position of the substring within the variable. use that position to extract a substring.

Posted: **Fri Jan 07, 2011 10:29 am**

Substring Looks Easy than PregMatch :-)

Let me fight with it ..!!

Posted: **Sat Jan 08, 2011 1:45 am**

Code: Select all: <?php ini_set('display_errors',true);//Just in case we get some errors, let us know.... // create a new cURL resource $ch = curl_init(); $fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information // set URL and other appropriate options curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); // grab URL and pass it to the browser $data = curl_exec($ch); //$file = fopen("a.txt", "r") or exit("Unable to open file!"); $fileTxt = "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>"; //echo "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>"; while(!feof($fileTxt)) { // echo fgets($file); // Get Value of First Name and Last Name // $StrPoSt = strpos(fgets($fileTxt),"clearfix\"><h1 class=",0); // $StrPoEnd = strpos(fgets($fileTxt),"</h1></div><div class",0); // echo $StrPoSt ."- Start<br>"; // echo $StrPoEnd ."- End<br>"; // echo "Value-" . substr(fgets($file),$StrPoSt+22,$StrPoEnd-1) . "<br>"; } fclose($fileTxt); // close cURL resource, and free up system resources curl_close($ch); ?>

At while(!feof($fileTxt)) ,Looks like it gets into Indefinite Loop as my page gets hang.

Is it a Problem as I am doing $fileTxt ??

Posted: **Sat Jan 08, 2011 1:53 am**

possibly just improper use of feof. I don't use that function, but it doesn't look like it belongs there to me.

Posted: **Sat Jan 08, 2011 1:56 am**

http://php.net/manual/en/function.feof.php

If the passed file pointer is not valid you may get an infinite loop, because feof() fails to return TRUE.

But My file pointer is

$fileTxt = "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>";

as I can do

echo $fileTxt;

Posted: **Sat Jan 08, 2011 2:02 am**

on the other hand, if that can be "echo'd" that does not mean it's a valid FILE POINTER. echo does not require a file pointer to operate, it merely passes a string to stdout, right?

try it with a real filename from your HD one time, see if the loop dies.

perhaps you should hit it with an "is this a valid file pointer" function .. FIRST.

I love makin' work for other programmers.

Posted: **Sat Jan 08, 2011 2:04 am**

Trying

<?php
if ($f = fopen('myfile.txt', 'r')) do {
$line = fgets($f);
// do any stuff here...
} while (!feof($f));
fclose($f);

Lets see ..

Posted: **Sat Jan 08, 2011 2:07 am**

in that case $f is a "struct" that holds information necessary to track the open file. certainly not a string. LOL

in fact: try echoing $f! bet that opens your eyes a bit. (then try print_r($f), that may be even more interesting if it works)

Posted: **Sat Jan 08, 2011 2:10 am**

I have saved the HTML content in a Notepad.

Now, I want to Read the Notepad by looping and Extracting the Value between the Tag and Display the Content between those tags.

Code: Select all: [color=blue]<?php /** * * @get text between tags * * @param string $tag The tag name * * @param string $html The XML or XHTML string * * @param int $strict Whether to use strict mode * * @return array * */ function getTextBetweenTags($tag, $html, $strict=0) { /*** a new dom object ***/ $dom = new domDocument; /*** load the html into the object ***/ if($strict==1) { $dom->loadXML($html); } else { $dom->loadHTML($html); } /*** discard white space ***/ $dom->preserveWhiteSpace = false; /*** the tag by its tag name ***/ $content = $dom->getElementsByTagname($tag); /*** the array to return ***/ $out = array(); foreach ($content as $item) { /*** add node value to the out array ***/ $out[] = $item->nodeValue; } /*** return the results ***/ return $out; } ?> <?php // create a new cURL resource $ch = curl_init(); $fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information // set URL and other appropriate options curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); // grab URL and pass it to the browser $data = curl_exec($ch); $file = fopen("a.txt", "r") or exit("Unable to open file!"); //$fileTxt = "<pre>".htmlspecialchars(file_get_contents("a.txt"))."</pre>"; while(!feof($file)) { $html = fgets($file); $content = getTextBetweenTags('h1', $html); $content1 = getTextBetweenTags('street_line', $html); foreach( $content as $itemName ) { echo "Name : " .$itemName.'<br />'; foreach( $content1 as $itemAdd) { echo "Address : " .$itemAdd.'<br />'; } } } fclose($file); // close cURL resource, and free up system resources curl_close($ch); ?> [/color]

I am getting Output for Name as tag is Like

Code: Select all: <h1 class=" ">Smith A B</h1>

How to Get Address as there are many tags with span ?

Code: Select all: <span class="street_line">398 Lonsdale St</span>

Code: Select all: <span class="locality">Melbourne</span>

Code: Select all: <span class="state">VIC</span>

Code: Select all: <span class="postcode">3000</span>

Posted: **Sat Jan 08, 2011 9:06 am**

the span statement contains the address specifics. you'll have to extract that first and use it to control the assignment of the associated text.

lots of extra work. (ie: yep, that's programming)

Posted: **Sat Jan 08, 2011 9:29 am**

Code: Select all: <div class="address"><span class="street_line">25 Spring St</span><span class="locality">Melbourne</span><span class="state">VIC</span><span class="postcode">3000</span></div>

This div has Address but There are Other Div also.

Just thinking How to Extract Address One Only ?

Posted: **Sat Jan 08, 2011 9:34 am**

the address one is a "wrapper" around the others.

address is not "one", it is "four", it is a div that contains four spans which have the indivicudal lines of the address properly identified in it.

you extract the address by finding it's div statement and then locating the NEXT /div and extracting what is between them

then you extract the individual address lines from the included spans usng the same method.

lots of fun

Posted: **Sat Jan 08, 2011 9:44 am**

Thanks William.

I understand that.

Primary is <div class="address">

But Before address div, there are many more div.

Now , in above function , getTextBetweenTags , it takes h1 or div . I tried "<div class=address" to send as $tag but Didnt work.

Posted: **Sat Jan 08, 2011 9:57 am**

probably not designed to get a tag based on sub-name, only on tag type.

Posted: **Sat Jan 08, 2011 10:20 am**

Working on Another Version :

Code: Select all: <?php ini_set('display_errors',true);//Just in case we get some errors, let us know.... // create a new cURL resource $ch = curl_init(); $fp = fopen (dirname(__FILE__) . '/a.txt', 'w+');//This is the file where we save the information // set URL and other appropriate options curl_setopt($ch, CURLOPT_URL, "http://www.whitepages.com.au/resSearch.do?subscriberName=smith&givenName=&location=Melbourne+VIC"); curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_HEADER, 0); $file = fopen("a.txt", "r") or exit("Unable to open file!"); while(!feof($file)) { $regex="/clearfix\"><h1 class(.*)<\/h1><\/div><div class/"; preg_replace($regex,"",fgets($file)); } fclose($file); // close cURL resource, and free up system resources curl_close($ch); ?>

Posted: **Wed Jan 12, 2011 9:52 pm**

Trying

document.getElementsByTagName

Lets see.

vicidial.org

Off Topic : Need Help with simple_html_dom

Off Topic : Need Help with simple_html_dom