html scraping y css consultas

¿Cuáles son las ventajas y desventajas de las siguientes bibliotecas?html scraping y css consultas

Desde el QP anterior que he usado y que no pudieron analizar HTML no válido, y simpleDomParser, que hace un buen trabajo, pero un poco las fugas memoria debido al modelo de objetos. Pero puede mantenerlo bajo control llamando al $object->clear(); unset($object); cuando ya no necesite un objeto.

¿Hay más raspadores? ¿Cuáles son sus experiencias con ellos? Voy a hacer de esta una wiki de la comunidad, podemos construir una lista útil de bibliotecas que puedan ser útiles al raspar.

Hice algunas pruebas respuesta basada de Byron:

<? 
    include("lib/simplehtmldom/simple_html_dom.php"); 
    include("lib/phpQuery/phpQuery/phpQuery.php"); 


    echo "<pre>"; 

    $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon"); 
    $data['pq'] = $data['dom'] = $data['simple_dom'] = array(); 

    $timer_start = microtime(true); 

    $dom = new DOMDocument(); 
    @$dom->loadHTML($html); 
    $x = new DOMXPath($dom); 

    foreach($x->query("//a") as $node) 
    { 
     $data['dom'][] = $node->getAttribute("href"); 
    } 

    foreach($x->query("//img") as $node) 
    { 
     $data['dom'][] = $node->getAttribute("src"); 
    } 

    foreach($x->query("//input") as $node) 
    { 
     $data['dom'][] = $node->getAttribute("name"); 
    } 

    $dom_time = microtime(true) - $timer_start; 
    echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n"; 






    $timer_start = microtime(true); 
    $doc = phpQuery::newDocument($html); 
    foreach($doc->find("a") as $node) 
    { 
     $data['pq'][] = $node->href; 
    } 

    foreach($doc->find("img") as $node) 
    { 
     $data['pq'][] = $node->src; 
    } 

    foreach($doc->find("input") as $node) 
    { 
     $data['pq'][] = $node->name; 
    } 
    $time = microtime(true) - $timer_start; 
    echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n"; 









    $timer_start = microtime(true); 
    $simple_dom = new simple_html_dom(); 
    $simple_dom->load($html); 
    foreach($simple_dom->find("a") as $node) 
    { 
     $data['simple_dom'][] = $node->href; 
    } 

    foreach($simple_dom->find("img") as $node) 
    { 
     $data['simple_dom'][] = $node->src; 
    } 

    foreach($simple_dom->find("input") as $node) 
    { 
     $data['simple_dom'][] = $node->name; 
    } 
    $simple_dom_time = microtime(true) - $timer_start; 
    echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n"; 


    echo "</pre>";

y consiguió

dom:   0.00359296798706 . Got 115 items 
PQ:   0.010568857193 . Got 115 items 
simple_dom: 0.0770139694214 . Got 115 items

Fuente

2010-08-30 Quamis

Solía usar sencilla dom html exclusivamente hasta algunos SO'ers brillantes me mostraron la luz aleluya .

Simplemente use las funciones integradas de DOM. Están escritos en C y son parte del núcleo de PHP. Son más rápidos y más eficientes que cualquier solución de terceros. Con firebug, obtener una consulta XPath es muy simple. Este simple cambio ha hecho que mis raspadores basados en php corran más rápido, mientras que ahorra mi valioso tiempo.

Mis raspadores solían tomar ~ 60 megabytes para raspar 10 sitios asincrónicamente con rizo. Eso fue incluso con la simple solución de memoria dom html que mencionaste.

Ahora mis procesos php nunca superan los 8 megabytes.

Muy recomendado.

EDITAR

bien que hice algunos puntos de referencia. Construido en dom es al menos un orden de magnitud más rápido.

Built in php DOM: 0.007061 
Simple html DOM: 0.117781 

<? 
include("../lib/simple_html_dom.php"); 

$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon"); 
$data['dom'] = $data['simple_dom'] = array(); 

$timer_start = microtime(true); 

$dom = new DOMDocument(); 
@$dom->loadHTML($html); 
$x = new DOMXPath($dom); 

foreach($x->query("//a") as $node) 
{ 
    $data['dom'][] = $node->getAttribute("href"); 
} 

foreach($x->query("//img") as $node) 
{ 
    $data['dom'][] = $node->getAttribute("src"); 
} 

foreach($x->query("//input") as $node) 
{ 
    $data['dom'][] = $node->getAttribute("name"); 
} 

$dom_time = microtime(true) - $timer_start; 

echo "built in php DOM : $dom_time\n"; 

$timer_start = microtime(true); 
$simple_dom = new simple_html_dom(); 
$simple_dom->load($html); 
foreach($simple_dom->find("a") as $node) 
{ 
    $data['simple_dom'][] = $node->href; 
} 

foreach($simple_dom->find("img") as $node) 
{ 
    $data['simple_dom'][] = $node->src; 
} 

foreach($simple_dom->find("input") as $node) 
{ 
    $data['simple_dom'][] = $node->name; 
} 
$simple_dom_time = microtime(true) - $timer_start; 

echo "simple html DOM : $simple_dom_time\n";

Fuente

2010-08-30 19:32:25

esto no funciona para el marcado no válido. ¿Cuánto más rápido es esto versus simple dom? – Quamis

Esto ** no funciona para el marcado no válido. No tengo puntos de referencia, pero es al menos un orden de magnitud más rápido. En páginas grandes, el html dom simple tomaría 1-2 segundos. El DOM incorporado lo hace en un abrir y cerrar de ojos. He escrito muchos raspadores con esto y nunca volvería a utilizar HTML html simple para nada. –

@Quamis Observe el @ delante de loadHtml(). Con eso eliminado verás un montón de advertencias de html no válido forzado en el árbol dom. Funciona para navegadores, también funciona para php;) –

Respuesta

Cuestiones relacionadas