Utilizamos Apache Tika a través de PHP (utilidad de línea de comandos) con -j para JSON:
http://tika.apache.org/
<?php
shell_exec('java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying');
?>
Ésta es una muestra de salida de forma aleatoria artículo guardián:
{
"Content-Encoding":"UTF-8",
"Content-Length":205599,
"Content-Type":"text/html; charset\u003dUTF-8",
"DC.date.issued":"2013-07-21",
"X-UA-Compatible":"IE\u003dEdge,chrome\u003d1",
"application-name":"The Guardian",
"article:author":"http://www.guardian.co.uk/profile/nicholaswatt",
"article:modified_time":"2013-07-21T22:42:21+01:00",
"article:published_time":"2013-07-21T22:00:03+01:00",
"article:section":"Politics",
"article:tag":[
"Lynton Crosby",
"Health policy",
"NHS",
"Health",
"Healthcare industry",
"Society",
"Public services policy",
"Lobbying",
"Conservatives",
"David Cameron",
"Politics",
"UK news",
"Business"
],
"content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
"description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
"fb:app_id":180444840287,
"keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
"msapplication-TileColor":"#004983",
"msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/common/images/favicons/windows_tile_144_b.png",
"news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
"og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
"og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg",
"og:site_name":"the Guardian",
"og:title":"Tory strategist Lynton Crosby in new lobbying row",
"og:type":"article",
"og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"resourceName":"tory-strategist-lynton-crosby-lobbying",
"title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
"twitter:app:id:googleplay":"com.guardian",
"twitter:app:id:iphone":409128287,
"twitter:app:name:googleplay":"The Guardian",
"twitter:app:name:iphone":"The Guardian",
"twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
"twitter:card":"summary_large_image",
"twitter:site":"@guardian"
}
esto no funciona para todos los sitios web. por ejemplo: http://www.baidu.com – Prakash
Dos años después y todavía funciona sin problemas, cambié los atributos para obtener las etiquetas 'og:' relacionadas con OpenGraph y funciona perfectamente bien – Yaroslav
¡Impresionante! Gracias Shamittomar. Agregue 'strtolower' alrededor de' $ meta-> getAttribute() '!! A veces tienen un capital al frente – JoeRocc