Estoy tratando de raspar thesession.org para crear una tabla de cuántas veces se ha agregado cada canción a los libros de Memeber para que pueda encontrar algunas piezas populares para aprender. Empecé con el tutorial de scrapy here y estoy tratando de modificarlo para mis propósitos. El problema es que aunque el sitio web thesession.org parece tener unas 10.390 canciones, mi raspador solo devuelve datos en 10 de ellos (solo los que están en http://www.thesession.org/tunes/index.php). ¿Cómo puedo obtener datos sobre todas las canciones (o las cien canciones mejor clasificadas)? Cualquier consejo sería muy apreciado.python's scrapy no parece obtener datos de todas las URL disponibles
Esto es lo que tengo hasta ahora:
items.py
from scrapy.item import Item, Field
class tuneItem(Item):
url = Field()
name1 = Field()
name2 = Field()
key = Field()
count = Field()
pass
tune_spider.py
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings
class tunesSpider(CrawlSpider):
name = "irishtunes"
allowed_domains = ["thesession.org"]
start_urls = ["http://www.thesession.org/tunes"]
rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]
def parse_tune(self, response):
x = HtmlXPathSelector(response)
tune = tuneItem()
tune['url'] = response.url
tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
tune['key'] = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
return tune
corro el rascador mediante la apertura de mi consola, ir al directorio que contiene archivo cfg del tutorial y ejecutando scrapy crawl irishtunes --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv
Esto es lo que obtengo:
C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to <GET http://ww
w.thesession.org/tunes/> from <GET http://www.thesession.org/tunes>
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/> (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11602> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
{'count': [u'1'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Brendan Begley's"],
'name2': [u'polka'],
'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11593> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
{'count': [u'3'],
'key': [u'Key signature: Amajor'],
'name1': [u'Carleton County Breakdown'],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11597> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
{'count': [u'3'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Kasper's Rant"],
'name2': [u'hornpipe'],
'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11594> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
{'count': [u'5'],
'key': [u'Key signature: Gmajor'],
'name1': [u'The Full Of The Bag'],
'name2': [u'hornpipe'],
'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11599> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
{'count': [u'1'],
'key': [u'Key signature: Adorian'],
'name1': [u'The New Steamboat'],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11598> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
{'count': [u'4'],
'key': [u'Key signature: Gmajor'],
'name1': [u"Galen's Arrival"],
'name2': [u'reel'],
'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11596> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
{'count': [u'2'],
'key': [u'Key signature: Amixolydian'],
'name1': [u'Culloden Day'],
'name2': [u'strathspey'],
'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11595> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
{'count': [u'2'],
'key': [u'Key signature: Aminor'],
'name1': [u'Miss Sine Flemington'],
'name2': [u'barndance'],
'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11600> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
{'count': [u'2'],
'key': [u'Key signature: Dmajor'],
'name1': [u"Joan Martin's"],
'name2': [u'polka'],
'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11601> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
{'count': [u'2'],
'key': [u'Key signature: Gmajor'],
'name1': [u'My Time Inside 2005'],
'name2': [u'waltz'],
'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
{'downloader/request_bytes': 3655,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 31620,
'downloader/response_count': 12,
'downloader/response_status_count/200': 11,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
'item_scraped_count': 10,
'request_depth_max': 1,
'scheduler/memory_enqueued': 12,
'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
{}
EDIT: La respuesta de @reclosedev me consiguió en el camino. Para cualquiera que se pregunte por el resultado, aquí está una instantánea ...
(1) La gran mayoría de las canciones tienen menos de 10 tunebooks miembros
(2) La popularidad de los 10.379 canciones que podría raspar desde el sitio (como se mide por el número tunebooks están en) sigue una distribución de ley de potencia
(3) Y aquí están las canciones que se encuentran en> 1000 ma nebooks en el sitio, que muestra los nombres de las canciones mejor clasificados y cuántos tunebooks están en
resultados interesantes, pero podría haberse ahorrado el problema: http://www.irishtune.info/session/tunes.php – alanng