Descargar todos los enlaces (documentos relacionados) en una página web usando Python

Tengo que descargar una gran cantidad de documentos de una página web. Son archivos wmv, PDF, BMP, etc. Por supuesto, todos tienen enlaces a ellos. Así que cada vez que tengo un archivo RMC, seleccione 'Guardar enlace como' Luego guarde y escriba todos los archivos. ¿Es posible hacer esto en Python? Busco SO DB y la gente ha respondido la pregunta de cómo obtener los enlaces de la página web. Quiero descargar los archivos reales. Gracias por adelantado. (Esta no es una pregunta HW :)).Descargar todos los enlaces (documentos relacionados) en una página web usando Python

Fuente

2011-05-12 Sumod

Aquí es un ejemplo de cómo se puede descargar algunos archivos escogidos de http://pypi.python.org/pypi/xlwt

tendrá que instalar mecanizar la primera: http://wwwsearch.sourceforge.net/mechanize/download.html

import mechanize 
from time import sleep 
#Make a Browser (think of this as chrome or firefox etc) 
br = mechanize.Browser() 

#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/ 
#for more ways to set up your br browser object e.g. so it look like mozilla 
#and if you need to fill out forms with passwords. 

# Open your site 
br.open('http://pypi.python.org/pypi/xlwt') 

f=open("source.html","w") 
f.write(br.response().read()) #can be helpful for debugging maybe 

filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files 
myfiles=[] 
for l in br.links(): #you can also iterate through br.forms() to print forms on the page! 
    for t in filetypes: 
     if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something) 
      myfiles.append(l) 


def downloadlink(l): 
    f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist. 
    br.click_link(l) 
    f.write(br.response().read()) 
    print l.text," has been downloaded" 
    #br.back() 

for l in myfiles: 
    sleep(1) #throttle so you dont hammer the site 
    downloadlink(l)

Nota: En algunos casos es posible que desee reemplazar br.click_link(l) con br.follow_link(l). La diferencia es que click_link devuelve un objeto Request mientras que follow_link abrirá directamente el enlace. Ver Mechanize difference between br.click_link() and br.follow_link()

Fuente

2011-05-12 10:08:06

+1 para mecanizar! – jathanism

+1 para un código completamente funcional! –

robert kink, ejecuto su código solo para descargar archivos zip: el código se ejecuta sin errores, pero en la carpeta de descarga de cromos no veo los archivos – newGIS

seguir los códigos de Python en este enlace: wget-vs-urlretrieve-of-python.
También puede hacerlo fácilmente con Wget. Pruebe --limit, --recursive y --accept líneas de comando en Wget. Por ejemplo: wget --accept wmv,doc --limit 2 --recursive http://www.example.com/files/

Fuente

2011-05-12 07:28:44 gsbabil

Descargar todos los enlaces (documentos relacionados) en una página web usando Python

Respuesta

Cuestiones relacionadas