top of page

Python Script to Find and Download PDFs from a Webpage

Updated: Mar 31, 2023


I am currently writing my first novel. And I came across a real-life character to inspire a fictional one. The thematic core of my novel is conspiracy, both theoretical and practical. One of the most out-there conspiracy theorists on the internet is Miles Mathis. I don't endorse or reject his conspiracy theories, most of them are impossible to confirm or deny from my perspective as I have no first hand knowledge of the people, events, or evidence. Nevertheless, I found myself perusing his website and finding a large library of pdfs on all kinds of subjects. From the Beatles to the Beer Hall Putsch. I wanted to conduct some text analysis on the documents to confirm my intuitions about his main theses without having to read every single document (there are hundreds). So, I wrote this script to scan a given webpage for pdfs and download them to a folder of your choosing. Feel free to use this script, just remember to change the webpage variables, url and url2, and change the folder destination variables, folder and folder2.


In the future I'd like to update the script so it can iterate through all pages of a website and pull all pdfs into folders dynamically created based on webpage where the pdfs were posted. But given my needs at the time, it would be overengineering. I only needed pdfs from two pages. And there are a lot of pages without pdf links on the site, not to mention redundancies in links across pages, so there it is. Enjoy and hope this script makes your life and research easier!

 


from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
import os


# create soup object by scraping webpage (url)
def soupify(url):
    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data, features="html.parser")

    return soup


# takes soup object of webpage as input, and
# outputs a list of pdf urls
def get_pdf_urls(soup_object):

    pdf_urls = []
    for link in soup_object.select("[href$='.pdf']"):
        pdf_urls.append(link.get("href"))

    return pdf_urls

# takes pdf_urls as input and downloads pdf files into
# a folder as output
def download_pdfs(pdf_list, folder):
      
    if not os.path.exists(folder):
        os.mkdir(folder)    

    for link in pdf_urls:
        filename = os.path.join(folder,link.split('/')[-1])
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url,link)).content)

# needs to be worked on to dynamically create different
# folders for pdfs coming from different urls
def full_pdf_cycle(url_list, folder):
    for url in url_list:
        soup_object = soupify(url)
        pdf_urls = get_pdf_urls(soup_object)
        download_pdfs(pdf_urls,folder)



# pass arguments to functions and complete task
# of downloading all political pdfs from miles mathis
# art website
if __name__ == '__main__':

    # folder path for pdfs
    folder = r'C:\Users\rlf21\Documents\Python Scripts\pdf project\Miles Mathis political essays'
    folder2 = r'C:\Users\rfl21\Documents\Python Scripts\pdf\projectMiles Mathis scientific essays'

    # relevant urls to scrape from Miles Mathis Site
    url = "http://mileswmathis.com/updates.html"
    url2 = "http://milesmathis.com/updates.html"

    # call functions to download all pdfs from given url
    soup_object = soupify(url2)
    pdf_urls = get_pdf_urls(soup_object)
    download_pdfs(pdf_urls, folder2)
   

Kommentare


Die Kommentarfunktion wurde abgeschaltet.
bottom of page