Download Complete Set of PDFs from a Company Portal

I'm sure this is probably simple for a lot of you. I have access to a bunch of information in pdf format, in many different files, on a password protected server. When I am logged in, I can access documents directly at company.com/publications/xxxxxx.pdf, where xxxxxx is the document number.

Is there any way to automate the downloading of all available files in the "publication" folder on this server? I would have done a Google search, buy had no idea where to start keyword wise.

Comments

  • +8

    You could use something like DownThemAll! which can automatically download certain files based on rules you set, or create custom filters using regular expressions

  • There are many download managers that integrate into browsers like FlashGet that can scrape an entire website's contents to auto-download, although nowadays you can probably get an extension to do this.

  • +2

    I wrote a piece of software in Java many years ago, where you could write in a text file the files you wanted to download and the local file location you wanted to save them in. It could extract zips and only save the files you specified within it. It would also handle torrents. It had a neat little UI and would show progress bars of the downloading and extracting. I don't remember if you could write a folder location and download everything in that folder or if you would have to write all the file locations individually. If you want to see if it could help you it is at https://github.com/quantumcat1/ConfigurableDownloader/

  • The obvious benefit in not downloading the docs is that the current version is always on the server. Your downloaded version could be obsolete later today.

  • +1

    This sounds a little like something you perhaps should not be doing, especially if they contain proprietary information.

    Just say'n.

  • +1

    https://eternallybored.org/misc/wget/

    copy wget.exe file to C:\Windows\System32 folder

    win+r, type 'cmd' and his enter in the box, in the dos prompt type:

    wget —no-parent -r http://asdfasdf.com/DIRECTORY

    • +1

      Cookie cookies cookies start with C

      • oh, yeah i missed the login part, sorry disregard op

  • can you get access to the file location and do a copy and paste?

  • HTTrack Website Copier
    https://www.httrack.com/

    You can easily set it to only download PDF in your target location

    It's free under general public licence

  • Is there any way to automate the downloading of all available files in the "publication" folder on this server?

    Sure. Others have suggested programs that do so but if you're skittish about using third-party stuff (like myself, to be honest) and/or have the chops to roll your own:

    • You can use curl together with cookies (reproduce the HTTP POST request that you'd get from a browser doing a successful log-in manually), which you can then embed in a script that loops over all of the files and downloads them separately.
    • Alternatively (easier to use but harder to set up, in my experience) you can set up Selenium to automate a web browser instance. Then you'd just bring up an instance in a Python script (say), log in the manual way in the browser window, and then have the rest of the script that automates navigating to each of the different pages and downloads everything on a loop.

    There are always ways.

    (Fair warning if you're planning to use this for less-than-goodboy purposes: it's trivial for a web portal administrator to notice and mitigate against/report that there's a lot of rapid-fire page requests followed by downloads. There is such a thing as being too efficient, after all. So y'know, proceed with caution or whatever.)

  • Thanks for the suggestions all, I'll give them a try, from easiest to hardest 😂. No issues with legalities etc, they are legitimate things that I have access to, and they aren't updated. I simply want them all on my laptop, organised into folders etc, so I have them when I need them.

Login or Join to leave a comment