No. 3378
You can archive entire websites with the command
wget -r –page-requisites –html-extension –convert-links WEBSITEYOUWANTTOARCHIVE
I just archived the entirety of tohno-chan. It took 10 hours and it ended up being only 27 GB.
No. 3379
Please add some timeout otherwise you'll end up bringing down small sites that are hosted out of a tin can. Also 27GB seems quite large, is the majority coming from images?
No. 3380
For an imageboard that been running over a decade. Doesn't really auto-delete threads, and allows 7-10 mb files, that's pretty small.
No. 3381
On a side note, does tohno-chan's software offer an API endpoint? Requesting a JSON file is way better than downloading a raw static page and then parsing it as an HTMLDocument
No. 3382
>API endpoint
I don't believe kusaba has this. The HTML is very structured and easy to parse anyway.
No. 3447
Why use wget over httrack? I'm asking, because I do archive a lot of webpages too, but I do that with httrack and I want to know if wget has any advantages?
No. 3448
I've never heard of httrack.
No. 3449
They seem to be similar, I can't find good comparisons. HTTrack allows multiple-connections + multithreading looks like. HTTRack seems to be more gui-focused, although there is a CLI version?

I think given httrack is more focused on mirroring it might work better out of the box for more complex websites.
No. 3476
There is also this:

wget --mirror --page-requisites --adjust-extension --convert-links

The --mirror flag is, as far as I understand it, different in the way, that it doesn't pick up on external links and just focuses on plain mirroring. The man page doesn't say much about this flag, except for it being '[..] suitable for mirroring'.
Also, the --adjust-extension is probably irrelevant for TC, but it changes .cgi, .php, .py, .asp to .html. The man page states, that the --adjust-extension is the same as --html-extension, but with a different name.

Post edited on 7th Oct 2024, 8:39am
No. 3477
Which version of man pages are you looking at, mine seems to say "mirror is a superset of recursive - "--mirror" option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to ‘-r -N -l inf --no-remove-listing."

Which seems fairly clear. (Other gnu projects do have unfortunately terse pan pages, for some reason Gnu hates man pages and prefer using "info" doc tool, which is pretty stupid.)
No. 3478
>Which version of man pages are you looking at
GNU Wget 1.21.3 2022-05-14 WGET(1) on Debian 12.
I copied the relevant section and it reads as follows:

Turn on options suitable for mirroring. This option turns on
recursion and time-stamping, sets infinite recursion depth and
keeps FTP directory listings. It is currently equivalent to -r -N
-l inf --no-remove-listing.

No. 3500
I archive Youtube channels like this
yt-dlp --embed-chapters --sleep-request 5 --download-archive archive (link to video tab of channel)

