Beep Boop Bip
[Return]
Posting mode: Reply
Name
Email
Subject   (reply to 3378)
Message
BB Code
File
File URL
Embed   Help
Password  (for post and file deletion)
  • Supported file types are: BMP, C, CPP, CSS, EPUB, FLAC, FLV, GIF, JPG, OGG, PDF, PNG, PSD, RAR, TORRENT, TXT, WEBM, ZIP
  • Maximum file size allowed is 10000 KB.
  • Images greater than 260x260 pixels will be thumbnailed.
  • Currently 1052 unique user posts.
  • board catalog

File 171088592917.jpg - (29.89KB , 304x383 , r07-rika-figure.jpg )
3378 No. 3378 [Edit]
You can archive entire websites with the command
wget -r –page-requisites –html-extension –convert-links WEBSITEYOUWANTTOARCHIVE
I just archived the entirety of tohno-chan. It took 10 hours and it ended up being only 27 GB.
Expand all images
>> No. 3379 [Edit]
>>3378
Please add some timeout otherwise you'll end up bringing down small sites that are hosted out of a tin can. Also 27GB seems quite large, is the majority coming from images?
>> No. 3380 [Edit]
>>3379
For an imageboard that been running over a decade. Doesn't really auto-delete threads, and allows 7-10 mb files, that's pretty small.
>> No. 3381 [Edit]
>>3379
On a side note, does tohno-chan's software offer an API endpoint? Requesting a JSON file is way better than downloading a raw static page and then parsing it as an HTMLDocument
>> No. 3382 [Edit]
>>3381
>API endpoint
I don't believe kusaba has this. The HTML is very structured and easy to parse anyway.
>> No. 3447 [Edit]
Why use wget over httrack? I'm asking, because I do archive a lot of webpages too, but I do that with httrack and I want to know if wget has any advantages?
>> No. 3448 [Edit]
>>3447
I've never heard of httrack.
>> No. 3449 [Edit]
>>3447
They seem to be similar, I can't find good comparisons. HTTrack allows multiple-connections + multithreading looks like. HTTRack seems to be more gui-focused, although there is a CLI version?

I think given httrack is more focused on mirroring it might work better out of the box for more complex websites.
>> No. 3476 [Edit]
There is also this:

wget --mirror --page-requisites --adjust-extension --convert-links

The --mirror flag is, as far as I understand it, different in the way, that it doesn't pick up on external links and just focuses on plain mirroring. The man page doesn't say much about this flag, except for it being '[..] suitable for mirroring'.
Also, the --adjust-extension is probably irrelevant for TC, but it changes .cgi, .php, .py, .asp to .html. The man page states, that the --adjust-extension is the same as --html-extension, but with a different name.

Post edited on 7th Oct 2024, 8:39am
>> No. 3477 [Edit]
>>3476
Which version of man pages are you looking at, mine seems to say "mirror is a superset of recursive - "--mirror" option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings. It is currently equivalent to ‘-r -N -l inf --no-remove-listing."

Which seems fairly clear. (Other gnu projects do have unfortunately terse pan pages, for some reason Gnu hates man pages and prefer using "info" doc tool, which is pretty stupid.)
>> No. 3478 [Edit]
>>3477
>Which version of man pages are you looking at
GNU Wget 1.21.3 2022-05-14 WGET(1) on Debian 12.
I copied the relevant section and it reads as follows:

--mirror
Turn on options suitable for mirroring. This option turns on
recursion and time-stamping, sets infinite recursion depth and
keeps FTP directory listings. It is currently equivalent to -r -N
-l inf --no-remove-listing.

>> No. 3500 [Edit]
File 172940954229.jpg - (342.64KB , 1101x891 , f1262753d8f8516c4f07d273e958b4cb.jpg )
3500
I archive Youtube channels like this
yt-dlp --embed-chapters --sleep-request 5 --download-archive archive (link to video tab of channel)


View catalog

Delete post []
Password  
Report post
Reason  


[Home] [Manage]



[ Rules ] [ an / foe / ma / mp3 / vg / vn ] [ cr / fig / navi ] [ mai / ot / so / tat ] [ arc / ddl / irc / lol / ns / pic ] [ home ]