Archive a website with wget
On occasion I have needed to download the contents of a website before it was decommissioned. I wanted to keep the content, but was not really interested in keeping the entire website and its data. A quick and easy way to do this is with wget.
To do this, open a terminal and type:
mkdir website-name
cd website-name
You can name the folder anything you like.
Now run the following command:
wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla https://joshdawes.com
Make sure you are archiving a website you have permission for, such as a site you own.
Replace joshdawes.com with the actual site you want to archive. Here is what each flag does:
- –limit-rate=200k: Limit the download to 200 KB/sec. Higher rates can look suspicious.
- –no-clobber: Do not overwrite existing files (useful for resuming).
- –convert-links: Make links work locally, offline.
- –random-wait: Random waits between downloads.
- -r: Recursive; downloads the full site.
- -p: Pulls in assets (images, etc).
- -E: Ensures the right file extensions.
- -e robots=off: Ignore robots.txt restrictions (only use this where you have permission).
- -U mozilla: Spoof a browser user agent.