Thursday, August 30, 2007

using wget for data and file transfers

Did you ever experience transferring more than 50 files from a specific site using browser?
How about downloading gigabyte files between host?
Have you ever done unattended large file transfers between hosts without monitoring it for unexpected brief disconnection or timeouts?
Have you downloaded more than 4 GB of single file from a remote host not supported by electric generators or UPS?
What about downloading multiple files on a site with different source locations, from FTP or from web with irregular filename patterns?


Most linux servers I know and all servers I have been managing boots into runlevel 3 specially those unattended servers being managed remotely from far remote locations.

With that in mind, data file transfers are done via terminal commands between two or more hosts, locally from the network or from the internet. Here are two ways to accomplish file transfer over your network and via internet.

This document entry covers data and file transfers using linux command wget. Each of them has its own set of advantages and disadvantages. And both of them have similar benefits for the users on transferring files interactively or in unattended mode. This entry aimed to maximize your systems administration time on large backup data transfers and file transfers locally and from remote host location while being proactive, busy and effective on another separate work for another hundreds of server.


USING WGET FOR FILE TRANSFERS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Man wget:

Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

Wget is non-interactive, meaning that it can work in the background, while the user is not logged on. This allows you to start a retrieval and disconnect from the system, letting Wget finish the work. By contrast, most of the Web browsers require constant user’s presence, which can be a great hindrance when transferring a lot of data.

Wget can follow links in HTML and XHTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site. This is sometimes referred to as ‘‘recursive downloading.’’ While doing that, Wget respects the Robot Exclusion Standard (/robots.txt). Wget can be instructed to convert the links in downloaded HTML files to the local files for offline viewing.



WGET USAGE
~~~~~~~~~~

Transfer 4GB of file from website

# wget http://website.com/folder/bigisofile.iso

While downloading bigisofile.iso, suddenly, the remote host server went kaput due to power failure and came back after 30 minutes. Resume a partially downloaded file using wget like so

# wget -c http://website.com/folder/bigisofile.iso

Any interrupted downloads due to network and/or disconnection failure would be resumed and retried as soon as the connectivity is re-established using the wget argument -c

If a partially downloaded file exists from current folder, and wget was issued without -c , wget would continue downloading but saving the file on a differnt name like bigisofile.iso.1.

You cal also specify the number for wget retry thresholds by using the wget argument --tries. Below specified 10 retries before deciding to quit the wget.

# wget -c --tries=10 http://website.com/folder/bigisofile.iso

or

# wget -c -t 10 http://website.com/folder/bigisofile.iso

You can also apply the command above with FTP, HTTP and other retrieval protocols done from proxies like so

# wget -c --tries=10 ftp://website.com/folder/bigisofile.iso

For visual downloading of file using wget, you can issue it like o

# wget -c --progress=dot http://website.com/folder/bigisofile.iso

Rate limiting is also possible with wget using --limit-rate as an argument like so, which limits wget download rate to 100.5K per second

# wget -c --limit-rate=100.5k http://website.com/folder/bigisofile.iso

alternatively, wget limit rate of 1MB, it would be like so

# wget -c --limit-rate=1m http://website.com/folder/bigisofile.iso

Wget supports http and ftp authentication mechanism as well and can be used like so

# wget -c --user=user --password=passwd http://website.com/folder/bigisofile.iso

This can be overridden with alternative argument like so

# wget -c --user=ftp-user --password=ftp-passwd ftp://10.10.0.100/file.txt

# wget -c --user=http-user --password=http-passwd http://10.10.0.100/file.txt

Wget command can also be used on posting data to sites with cookies like so

# wget --save-cookie cookies.txt --post-data 'name=ben&passwd=ver' "http://localhost/auth.php"

And after a one time authentication with cookies shown above, we can now proceed to grab the files we want to retrieve like so

# wget --load-cookies cookies.txt -p http://localhost/goods/items.php

Recursion with wget is also supported. If you wish to download all files from a site recursively using wget, this can be done like so

# wget -r "http://localhost/starthere/"

Recursive with no directories creation is also possible. This approach downloads only the files and does not create recursive directories locally

# wget -r -nd "http://localhost/starthere/"

Retrieve the first two levels or more with wget is possible like so

@ wget -r -l2 "http://localhost/starthere/"


File globbing are also being supported by wget. File globbing special characters includes * ? [ ] . Here are more samples of wget with file glob arguments

# wget http://localhost/*.txt
# wget ftp://domain.com/pub/file??.vbs
# wget http://domain.com/pub/files??.*
# wget -r "*.jpg" http://domain.com/pub/

Absolute path for document link conversion is also being supported by wget to make local viewing possible using the downloaded files and images. This is possible using -k .

Log file is another nice feature we can get from wget by using -o like so

# wget -c -o /var/log/logfile http://localhost/file.txt

Running wget in background can be specified via wget or by bash shell same like running applications in background like so

# wget -b http://localhost/file.txt
or
# wget http://localhost/file.txt &

Wget is capable of reading URL files from files. This approach makes wget to function in batch mode like so

# wget -i URL-list.txt

The above argument does not expect any source URL from command line anymore.

Any values for retry timeouts, network timeouts, dns time outs using wget can also be defined explicitly like so

network time outs with wget specified for 3 seconds
# wget -T=3 URL

DNS time outs with wget specified for 3 seconds
# wget --dns-timeout=3 URL

Connect time outs with wget specified for 3 seconds
# wget -connect-timeout=3 URL

Read timeout with wget for 3 seconds
# wget -read-timeout=3 URL

Sleep between retrieval with wget can also be specified like so
# wget -w 3 URL

Forice wget to use IPv6 or IPv4 is done with arguments -6 and -4 respectively.

Disabling cache and cookies can be done with wget arguments using --no-cache and --no-cookies

Proxy authentication can also be supplied with wget using --proxy-user and --proxy-password like shown below

# wget --proxy-user=user --proxy-password=passwd URL

Additionally, HTTPS (SSL/TLS) are also being supported by wget using more arguments shown below. Words shown in brackets are the choices available for particular wget argument, and file refers to physical file and folder refers to physical folder
location locally.

--secure-protocol= (auto,SSLv2,SSLv3, TLSv1)
--certificate=client_certificate_file
--certificate-type= (PEM,DER)
--private-key=private_key_file
--private-key-type= (PEM,DER)
--ca-certificate=certificate_file
--ca-directory=directory_source

--no-parent needs to be specified when doing recursive wgets so as to avoid recursive search from parent directory

You can also redirect output to files by using pipe or linux redirection characters.

Happy wget!

0 comments:

Sign up for PayPal and start accepting credit card payments instantly.
ILoveTux - howtos and news | About | Contact | TOS | Policy