Multi-threaded web scraper to download all the tutorials from www.learncpp.com and convert them to PDF files concurrently.
Please support here: https://www.learncpp.com/about/
Get the image
docker pull amalrajan/learncpp-download:latestAnd run the container
docker run --rm --name=learncpp-download --mount type=bind,destination=/app/learncpp,source=/home/amalr/temp/downloads amalrajan/learncpp-downloadReplace /home/amalr/temp/downloads with a local path on your system where you'd want the files to get downloaded.
You need Python 3.10 and wkhtmltopdf installed on your system.
Clone the repository
git clone https://github.com/amalrajan/learncpp-download.gitInstall Python dependencies
cd learncpp-download
pip install -r requirements.txtRun the script
scrapy crawl learncpp You'll find the downloaded files inside learncpp directory under the repository root directory.
Rate Limit Errors:
- Modify
settings.py. - Increase
DOWNLOAD_DELAY(default: 0) to 0.2.
High CPU Usage:
- Adjust
max_workersinlearncpp.py. - Decrease from default 192 to reduce CPU load.
self.executor = ThreadPoolExecutor(
max_workers=192
) # Limit to 192 concurrent PDF conversionsFurther Issues:
- Report at https://github.com/amalrajan/learncpp-download/issues. Attach console logs.