-
-
Notifications
You must be signed in to change notification settings - Fork 123
Description
During our Test Domain Crawls (Front Page Only) with Btrix 1.6.4 we found one URL with some embeded, huge youtube Videos:
https://www.skvi-katzenberger.at/
the Crawl with this Domain had around 69 GB within a seed list of 400
In the Domain Crawl we gernating the seeds with all variatoins (http, http://www, https https://www. )
After this we where checking the Domains of the Domain List and started only 5 seeds in the list Per Container
still the amount on Disk was ~10 GB
then we decided we don't wanna crawl the videos and deactivated autoplay, here is der docker CMD:
docker run -d --name ONB_Btrix_HUGE_DC_LIST_TEST_BULK_7_20250922103146 -e NODE_OPTIONS='--max-old-space-size=32768' -p 42667:42667 -p 37185:37185 -v /home/user/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.6.4 crawl --screencastPort 42667 --healthCheckPort 37185 --scopeType page --headless --delay 0 --behaviorTimeout 60 --pageLoadTimeout 60 --waitUntil networkidle0 --saveState always --logging stats,info --url https://www.skvi-katzenberger.at/
--depth 0 --workers 1 --useSHA1 --collection HUGE_DC_LIST_TEST_BULK_7_20250922103146 --behaviors ["autofetch","autoscroll","siteSpecific"]
Ending with a Crawl with no videos (as expected) and a size of ~30 MB
then i compared to with videos, and removed --behaviors again (as it was in the DomainCrawl)
but it ended up with no videos (Because of the timeout!) and a crawl of 50 MB
but suddenly the size increased dramatically when we add the seed list again via config (not chaning any other value)
config file:
seeds:
- url: https://www.skvi-katzenberger.at
depth: 0
- url: https://skvi-katzenberger.at
depth: 0
- url: http://skvi-katzenberger.at
depth: 0
- url: http://www.skvi-katzenberger.at
depth: 0
start cmd:
docker run -d --name ONB_Btrix_HUGE_DC_LIST_TEST_BULK_7_20250922103146 -e NODE_OPTIONS='--max-old-space-size=32768' -p 42667:42667 -p 37185:37185 -v /home/user/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.6.4 crawl --screencastPort 42667 --healthCheckPort 37185 --scopeType page --headless --delay 0 --behaviorTimeout 60 --pageLoadTimeout 60 --waitUntil networkidle0 --saveState always --logging stats,info --config /crawls/config/HUGE_DC_LIST_TEST_BULK_7_20250922103146.yaml --depth 0 --workers 1 --useSHA1 --collection HUGE_DC_LIST_TEST_BULK_7_2025092210314
And suddenly the Crawl had 2.7 GB [remember in the Domain Crawl their has been 400 Seeds for one container]
My suspicion is now, that the download of the videos is not limited by the behaviors-timeout, it seems to me as the crawler is using the entire Crawl Time for Downloading the videos of the first autoplay clicks.
I think this is not the expected behavior, and not as it is written in the docs.
I would expect, that the crawler download a maximum 60 Seconds after the first Click on Autoplay and stop the download if the time limit is reached
This means it makes a incredible differnce at which position a seed is in the config.yaml
Just imagine a bunch of seeds with no videos and my example in the beginning, will lead to a huge crawl with more then 50 GB of size
but when the config.yaml has in the beginning the bunch of seeds with no videos and my example in the end as last seed, the size of the crawl would be small propertly < 1 GB.
From my point of view the result in size should be the same whatever seed is first!
Metadata
Metadata
Assignees
Labels
Type
Projects
Status