Skip to content

Huge Size Difference when captering Videos #887

@gitreich

Description

@gitreich

During our Test Domain Crawls (Front Page Only) with Btrix 1.6.4 we found one URL with some embeded, huge youtube Videos:
https://www.skvi-katzenberger.at/

the Crawl with this Domain had around 69 GB within a seed list of 400

In the Domain Crawl we gernating the seeds with all variatoins (http, http://www, https https://www. )

After this we where checking the Domains of the Domain List and started only 5 seeds in the list Per Container
still the amount on Disk was ~10 GB

then we decided we don't wanna crawl the videos and deactivated autoplay, here is der docker CMD:

docker run -d --name ONB_Btrix_HUGE_DC_LIST_TEST_BULK_7_20250922103146 -e NODE_OPTIONS='--max-old-space-size=32768'   -p 42667:42667  -p 37185:37185  -v /home/user/browsertrix/crawls/:/crawls/    webrecorder/browsertrix-crawler:1.6.4 crawl  --screencastPort 42667  --healthCheckPort 37185  --scopeType page   --headless  --delay 0 --behaviorTimeout 60 --pageLoadTimeout 60 --waitUntil networkidle0  --saveState always   --logging stats,info   --url https://www.skvi-katzenberger.at/
 --depth 0  --workers 1   --useSHA1    --collection HUGE_DC_LIST_TEST_BULK_7_20250922103146 --behaviors ["autofetch","autoscroll","siteSpecific"] 

Ending with a Crawl with no videos (as expected) and a size of ~30 MB

then i compared to with videos, and removed --behaviors again (as it was in the DomainCrawl)

but it ended up with no videos (Because of the timeout!) and a crawl of 50 MB

but suddenly the size increased dramatically when we add the seed list again via config (not chaning any other value)
config file:

seeds:
 - url: https://www.skvi-katzenberger.at
   depth: 0
 - url: https://skvi-katzenberger.at
   depth: 0
 - url: http://skvi-katzenberger.at
   depth: 0
 - url: http://www.skvi-katzenberger.at
   depth: 0

start cmd:

docker run -d --name ONB_Btrix_HUGE_DC_LIST_TEST_BULK_7_20250922103146 -e NODE_OPTIONS='--max-old-space-size=32768' -p 42667:42667 -p 37185:37185 -v /home/user/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.6.4 crawl --screencastPort 42667 --healthCheckPort 37185 --scopeType page --headless --delay 0 --behaviorTimeout 60 --pageLoadTimeout 60 --waitUntil networkidle0 --saveState always --logging stats,info --config /crawls/config/HUGE_DC_LIST_TEST_BULK_7_20250922103146.yaml --depth 0 --workers 1 --useSHA1 --collection HUGE_DC_LIST_TEST_BULK_7_2025092210314

And suddenly the Crawl had 2.7 GB [remember in the Domain Crawl their has been 400 Seeds for one container]

My suspicion is now, that the download of the videos is not limited by the behaviors-timeout, it seems to me as the crawler is using the entire Crawl Time for Downloading the videos of the first autoplay clicks.
I think this is not the expected behavior, and not as it is written in the docs.
I would expect, that the crawler download a maximum 60 Seconds after the first Click on Autoplay and stop the download if the time limit is reached

This means it makes a incredible differnce at which position a seed is in the config.yaml
Just imagine a bunch of seeds with no videos and my example in the beginning, will lead to a huge crawl with more then 50 GB of size
but when the config.yaml has in the beginning the bunch of seeds with no videos and my example in the end as last seed, the size of the crawl would be small propertly < 1 GB.

From my point of view the result in size should be the same whatever seed is first!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions