Skip to content

Conversation

@janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Aug 7, 2025

Plan

In my opinion, it makes a lot of sense to do the remaining changes in a separate PR.

  • Introduce a ContextPipeline abstraction
  • Update crawlers to use it
  • Make sure that existing tests pass
  • Refine the ContextPipeline.compose signature and the semantics of BasicCrawlerOptions.contextPipelineEnhancer to maximize DX
  • Write tests for the contextPipelineEnhancer
  • Resolve added TODO comments (fix immediately or make issues)
  • Update documentation

Intent

The context-pipeline branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy.

The Problem

  1. Rigid inheritance hierarchy: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of BasicCrawler lifecycle methods made the execution flow even harder to follow.

  2. Context enhancement via monkey-patching: Manual property assignment (crawlingContext.page = page, crawlingContext.$ = $) scattered everywhere. It was a mess to follow and impossible to reason about.

  3. Cleanup coordination: Resource cleanup was handled by separate _cleanupContext methods that were not co-located with the initialization.

  4. Extension mechanism was broken: The CrawlerExtension.use() API tried to let you extend crawlers (the ones based on HttpCrawler) by overwriting properties - completely type-unsafe and fragile as hell.

The Solution

Introduces ContextPipeline - a middleware-based composition pattern where:

  • Each crawler layer defines how it enhances the context through explicit action functions
  • Cleanup logic is co-located with initialization via optional cleanup functions
  • Type safety is maintained through TypeScript generics that track context transformations
  • The pipeline executes middleware sequentially with proper error handling and guaranteed cleanup

Key Design Decisions

1. Middleware Pattern

Declarative middleware composition with co-located cleanup:

contextPipeline.compose({
  action: async (context) => ({ page, $ }),
  cleanup: async (context) => { await page.close(); }
})

2. Type-Safe Context Building

The ContextPipeline<TBase, TFinal> tracks type transformations through the chain:

ContextPipeline<CrawlingContext, CrawlingContext>
  .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }>
  .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }>

3. New Extension Mechanism

The CrawlerExtension.use() is gone. New approach via contextPipelineEnhancer:

new BasicCrawler({
  contextPipelineEnhancer: (pipeline) => 
    pipeline.compose({
      action: async (context) => ({ myCustomProp: ... })
    })
})

Discussion Topics

1. The API

The current way to express a context pipeline middleware has some shortcomings (ContextPipeline.compose, BasicCrawlerOptions.contextPipelineEnhancer). I suggest resolving this in another PR.

2. Migration Path

For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of BasicCrawler and HttpCrawler changed quite a bit.

3. Performance

The pipeline uses Object.defineProperties for each middleware. Is this a serious performance consideration?

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 7, 2025
/** The main middleware function that enhances the context */
action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>;
/** Optional cleanup function called after the consumer finishes or fails */
cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a cleanup callback from action may be a better approach. A benefit of that would be having access to the outer scope

@janbuchar janbuchar marked this pull request as ready for review October 16, 2025 13:26
import { Router } from '../index.js';
import { parseContentTypeFromResponse } from './utils.js';

export type FileDownloadOptions<
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made quite substantial changes to the options here so that we don't need to accept requestHandler xor streamHandler. My motivation was that I though it'd be difficult to shoehorn this into ContextPipeline, but it actually may be pretty easy.

So... is this worth it or do we want the original interface?

Mainly for @barjin 's eyes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that with the changes from #3163, the streamHandler / requestHandler might not be necessary anyway.

I'm not too happy about the current design anyway, so feel free to change the interface as you see fit 👍

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this quick so we unblock all the other PRs 😄

A few ideas to get the discussion started:


export class ContextPipelineInitializationError extends Error {
constructor(
public error: unknown,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you'd want to use Error { cause } field MDN? It seems more idiomatic that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thanks!

export class ContextPipelineInitializationError extends Error {
constructor(
public error: unknown,
public crawlingContext: {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the error instance will also log the entire attached crawling context into the console. This can be quite heavy - see all the stuff we're attaching to the context e.g. in the browser crawlers.

I see the context being useful for debugging as well, so this is rather a question - are you aware / sure about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of that, it actually bit me earlier 😁 After some consideration, we probably don't need the context here. The Python version does, but here we can make do with a simple cast.

*/
let numberOfRotations = -1;
const failedRequestHandler = vitest.fn();
const got = new GotScrapingHttpClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaiac we want to sunset got-scraping in favour of impit. Can you please use ImpitHttpClient here?

) {
super({
...options,
contextPipelineBuilder: () => this.buildContextPipeline(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a necessary step when subclassing Crawler instances now? Shouldn't this be the default behaviour anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only BrowserCrawler, and that one is special in many ways...

...options,
contextPipelineBuilder: () =>
this.buildContextPipeline().compose({
action: async (context) => this.parseContent(context),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe making two methods (parseContent, addHelpers) would help with dogfooding the ContextPipeline design? The current parseContent method does more than just parse content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, you mean that we should split the parseContent middleware into two separate steps, right? Sounds good to me...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the idea 👍

};
}
async waitForSelector(selector: string, timeoutMs = 5_000) {
const $ = cheerio.load(this.body);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that this in the context extensions will always be bound to the context? I'd rather reference the crawlingContext passed as the param to the outer function, just to be sure.

/**
* Get a key-value store with given name or id, or the default one for the crawler.
*/
getKeyValueStore: (idOrName?: string) => Promise<KeyValueStore>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RestrictedContext's getKeyValueStore returns a "limited" KVS type - some methods (like e.g. recordExists or getAutosavedValue) are missing from there. those are imo safe to call from a regular (non-adaptive) crawler.

Was this intentional?

Comment on lines +174 to +177
contextPipelineBuilder: () =>
this.buildContextPipeline().compose({
action: async (context) => await this.parseContent(context),
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: what about replacing this with

override async buildContextPipeline() {
   return super.buildContextPipeline().compose({ 
    ...
}

It feels slightly more OOP.

Also, then there might be no need for the contextPipelineBuilder constructor option and we'd only have the ...Enhancer method? I feel like having two such methods might be too much for the users (plus, imo a user will only want to extend the fully initialized context anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will extend this later with more detail, but the issue with a simple abstract method lies in correct generic types - when implementing CheerioCrawler for instance, you want to extend HttpCrawler<CheerioCrawlingContext>, but the inheritance "language" cannot express overriding HttpCrawler.buildContextPipeline (that returns a ContextPipeline<CrawlingContext, HttpCrawlingContext>) with CheerioCrawler.buildContextPipeline (that returns ContextPipeline<CrawlingContext, CheerioCrawlingContext>).

Oof. I hope that makes at least some sense. In any way, contextPipelineBuilder is not meant for your regular user, you shouldn't need it unless you're implementing a new crawler type. We should probably make that clearer, not sure why.

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first round of comments, it looks good overall, well done

* {@apilink BasicCrawlerOptions.requestHandlerTimeoutSecs|`requestHandlerTimeoutSecs`} for {@apilink BasicCrawler}
* that would not impare functionality (not timeout before crawlers).
*/
export const BASIC_CRAWLER_TIMEOUT_BUFFER_SECS = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing this is removed, does the PR also handle #2951?


export interface CrawlingContext<Crawler = unknown, UserData extends Dictionary = Dictionary>
extends RestrictedCrawlingContext<UserData> {
crawler: Crawler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets document all breaking changes right when they happen (including those on type level, if there are any)

https://github.com/apify/crawlee/blob/v4/docs/upgrading/upgrading_v4.md

also this is used in the docs in several examples, e.g. here or here or here

*/
parseWithCheerio(selector?: string, timeoutMs?: number): Promise<CheerioRoot>;

enqueueLinks(options?: EnqueueLinksOptions): Promise<BatchAddRequestsResult>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have a jsdoc

JSONData extends Dictionary = any, // with default to Dictionary we cant use a typed router in untyped crawler
> extends InternalHttpCrawlingContext<UserData, JSONData, CheerioCrawler> {
> extends InternalHttpCrawlingContext<UserData, JSONData> {
body: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here i guess

* @template TCrawlingContext - The input context type for this middleware
* @template TCrawlingContextExtension - The enhanced output context type
*/
export interface ContextMiddleware<TCrawlingContext extends {}, TCrawlingContextExtension extends {}> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the extends {} (not just here)?

{} doesn't represent an object on type level. i see you use it as a type here, that also feels wrong. if you want it to represent anything, lets use any or unknown. if it should be an object, object would be better, Record<string, unknown> even better.

if i change the type in the errors to unknown, all of those can be removed from here.

Comment on lines +19 to +21
action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>;
/** Optional cleanup function called after the consumer finishes or fails */
cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's allow sync callbacks too

Suggested change
action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>;
/** Optional cleanup function called after the consumer finishes or fails */
cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>;
action: (context: TCrawlingContext) => Awaitable<TCrawlingContextExtension>;
/** Optional cleanup function called after the consumer finishes or fails */
cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Awaitable<void>;

Comment on lines +526 to +543
private async performNavigation(crawlingContext: Context) {
if (crawlingContext.request.skipNavigation) {
return {
request: new Proxy(crawlingContext.request, {
get(target, propertyName, receiver) {
if (propertyName === 'loadedUrl') {
throw new Error(
'The `request.loadedUrl` property is not available - `skipNavigation` was used',
);
}
return Reflect.get(target, propertyName, receiver);
},
}) as LoadedRequest<Request>,
get response() {
throw new Error('The `response` property is not available - `skipNavigation` was used');
},
};
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice touch 👏

*
* TODO example
*/
contextPipelineEnhancer?: (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about calling this extendContext or something less scary :]

}
}
this.useSessionPool = useSessionPool;
this.crawlingContexts = new Map();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's document this as a BC, and hope nobody will ask us to put it back 🙃 but i guess people can do this easily in their own code

(back in the day, we suggested this here, can't see other references in the docs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants