Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions clang/docs/ClangIRCodeDuplication.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
================================
ClangIR Code Duplication Roadmap
================================

.. contents::
:local:

Introduction
============

This document describes the general approach to code duplication in the ClangIR
code generation implementation. It acknowledges specific problems with the
current implementation, discusses strategies for mitigating the risk inherent in
the current approach, and describes a general long-term plan for addressing the
issue.

Background
==========

The ClangIR code generation is very closely modeled after Clang's LLVM IR code
generation, and we intend for the CIR produced to eventually be semantically
equivalent to the LLVM IR produced when not going through ClangIR. However, we
acknowledge that as the ClangIR implementation is under development, there will
be differences in semantics, both because we have not yet implemented all
features of the classic codegen and because the CIR dialect is still evolving
and does not yet have a way to represent all of the necessary semantics.

We have chosen to model the ClangIR code generation directly after the classic
codegen, to the point of following identical code structure, using similar names
and often duplicating the logic because this seemed to be the most certain path
to producing equivalent results. Having such nearly identical code allows for
direct comparison between the CIR codegen and the LLVM IR codegen to find what
is missing or incorrect in the CIR implementation.

However, we recognize that this is not a sustainable permanent solution. As
bugs are fixed and new features are added to the classic codegen, the process of
keeping the analogous CIR code up to date will be a purely manual process.

Long term, we need a more sustainable approach.

Current Strategy
================

Practical considerations require that we make steady progress towards a working
implementation of ClangIR. This necessity is directly opposed to the goal of
minimizing code duplication.

For this reason, we have decided to accept a large amount of code duplication
in the short term, even with the explicit understanding that this is producing
a significant amount of technical debt as the project progresses.

As the CIR implementation is developed, we often note small pieces of code that
could be shared with the classic codegen if they were moved to a different part
of the source, such as a shared utility class in some directory available to
both codegen implementations or by moving the function into a related AST class.
It is left to the discretion of the developer and reviewers to decide whether
such refactoring should be done during the CIR development, or if it is
sufficient to leave a comment in the code indicating this as an opportunity for
future improvement. Because much of the current code is likely to change when
the long term code sharing strategy is complete, we will lean towards only
implementing refactorings that make sense independent of the code sharing
problem.

We have discussed various ways that major classes such as CGCXXABI/CIRGenCXXABI
could be refactored to allow parts of there implementation to be shared today
through inheritence and templated base classes. However, this may prove to be
wasted effort when the permanent solution is developed, so we have decided that
it is better to accept significant amounts of code duplication now, and defer
this type of refactoring until it is clear what the permanent solution will be.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth noting that duplicating the code also makes removal of CIR less risky should CIR fail to pan out long-term. I know this isn't an expected (or desired) outcome, but the less interdependence we have between CIR and current IR gen, the easier it is to make those kinds of significant decisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!


Mitigation Through Testing
==========================

The most important tactic that we are using to mitigate the risk of CIR diverging
from classic codegen is to incorporate two sets of LLVM IR checks in the CIR
codegen LIT tests. One set checks the LLVM IR that is produced by first
generating CIR and then lowering that to LLVM IR. Another set checks the LLVM IR
that is produced directly by the classic codegen.

At the time that tests are created, we compare the LLVM IR output from these two
paths to verify (manually) that any meaningful differences between them are the
result of known missing features in the current CIR implementation. Whenever
possible, differences are corrected in the same PR that the test is being added,
updating the CIR implementation as it is being developed.

However, these tests serve a second purpose. They also serve as sentinels to
alert us to changes in the classic codegen behavior that will need to be
accounted for in the CIR implementation. While we appreciate any help from
developers contributing to classic codegen, our current expectation is that it
will be the responsibility of the ClangIR contributors to update the CIR
implementation when these tests fail.

As the CIR implementation gets closer to the goal of IR that is semantically
equivalent to the LLVM IR produced by the classic codegen, we would like to
enhance the CIR tests to perform some automatic verification of the equivalence
of the generated LLVM IR, perhaps using a tool such as Alive2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"perhaps using a tool such as Alive2." -> "perhaps using a mix of tools such as opt -passes=normalize and Alive2."


Eventually, we would like to be able to run all existing classic codegen tests
using the CIR path as well.

Other Considerations
====================

The close modeling of CIR after classic codegen has also meant that the CIR
dialect often represents language details at a much lower level than it ideally
should.

In the interest of having a complete working implementation of ClangIR as soon
as is practical, we have chosen to take the approach of following the classic
codegen implementation closely in the initial implementation and only raising
the representation in the CIR dialect to a higher level when there is a clear
and immediate benefit to doing so.

Over time, we expect to progressively raise the CIR representation to a higher
level and remove low level details, including ABI-specific handling from the
dialect. However, having a working implementation in place makes it easier to
verify that the high level representation and subsequent lowering are correct.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumping my mental model as feedback, this is what I'd write:

Over time, we anticipate that CIR will initially operate at a higher abstraction level, especially when it is efficient to do so during code generation. Gradually, CIR will be lowered to other dialects or backends, including handling that is specific to particular ABIs. More complex transformations—such as library-aware idiom recognition or advanced loop representations—may occur later in the compilation pipeline through additional passes, which can be controlled by specific compiler flags. For now, having a functional implementation in place helps us validate both the high-level representation and the correctness of subsequent lowering steps.


Mixing With Other Dialects
==========================

Mixing of dialects is a central design feature of MLIR. The CIR dialect is
currently more self-contained than most dialects, but even now we generate
the ACC (OpenACCC) dialect in combination with CIR, and when support for OpenMP
and CUDA are added, similar mixing will occur.

We also expect CIR to be at least partially lowered to other dialects during
the optimization phase to enable features such as data dependence analysis, even
if we will eventually be lowering it to LLVM IR.

Therefore, any plan for generating LLVM IR from CIR must be integrated with the
general MLIR lowering design, which typically involves lowering to the LLVM
dialect, which is then transformed to LLVM IR.

Other Consumers of CIR and MLIR
===============================

We must also consider that we will not always be lowering CIR to LLVM IR. CIR,
usually mixed with other dialects, will also be directed to offload targets
and other code generators through interfaces that are opaque to Clang. We must
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Clang, examples include things like SPIR-V and MLIR core dialects.

still produce semantically correct CIR for these consumers.

Long Term Vision
================

As the CIR implementation matures, we will eliminate target-specific handling
from the high-level CIR generated by Clang. The high-level CIR will then be
progressively lowered to a form that is closer to LLVM IR, including a pass
that inserts ABI-specific handling, potentially representing the target-specific
details in another dialect.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very similar to the paragraph I added the brain dump above, maybe we just need it in one place?


As we raise CIR to this higher level implementation, there will naturally be
less code duplication, and less need to have the same logic repeated in the
CIR generation.

We will continue to use that same basic design and structure for CIR code
generation, with classes like CIRGenModule and CIRGenFunction that serve the
same purpose as their counterparts in classic codegen, but the handling there
will be more closely tied to core semantics and therefore less likely to require
frequent changes to stay in sync with classic codegen.

As the handling of low-level details is moved to later lowering phases, we will
need to move away from the current tight coupling of the CIR and classic codegen
implementations. As this happens, we will look for ways that this handling can
be moved to new classes that are specifically designed to be shared among
clients that are targeting different IR substrates. That is, rather than trying
to overlay reuse onto the existing implementations, we will replace relevant
parts of the existing implementation, piece by piece, as appropriate, with new
implementations that perform the same function but with a more general design.

Example: C Calling Convention Handling
======================================

C calling convention handling is an example of a general purpose redesign that
is already underway. This was started independently of CIR, but it will be
directly useful for lowering from high-level call representation in CIR to a
representation that includes the target- and calling convention-specific details
of function signatures, parameter type coercion, and so on.

The current CIR implementation duplicates most of the classic codegen handling
for function call handling, but it omits several pieces that handle type
coercion. This leads to an implementation that has all of the complexity of the
class codegen without actually achieving the goals of that complexity. It will
be a significant improvement to the CIR implementation to simplify the function
call handling in such a way that it generates a high-level representation of the
call, while preserving all information that will be needed to lower the call to
an ABI-compliant representation in a later phase of compilation.

This provides a clear example where trying to refactor the classic codegen in
some way to be reused by CIR would have been counterproductive. The classic
codegen implementation was tightly coupled with Clang's LLVM IR generation. The
implementation is being completely redesigned to allow general reuse, not just by
CIR, but also by other front ends.

The CIR calling convention lowering will make use of the general purpose C
calling convention library that is being created, but it should create an MLIR
transform pass on top of that library that is general enough to be used by other
dialects, such as FIR, that also need the same calling convention handling.

Significant Areas For Improvement
=================================

The following list enumerates some of the areas where significant restructuring
of the code is needed to enable better code sharing between CIR and classic
codegen. Each of these areas is relatively self-contained in the codegen
implementation, making the path to a shared implementation relatively clear.

- Constant expression evaluation
- Complex multiplication and division expansion
- Builtin function handling
- Exception Handling and C++ Cleanups
- Inline assembly handling
- C++ ABI Handling

- VTable generation
- Virtual function calls
- Constructor and destructor arguments
- Dynamic casts
- Base class address calculation
- Type descriptors
- Array new and delete

Pervasive Low-Level Issues
==========================

This section lists some of the features where a non-trivial amount of code
is shared between CIR and classic codegen, but the handling of the feature
is distributed across the codegen implementation, making it more difficult
to design an abstraction that can easily be shared.

- Global variable and function linkage
- Alignment management
- Debug information
- TBAA handling
- Sanitizer integration
- Lifetime markers
Loading