-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[Clang][CIR][Doc] Document CIR code duplication plans #166457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,236 @@ | ||
| ================================ | ||
| ClangIR Code Duplication Roadmap | ||
| ================================ | ||
|
|
||
| .. contents:: | ||
| :local: | ||
|
|
||
| Introduction | ||
| ============ | ||
|
|
||
| This document describes the general approach to code duplication in the ClangIR | ||
| code generation implementation. It acknowledges specific problems with the | ||
| current implementation, discusses strategies for mitigating the risk inherent in | ||
| the current approach, and describes a general long-term plan for addressing the | ||
| issue. | ||
|
|
||
| Background | ||
| ========== | ||
|
|
||
| The ClangIR code generation is very closely modeled after Clang's LLVM IR code | ||
| generation, and we intend for the CIR produced to eventually be semantically | ||
| equivalent to the LLVM IR produced when not going through ClangIR. However, we | ||
| acknowledge that as the ClangIR implementation is under development, there will | ||
| be differences in semantics, both because we have not yet implemented all | ||
| features of the classic codegen and because the CIR dialect is still evolving | ||
| and does not yet have a way to represent all of the necessary semantics. | ||
|
|
||
| We have chosen to model the ClangIR code generation directly after the classic | ||
| codegen, to the point of following identical code structure, using similar names | ||
| and often duplicating the logic because this seemed to be the most certain path | ||
| to producing equivalent results. Having such nearly identical code allows for | ||
| direct comparison between the CIR codegen and the LLVM IR codegen to find what | ||
| is missing or incorrect in the CIR implementation. | ||
|
|
||
| However, we recognize that this is not a sustainable permanent solution. As | ||
| bugs are fixed and new features are added to the classic codegen, the process of | ||
| keeping the analogous CIR code up to date will be a purely manual process. | ||
|
|
||
| Long term, we need a more sustainable approach. | ||
|
|
||
| Current Strategy | ||
| ================ | ||
|
|
||
| Practical considerations require that we make steady progress towards a working | ||
| implementation of ClangIR. This necessity is directly opposed to the goal of | ||
| minimizing code duplication. | ||
|
|
||
| For this reason, we have decided to accept a large amount of code duplication | ||
| in the short term, even with the explicit understanding that this is producing | ||
| a significant amount of technical debt as the project progresses. | ||
|
|
||
| As the CIR implementation is developed, we often note small pieces of code that | ||
| could be shared with the classic codegen if they were moved to a different part | ||
| of the source, such as a shared utility class in some directory available to | ||
| both codegen implementations or by moving the function into a related AST class. | ||
| It is left to the discretion of the developer and reviewers to decide whether | ||
| such refactoring should be done during the CIR development, or if it is | ||
| sufficient to leave a comment in the code indicating this as an opportunity for | ||
| future improvement. Because much of the current code is likely to change when | ||
| the long term code sharing strategy is complete, we will lean towards only | ||
| implementing refactorings that make sense independent of the code sharing | ||
| problem. | ||
|
|
||
| We have discussed various ways that major classes such as CGCXXABI/CIRGenCXXABI | ||
| could be refactored to allow parts of there implementation to be shared today | ||
| through inheritence and templated base classes. However, this may prove to be | ||
| wasted effort when the permanent solution is developed, so we have decided that | ||
| it is better to accept significant amounts of code duplication now, and defer | ||
| this type of refactoring until it is clear what the permanent solution will be. | ||
|
|
||
| Mitigation Through Testing | ||
| ========================== | ||
|
|
||
| The most important tactic that we are using to mitigate the risk of CIR diverging | ||
| from classic codegen is to incorporate two sets of LLVM IR checks in the CIR | ||
| codegen LIT tests. One set checks the LLVM IR that is produced by first | ||
| generating CIR and then lowering that to LLVM IR. Another set checks the LLVM IR | ||
| that is produced directly by the classic codegen. | ||
|
|
||
| At the time that tests are created, we compare the LLVM IR output from these two | ||
| paths to verify (manually) that any meaningful differences between them are the | ||
| result of known missing features in the current CIR implementation. Whenever | ||
| possible, differences are corrected in the same PR that the test is being added, | ||
| updating the CIR implementation as it is being developed. | ||
|
|
||
| However, these tests serve a second purpose. They also serve as sentinels to | ||
| alert us to changes in the classic codegen behavior that will need to be | ||
| accounted for in the CIR implementation. While we appreciate any help from | ||
| developers contributing to classic codegen, our current expectation is that it | ||
| will be the responsibility of the ClangIR contributors to update the CIR | ||
| implementation when these tests fail. | ||
|
|
||
| As the CIR implementation gets closer to the goal of IR that is semantically | ||
| equivalent to the LLVM IR produced by the classic codegen, we would like to | ||
| enhance the CIR tests to perform some automatic verification of the equivalence | ||
| of the generated LLVM IR, perhaps using a tool such as Alive2. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "perhaps using a tool such as Alive2." -> "perhaps using a mix of tools such as |
||
|
|
||
| Eventually, we would like to be able to run all existing classic codegen tests | ||
| using the CIR path as well. | ||
|
|
||
| Other Considerations | ||
| ==================== | ||
|
|
||
| The close modeling of CIR after classic codegen has also meant that the CIR | ||
| dialect often represents language details at a much lower level than it ideally | ||
| should. | ||
|
|
||
| In the interest of having a complete working implementation of ClangIR as soon | ||
| as is practical, we have chosen to take the approach of following the classic | ||
| codegen implementation closely in the initial implementation and only raising | ||
| the representation in the CIR dialect to a higher level when there is a clear | ||
| and immediate benefit to doing so. | ||
|
|
||
| Over time, we expect to progressively raise the CIR representation to a higher | ||
| level and remove low level details, including ABI-specific handling from the | ||
| dialect. However, having a working implementation in place makes it easier to | ||
| verify that the high level representation and subsequent lowering are correct. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dumping my mental model as feedback, this is what I'd write:
|
||
|
|
||
| Mixing With Other Dialects | ||
| ========================== | ||
|
|
||
| Mixing of dialects is a central design feature of MLIR. The CIR dialect is | ||
| currently more self-contained than most dialects, but even now we generate | ||
| the ACC (OpenACCC) dialect in combination with CIR, and when support for OpenMP | ||
| and CUDA are added, similar mixing will occur. | ||
|
|
||
| We also expect CIR to be at least partially lowered to other dialects during | ||
| the optimization phase to enable features such as data dependence analysis, even | ||
| if we will eventually be lowering it to LLVM IR. | ||
|
|
||
| Therefore, any plan for generating LLVM IR from CIR must be integrated with the | ||
| general MLIR lowering design, which typically involves lowering to the LLVM | ||
| dialect, which is then transformed to LLVM IR. | ||
|
|
||
| Other Consumers of CIR and MLIR | ||
| =============================== | ||
|
|
||
| We must also consider that we will not always be lowering CIR to LLVM IR. CIR, | ||
| usually mixed with other dialects, will also be directed to offload targets | ||
| and other code generators through interfaces that are opaque to Clang. We must | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| still produce semantically correct CIR for these consumers. | ||
|
|
||
| Long Term Vision | ||
| ================ | ||
|
|
||
| As the CIR implementation matures, we will eliminate target-specific handling | ||
| from the high-level CIR generated by Clang. The high-level CIR will then be | ||
| progressively lowered to a form that is closer to LLVM IR, including a pass | ||
| that inserts ABI-specific handling, potentially representing the target-specific | ||
| details in another dialect. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very similar to the paragraph I added the brain dump above, maybe we just need it in one place? |
||
|
|
||
| As we raise CIR to this higher level implementation, there will naturally be | ||
| less code duplication, and less need to have the same logic repeated in the | ||
| CIR generation. | ||
|
|
||
| We will continue to use that same basic design and structure for CIR code | ||
| generation, with classes like CIRGenModule and CIRGenFunction that serve the | ||
| same purpose as their counterparts in classic codegen, but the handling there | ||
| will be more closely tied to core semantics and therefore less likely to require | ||
| frequent changes to stay in sync with classic codegen. | ||
|
|
||
| As the handling of low-level details is moved to later lowering phases, we will | ||
| need to move away from the current tight coupling of the CIR and classic codegen | ||
| implementations. As this happens, we will look for ways that this handling can | ||
| be moved to new classes that are specifically designed to be shared among | ||
| clients that are targeting different IR substrates. That is, rather than trying | ||
| to overlay reuse onto the existing implementations, we will replace relevant | ||
| parts of the existing implementation, piece by piece, as appropriate, with new | ||
| implementations that perform the same function but with a more general design. | ||
|
|
||
| Example: C Calling Convention Handling | ||
| ====================================== | ||
|
|
||
| C calling convention handling is an example of a general purpose redesign that | ||
| is already underway. This was started independently of CIR, but it will be | ||
| directly useful for lowering from high-level call representation in CIR to a | ||
| representation that includes the target- and calling convention-specific details | ||
| of function signatures, parameter type coercion, and so on. | ||
|
|
||
| The current CIR implementation duplicates most of the classic codegen handling | ||
| for function call handling, but it omits several pieces that handle type | ||
| coercion. This leads to an implementation that has all of the complexity of the | ||
| class codegen without actually achieving the goals of that complexity. It will | ||
| be a significant improvement to the CIR implementation to simplify the function | ||
| call handling in such a way that it generates a high-level representation of the | ||
| call, while preserving all information that will be needed to lower the call to | ||
| an ABI-compliant representation in a later phase of compilation. | ||
|
|
||
| This provides a clear example where trying to refactor the classic codegen in | ||
| some way to be reused by CIR would have been counterproductive. The classic | ||
| codegen implementation was tightly coupled with Clang's LLVM IR generation. The | ||
| implementation is being completely redesigned to allow general reuse, not just by | ||
| CIR, but also by other front ends. | ||
|
|
||
| The CIR calling convention lowering will make use of the general purpose C | ||
| calling convention library that is being created, but it should create an MLIR | ||
| transform pass on top of that library that is general enough to be used by other | ||
| dialects, such as FIR, that also need the same calling convention handling. | ||
|
|
||
| Significant Areas For Improvement | ||
| ================================= | ||
|
|
||
| The following list enumerates some of the areas where significant restructuring | ||
| of the code is needed to enable better code sharing between CIR and classic | ||
| codegen. Each of these areas is relatively self-contained in the codegen | ||
| implementation, making the path to a shared implementation relatively clear. | ||
|
|
||
| - Constant expression evaluation | ||
| - Complex multiplication and division expansion | ||
| - Builtin function handling | ||
| - Exception Handling and C++ Cleanups | ||
| - Inline assembly handling | ||
| - C++ ABI Handling | ||
|
|
||
| - VTable generation | ||
| - Virtual function calls | ||
| - Constructor and destructor arguments | ||
| - Dynamic casts | ||
| - Base class address calculation | ||
| - Type descriptors | ||
| - Array new and delete | ||
|
|
||
| Pervasive Low-Level Issues | ||
| ========================== | ||
|
|
||
| This section lists some of the features where a non-trivial amount of code | ||
| is shared between CIR and classic codegen, but the handling of the feature | ||
| is distributed across the codegen implementation, making it more difficult | ||
| to design an abstraction that can easily be shared. | ||
|
|
||
| - Global variable and function linkage | ||
| - Alignment management | ||
| - Debug information | ||
| - TBAA handling | ||
| - Sanitizer integration | ||
| - Lifetime markers | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth noting that duplicating the code also makes removal of CIR less risky should CIR fail to pan out long-term. I know this isn't an expected (or desired) outcome, but the less interdependence we have between CIR and current IR gen, the easier it is to make those kinds of significant decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!