Hacking CPython: Syntax
This post is the first in a series on adding functionality to the CPython interpreter. In it, we implement the almost-equal operator described in CPython Internals. In this first post, we update the CPython tokenizer, parser, and abstract-syntax tree to support our new feature.
Motivation
CPython Internals by Anthony Shaw and the Real Python team includes some fun tutorials that walk through modifying the CPython interpreter to support new language constructs. This series of posts does not reproduce all of the information in the book. Instead, I hope to augment the information that Shaw provides with my own thoughts and observations.
At the time of writing, I am a complete newcomer to the CPython source. Accordingly, many of my observations may appear obvious to those more familiar with the codebase.
Expected Behavior
In this series, we implement an almost-equal operator in CPython. This operator, denoted by ~=
, is capable of performing equality comparisons between integer (int
) and floating-point (float
) types. When performing a comparison between an int
and a float
, the float
argument is implicitly converted to an int
before the comparison is performed. With two int
operands, the operator behaves identically to equality comparison with ==
.
As described in the book, the almost-equal operator should behave as follows:
>>> 1 ~= 1
True
>>> 1 ~= 1.0
True
>>> 1 ~= 1.1
True
>>> 1 ~= 1.9
True
Adding a Token
Instead of beginning with updating the Python grammar for our almost-equal operator as Shaw does in the book, I elect to start with updating the tokenizer because I believe it represents the simplest possible modification to the Python syntax. All that is required to update the tokenizer is the addition of a token for ~=
in the tokens file (Grammar/Tokens
):
...
COLONEQUAL ':='
ALMOSTEQUAL '~='
With the tokens updated, we can regenerate the tokenizer:
make regen-token
After regenerating, we see some modifications to the tokenizer in Parser/token.c
. First, an ALMOSTEQUAL
token name is added to the global array:
const char * const _PyParser_TokenNames[] = {
...
"ALMOSTEQUAL",
Second, a rule to tokenize this syntax is added to the PyToken_TwoChars()
function:
int
PyToken_TwoChars(int c1, int c2)
{
switch (c1) {
...
case '~':
switch (c2) {
case '=': return ALMOSTEQUAL;
}
break;
The tokenizer is now prepared to tokenize our almost-equal operator. Next, we must modify the parser so that it knows what to do when it encounters this token.
Aside: The Grammar File
The Python grammar file (python.gram
) reflects the transition to the new Parsing Expression Grammar (PEG) parser for Python. This parser replaced the previous LL(1) parser. PEP 617 has more details on the parser transition.
In implementing the almost-equal operator, we modify the compare_op_bitwise_or_pair
expression.
compare_op_bitwise_or_pair[CmpopExprPair*]:
| eq_bitwise_or
| noteq_bitwise_or
| lte_bitwise_or
| lt_bitwise_or
| gte_bitwise_or
| gt_bitwise_or
| notin_bitwise_or
| in_bitwise_or
| isnot_bitwise_or
| is_bitwise_or
compare_op_bitwise_or_pair
is the name of the rule. This merely provides a way to reference this production in other rules of the grammar (and also identifies it to human readers).
Immediately following the rule is [CmpopExprPair*]
; this is the return type of the C or Python function that corresponds to the rule. The inclusion of this return type is optional; if it is omitted, void*
is inferred for C functions and Any
is inferred for Python functions.
Finally, following the :
are the rule’s expressions. When separated by the |
symbol, these expressions are checked for a match against the syntax of the program in order - this is is a departure from the manner in which the LL(1) grammar was defined. In the case of the compare_op_bitwise_or_pair
rule, all of these expressions are references to other rules in the grammar, rather than concrete pieces of syntax. If we follow one of these rules, we can determine more about how the grammar works. For instance, the first rule in the list of expressions for compare_op_bitwise_or_pair
is eq_bitwise_or
:
eq_bitwise_or[CmpopExprPair*]: '==' a=bitwise_or { _PyPegen_cmpop_expr_pair(p, Eq, a) }
In this rule, we see some additional features of the Python grammar. The first expression that follows the rule name is '=='
, meaning that this rule will match against a literal ==
symbol in the source. Following this, we see an example of a grammar variable: a=bitwise_or
. This is a bitwise_or
expression whose matched syntax is assigned a name in the expression list such that it can be referred to later in a grammar action. In this example, { _PyPegen_cmpop_expr_pair(p, Eq, a) }
is a grammar action that allows us to specify how an AST node is generated for the matched syntax directly within the grammar definition.
As a final note, you might be wondering why we match against a bitwise_or
operator as the second expression in the eq_bitwise_or
rule - why is this not merely the second operand of the ==
expression? For instance, if we are performing a comparison between two variables:
a: int = 1
b: int = 2
if a == b:
pass
We would expect this second expression to be a NAME
that we can match against the literal b
syntax that we observe in the program. However, the bitwise_or
expression is necessary because the Python syntax must be sufficiently general to allow for complex constructions. For example:
if a == b | c & d + 1:
pass
Python must be able to parse the expression on the right of the ==
operator and resolve it to a value (at runtime) for comparison against the left operand. If we follow the chain of productions from the bitwise_or
rule all the way down to atom
, we find the following expressions:
atom[expr_ty]:
| NAME
| 'True' { _Py_Constant(Py_True, NULL, EXTRA) }
| 'False' { _Py_Constant(Py_False, NULL, EXTRA) }
| 'None' { _Py_Constant(Py_None, NULL, EXTRA) }
| '__peg_parser__' { RAISE_SYNTAX_ERROR("You found it!") }
| &STRING strings
| NUMBER
| &'(' (tuple | group | genexp)
| &'[' (list | listcomp)
| &'{' (dict | set | dictcomp | setcomp)
| '...' { _Py_Constant(Py_Ellipsis, NULL, EXTRA) }
Here we see some literal values like NAME
and NUMBER
to match against concrete syntax in the program.
Updating the Grammar
Now that we understand the syntax of the Python grammar in python.gram
, we can modify it to support our almost-equal operator. First, we add a new expression for ale_bitwise_or
under the compare_op_bitwise_or_pair
rule:
compare_op_bitwise_or_pair[CmpopExprPair*]:
| eq_bitwise_or
...
| ale_bitwise_or
Next, we define the rule for this expression:
ale_bitwise_or[CmpopExprPair*]: '~=' a=bitwise_or { _PyPegen_cmpop_expr_pair(p, AlE, a) }
Here we see the same familiar features that were described above:
~=
is the syntax for our almost-equal operatora=bitwise_or
is a named expression for the right operand of our operator{ _PyPegen_cmpop_expr_pair(p, AlE, a) }
defines the function that generates an AST node for our operator
With that, we can regenerate the parser:
make regen-pegen
Regenerating the parser updates the parser implementation in Parser/pegen/parse.c
. In particular, we see the following updates to the compare_op_bitwise_or_pair_rule()
function:
static CmpopExprPair*
compare_op_bitwise_or_pair_rule(Parser *p) {
...
{ // ale_bitwise_or
if (p->error_indicator) {
p->level--;
return NULL;
}
D(fprintf(stderr, "%*c> compare_op_bitwise_or_pair[%d-%d]: %s\n", p->level, ' ', _mark, p->mark, "ale_bitwise_or"));
CmpopExprPair* ale_bitwise_or_var;
if (
(ale_bitwise_or_var = ale_bitwise_or_rule(p)) // ale_bitwise_or
)
{
D(fprintf(stderr, "%*c+ compare_op_bitwise_or_pair[%d-%d]: %s succeeded!\n", p->level, ' ', _mark, p->mark, "ale_bitwise_or"));
_res = ale_bitwise_or_var;
goto done;
}
p->mark = _mark;
D(fprintf(stderr, "%*c%s compare_op_bitwise_or_pair[%d-%d]: %s failed!\n", p->level, ' ',
p->error_indicator ? "ERROR!" : "-", _mark, p->mark, "ale_bitwise_or"));
}
}
Most of this is just error-handling boilerplate; the important line is:
if (
(ale_bitwise_or_var = ale_bitwise_or_rule(p)) // ale_bitwise_or
)
where a new function, ale_bitwise_or_rule()
, is called to determine if the current state of the parser can match against this rule:
// ale_bitwise_or: '~=' bitwise_or
static CmpopExprPair*
ale_bitwise_or_rule(Parser *p)
{
if (p->level++ == MAXSTACK) {
p->error_indicator = 1;
PyErr_NoMemory();
}
if (p->error_indicator) {
p->level--;
return NULL;
}
CmpopExprPair* _res = NULL;
int _mark = p->mark;
{ // '~=' bitwise_or
if (p->error_indicator) {
p->level--;
return NULL;
}
D(fprintf(stderr, "%*c> ale_bitwise_or[%d-%d]: %s\n", p->level, ' ', _mark, p->mark, "'~=' bitwise_or"));
Token * _literal;
expr_ty a;
if (
(_literal = _PyPegen_expect_token(p, 54)) // token='~='
&&
(a = bitwise_or_rule(p)) // bitwise_or
)
{
D(fprintf(stderr, "%*c+ ale_bitwise_or[%d-%d]: %s succeeded!\n", p->level, ' ', _mark, p->mark, "'~=' bitwise_or"));
_res = _PyPegen_cmpop_expr_pair ( p , AlE , a );
if (_res == NULL && PyErr_Occurred()) {
p->error_indicator = 1;
p->level--;
return NULL;
}
goto done;
}
p->mark = _mark;
D(fprintf(stderr, "%*c%s ale_bitwise_or[%d-%d]: %s failed!\n", p->level, ' ',
p->error_indicator ? "ERROR!" : "-", _mark, p->mark, "'~=' bitwise_or"));
}
_res = NULL;
done:
p->level--;
return _res;
}
Again, most of this function is boilerplate; the important piece is the conditional where the parser checks if it can match am ALMOSTEQUAL
token and a bitwise_or
expression. If this succeeds, the almost-equal expression is successfully parsed, and _PyPegen_cmpop_expr_pair()
is called to construct the return value for the function.
Updating the AST
If we attempt to recompile CPython now, it will fail with the following message:
Parser/pegen/parse.c: In function ‘ale_bitwise_or_rule’:
Parser/pegen/parse.c:9313:51: error: ‘AlE’ undeclared (first use in this function)
9313 | _res = _PyPegen_cmpop_expr_pair ( p , AlE , a );
The parser is capable of parsing our operator, but it fails when we attempt to generate an AST node.
CPython uses ASDL to generate an AST programmatically from a concise definition. Abstract Syntax Definition Language (ASDL) is a language-agnostic tool for defining and generating the code to construct an AST. More information is available on its website.
If we look in Parser/Python.asdl
we can find the cmpop
enumeration that describes all of the valid comparison operator AST-node types:
cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn
We can fix the compilation issue by adding an AlE
type to this enumeration:
cmpop = Eq | NotEq | Lt | LtE | Gt | GtE | Is | IsNot | In | NotIn | AlE
make regen-ast
With the AST regenerated, we can successfully compile CPython, but if we attempt to use our almost-equal operator we are met with a nasty error:
>>> 1 ~= 1
Fatal Python error: compiler_addcompare: We've reached an unreachable state. Anything is possible.
The limits were in our heads all along. Follow your dreams.
https://xkcd.com/2200
Python runtime state: initialized
Current thread 0x00007fddeab30280 (most recent call first):
<no Python frame>
[1] 127061 abort (core dumped) ./python
This error is raised because we still need to associate the ALMOSTEQUAL
token with the AlE
comparison type. Without this, the AST does not have the knowledge that the ALMOSTEQUAL
token corresponds to the AlE
comparison. We remedy this by modifying ast_for_comp_op()
in Python/ast.c
:
static cmpop_ty
ast_for_comp_op(struct compiling *c, const node *n)
{
/* comp_op: '<'|'>'|'=='|'>='|'<='|'!='|'in'|'not' 'in'|'is'
|'is' 'not'
*/
REQ(n, comp_op);
if (NCH(n) == 1) {
n = CHILD(n, 0);
switch (TYPE(n)) {
case LESS:
return Lt;
case GREATER:
return Gt;
case ALMOSTEQUAL:
return AlE;
Once this is resolved, we can successfully recompile CPython. Furthermore, we now have all of the additions we need to parse our almost-equal operator:
>>> import ast
>>> m = ast.parse('1 ~= 2')
>>> m.body[0].value.ops[0]
<ast.AlE object at 0x7f7188233eb0>
This snippet demonstrates that we can successfully produce a well-formed AST from an expression that contains our new operator. However, if we attempt to execute code that contains almost-equals, we encounter the same “Fatal Python error” that we saw previously. This is because nothing beyond the parser and AST, namely the compiler, knows how to handle our new expression. This is the topic of the next post.