Saturday, May 12, 2018

The origins of pgen

David Beazley's talk at US PyCon 2018, about parser generators, reminded me I should write a bit about its history. Here'.s a brief brain dump (maybe I'll expand later)

There are actually two pgens. The original, in C, and a rewrite in Python, under lib2to3/pgen2.

I wrote both. The original was actually the first code I wrote for Python. Although technically I had to write the lexer first (pgen and Python share their lexer, pgen just doesn't have any uses for most tokens).

The reason for writing my own parser generator was that at the time the landscape of parser generators (that I was familiar with) was pretty sparse -- basically there was Yacc (the GNU rewrite named Bison existed, but I don't know if I knew); or you could write a parser by hand (which is what most people did). I had used Yacc in college and was familiar with how it worked from the Dragon book, but for some reason I didn't like it much; IIRC I found it hard to reason about what the limitations on an LALR(1) grammar would be. I was also familiar with LL(1) parsers and had written some serious recursive-descent LL(1) parsers by hand -- I liked that fine but I was also familiar (again from the Dragon book) with LL(1) parser generation techniques, and I had one improvement over that that I wanted to try out: using regular expressions (sort of) instead of the standard BNF format. The Dragon book had also taught me how to turn regular expressions into DFAs, so I combined all that and pgen was born. [Update: see below for a slightly different version for the reason.]

I was not familiar with more advanced techniques, or I deemed them too inefficient. (I think most people working in parsers did, at the time.)

For the lexer I decided I would not use a generator -- I had a much lower opinion of Lex than of Yacc, since the version of Lex that I was familiar with would just segfault when trying to scan a token longer than 255 bytes (really!). Also I figured that indentation would be hard to teach to a lexer generator.

The story of pgen2 is quite different: I was employed by a startup in San Mateo (Elemental Security, it died in 2007, after I had left and joined Google) where I was given the task to design a custom language (for security assertions about system configuration), and was given pretty much free reign. I decided to design something vaguely like Python, implemented in Python, and decided that I wanted to reuse pgen, but with a backend in Python, using tokenize.py as the lexer. So I just rewrote the exact algorithm from pgen in Python and then moved on to build the rest of that language. Open-sourcing the tooling made sense to management so they quickly approved that, and some time later (I had probably moved on to Google by then?) it made sense to use those tools for 2to3. (It was easy to generate a Python parser with it because the input format was identical to the original pgen -- I just had to feed the Grammar file into the tool. :-)

Update: more color behind my reason for creating pgen

I don't quite recall why I did things this way but I just peeked at https://en.wikipedia.org/wiki/LL_parser#Conflicts and I may have seen this as a new (to me) way of addressing the various conflicts that may occur without adding helper rules. E.g. what that page calls left-factoring (replace A -> X | X Y Z with A -> X B; B -> Y Z | <empty>) I would rewrite instead as A -> X [Y Z]. IIRC the parsing engine (function syntacticAnalysis from earlier on that page) still works with parsing tables derived from such rules through the regex -> NFA -> DFA transformation; I think the requirement not to have any empty productions was required here.

The other thing I came up with was that the parse tree nodes produced by the parsing engine would have a variable number of children, e.g. for the rule A -> X [Y Z] above the node for A would have either 1 child (X) or 3 (X Y Z). A simple check in the code generator would then be needed to determine which alternative the parser had encountered. (This has turned out to be a double-edged sword, and later we added a parse tree -> AST step driven by a separate generator to simplify the byte code generator.)

So the reason I used regular expressions was probably to make the grammar easier to read: after resolving the conflicts using the needed rewrites I found the grammar not all that easy to read (insert Zen of Python reference here :-) hence the regular expressions -- no helper rules with awkward names, and [optional] parts or repeated* parts are closer to how I think about a typical language's syntax. It certainly did not change the power beyond LL(1), nor did it reduce the power. And yes by "regular expressions" I meant the EBNF conventions -- I'm not sure that at the time "EBNF" was a well-defined notation, it might just have meant any extension to BNF.

Converting EBNF to BNF and taking it from there would cause awkward extra parse tree node types, so I don't think that would have been an improvement.

If I had to do it all over again I might opt for a more powerful parsing engine, perhaps a version of LALR(1) (i.e. Yacc/Bison). There are several areas of the grammar where something more powerful than LL(1) would be helpful, e.g. keyword args: the rule 'arg: [NAME =] expr' is not a valid rule, since NAME occurs in the FIRST-set of expr and the algorithm doesn't handle that. IIRC LALR(1) can handle it. But keyword args were several years in the future when I wrote the first version of pgen, and I didn't want to redo the parser at that point.

Monday, November 11, 2013

The history of bool, True and False

Writing up the reasons why True and False, when introduced, weren't reserved words, I realized there's another interesting lesson in the history of Python's bool type. It was formally introduced in Python 2.3, as a new type with two constants, and the type was introduced in PEP 285 ("Adding a bool type").

But bool, True and False were also introduced in Python 2.2.1 (a bugfix release!). The Misc/NEWS file said:
What's New in Python 2.2.1 final?
Release date: 10-Apr-2002
=================================

Core and builtins

- Added new builtin function bool() and new builtin constants True and
  False to ease backporting of code developed for Python 2.3.  In 2.2,
  bool() returns 1 or 0, True == 1, and False == 0.

This was the last (and the most criticized) time we added a new feature in a bugfix release -- we'd never do that these days. Also note that bool/True/False in 2.2.1 were different from 2.3: in 2.3, bool is a new type; in 2.2.1, bool() is a built-in function and the constants are just ints.

The chronology is also interesting: the proper new bool type was introduced in 2.3a1, released on Dec 31 2002, well after the above-mentioned 2.2.1 release. And the final 2.3 release didn't come out until July 29 2003. And yet, the above comment talks about backporting from 2.3 to 2.2.1. PEP 285 was created on March 8, 2002, accepted on April 3, and declared final on April 11 (i.e. after Python 2.2.1 was released). I'm assuming that by then the proper bool implementation had landed in the 2.3 branch. That's a breakneck pace compared to how we do things a decade later!

Sunday, November 10, 2013

The story of None, True and False (and an explanation of literals, keywords and builtins thrown in)

I received an interesting question in the mail recently:
What is the difference between keywords and literals? Why are True and False keywords rather than literals in python3?
I was horrified recently to find that assigning to True/False works in python2. So I went digging, and found that True and False were created to be 'constants' like None in PEP 285. Assignment to None was disallowed in 2.4, but not to True/False until python3. Was there a reason None was originally built as a variable rather than a literal?
Let's start with the first question: keywords and literals.

A keyword, in the context of defining the syntax of a language, also known as a reserved word, is something that looks like an identifier in the language, but from the parser's point of view act like a token of the language. An identifier is defined as a sequence of one or more letters, digits and underscores, not starting with a digit. (This is Python's definition, but many languages, like C or Java, use the same or a very similar definition.)

The important thing to remember about keywords is that a keyword cannot be used to name a variable (or function, class, etc.). Some well-known keywords in Python include 'if', 'while', 'for', 'and', 'or'.

A literal, on the other hand, is an element of an expression that describes a constant value. Examples of literals are numbers (e.g. 42, 3.14, or 1.6e-10) and strings (e.g. "Hello, world"). Literals are recognized by the parser, and the exact rules for how literals are parsed are often quite subtle. For example, these are all numeric literals in Python 3:
123
1.0
1.
.01e10
.1e+42
123.456e-100
0xfffe
0o755
but these are not:
. (dot)
e10 (identifier)
0y12 (the literal 0 followed by the identifier y12)
0xffe+10 (the literal 0xffe followed by a plus sign and and the number 10)
Note the distinction between a constant and a literal. We often write code defining "constants", e.g.
MAX_LEVELS = 15
Here, 15 is a literal, but MAX_LEVELS is not -- it is an identifier, and the all-caps form of the name suggests to the reader that it is probably not changed anywhere in the code, which means that we can consider it a constant -- but this is just a convention, and the Python parser doesn't know about that convention, nor does it enforce it.

On the other hand, the parser won't let you write
15 = MAX_LEVELS
This is because the left-hand side of the assignment operator (=) must be a variable, and a literal is not a variable. (The exact definition of variable is complex, since some things that look like expressions are also considered to be variables, such as d[k], (a, b), and foo.bar -- but not f() or () or 42. This definition of variable is also used by the "del" statement.)

Now on to None, True and False.

Let's begin with None, because it has always been in the language. (True and False were relatively recent additions -- they first made their appearance in Python 2.2.1, to be precise.) None is a singleton object (meaning there is only one None), used in many places in the language and library to represent the absence of some other value. For example, if d is a dictionary, d.get(k) will return d[k] if it exists, but None if d has no key k. In earlier versions of Python, None was just a "built-in name". The parser had no special knowledge of None -- just like it doesn't have special knowledge of built-in types like int, float or str, or built-in exceptions like KeyError or ZeroDivisionError. All of these are treated by the parser as identifiers, and when your code is being interpreted they are looked up just like any other names (e.g. the functions and variables you define yourself). So from the parser's perspective, the following are treated the same, and the parse tree it produces (<name> = <name>) is the same in each case:
x = None
x = int
x = foobar
On the other hand, the following produce different parse trees (<name> = <literal>):
x = 42
x = 'hello'
because the parser treats numeric and string literals as different from identifiers. Combining this with the earlier MAX_LEVEL examples, we can see that if we swap the left and right hand sides, the first three will still be accepted by the parser (<name> = <name>), while the swapped version of the second set will be rejected (<literal> = <name> is invalid).

The practical consequence is that, if you really want to mess with your readers, you can write code that reassigns built-ins; for example, you could write:
int = float
def parse_string(s):
    return int(s)
print(parse_string('42'))    # Will print '42.0'
Some of you may respond to this with "So what? Reasonable programmers don't write such code." Others may react in the opposite way, saying "Why on earth does the language allow assignment to a built-in name like 'int' at all?!"

The answer is subtle, and has to do with consistency and evolution of the language. I bet that without looking it up you won't be able to give a complete list all built-in names defined by Python. (I know I can't.) Moreover, I bet that many of you won't recognize every single name on that list. (To see the list, try typing dir(__builtins__) at the Python command prompt.)

Take for example the weird built-ins named copyright, credits or license. They exist so that we can mention them in the greeting shown when you start Python interactively:
Python 3.4.0a4+ (default:0917f6c62c62, Oct 22 2013, 10:55:35)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> credits
Thanks to CWI, CNRI, BeOpen.com, Zope Corporation and a cast of thousands
for supporting Python development.  See www.python.org for more information.
>>> 
 In order for this to work, we made them built-ins. But does this mean you shouldn't be allowed to use 'credits' as a variable or parameter name? I think not. Certainly many people don't realize that these esoteric built-ins even exist, and they would be surprised if they were prevented from using them as variable names. From here, it's just a gradual path. Many people write functions or methods with arguments named str or len, or with names like compile or format. Moreover, suppose you wrote some Python 2.5 code where you used bytes as a variable name. In Python 2.6, we added a built-in function named 'bytes' (it's an alias for str, actually). Should your code now be considered invalid? There's no reason for that, and in fact your code will be fine. (Even in Python 3, where bytes is one of the fundamental types.)

On the other hand, you cannot have a variable named 'if' or 'try', because these are reserved words (keywords) that are treated special by the parser. Because you cannot use these as variable or function names anywhere, ever, in any Python program, everyone using Python has to know about all the reserved words in the language, even if they don't have any need for them. For this reason, we try to keep the list of reserved words small, and the core developers hem and haw a lot before adding a new reserved word to the language.

In fact, many proposed new features have been killed because they would require a new keyword; others have been modified to avoid that need. Also, when we do decide to add a new keyword, we start a deprecation campaign at least one release before the new keyword is introduced, warning developers to choose a different name for their variables. (There's also a trick to allow developers to choose to use the new keyword right away; this is why we have e.g. "from __future__ import with_statement".)

There's no such concern for built-ins. Code that happens to use the name of a new built-in as a variable or function name will continue to function (as long as you don't also try to use the new built-in in the same function). While we still try to be conservative with the introduction of new built-ins, at least we don't have to worry about breaking working code by merely adding something to the language. The (small) price we pay for this is the possibility that some joker intentionally redefines a built-in just to confuse others. But there are tons of other ways to write unreadable code, and I don't see this as a particularly bad problem.

So, after this long detour about built-ins vs. keywords, back to None. Why did we eventually make None a reserved word? Frankly, the reasons were perhaps mostly social. Unlike some built-ins and many exceptions, None is so central to using Python that you really can't be using Python without knowing about None. So people were (like our question-asker) "horrified" when they found that assignment to None was actually allowed at all. Worse, there was the concern (whether founded or not) that the way name lookup in Python works, "evaluating" the expression None is slow, because it requires at least two dictionary lookups (all names are looked up in the globals dict before being looked up in the built-ins dict).

In the end we decided that there was no downside to making None a keyword (there is no code that actually assigns to it) and it might make some code a tiny bit faster, or catch rare typos. There was still a one-time cost to the developer community (changes to the parser and documentation) but this was small enough that we din't hesitate very long.

The situation for True/False is a little different. They weren't always part of the language, and many people had invented their own convention. People would define constants named true and false, True and False, or TRUE and FALSE, and use those consistently throughout their code. I don't recall which spelling was most popular, but when we introduced True and False into the language, we definitely did not want to break any packages that were defining their own True and False constants. (One concern was that those packages would have to have a way to continue to run on previous Python versions.)

So, essentially our hand was forced in this case, and we had to introduce True and False as built-in constants, not as keywords. But over time, code defining its own versions of True and False (by whichever name) became more and more frowned upon, and by the time Python 3 came around, when we looked at opportunities for cleaning up the language, we found that it was logical to make True and False keywords, by analogy to None.

And there you have it. It's all completely logical, once you understand the context. :-) Sorry for the long response; I hope it's been educational.

UPDATE: I still forgot to answer whether None/True/False are literals or keywords. My answer is that they are both. They are keywords because that's how the parser recognizes them. They are literals because that's their role in expressions and because they stand for constant values. One could argue about whether things like {'foo': 42} are literals; personally I'd prefer to give these some other name, because otherwise what would you call {'foo': x+1}? The language reference calls both of these "displays".

Thursday, October 24, 2013

Origin of metaclasses in Python

There was some speculation on python-ideas today on whether Python's metaclass design came from Ruby. It did not. And as long as we are speculating about the origins of language features, I feel the need to set the record straight.

I was not inspired by Ruby at that point (or ever :-). Ruby was in fact inspired by Python. Mats once told me that his inspiration was 20% Python, 80% Perl, and that Larry Wall is his hero.
 
I wrote about metaclasses in Python in 1998: http://www.python.org/doc/essays/metaclasses/.
 
New-style classes were just the second or third iteration of the idea.
 
I was inspired to implement new-style classes by a very specific book, "Putting Metaclasses to Work" by Ira Forman and Scott Danforth (http://www.amazon.com/Putting-Metaclasses-Work-Ira-Forman/dp/0201433052).
But even Python's original design (in 1990, published in 1991) had the notion that 'type' was itself an object. The type pointer in any object has always been a pointer to a special object, whose "data" was a bunch of C function pointers implementing the behavior of other objects, similar to a C++ vtable. The type of a type was always a special type object, which you could call a meta-type, to be recognized because it was its own type.
I was only vaguely aware of Smalltalk at the time; I remember being surprised by its use of metaclasses (which is quite different from that in Python or Ruby!) when I read about them much later. Smalltalk's bytecode was a bigger influence of Python's bytecode though. I'd read about it in a book by Adele Goldberg and others, I believe "Smalltalk-80: The Language and its Implementation" (http://www.amazon.com/Smalltalk-80-The-Language-its-Implementation/dp/0201113716).

Why Python uses 0-based indexing

I was asked on Twitter why Python uses 0-based indexing, with a link to a new (fascinating) post on the subject (http://exple.tive.org/blarg/2013/10/22/citation-needed/). I recall thinking about it a lot; ABC, one of Python's predecessors, used 1-based indexing, while C, the other big influence, used 0-based. My first few programming languages (Algol, Fortran, Pascal) used 1-based or variable-based. I think that one of the issues that helped me decide was slice notation.

Let's first look at use cases. Probably the most common use cases for slicing are "get the first n items" and "get the next n items starting at i" (the first is a special case of that for i == the first index). It would be nice if both of these could be expressed as without awkward +1 or -1 compensations.

Using 0-based indexing, half-open intervals, and suitable defaults (as Python ended up having), they are beautiful: a[:n] and a[i:i+n]; the former is long for a[0:n].

Using 1-based indexing, if you want a[:n] to mean the first n elements, you either have to use closed intervals or you can use a slice notation that uses start and length as the slice parameters. Using half-open intervals just isn't very elegant when combined with 1-based indexing. Using closed intervals, you'd have to write a[i:i+n-1] for the n items starting at i. So perhaps using the slice length would be more elegant with 1-based indexing? Then you could write a[i:n]. And this is in fact what ABC did -- it used a different notation so you could write a@i|n.(See http://homepages.cwi.nl/~steven/abc/qr.html#EXPRESSIONS.)

But how does the index:length convention work out for other use cases? TBH this is where my memory gets fuzzy, but I think I was swayed by the elegance of half-open intervals. Especially the invariant that when two slices are adjacent, the first slice's end index is the second slice's start index is just too beautiful to ignore. For example, suppose you split a string into three parts at indices i and j -- the parts would be a[:i], a[i:j], and a[j:].

So that's why Python uses 0-based indexing.

Friday, July 8, 2011

Karin Dewar, Indentation and the Colon

In a recent post on my other blog I mentioned a second-hand story about how Python's indentation was invented by the wife of Robert Dewar. I added that I wasn't very sure of the details, and I'm glad I did, because the truth was quite different. I received a long email from Lambert Meertens with the real story. I am going to quote it almost completely here, except for some part which he requested not to be quoted. In summary: Karin Dewar provided the inspiration for the use of the colon in ABC (and hence in Python) leading up to the indentation, not for indentation itself. Here is Lambert's email:

The Dewar story is not about indentation, but about the invention of the colon.

First about indentation in B. Already B0, the first iteration in the B0, B1, B2, ... sequence of designs leading to ABC, had non-optional indentation for grouping, supplemented by enclosing the group between BEGIN and END delimiters. This can be seen in [GM76], section 4.1 (Layout). The indentation was supposed to be added, like pretty printing, by a dedicated B editor, and the user had no control over this; they were not supposed to be able to turn this off or otherwise modify the indentation regime.

Given the mandatory indentation, separate BEGIN and END delimiters are of course superfluous; in B1 we had no BEGIN, but only END IF, END FOR, and so on, and then the latter delimiters were also dropped in B2, leaving pure indentation as the sole indicator of grouping. See [ME81], Section 4 (STATEMENT SYNTAX).

The origin of indentation in ABC is thus simply the desire to have the program text look neat and be suggestive of the meaning, codifying what was already common practice but not enforced. The decision to remove the noise of BEGIN ... END may have been influenced by [PL75], which actually advocated using pure indentation for grouping. Since occam came later (1983), the feature can't have been copied from that language. Same for Miranda (1985). As far as I am aware, B was the first actually published (and implemented) language to use indentation for grouping.

Now the Dewar story, which is how I got the idea of the colon, as I wrote it down in a memoir of the ABC design rationale:

And here I will paraphrase, at Lambert's request.

In 1978, in a design session in a mansion in Jabłonna (Poland), Robert Dewar, Peter King, Jack Schwartz and Lambert were comparing various alternative proposed syntaxes for B, by comparing (buggy) bubble sort implementations written down in each alternative. Since they couldn't agree, Robert Dewar's wife was called from her room and asked for her opinion, like a modern-day Paris asked to compare the beauty of Hera, Athena, and Aphrodite. But after the first version was explained to her, she remarked: "You mean, in the line where it says: 'FOR i ... ', that it has to be done for the lines that follow; not just for that line?!" And here the scientists realized that the misunderstanding would have been avoided if there had been a colon at the end of that line.

Lambert also included the following useful references:

[PL75] P. J. Plauger. Signal and noise in programming language. In J. D. White, editor, Proc. ACM Annual Conference 1975, page 216. ACM, 1975.

[GM76] Leo Geurts and Lambert Meertens. Designing a beginners' programming language. In S.A. Schuman, editor, New Directions in Algorithmic Languages 1975, pages 1–18. IRIA, Rocquencourt, 1976.

[ME81] Lambert Meertens. Issues in the design of a beginners' programming language. In J.W. de Bakker and J.C. van Vliet, editors, Algorithmic Languages, pages 167–184. North-Holland Publishing Company, Amsterdam, 1981.

Tuesday, August 24, 2010

Why Python's Integer Division Floors

I was asked (again) today to explain why integer division in Python returns the floor of the result instead of truncating towards zero like C.

For positive numbers, there's no surprise:

>>> 5//2
2

But if one of the operands is negative, the result is floored, i.e., rounded away from zero (towards negative infinity):

>>> -5//2
-3
>>> 5//-2
-3

This disturbs some people, but there is a good mathematical reason. The integer division operation (//) and its sibling, the modulo operation (%), go together and satisfy a nice mathematical relationship (all variables are integers):

a/b = q with remainder r

such that

b*q + r = a and 0 <= r < b

(assuming a and b are >= 0).

If you want the relationship to extend for negative a (keeping b positive), you have two choices: if you truncate q towards zero, r will become negative, so that the invariant changes to 0 <= abs(r) < otherwise, you can floor q towards negative infinity, and the invariant remains 0 <= r < b. [update: fixed this para]

In mathematical number theory, mathematicians always prefer the latter choice (see e.g. Wikipedia). For Python, I made the same choice because there are some interesting applications of the modulo operation where the sign of a is uninteresting. Consider taking a POSIX timestamp (seconds since the start of 1970) and turning it into the time of day. Since there are 24*3600 = 86400 seconds in a day, this calculation is simply t % 86400. But if we were to express times before 1970 using negative numbers, the "truncate towards zero" rule would give a meaningless result! Using the floor rule it all works out fine.

Other applications I've thought of are computations of pixel positions in computer graphics. I'm sure there are more.

For negative b, by the way, everything just flips, and the invariant becomes:

0 >= r > b.

So why doesn't C do it this way? Probably the hardware didn't do this at the time C was designed. And the hardware probably didn't do it this way because in the oldest hardware, negative numbers were represented as "sign + magnitude" rather than the two's complement representation used these days (at least for integers). My first computer was a Control Data mainframe and it used one's complement for integers as well as floats. A pattern of 60 ones meant negative zero!

Tim Peters, who knows where all Python's floating point skeletons are buried, has expressed some worry about my desire to extend these rules to floating point modulo. He's probably right; the truncate-towards-negative-infinity rule can cause precision loss for x%1.0 when x is a very small negative number. But that's not enough for me to break integer modulo, and // is tightly coupled to that.

PS. Note that I am using // instead of / -- this is Python 3 syntax, and also allowed in Python 2 to emphasize that you know you are invoking integer division. The / operator in Python 2 is ambiguous, since it returns a different result for two integer operands than for an int and a float or two floats. But that's a totally separate story; see PEP 238.