[Tutor] subtyping builtin type

spir denis.spir at gmail.com
Thu Jan 2 10:28:43 CET 2014


On 01/02/2014 03:21 AM, Steven D'Aprano wrote:
> On Wed, Jan 01, 2014 at 02:49:17PM +0100, spir wrote:
>> On 01/01/2014 01:26 AM, Steven D'Aprano wrote:
>>> On Tue, Dec 31, 2013 at 03:35:55PM +0100, spir wrote:
> [...]
>> I take the opportunity to add a few features, but would do
>> without Source altogether if it were not for 'i'.
>> The reason is: it is for parsing library, or hand-made parsers. Every
>> matching func, representing a pattern (or "rule"), advances in source
>> whenever mathc is ok, right? Thus in addition to return the form (of what
>> was matched), they must return the new match index:
>> 	return (form, i)
>
> The usual way to do this is to make the matching index an attribute of
> the parser, not the text being parsed. In OOP code, you make the parser
> an object:
>
> class Parser:
>      def __init__(self, source):
>          self.current_position = 0  # Always start at the beginning
>          self.source = source
>      def parse(self):
>          ...
>
> parser = Parser("some text to be parsed")
> for token in parser.parse():
>      handle(token)
>
> The index is not an attribute of the source text, because the source
> text doesn't care about the index. Only the parser cares about the
> index, so it should be the responsibility of the parser to manage.

There is (no need for) a distinct Parser class or evne notion of parser. A 
parser is a top-level pattern (rule, object, or match func if one designs more 
like FP than OOP). Symmetrically, every pattern is a parser for what it matches.

Think at branches in a tree: the tree is a top-level branch and every branch is 
a sub-tree.

This conception is not semantically meaningful but highly useful, if not 
necessary, in practice: it permits using every pattern on its own ,for what it 
matches; it permits trying and testing patterns indicidually. (One could always 
find and implement workarounds, they would be needless complications.)

However, I see and partially share what you say -- se below.

>> Symmetrically, every match func using another (meaning nearly all) receive
>> this pair. (Less annoyingly, every math func also takes i as input, in
>> addition to the src str.) (There are also a handful of other annoying
>> points, consequences of those ones.)
>
> The match functions are a property of the parser, not the source text.
> So they should be methods on a Parser object. Since they need to track
> the index (or indexes), the index ought to be an attribute on the
> parser, not the source text.

This does not hold for me. Think eg at 2-phase parsing (like when using lex & 
yacc): the lexer (lex) provides the parser (yacc) with a stream of lexemes 
completely opaquely for the parser, which does not know about indexes (neither 
in original source sting, nore in the stream of lexemes). Since I parse 
(usually) in a single phase, the source string is in the position lex above: it 
feeds the parser with a stream of ucodes, advancing in coherent manner; the 
parser does not need, nore want (lol!) to manage the source's index That's the 
point. The index is a given for the parser, that it just uses to try & match at 
the correct position.

>> If I have a string that stores its index, all of this mess is gone.
>
> What you are describing is covered by Martin Fowler's book
> "Refactoring". He describes the problem:
>
>      A field is, or will be, used by another class more than the
>      class on which it is defined.
>
> and the solution is to move the field from that class to the class where
> it is actually used.
>
> ("Refactoring - Ruby Edition", by Jay Fields, Shane Harvie and Martin
> Fowler.)
>
> Having a class (in your case, Source) carry around state which is only
> used by *other functions* is a code-smell. That means that Source is
> responsible for things it has no need of. That's poor design.

I don't share this. Just like an open file currently beeing read conceptually 
(if not in practice) has a current index.

It's also sane & simple, and thread-safe, even if two Source objects happened to 
share (refs to) the same underlying actual source string (eg read form the same 
source file): each has its own current index.

I don't see how Fowler's views apply to this case. Whether a Source or a Parser 
holds the index does not change attribute access or its wishable properties.

> By making the parser a class, instead of a bunch of functions, they can
> share state -- the *parser state*. That state includes:
>
> - the text being parsed;
> - the tokens that can be found; and
> - the position in the text.
>
> The caller can create as many parsers as they need:
>
> parse_this = Parser("some text")
> parse_that = Parser("different text")
>
> without them interfering, and then run the parsers independently of each
> other. The implementer, that is you, can change the algorithm used by
> the Parser without the caller needing to know. With your current design,
> you start with this:
>
> # caller is responsible for tracking the index
> source = Source("some text")
> assert source.i = 0
> parse(source)

Maybe we just don't have the same experience or practice of parsing, but your 
reflexion does not match (sic!) anything I know. I don't see how having a source 
hold its index prevents anything above, in particular, how does it prevent to 
"change the algorithm used by the Parser without the caller needing to know"?

> What happens if next month you decide to change the parsing algorithm?
> Now it needs not one index, but two.

?
what do you mean?

>  You change the parse() function,
> but the caller's code breaks because Source only has one index.

?
Everyone of us constantly changes the algorithm when in development phase, and 
setting up patterns for a new kind of sources, don't we? What does it have to do 
with where the match index is stored?

>  You
> can't change Source, because other parts of the code are relying on
> Source having exactly a single index.

?

>  So you have to introduce *two* new
> pieces of code, and the caller has to make two changes::
>
> source = SourceWithTwoIndexes("some text")
> assert source.i = 0 and source.j = -1
> improved_parse(source)

???

> Instead, if the parser is responsible for tracking it's own data (the
> index, or indexes), then the caller doesn't need to care if the parsing
> algorithm changes. The internal details of the parser are irrelevant to
> the caller. This is a good thing!

A source is just a source, not part of, container of, or in any way related to 
the parsing algorithm. Changing an algo does not interfere with the source in 
any manner I can imagine.

> parser = Parse("some text")
> parser.parse()
>
> With this design, if you change the internal details of the parser, the
> caller doesn't need to change a thing. They get the improved parser for
> free.
>
> Since the parser tracks both the source text and the index, it doesn't
> need to worry that the Source object might change the index.
>
> With your design, the index is part of the source text. That means that
> the source text is free to change the index at any time. But it can't do
> that, since there might be a parser in the middle of processing it. So
> the Source class has to carry around data that it isn't free to use.
>
> This is the opposite of encapsulation. It means that the Source object
> and the parsing code are tightly coupled. The Source object has no way
> of knowing whether it is being parsed or not, but has to carry around
> this dead weight, an unused (unused by Source) field, and avoid using it
> for any reason, *just in case* it is being used by a parser. This is the
> very opposite of how OOP is supposed to work.

I understand, i guess, the underlying concerns expressed in your views (I 
guess). Like separation of concerns, and things holding their own data. This is 
in fact related to why I want sources to know their index. An alternative is to 
consider the whole library, or parser module (a bunch of matching patterns) as a 
global tool, a parsing machine, and make i a module-level var. Not a global in 
the usual sense, it's really part of the parsing machine (part of the machine's 
state), an attribute thus rather than a plain var. But this prevented beeing 
thread-safe and (how unlikely it may be to ever parse sources in parallel, i 
don't know of any example) this looked somewhat esthetically unsatisfying to me.

The price is having attribute access (in code and execution time) for every 
actual read into the source. I dislike this, but less than having a "lost" var 
roaming around ;-).

>> It
>> makes for clean and simple interfaces everywhere. Also (one of the
>> consequences) I can directly provide match funcs to the user, instead of
>> having to wrap them inside a func which only utility is to hide the
>> additional index (in both input & output).
>
> I don't quite understand what you mean here.

If every match func, or 'match' method of pattern objects, takes & returns the 
index (in addition to their source input and form output), then they don't have 
the simple & expected interface by users of the tool (lib or hand-backed parser, 
or a mix). You want to write this
	form = pat.match(source)
Not that:
	form, i = pat.match(source, i)

I'd thus need to provide wrapper funcs to get rid of i in both input and output.

Denis


More information about the Tutor mailing list