ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> tcl-core
tcl-core
[TCLCORE] Cloverfield references (was Re: Variable access)
by fbonnet other posts by this author
May 15 2008 4:50AM messages near this date
Re: [TCLCORE] Variable access | Re: [TCLCORE] Cloverfield references (was Re: Variable access)
Hi Neil,

Neil Madden wrote:
>  I think a reference implementation (pun intended!), or some semi-formal 
>  description, would clarify this discussion immensely. Until then, some 
>  not-so-brief comments are below. I've tried to trim down to just the key 
>  points, so I've trimmed much of the rest of the discussion.

OK, so I'm going to give a little more information about Cloverfield 
references here, but first, I suggest you read the Cloverfield 
Tridekalogue on the Wiki here:

http://wiki.tcl.tk/20643

Especially the parts on references. Also read the section on references 
on the main Cloverfield page:

http://wiki.tcl.tk/20638 (near the end of the page).


The reason why I haven't given much details before is that the original 
discussion revolved on syntactic rather than semantic issues. But now I 
understand that a little background info on what is meant by "reference" 
is certainly useful to get the whole concept and its potential usage.

So the present discussion is no longer about the proposed syntax 
enhancement, but about the semantics of references in Cloverfield. Both 
are orthogonal issues. I've changed the subject line accordingly.


First, references as they are meant in Cloverfield are distinct from C 
pointers or Java references. The {ref <id> } syntax is closer in 
principle to XML ids, in the sense that it can be used to tag data 
within a structure, so that potentially circular relations can be 
expressed. And the distinction between $var and $&var has more to do 
with immutable vs mutable than with value vs reference. Also note that 
Cloverfield is a work in progress, so the syntax is not frozen.

     % set v ({ref a}(1 2 3) {ref a}{})
     % puts $v
     {ref a}(1 2 3) {ref a}{}
     % puts $v{0}
     1 2 3
     % puts $v{1}
     1 2 3

COW still applies to structures with references. Mutations don't 
propagate through simple values ($) but only through references, 
explicitly ($&) or implicity (mutable operations).

     % set v2 $v
     % puts $v2
     {ref a}(1 2 3) {ref a}{}

     % set v3 $&v
     % puts $v2
     {ref a}(1 2 3) {ref a}{}

     % set v{0 2} foo
     % puts $v{0}
     1 2 foo
     % puts $v{1}
     1 2 foo
     % puts $v
     {ref a}(1 2 foo) {ref a}{}
     % puts $v2
     {ref a}(1 2 3) {ref a}{}
     % puts $v3
     {ref a}(1 2 foo) {ref a}{}

Moreover, the {ref <id> } tag is only output for values that are 
referenced several times in a given structure (=>  self-referential or 
circular structures). So simple structures are serialized simply, even 
if references are internally used everywhere.

     % set v{1} bar
     % puts $v
     (1 2 3) bar
     % set v{1} $&v{0}
     % puts $v
     {ref a}(1 2 3) {ref a}{}

Another important point is that the ID space is word-local, i.e. IDs are 
resolved in the current substitution context. So references with the 
same name don't point to the same value unless they are in the same context:

     # Two structures using the same ID but in different contexts.
     % set a {ref a}(1 2 3)
     % set b {ref a}(4 5 6)
     % puts $a
     1 2 3
     % puts $b
     4 5 6

     # One structure using the same ID twice in the same context.
     # (Using parentheses).
     % set c (({ref a}1 2) ({ref a}3 4))
     % puts $c
     ({ref a}1 2) ({ref a}{} 4)
     % puts $c{0}
     1 2
     % puts $c{1}
     1 4
     % set c{0 0} foo
     % puts $c{0}
     foo 2
     % puts $c{1}
     foo 4
     % puts $c
     ({ref a}foo 2) ({ref a}{} 4)

     % One structure using the same ID twice but in distinct contexts.
     # (Using braces).
     % set d {{{ref a}1 2} {{ref a}3 4}}
     % puts $d
     {{ref a}1 2} {{ref a}3 4}
     % puts $d{0}
     {ref a}1 2
     % puts $d{1}
     {ref a}3 4
     % set d{0 0} foo
     % puts $d{0}
     foo 2
     % puts $d{1}
     {ref a}3 4
     % puts $d
     {foo 2} {{ref a}3 4}


Finally, Cloverfield allows globally scoped IDs for complex cases 
(values shared by otherwise distinct structures). They are the only 
potential source of security problems. Also, global IDs are per 
interpreter so propagation is limited. So safe interpreters could for 
example manage separate pools of global IDs depending on the source, or 
let them resolve to totally distinct references (which would break the 
bindings but preserve the data). But I don't think it's a big problem 
with regular applications, as applications having security issues should 
use safe interps anyway. I don't know if I make myself clear, however 
it's quite clear in my mind ;-)

> > A --> B
> > |   / |
> > |  /  |
> > V L   V
> > C --> D
> >
> > Obviously this cannot be represented by a tree, so pure Tcl lists are
> > useless in this case. Now if we use nodes with references, we get the
> > following :
> >
> > Node  Children
> > A     B C
> > B     C D
> > C     D
> > D     {}
>  
>  My immediate thought is that graphs are best represented by relations, 
>  rather than references, so for any serious graph work, I'd use a 
>  relational database like SQLite, or a relational language like Prolog. 
>  For a small example like this, I'd probably use my own digraph package 
>  (http://wiki.tcl.tk/18129 - based on dicts), which results in a similar 
>  solution to your first array solution:

Keep in mind that this was a very simple example, you have to put it in 
perspective for a more generic usage. You can't mandate a RDB for every 
application that needs even mildly complex structures. The typical 
example I have in mind is DOM trees.

> >      array set graph {
> >          A {graph(B) graph(C)}
> >          B {graph(C) graph(D)}
> >          C {graph(D)}
> >          D {}
> >      }
>  
>  Why would you do this? As I mentioned before, references always need 
>  some mapping (interpretation) which defines what they refer to. Trading 
>  a reference which is defined relative to the graph itself (i.e. simple 
>  keys) for one which is defined relative to the current variable 
>  namespace, doesn't seem to gain much (and makes it harder to e.g. clone 
>  the graph).

We put ourselves in the general case where references must point to 
globally scoped variables. For example you may want to use several 
graphs and move some nodes between them. If you prefer I can express the 
above graph as a set of variables:

     set a {A {b c}}
     set b {B {c d}}
     set c {C {d}}
     set d {D {}}

But this doesn't change the point.

> > Now, if we want to manage the lifetime of each node, we have to maintain
> > bookkeeping info somewhere. IOW, we have to write an ad-hoc garbage
> > collector.
>  
>  Sure. If you use a new system of reference (e.g. a dict, or an array 
>  etc), then you need to manage the lifetimes of the objects contained in 
>  it. It's not clear in this example what particular strategy of memory 
>  management is desired: do you want a node to disappear when no edge 
>  leads to it? Or when the only reference to it is itself (via a circular 
>  reference)? Or do you want edges to be removed when the node they target 
>  is removed? It's not obvious to me that a general-purpose reference GC 
>  does the right thing for the application, without knowing what the 
>  application is.

We're in the classical case where structures can be arbitrarily shared, 
and get GC'd when they are no longer reachable. That's the kind of 
application where Tcl doesn't do well when compared to competing 
solutions such as Java or Python.

> > The idea beind Cloverfield's serialization format is that any reference
> > can be recreated using any of its string representation. The technique
> > consists in serializing both the referenced value and a suitable
> > identifier. [...]
>  
>  Right, but this still leaves lots of questions. For instance, it is 
>  quite easy to manufacture references in this system:
>  
>  set a {ref A}(1 2 3) ;# $a is now a reference with id "A" and val (1 2 3)
>  set b {ref A}(a b c)
 > 
 >  Now what? Does setting b override the reference in $a? What is used to
 >  determine which of these conflicting reference definitions is current?
[...]

See my previous explanation. Also, reference IDs are designed to be 
manufactured easily, but their scope is limited to the context in which 
they are declared, so they are pretty useless for potential attacks.

>  (Being able to 
>  pass mutable references where only values were previously allowed in 
>  general would need a careful security audit).

With the proposed syntax, one cannot pass a mutable value other than 
explicity. For example (let's suppose that incr works on values or 
references and not var names):

     % set v ({ref a}1 {ref a}{})
     % set v{1}
     1
     % incr $v{0}
     2
     % set v{1}
     1
     % incr $&v{0}
     2
     % set v{1}
     2

Other than with global IDs, mutable ops behind your back is impossible.

>  Another scenario:
>  set a {ref A}(1 2 3)
>  set b $&a
>  
>  As I understand it, both $a and $b now refer to the reference ("A") that 
>  points at the value (1 2 3). Now, let's say that for some reason that 
>  reference "A" loses its internal rep (say somebody performs an operation 
>  with $a that converts it to some other type). 

First, {ref A} only tags data within a. So in the first line, {ref A} is 
essentially a no-op. The second line, though, binds variable b to a's 
current value. If the value of variable a changes then b is unaffected. 
However if the content of a's value is altered through a mutable 
operation, then this change affects the content of b as well.

     % set a (1 2 3)
     % set b $&a
     % set a{0} foo ; # Mutable op on a's value, also ref'd by b.
     % set b
     foo 2 3
     % set a (4 5 6) ; # Variable now refers to another value.
     % set b
     foo 2 3

>  Furthermore, lets say that 
>  by some further operation causes a copy-on-write so that $a and $b now 
>  are distinct Tcl_Objs, but that they still have an identical string rep. 
>  IIRC, the following might cause that situation:
 > 
>    append a " "
>    set a [string range $a 0 end-1]
>  
>  So now we have two distinct Tcl_Objs $a and $b that used to be the same 
>  reference but now aren't and neither has a reference internal rep. Now, 
>  say $a is stashed away in some data structure and forgotten about. 
>  Meanwhile, $b is treated as a reference again and re-establishes the 
>  reference, recreating it if necessary from the string rep. Now the user 
>  happily updates $b, mutating the reference, to say (a b c). A little 
>  later, $b is unset and the reference disappears again. Now, $a is 
>  brought out of hiding, and brushed down, and then an attempt is made to 
>  read the reference: what does it return? Presumably, as the reference 
>  has now gone, it is recreated again from the string rep of $a, reverting 
>  to (1 2 3) -- all the updates to $b have been lost, even though these 
>  references were supposed to be the same.

Since the internal link is broken when a's value is set again, this 
won't happen hopefully.

     % set a foo
     % set b $&a
     % append a bar ; # Mutable op on a's value, also ref'd by b.
     % set b
     foobar
     % set a [string range $a 0 end-1] ; # a now refers to another value.
     fooba
     % set b
     foobar

I hope this answers all your concerns.

> >  To handle circular references, a value is only serialized
> > when it is first encountered in the current serialization process. In
> > the above example, A's children is (B C D), so when serializing the
> > whole graph, A is first serialized, then its first child B, then B's
> > children and so on, so that when nodes need to be serialized again
> > later, we only need to output the ref id, and the whole value can be
> > omitted.
>  
>  But then, presumably if this entire structure gets shimmered (e.g. by an 
>  append), all of these embedded references lose their internal reps 
>  (possibly causing the actual reference to be GC'd). Now if we extract 
>  one of these later nodes that only have the reference ID then we no 
>  longer have enough information to recreate that reference, as we have 
>  lost both the original reference and any string rep that could be used 
>  to recreate it. The integrity of each individual reference is only 
>  preserved as long as the integrity of the entire structure is preserved.

No, references can be recreated because both their ID and value are 
serialized.

Oh, you mean extracting the value-less nodes from their string rep? Sure 
the value would be lost, but I don't think this makes much sense in a 
real-world scenario, as it involves slicing the string rep manually 
without using access commands. That would be the same as using [string 
range] to extract an element from a list's string rep (and it wouldn't 
work anyway because of quoting rules).

>  My point is that real "strong" references are entirely incompatible with 
>  EIAS.

Hmmm. I'm not sure what you mean by that. More precisely, what does EIAS 
means exactly? To me, it's the fact that everything can be expressed as 
strings, and that internal structures can be recreated from their string 
representation. The Cloverfield reference serialization format meets 
these criteria IMHO.


> >      % set jim (1 2 3)
> >      % set d [dict create foo 12 bar $&jim baz $&jim]
> >      foo 12 bar {ref <id>}(1 2 3) baz {ref <id>}{}
> >      % set jim (a b c)
> >      % set d
> >      foo 12 bar {ref <id>}(a b c) baz {ref <id>}{}
> >      % set d(baz)
> >      a b c
>  
>  Why does the item denoted by $d(baz) have a different string rep when 
>  serialised inside the dictionary to when it is used on its own? I.e., 
>  why isn't the output of that last command:
>  
>      % set d(baz)
>      {ref <id>}{}

String reps are generated on demand and depending on the context. So 
d(baz) refers to the list (a b c). The fact that this value is used 
elsewhere is out of context. So the string rep is simply that of the 
refered value.

>  Presumably, [set] is looking at the value at $d(baz) and seeing (by 
>  inspection of the internal rep, or by string matching if the rep has 
>  been lost) that it looks like a reference and then doing another 
>  de-reference to get the actual value? Whereas the $&foo syntax just 
>  grabs the reference itself and returns it's string rep? Except that in 
>  this case 'jim' doesn't contain a reference to begin with, but is a 
>  normal variable, so $&jim creates the reference, presumably. Then, do 
>  subsequent $&jim's guarantee to return the same reference, or a fresh 
>  one? Or are variables themselves implemented as references?

Variables are just entry points to references. IOW, variables are names 
in the current calling context that hold references to values. So $&var 
returns a reference to var's value at the time of substitution. $var 
returns the value itself. $@var would return a weak reference to var 
(semantically equivalent to its variable name), meaning that it would 
dereference var at each access. If var's value changes, then $@var will 
reflect this change, contrary to $&var which will continue to point to 
the previous value.

     % set var foo
     % set v1 $&var
     % set v2 $v1
     % set v3 $@var
     % set var bar
     % set v1
     foo
     % set v2
     foo
     % set v3
     bar
     % set v1 baz
     % set v2
     foo

>  Again, I worry about the security problems this introduces, and more 
>  generally the difficulties it introduces. For example, consider some 
>  code like the following:
>  
>  proc execute-query {query} {
>      lappend query -user foo -password sekret
>      db eval $query
>  }
>  set i [interp create -safe]
>  $i alias query execute-query
>  $i eval $untrustedcode
>  
>  In current Tcl that is safe: the username and password remain in trusted 
>  code. With mutable references there is now a brand new line of attack 
>  where the caller simply passes a reference instead of a value and can 
>  now read back the user name and password once the query completes. 
>  Mutable references should not be transparently interchangeable with 
>  values. The security aspects are just the most extreme example of the
>  problems this can cause. Much code is written now with the assumption 
>  that changes to local variables aren't visible outside of the proc 
>  unless a specific broadening of scope is performed (via upvar, variable, 
>  global etc).
>  
>  (Incidentally, while testing this I discovered that a safe slave can 
>  introspect the aliases defined in it, including retrieving the target 
>  command that would be executed in the master interpreter, which seems 
>  like a security flaw of its own).

I see your point. In the code above, execute-query should avoid 
modifying the query variable, as it can hold a reference. Hopefully with 
the proposed changes, lappend would become [list append] or something 
and work in a purely functional way:

proc execute-query {query} {
     db eval [list append $query -user foo -password sekret]
}

But this would require a careful audit of aliased procs (which is needed 
anyway). A more general solution would be to disallow reference passing 
from safe interpreters and downgrade them to plain values, which would 
do the trick.

Anyway, I think my proposal won't break the existing pass-by-value 
semantics, and rest assured that I'll work hard to ensure that the 
implementation lives up to this promise.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Tcl-Core mailing list
Tcl-Core@[...].net
https://lists.sourceforge.net/lists/listinfo/tcl-core
Thread:
fbonnet
fredericbonnet
Neil Madden
fbonnet
Neil Madden
fbonnet
Lars Hellstrom
fbonnet
Neil Madden
fredericbonnet
David Welton
fbonnet
David Welton
Larry McVoy
Alexandre Ferrieux
Andreas Leitgeb
fbonnet
Neil Madden
Donal K. Fellows
Alexandre Ferrieux
Larry McVoy
Neil Madden
Gustaf Neumann
Neil Madden
Larry McVoy
Neil Madden
Alexandre Ferrieux
fbonnet
Neil Madden
Alexandre Ferrieux
Donal K. Fellows
Larry McVoy
Alexandre Ferrieux
Donal K. Fellows
Alexandre Ferrieux

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved