ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> xml-dev
xml-dev
RE: [xml-dev] Normalizing XML [was: XML information modeling best practices]
by Joshua Allen other posts by this author
Apr 30 2002 11:05PM messages near this date
Re: [xml-dev] XML envelopes and headers | RE: [xml-dev] Normalizing XML [was: XML information modeling best practices]
And in my opinion, normalization is overrated.  I have seen too many
times people fresh from a database theory class who build a system that
is completely normalized, with no thought as to how the data is going to
be coming in or going out.  They end up with a system that is
unmanageable and doesn't scale.  That is not to say that database
normalization is bad, but normalization is harmful if people are just
following rote guidelines without understanding *why* those guidelines
exist, and more importantly, when those guidelines do *not* apply, and
when other information modeling techniques are more useful.

Joshua Allen
Microsoft WebData XML
425.705.7857

>  -----Original Message-----
>  From: Michael Rys [mailto:mrys@[...].com]
>  Sent: Tuesday, April 30, 2002 3:27 PM
>  To: Ronald Bourret; xml-dev@[...].org
>  
>  Ah, for once an interesting topic.
>  
>  I don't have much time now, but I would like to point out the
following:
>  
>  1. The best definition of normal forms is based on the functional
>  dependencies and what constraints the normal forms impose on allowed
>  functional dependencies. The one exception is the 1st NF.
>  
>  1. 1st NF: All nested relational models are also known as NF^2
>  (non-first-normal-form). Since XML is nested and allows repetition, I
>  would consider this to also not apply to XML data. However, some of
the
>  guidelines given below are certainly good guidelines.
>  
>  2. 2nd NF etc: I think they all certainly apply, since FD analysis
>  basically allows you to determine what property groups should be
>  considered "entities" (ie complex elements) and how they should relate
>  to each other (nesting vs IDREF).
>  
>  Note that this of course assumes that you still apply the notion of
>  functional dependencies and some form of ER modeling which may or may
>  not be appropriate for your XML domain (ie, markup may not care too
>  much).
>  
>  Best regards
>  Michael
>  
>  PS: I think the above basically agrees with Michael Kay's reply...
>  
>  > -----Original Message-----
>  > From: Ronald Bourret [mailto:rpbourret@[...].com]
>  > Sent: Tuesday, April 30, 2002 2:19 AM
>  > To: xml-dev@[...].org
>  > Subject: [xml-dev] Normalizing XML [was: XML information modeling
best
>  > practices]
>  >
>  > Manos Batsis wrote:
>  > > > XML is pretty good for tables, but not so good
>  > > > for enforcing relational normalization rules.
>  > >
>  > > Of course. Enforcing relational normalization rules shuts down
most
>  good
>  > > reasons to use XML in the first place, with the possible exception
>  of
>  > > exchanging DB data between servers via http.
>  >
>  > I don't think this is quite true. One of the questions I wrestled
with
>  > in trying to understand native XML databases was what
"normalization"
>  > meant. While there's undoubtedly a lot of room for thought here, I
did
>  > take a look at the first, second, and third (relational) normal
forms
>  > and tried to see what they meant in XML terms.
>  >
>  > The following example supposes you are viewing a sales order
>  hierarchy:
>  >
>  >    <Order>
>  >       <Number>123</Number>
>  >       <Date>1/2/03</Date>
>  >       <Customer>
>  >          <Number>456</Number>
>  >          <Name>Customers, Inc.</Name>
>  >       </Customer>
>  >       <Item>
>  >          <Number>1</Number>
>  >          <Part>1b-10</Part>
>  >          <Quantity>10</Quantity>
>  >       </Item>
>  >       <Item>
>  >          <Number>2</Number>
>  >          <Part>zyx23</Part>
>  >          <Quantity>5</Quantity>
>  >       </Item>
>  >    </Order>
>  >
>  > First normal form:
>  > ------------------
>  > Data is in first normal form if it (a) has a primary key and (b) has
>  no
>  > repeating fields.
>  >
>  > A primary key basically means that there is a set of fields that
>  > uniquely identify the other fields in the row. In XML terms, this
>  > implies that you only store one "thing" per document -- such as one
>  > sales order or one chapter -- unless the collection of "things"
itself
>  > has identity. (It's not clear that all "things" stored in XML
>  documents
>  > can, in fact, be identified by a proper subset of their data.)
>  >
>  > This is not to say that it's not useful to place multiple "things"
in
>  a
>  > single XML document -- for example, it is quite useful to batch a
>  bunch
>  > of sales orders together to ship them over the wire -- just that
such
>  > documents are not normal.
>  >
>  > Repeating fields just means that you don't see field names like
>  Author1
>  > and Author2. In XML terms, this means you use repeating children (*,
>  +)
>  > rather than enumerated children.
>  >
>  > Second normal form:
>  > -------------------
>  > Data is in second normal form if the entire primary key is needed to
>  > predict each field value. The effect is to split the one and many
>  parts
>  > of a one-to-many relationship into separate tables. For example,
store
>  > sales order header information and line item information in separate
>  > tables.
>  >
>  > This form exists in the relational model to avoid duplicate data: if
>  you
>  > store sales order header and line item data in the same table, the
>  > header information gets repeated on each line item row. XML doesn't
>  have
>  > this problem -- it stores hierarchies quite nicely without duplicate
>  > data -- so I don't think the second normal form really applies.
>  >
>  > Third normal form:
>  > ------------------
>  > Data is in third normal form if you can't predict one non-key field
>  from
>  > another non-key field. The effect of this is to split the many and
one
>  > parts of a many-to-one relationship into separate tables. For
example,
>  > store customer data in a separate table from sales order data.
>  >
>  > This poses a real problem in the XML world, since many real-world
>  > documents contain duplicate data. For example, many sales orders
>  contain
>  > customer information -- name, address, phone nummber, etc.
>  >
>  > I think that this does apply to XML, but that you need to decide
when
>  it
>  > is useful to apply this form. That is, if you want truly normal XML
>  > data, you should store this sort of data in a separate document and
>  link
>  > to it from your main document. For example, store the data for each
>  > customer in a separate document and link to it from your sales order
>  > documents.
>  >
>  > However, I also think that this only makes sense if XML is the
primary
>  > storage format for your data, since it allows you to avoid update
>  > anomalies (as Jonathan Robie pointed out in another email). If XML
is
>  a
>  > secondary storage format, then you probably don't need to worry
about
>  > the duplicate data, since it is really a historical record, not a
set
>  of
>  > live data.
>  >
>  > To explain: Consider our sales order documents. It is unlikely that
>  the
>  > data for these documents lives in XML. More likely, the data lives
in
>  a
>  > relational database. In this case, the sales order document is a
>  > historical record of a given transaction, so the fact that the same
>  > customer data is used in multiple sales order documents doesn't
matter
>  > -- nobody is going to try to update it and there is no/low risk of
>  > update anomalies.
>  >
>  > Now consider geneological data that I am storing in a native XML
>  > database because it is too irregular to fit into a relational
>  database.
>  > In this case, I probably do want to store shared data in separate
>  > documents so it lives in only one place in the database.
>  >
>  > For example, my documents for each person contain data such as
>  > birthplace, birthdate, parents, siblings, career information, etc.,
>  but
>  > point to separate documents for things like information about the
>  agency
>  > where the birth certificate is stored and the contact information
for
>  > the administrator of the cemetery where the person is buried.
>  >
>  > Comments?
>  >
>  > -- Ron
>  >
>  > -----------------------------------------------------------------
>  > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>  > initiative of OASIS <http://www.oasis-open.org>
>  >
>  > The list archives are at http://lists.xml.org/archives/xml-dev/
>  >
>  > To subscribe or unsubscribe from this list use the subscription
>  > manager: <http://lists.xml.org/ob/adm.pl>
>  
>  
>  -----------------------------------------------------------------
>  The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
>  initiative of OASIS <http://www.oasis-open.org>
>  
>  The list archives are at http://lists.xml.org/archives/xml-dev/
>  
>  To subscribe or unsubscribe from this list use the subscription
>  manager: <http://lists.xml.org/ob/adm.pl>


-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org> , an
initiative of OASIS <http://www.oasis-open.org> 

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl> 

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved