RE: [xml-dev] Normalizing XML [was: XML information modeling best practices]
by Michael Rys other posts by this author
Apr 30 2002 10:27PM messages near this date
RE: [xml-dev] Normalizing XML [was: XML information modeling best practices]
|
RE: [xml-dev] Re: I can XInclude where I [expletive deleted] want
to
Ah, for once an interesting topic.
I don't have much time now, but I would like to point out the following:
1. The best definition of normal forms is based on the functional
dependencies and what constraints the normal forms impose on allowed
functional dependencies. The one exception is the 1st NF.
1. 1st NF: All nested relational models are also known as NF^2
(non-first-normal-form). Since XML is nested and allows repetition, I
would consider this to also not apply to XML data. However, some of the
guidelines given below are certainly good guidelines.
2. 2nd NF etc: I think they all certainly apply, since FD analysis
basically allows you to determine what property groups should be
considered "entities" (ie complex elements) and how they should relate
to each other (nesting vs IDREF).
Note that this of course assumes that you still apply the notion of
functional dependencies and some form of ER modeling which may or may
not be appropriate for your XML domain (ie, markup may not care too
much).
Best regards
Michael
PS: I think the above basically agrees with Michael Kay's reply...
> -----Original Message-----
> From: Ronald Bourret [mailto:rpbourret@[...].com]
> Sent: Tuesday, April 30, 2002 2:19 AM
> To: xml-dev@[...].org
> Subject: [xml-dev] Normalizing XML [was: XML information modeling best
> practices]
>
> Manos Batsis wrote:
> > > XML is pretty good for tables, but not so good
> > > for enforcing relational normalization rules.
> >
> > Of course. Enforcing relational normalization rules shuts down most
good
> > reasons to use XML in the first place, with the possible exception
of
> > exchanging DB data between servers via http.
>
> I don't think this is quite true. One of the questions I wrestled with
> in trying to understand native XML databases was what "normalization"
> meant. While there's undoubtedly a lot of room for thought here, I did
> take a look at the first, second, and third (relational) normal forms
> and tried to see what they meant in XML terms.
>
> The following example supposes you are viewing a sales order
hierarchy:
>
> <Order>
> <Number>123</Number>
> <Date>1/2/03</Date>
> <Customer>
> <Number>456</Number>
> <Name>Customers, Inc.</Name>
> </Customer>
> <Item>
> <Number>1</Number>
> <Part>1b-10</Part>
> <Quantity>10</Quantity>
> </Item>
> <Item>
> <Number>2</Number>
> <Part>zyx23</Part>
> <Quantity>5</Quantity>
> </Item>
> </Order>
>
> First normal form:
> ------------------
> Data is in first normal form if it (a) has a primary key and (b) has
no
> repeating fields.
>
> A primary key basically means that there is a set of fields that
> uniquely identify the other fields in the row. In XML terms, this
> implies that you only store one "thing" per document -- such as one
> sales order or one chapter -- unless the collection of "things" itself
> has identity. (It's not clear that all "things" stored in XML
documents
> can, in fact, be identified by a proper subset of their data.)
>
> This is not to say that it's not useful to place multiple "things" in
a
> single XML document -- for example, it is quite useful to batch a
bunch
> of sales orders together to ship them over the wire -- just that such
> documents are not normal.
>
> Repeating fields just means that you don't see field names like
Author1
> and Author2. In XML terms, this means you use repeating children (*,
+)
> rather than enumerated children.
>
> Second normal form:
> -------------------
> Data is in second normal form if the entire primary key is needed to
> predict each field value. The effect is to split the one and many
parts
> of a one-to-many relationship into separate tables. For example, store
> sales order header information and line item information in separate
> tables.
>
> This form exists in the relational model to avoid duplicate data: if
you
> store sales order header and line item data in the same table, the
> header information gets repeated on each line item row. XML doesn't
have
> this problem -- it stores hierarchies quite nicely without duplicate
> data -- so I don't think the second normal form really applies.
>
> Third normal form:
> ------------------
> Data is in third normal form if you can't predict one non-key field
from
> another non-key field. The effect of this is to split the many and one
> parts of a many-to-one relationship into separate tables. For example,
> store customer data in a separate table from sales order data.
>
> This poses a real problem in the XML world, since many real-world
> documents contain duplicate data. For example, many sales orders
contain
> customer information -- name, address, phone nummber, etc.
>
> I think that this does apply to XML, but that you need to decide when
it
> is useful to apply this form. That is, if you want truly normal XML
> data, you should store this sort of data in a separate document and
link
> to it from your main document. For example, store the data for each
> customer in a separate document and link to it from your sales order
> documents.
>
> However, I also think that this only makes sense if XML is the primary
> storage format for your data, since it allows you to avoid update
> anomalies (as Jonathan Robie pointed out in another email). If XML is
a
> secondary storage format, then you probably don't need to worry about
> the duplicate data, since it is really a historical record, not a set
of
> live data.
>
> To explain: Consider our sales order documents. It is unlikely that
the
> data for these documents lives in XML. More likely, the data lives in
a
> relational database. In this case, the sales order document is a
> historical record of a given transaction, so the fact that the same
> customer data is used in multiple sales order documents doesn't matter
> -- nobody is going to try to update it and there is no/low risk of
> update anomalies.
>
> Now consider geneological data that I am storing in a native XML
> database because it is too irregular to fit into a relational
database.
> In this case, I probably do want to store shared data in separate
> documents so it lives in only one place in the database.
>
> For example, my documents for each person contain data such as
> birthplace, birthdate, parents, siblings, career information, etc.,
but
> point to separate documents for things like information about the
agency
> where the birth certificate is stored and the contact information for
> the administrator of the cemetery where the person is buried.
>
> Comments?
>
> -- Ron
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>
-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org> , an
initiative of OASIS <http://www.oasis-open.org>
The list archives are at http://lists.xml.org/archives/xml-dev/
To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>
|