ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> perl-xml
perl-xml
[ANNOUNCE] Sax Machines v0.3 and related modules
by Barrie Slaymaker other posts by this author
Jan 14 2002 11:24PM messages near this date
view in the new Beta List Site
Re: Big XML files (thanks) | Q: XML Iterators
At Matt's suggestion, there's a new machine that allows record oriented
processing of XML, XML::SAX::ByRecord, along with a supporting SAX
filter XML::Filter::DocSplitter.  X::S::ByRecord is documented below,
feedback/testing/patches on any/all of this is quite welcome.

Here are the current set of files

To use machines:

  file: $CPAN/authors/id/R/RB/RBS/XML-SAX-Machines-0.3.tar.gz

To trace SAX events:

  file: $CPAN/authors/id/R/RB/RBS/Devel-TraceCalls-0.02.tar.gz
  file: $CPAN/authors/id/R/RB/RBS/Devel-TraceSAX-0.02.tar.gz

To graph machine topologies:

  file: $CPAN/authors/id/R/RB/RBS/XML-Filter-Dispatcher-0.11.tar.gz
  file: $CPAN/authors/id/R/RB/RBS/XML-Handler-Machine2GraphViz-0.2.tar.gz

Thanks,

Barrie

--------------------------------------------------------------------------

NAME
    XML::SAX::ByRecord - Record oriented processing of (data) documents

SYNOPSIS
        use XML::SAX::Machines qw( ByRecord ) ;

        my $m = ByRecord(
            "My::RecordFilter1",
            "My::RecordFilter2",
            ...
            {
                Handler =>  $h, ## optional
            }
        );

        $m-> parse_uri( "foo.xml" );

DESCRIPTION
    XML::SAX::ByRecord is a SAX machine that treats a document as a series
    of records. Everything before and after the records is emitted as-is
    while the records are excerpted in to little mini-documents and run one
    at a time through the filter pipeline contained in ByRecord.

    The output is a document that has the same exact things before, after,
    and between the records that the input document did, but which has run
    each record through a filter. So if a document has 10 records in it, the
    per-record filter pipeline will see 10 sets of ( start_document, body of
    record, end_document ) events. An example is below.

    This has several use cases:

    *   Big, record oriented documents

        Big documents can be treated a record at a time with various DOM
        oriented processors like XML::Filter::XSLT.

    *   Streaming XML

        Small sections of an XML stream can be run through a document
        processor without holding up the stream.

    *   Record oriented style sheets / processors

        Sometimes it's just plain easier to write a style sheet or SAX
        filter that applies to a single record at at time, rather than
        having to run through a series of records.

  Topology

    Here's how the innards look:

       +-----------------------------------------------------------+
       |                  An XML:SAX::ByRecord                     |
       |    Intake                                                 |
       |   +----------+    +---------+         +--------+  Exhaust |
     --+--> | Splitter |--->| Stage_1 |-->...-->| Merger |----------+----->
       |   +----------+    +---------+         +--------+          |
       |               \                            ^              |
       |                \                           |              |
       |                 +----------> ---------------+              |
       |                   Events not in any records               |
       |                                                           |
       +-----------------------------------------------------------+

    The "Splitter" is an XML::Filter::DocSplitter by default,
    and the "Merger" is an XML::Filter::Merger by default. The
    line that bypasses the "Stage_1 ..." filter pipeline is used for all
    events that do not occur in a record. All events that occur in a record
    pass through the filter pipeline.

  Example

    Here's a quick little filter to uppercase text content:

        package My::Filter::Uc;

        use vars qw( @ISA );
        @ISA = qw( XML::SAX::Base );

        use XML::SAX::Base;

        sub characters {
            my $self = shift;
            my ( $data ) = @_;
            $data-> {Data} = uc $data->{Data};
            $self-> SUPER::characters( @_ );
        }

    And here's a little machine that uses it:

        $m = Pipeline(
            ByRecord( "My::Filter::Uc" ),
            \$out,
        );

    When fed a document like:

        <root>  a
            <rec> b</rec> c
            <rec> d</rec> e
            <rec> f</rec> g
        </root> 

    the output looks like:

        <root>  a
            <rec> B</rec> c
            <rec> C</rec> e
            <rec> D</rec> g
        </root> 

    and the My::Filter::Uc got three sets of events like:

        start_document
        start_element: <rec> 
        characters:    'b'
        end_element:   </rec> 
        end_document

        start_document
        start_element: <rec> 
        characters:    'd'
        end_element:   </rec> 
        end_document

        start_document
        start_element: <rec> 
        characters:   'f'
        end_element:   </rec> 
        end_document

METHODS
    new
            my $d = XML::SAX::ByRecord-> new( @channels, \%options );

        Longhand for calling the ByRecord function exported by
        XML::SAX::Machines.

CREDIT
    Proposed by Matt Sergeant, with advise by Kip Hampton and Robin Berjon.

Writing an aggregator.
    To be written. Pretty much just that "start_manifold_processing" and
    "end_manifold_processing" need to be provided. See 
    XML::Filter::Merger and it's source code for a starter.

_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
http://listserv.ActiveState.com/mailman/listinfo/perl-xml
Thread:
Barrie Slaymaker
Adam Turoff

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved