[ANNOUNCE] Sax Machines v0.3 and related modules
by Barrie Slaymaker other posts by this author
Jan 14 2002 11:24PM messages near this date
view in the new Beta List Site
Re: Big XML files (thanks)
|
Q: XML Iterators
At Matt's suggestion, there's a new machine that allows record oriented
processing of XML, XML::SAX::ByRecord, along with a supporting SAX
filter XML::Filter::DocSplitter. X::S::ByRecord is documented below,
feedback/testing/patches on any/all of this is quite welcome.
Here are the current set of files
To use machines:
file: $CPAN/authors/id/R/RB/RBS/XML-SAX-Machines-0.3.tar.gz
To trace SAX events:
file: $CPAN/authors/id/R/RB/RBS/Devel-TraceCalls-0.02.tar.gz
file: $CPAN/authors/id/R/RB/RBS/Devel-TraceSAX-0.02.tar.gz
To graph machine topologies:
file: $CPAN/authors/id/R/RB/RBS/XML-Filter-Dispatcher-0.11.tar.gz
file: $CPAN/authors/id/R/RB/RBS/XML-Handler-Machine2GraphViz-0.2.tar.gz
Thanks,
Barrie
--------------------------------------------------------------------------
NAME
XML::SAX::ByRecord - Record oriented processing of (data) documents
SYNOPSIS
use XML::SAX::Machines qw( ByRecord ) ;
my $m = ByRecord(
"My::RecordFilter1",
"My::RecordFilter2",
...
{
Handler => $h, ## optional
}
);
$m-> parse_uri( "foo.xml" );
DESCRIPTION
XML::SAX::ByRecord is a SAX machine that treats a document as a series
of records. Everything before and after the records is emitted as-is
while the records are excerpted in to little mini-documents and run one
at a time through the filter pipeline contained in ByRecord.
The output is a document that has the same exact things before, after,
and between the records that the input document did, but which has run
each record through a filter. So if a document has 10 records in it, the
per-record filter pipeline will see 10 sets of ( start_document, body of
record, end_document ) events. An example is below.
This has several use cases:
* Big, record oriented documents
Big documents can be treated a record at a time with various DOM
oriented processors like XML::Filter::XSLT.
* Streaming XML
Small sections of an XML stream can be run through a document
processor without holding up the stream.
* Record oriented style sheets / processors
Sometimes it's just plain easier to write a style sheet or SAX
filter that applies to a single record at at time, rather than
having to run through a series of records.
Topology
Here's how the innards look:
+-----------------------------------------------------------+
| An XML:SAX::ByRecord |
| Intake |
| +----------+ +---------+ +--------+ Exhaust |
--+--> | Splitter |--->| Stage_1 |-->...-->| Merger |----------+----->
| +----------+ +---------+ +--------+ |
| \ ^ |
| \ | |
| +----------> ---------------+ |
| Events not in any records |
| |
+-----------------------------------------------------------+
The "Splitter" is an XML::Filter::DocSplitter by default,
and the "Merger" is an XML::Filter::Merger by default. The
line that bypasses the "Stage_1 ..." filter pipeline is used for all
events that do not occur in a record. All events that occur in a record
pass through the filter pipeline.
Example
Here's a quick little filter to uppercase text content:
package My::Filter::Uc;
use vars qw( @ISA );
@ISA = qw( XML::SAX::Base );
use XML::SAX::Base;
sub characters {
my $self = shift;
my ( $data ) = @_;
$data-> {Data} = uc $data->{Data};
$self-> SUPER::characters( @_ );
}
And here's a little machine that uses it:
$m = Pipeline(
ByRecord( "My::Filter::Uc" ),
\$out,
);
When fed a document like:
<root> a
<rec> b</rec> c
<rec> d</rec> e
<rec> f</rec> g
</root>
the output looks like:
<root> a
<rec> B</rec> c
<rec> C</rec> e
<rec> D</rec> g
</root>
and the My::Filter::Uc got three sets of events like:
start_document
start_element: <rec>
characters: 'b'
end_element: </rec>
end_document
start_document
start_element: <rec>
characters: 'd'
end_element: </rec>
end_document
start_document
start_element: <rec>
characters: 'f'
end_element: </rec>
end_document
METHODS
new
my $d = XML::SAX::ByRecord-> new( @channels, \%options );
Longhand for calling the ByRecord function exported by
XML::SAX::Machines.
CREDIT
Proposed by Matt Sergeant, with advise by Kip Hampton and Robin Berjon.
Writing an aggregator.
To be written. Pretty much just that "start_manifold_processing" and
"end_manifold_processing" need to be provided. See
XML::Filter::Merger and it's source code for a starter.
_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
http://listserv.ActiveState.com/mailman/listinfo/perl-xml
Thread:
Barrie Slaymaker
Adam Turoff
|