Lingua::EN::AddressParse 1.05
Perl
module
-
Part of CPAN
distribution
Lingua-EN-AddressParse 1.05.
Lingua::EN::AddressParse - manipulate geographical addresses
use Lingua::EN::AddressParse;
my %args =
(
country => 'Australia',
auto_clean => 1,
force_case => 1,
abbreviate_subcountry => 1
);
my $address = new Lingua::EN::AddressParse(%args);
$error = $address->parse("14A MAIN RD. ST JOHNS WOOD NEW SOUTH WALES 2000");
%my_address = $address->components;
$suburb = $my_address{suburb};
$correct_casing = $address->case_all;
Perl, version 5.004 or higher, Lingua::EN::NameParse, Locale::SubCountry, Parse::RecDescent
This module takes as input an address or post box in free format text such as,
12/3-5 AUBREY ST VERMONT VIC 3133
"OLD REGRET" WENTWORTH FALLS NSW 2782 AUSTRALIA
2A OLD SOUTH LOW ST. KEW NEW SOUTH WALES 2123
GPO Box K318, HAYMARKET, NSW 2000
and attempts to parse it. If successful, the address is broken
down into components and useful functions can be performed such as :
converting upper or lower case values to name case (2 Low St. Kew NSW 2123 )
extracting the addresses individual components (2,Low St.,KEW,NSW,2123 )
determining the type of format the address is in ('suburban')
If the address cannot be parsed you have the option of cleaning the address
of bad characters, or extracting any portion that was parsed and the portion
that failed.
This module can be used for analysing and improving the quality of
lists of addresses.
The following terms are used by AddressParse to define
the components that can make up an address or post box.
Post Box - GP0 Box K123, LPO 2345, RMS 23 ...
Property Identifier
Sub property description - Level, Unit, Apartment, Lot ...
Property number - 12/66A, 24-34, 2A, 23B/12C, 12/42-44
Property name - "Old Regret"
Street
Street name - O'Hare, New South Head, The Causeway
Street type - Road, Rd., St, Lane, Highway, Crescent, Circuit ...
Suburb - Dee Why, St. John's Wood ...
Sub country - NSW, New South Wales, ACT, NY, AZ ...
Post code - 2062, 34532, SG12A 9ET
Country - Australia, UK, US or Canada
Refer to the component grammar defined in the AddressGrammar module for a
list of combinations.
The following address formats are currently supported :
'suburban' - property_identifier(?) street street_type suburb subcountry post_code country(?)
'post_box' - post_box suburb subcountry post_code country(?)
'rural' - property_name suburb subcountry post_code country(?)
The new method creates an instance of an address object and sets up
the grammar used to parse addresses. This must be called before any of the
following methods are invoked. Note that the object only needs to be
created once, and can be reused with new input data.
Various setup options may be defined in a hash that is passed as an
optional argument to the new method.
my %args =
(
country => 'Australia',
auto_clean => 1,
force_case => 1,
abbreviate_subcountry => 1
);
my $address = new Lingua::EN::AddressParse(%args);
The country argument must be specified. It determines the possible list of
valid sub countries (states, counties etc, defined in the Locale::SubCountry
module) and post code formats. Formats are currently supported for:
Australia
Canada
UK
US
All forms of upper/lower case are acceptable in the country's spelling. If a
country name is supplied that the module doesn't recognise, it will die.
This option will force the case_all method to address case the entire input
string, including any unmatched sections that failed parsing. This option is
useful when you know you data has invalid addresses, but you cannot filter out
or reject them.
When this option is set to a positive value, any call to the parse method
that fails will attempt to 'clean' the address and then reparse it. See the
clean method for details. This is useful for dirty data with embedded
unprintable or non alphabetic characters.
When this option is set to a positive value, the sub country is forced to it's
abbreviated form, so "New South Wales" becomes "NSW". If the sub country is
already abbreviated then it's value is not altered.
$error = $address->parse("12/3-5 AUBREY ST VERMONT VIC 3133");
The parse method takes a single parameter of a text string containing a
address. It attempts to parse the address and break it down into the components
described above. If the address was parsed successfully, a 0 is returned,
otherwise a 1. This step is a prerequisite for the following functions.
$correct_casing = $address->case_all;
The case_all method converts the first letter of each component to
capitals and the remainder to lower case, with the following exceptions-
Proper names capitalisation such as MacNay and O'Brien are observed
The method returns the entire cased address as text.
%my_address = $address->components;
$cased_suburb = $my_address{suburb};
The case_components method does the same thing as the case_all method,
but returns the addresses cased components in a hash. The following keys are
used for each component-
post_box
property_identifier
property_name
street
street_type
suburb
subcountry
post_code
country
If a key has no matching data for a given address, it's values will be
set to the empty string.
%address = $address->components;
$surburb = $address{suburb};
The components method does the same thing as the case_components method,
but each component is returned as it appears in the input string, with no case
conversion.
The properties method return several properties of the address as a hash.
The type of format a name is in, as one of the following strings:
suburban
rural
post_box
unknown
Returns any unmatched section that was found.
The huge number of character combinations that can form a valid address makes
it is impossible to correctly identify them all.
Valid addresses must contain a suburb, subcountry (state) and post code,
in that order. This format is widely accepted in Australia and the US. UK
addresses will often include suburb, town, city and county, formats that
are very difficult to parse.
Property names must be enclosed in quotes like "Old Regret"
Because of the large combination of possible addresses defined in the grammar,
the program is not very fast.
"The Wordsworth Dictionary of Abbreviations & Acronyms" (1997)
Australian Standard AS4212-1994 "Geographic Information Systems -
Data Dictionary for transfer of street addressing information"
ISO 3166-2:1998, Codes for the representation of names of countries and their subdivisions
Also released as AS/NZS 2632.2:1999
Define grammar for other languages. Hopefully, all that would be needed is
to specify a new module with its own grammar, and inherit all the existing
methods. I don't have the knowledge of the naming conventions for non-english
languages.
Lingua::EN::NameParse, Parse::RecDescent, Locale::SubCountry
Streets such as "The Esplanade" will return a street of "The" and a street type
of "Esplanade".
Copyright (c) 2000-1 Kim Ryan. All rights reserved.
This program is free software; you can redistribute it
and/or modify it under the terms of the Perl Artistic License
(see http://www.perl.com/perl/misc/Artistic.html).
AddressParse was written by Kim Ryan <kimaryan@ozemail.com.au>.
<http://members.ozemail.com.au/~kimaryan/data_distillers/>
|