|
Description:
Appendix B of IETF RFC 2396 provides this regular expression, which breaks down
a Uniform Resource Identifier (URI) into its component parts.
Usage: Text Source
my $uri = "http://www.ics.uci.edu/pub/ietf/uri/#Related";
print "$1, $2, $3, $4, $5, $6, $7, $8, $9" if
$uri =~ m{^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?};
The license for this recipe is available here.
Discussion:
If the match is successful, a URL such as
http://www.ics.uci.edu/pub/ietf/uri/#Related
will be broken down into the following group match variables:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 =
$7 =
$8 = #Related
$9 = Related
In general, this regular expression breaks a URI down into the following parts,
as defined in the RFC:
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
|
|
Add comment
|
|
Number of comments: 1
My mistake, John Liu, 2002/03/14
I mistook this for a regular expression to actually catch URLs from any text, and as such, it actually does quite a poor job, because it maps everything that fits ^[^#?]
Anyway, upon re-reading the description, use this to break a URI down to its component parts, I realized my error. Anyway, use another regular expression to get your URLs.
Add comment
|
|
|