ASPN ActiveState Programmer Network
ActiveState
/ Home / Perl / PHP / Python / Tcl / XSLT /
/ Safari / My ASPN /
Cookbooks | Documentation | Mailing Lists | Modules | News Feeds | Products | User Groups


Recent Messages
List Archives
About the List
List Leaders
Subscription Options

View Subscriptions
Help

View by Topic
ActiveState
.NET Framework
Open Source
Perl
PHP
Python
Tcl
Web Services
XML & XSLT

View by Category
Database
General
SOAP
System Administration
Tools
User Interfaces
Web Programming
XML Programming


MyASPN >> Mail Archive >> php-dev
php-dev
#41896 [Opn->Bgs]: preg_replace crashes with large input
by tony2001 other posts by this author
Jul 4 2007 12:14PM messages near this date
#41896 [NEW]: preg_replace crashes with large input | [PHP-DEV] Simple Namespace Proposal
ID:               41896
 Updated by:       tony2001@[...].net
 Reported By:      giacomoread at hotmail dot com
-Status:           Open
+Status:           Bogus
-Bug Type:         Scripting Engine problem
+Bug Type:         PCRE related
 Operating System: All
 PHP Version:      5.2.3
 New Comment:

> I found a similar bug which was closed with status bogus. 
Surely it's bogus, since it's not PHP issue.

> There is nothing in the documentation which states limits to the
> input of preg_replace or any portable work arounds documented. 
Right, we can't and we won't document any bugs in third-party libs.

> Stating that 'it is just a stack overflow' just to keep the bug
> count down is more than a little unprofessional. 
"It's just a stack overflow" that happens outside of PHP and we cannot
control it. I guess you failed to read the second part of the sentence.

> A scripting language should either make the workaround internal 
> or document input limits NOT cause seg faults.

We do accept patches both to the source code and to the documentation.

> This is a bug whether the php community is willing to accept it or
not.
Yes, it's known bug in PCRE.
Please report it to PCRE developers.


Previous Comments:
------------------------------------------------------------------------

[2007-07-04 19:03:15] giacomoread at hotmail dot com

Description:
------------
I found a similar bug which was closed with status bogus. Unacceptable!
There is nothing in the documentation which states limits to the input
of preg_replace or any portable work arounds documented. Stating that
'it is just a stack overflow' just to keep the bug count down is more
than a little unprofessional. A scripting language should either make
the workaround internal or document input limits NOT cause seg faults.
This is a bug whether the php community is willing to accept it or not.

Reproduce code:
---------------
function parse($html, &$title, &$text, &$anchors)
{
  $pstring1 = "'[^']*'";
  $pstring2 = '"[^"]*"';
  $pnstring = "[^'\"> ]";
  $pintag   = "(?:$pstring1|$pstring2|$pnstring)*";
  $pattrs   = "(?:\\s$pintag){0,1}";

  $pcomment = enclose("<!--", "-", "-> ");
  $pscript  = enclose("<script$pattrs> ", "<", "\\/script>");
  $pstyle   = enclose("<style$pattrs> ", "<", "\\/style>");
  $pexclude = "(?:$pcomment|$pscript|$pstyle)";

  $ptitle   = enclose("<title$pattrs> ", "<", "\\/title>");
  $panchor  = "<a(?:\\s$pintag){0,1}> ";
  $phref    = "href\\s*=[\\s'\"]*([^\\s'\"> ]*)";

  $html = preg_replace("/$pexclude/iX", " ", $html);

  if ($title !== false)
    $title = preg_match("/$ptitle/iX", $html, $title)
             ? $title[1] : '';

  if ($text !== false)
  {
    $text = preg_replace("/<$pintag> /iX",   " ", $html);
    $text = preg_replace("/\\s+|&nbsp;/iX", " ", $text);
  }

  if ($anchors !== false)
  {
    preg_match_all("/$panchor/iX", $html, $anchors);
    $anchors = $anchors[0];

    reset($anchors);
    while (list($i, $x) = each($anchors))
      $anchors[$i] =
        preg_match("/$phref/iX", $x, $x) ? $x[1] : '';

    $anchors = array_unique($anchors);
  }
}

function enclose($start, $end1, $end2)
{
  return "$start((?:[^$end1]|$end1(?!$end2))*)$end1$end2";
}

Expected result:
----------------
The code should clean the html pages into title, text and links. It
works fine until large pages are downloaded. Then it seg faults with gdb
showing the blame lying on preg_replace.



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=41896&edit=1
Thread:
Giacomoread At Hotmail Dot Com
tony2001

Privacy Policy | Email Opt-out | Feedback | Syndication
© ActiveState Software Inc. All rights reserved