Error processing request

Parameters

CONTENT_LENGTH0
REQUEST_METHODGET
REQUEST_URI/revision/Web+Scraping+with+htmlparse?V=26
QUERY_STRINGV=26
CONTENT_TYPE
DOCUMENT_URI/revision/Web+Scraping+with+htmlparse
DOCUMENT_ROOT/var/www/nikit/nikit/nginx/../docroot
SCGI1
SERVER_PROTOCOLHTTP/1.1
HTTPSon
REMOTE_ADDR172.70.127.146
REMOTE_PORT14198
SERVER_PORT4443
SERVER_NAMEwiki.tcl-lang.org
HTTP_HOSTwiki.tcl-lang.org
HTTP_CONNECTIONKeep-Alive
HTTP_ACCEPT_ENCODINGgzip, br
HTTP_X_FORWARDED_FOR18.223.110.131
HTTP_CF_RAY88cd54e6cb622a66-ORD
HTTP_X_FORWARDED_PROTOhttps
HTTP_CF_VISITOR{"scheme":"https"}
HTTP_ACCEPT*/*
HTTP_USER_AGENTMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected])
HTTP_REFERERhttp://wiki.tcl.tk/revision/Web+Scraping+with+htmlparse?V=26
HTTP_CF_CONNECTING_IP18.223.110.131
HTTP_CDN_LOOPcloudflare
HTTP_CF_IPCOUNTRYUS

Body


Error

Unknow state transition: LINE -> END

-code

1

-level

0

-errorstack

INNER {returnImm {Unknow state transition: LINE -> END} {}} CALL {my render_wikit {Web Scraping with htmlparse} ''\[JM\]\ 4\ Dec\ 2012''\ -\ Here\ is\ a\ minimal\ example\ of\ \[Web\ scraping\]\ using\ \[htmlparse\]\n\nAs\ I\ am\ a\ \[RS\]\ fan,\ I\ am\ getting\ a\ list\ of\ all\ his\ recent\ projects.\n\n\ \ \ *\ This\ is\ an\ unfinished\ code\ just\ to\ show\ the\ overall\ mechanism.\n\ \ \ *\ notice\ that\ I\ am\ getting\ just\ one\ link\ per\ bullet,\ so,\ for\ example,\ I\ am\ missing\ the\ link\ for\ \[A\ pocket\ Wiki\],\ which\ is\ the\ second\ link\ on\ the\ 5th\ bullet.\ see\ how\ ONLY\ \[Profiling\ with\ execution\ traces\]\ is\ being\ listed.\n\ \ \ *\ also,\ notice\ the\ error\ message\ \"node\ \"\"\ does\ not\ exist\ in\ tree\ \"t\"\"\ when\ there\ is\ no\ link\ on\ the\ bullet,\ as\ in\ \"simplicite\"\n\ngetting\ as\ many\ links\ per\ bullet\ could\ be\ a\ good\ exercise\ for\ the\ reader.\n\n\[wsWithImg1\]\n\nAs\ a\ side\ note,\ I\ used\ \[LemonTree\ branch\]\ to\ easily\ find\ the\ location\ of\ the\ bulleted\ list\ block\ that\ I\ am\ parsing.\n\n\[wsWithImg2\]\n\n**Ways\ of\ accessing\ the\ data**\n***Walking\ the\ tree***\n\n======tcl\npackage\ require\ struct\npackage\ require\ htmlparse\npackage\ require\ http\n\nnamespace\ eval\ ::scraper\ \{\n\ \ \ \ #\ The\ tag\ at\ \$startNodePath\ should\ be\ a\ <ul>\ with\ its\ children\ having\ the\n\ \ \ \ #\ structure\ of\ <li><a\ href=\"...\">...</a><li>.\n\ \ \ \ proc\ parse-list-of-links\ \{url\ startNodePath\}\ \{\n\ \ \ \ \ \ \ \ set\ documentTree\ \[::struct::tree\]\n\n\ \ \ \ \ \ \ \ set\ conn\ \[::http::geturl\ \$url\]\n\ \ \ \ \ \ \ \ set\ html\ \[::http::data\ \$conn\]\n\n\ \ \ \ \ \ \ \ htmlparse::2tree\ \$html\ \$documentTree\n\ \ \ \ \ \ \ \ htmlparse::removeVisualFluff\ \$documentTree\n\ \ \ \ \ \ \ \ htmlparse::removeFormDefs\ \$documentTree\n\n\ \ \ \ \ \ \ \ set\ base\ \[walk\ \$documentTree\ \$startNodePath\]\n\ \ \ \ \ \ \ \ puts\ \"data:\ \[\$documentTree\ get\ \$base\ data\]\"\n\ \ \ \ \ \ \ \ puts\ \"type(tag):\ \[\$documentTree\ get\ \$base\ type\]\\n\"\n\n\ \ \ \ \ \ \ \ #\ Start\ with\ the\ first\ child\ of\ the\ base\ tag.\n\ \ \ \ \ \ \ \ set\ li\ \[walkf\ \$documentTree\ \$base\ \{0\}\]\n\ \ \ \ \ \ \ \ while\ \{\$li\ ne\ \"\"\}\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ set\ link\ \[\$documentTree\ get\ \[walkf\ \$documentTree\ \$li\ \{0\}\]\ data\]\n\ \ \ \ \ \ \ \ \ \ \ \ catch\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$documentTree\ get\ \[walkf\ \$documentTree\ \$li\ \{0\ 0\}\]\ data\n\ \ \ \ \ \ \ \ \ \ \ \ \}\ title\n\ \ \ \ \ \ \ \ \ \ \ \ puts\ \"\$link:\ \$title\"\n\ \ \ \ \ \ \ \ \ \ \ \ #\ Go\ from\ the\ current\ li\ to\ its\ sibling\ node.\n\ \ \ \ \ \ \ \ \ \ \ \ set\ li\ \[\$documentTree\ next\ \$li\]\n\ \ \ \ \ \ \ \ \}\n\n\ \ \ \ \ \ \ \ \$documentTree\ destroy\n\ \ \ \ \ \ \ \ return\n\ \ \ \ \}\n\n\ \ \ \ proc\ walkf\ \{tree\ startNode\ path\}\ \{\n\ \ \ \ \ \ \ \ set\ node\ \$startNode\n\ \ \ \ \ \ \ \ foreach\ idx\ \$path\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ if\ \{\$node\ eq\ \"\"\}\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ break\n\ \ \ \ \ \ \ \ \ \ \ \ \}\n\ \ \ \ \ \ \ \ \ \ \ \ set\ node\ \[lindex\ \[\$tree\ children\ \$node\]\ \$idx\]\n\ \ \ \ \ \ \ \ \}\n\ \ \ \ \ \ \ \ return\ \$node\n\ \ \ \ \}\n\n\ \ \ \ proc\ walk\ \{tree\ path\}\ \{\n\ \ \ \ \ \ \ \ return\ \[walkf\ \$tree\ root\ \$path\]\n\ \ \ \ \}\n\}\n\n::scraper::parse-list-of-links\ \"http://wiki.tcl.tk/1683\"\ \{1\ 15\ 0\}\n======\n\n\[dbohdan\]\ 2015-01-11:\ I\ found\ the\ example\ code\ above\ hard\ to\ understand,\ so\ I\ updated\ it\ with\ some\ comments\ as\ well\ as\ variable\ and\ proc\ names\ that\ I\ think\ clarify\ what\ the\ script\ does\ at\ each\ step.\ JM,\ I\ hope\ you\ don't\ mind\ my\ changes.\n\n\[JM\]\ 2015-01-14:\ Of\ course\ not,\ this\ is\ much\ better,\ thanks!\n\n***TreeQL***\n\n\[dbohdan\]\ 2015-01-11:\ The\ following\ script\ scrapes\ the\ same\ data\ as\ the\ one\ above\ but\ processes\ multiple\ links\ in\ each\ list\ item,\ not\ just\ the\ first\ one.\ This\ is\ done\ using\ \[TreeQL\]\ queries\ with\ which\ manipulating\ every\ child\ node\ of\ a\ given\ node\ comes\ naturally.\ \n\n======\npackage\ require\ struct\npackage\ require\ fileutil\npackage\ require\ htmlparse\npackage\ require\ http\npackage\ require\ treeql\ 1.3\n\nproc\ parse-treeql\ \{url\}\ \{\n\ \ \ \ set\ documentTree\ \[::struct::tree\]\n\n\ \ \ \ set\ conn\ \[::http::geturl\ \$url\]\n\ \ \ \ set\ html\ \[::http::data\ \$conn\]\n\n\ \ \ \ htmlparse::2tree\ \$html\ \$documentTree\n\ \ \ \ htmlparse::removeVisualFluff\ \$documentTree\n\ \ \ \ htmlparse::removeFormDefs\ \$documentTree\n\n\ \ \ \ treeql\ q1\ -tree\ \$documentTree\n\ \ \ \ treeql\ q2\ -tree\ \$documentTree\n\n\ \ \ \ q1\ query\ tree\ withatt\ type\ ul\n\ \ \ \ set\ ul\ \[lindex\ \[q1\ result\]\ 2\]\n\ \ \ \ q1\ query\ replace\ \$ul\ children\ children\ map\ x\ \{\n\ \ \ \ \ \ \ \ #\ For\ each\ li\ in\ the\ ul...\n\ \ \ \ \ \ \ \ q2\ query\ replace\ \$x\ get\ data\n\ \ \ \ \ \ \ \ set\ link\ \[lindex\ \[q2\ result\]\ 0\]\n\ \ \ \ \ \ \ \ q2\ query\ replace\ \$x\ children\ get\ data\n\ \ \ \ \ \ \ \ set\ title\ \[lindex\ \[q2\ result\]\ 0\]\n\ \ \ \ \ \ \ \ if\ \{\$title\ ne\ \"\"\}\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ puts\ \"\$link:\ \$title\"\n\ \ \ \ \ \ \ \ \}\n\ \ \ \ \}\n\n\ \ \ \ q1\ discard\n\ \ \ \ q2\ discard\n\ \ \ \ \$documentTree\ destroy\n\n\ \ \ \ return\n\}\n\nparse-treeql\ \"http://wiki.tcl.tk/1683\"\n======\n\n***Selectors***\n\nWith\ \[treeselect\]\ you\ can\ use\ CSS\ selector-like\ queries\ to\ access\ the\ elements\ of\ an\ HTML\ document\ stored\ in\ a\ tree\ object.\n\nTo\ run\ this\ example\ you\ will\ need\ a\ copy\ of\ the\ treeselect\ module\ in\ the\ same\ directory.\ You\ can\ download\ it\ with\ \[wiki-reaper\]:\ `wiki-reaper\ 41023\ 0\ 8\ >\ treeselect-0.3.1.tm`.\n\n======\n::tcl::tm::path\ add\ .\npackage\ require\ treeselect\ 0.3\n\nset\ tree\ \[::treeselect::url-to-tree\ \"http://wiki.tcl.tk/1683\"\]\nset\ anchorNodes\ \[::treeselect::query\ \$tree\ \{\n\ \ \ \ hmstart\ html\ body\ .container\ #wrapper\ div#content\n\ \ \ \ p:nth-child(10)\ ul\ li\ a\n\}\]\nforeach\ node\ \$anchorNodes\ \{\n\ \ \ \ set\ link\ \[\$tree\ get\ \$node\ data\]\n\ \ \ \ set\ title\ \[\$tree\ get\ \\\n\ \ \ \ \ \ \ \ \ \ \ \ \[::treeselect::query\ \$tree\ \"PCDATA\"\ \$node\]\ data\]\n\ \ \ \ puts\ \"\$link:\ \$title\"\n\}\n======\nRelated\ links:<<br>>\nhttp://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/struct/struct_tree.html\n======\n<<categories>>\ Internet\ |\ Web regexp2} CALL {my render {Web Scraping with htmlparse} ''\[JM\]\ 4\ Dec\ 2012''\ -\ Here\ is\ a\ minimal\ example\ of\ \[Web\ scraping\]\ using\ \[htmlparse\]\n\nAs\ I\ am\ a\ \[RS\]\ fan,\ I\ am\ getting\ a\ list\ of\ all\ his\ recent\ projects.\n\n\ \ \ *\ This\ is\ an\ unfinished\ code\ just\ to\ show\ the\ overall\ mechanism.\n\ \ \ *\ notice\ that\ I\ am\ getting\ just\ one\ link\ per\ bullet,\ so,\ for\ example,\ I\ am\ missing\ the\ link\ for\ \[A\ pocket\ Wiki\],\ which\ is\ the\ second\ link\ on\ the\ 5th\ bullet.\ see\ how\ ONLY\ \[Profiling\ with\ execution\ traces\]\ is\ being\ listed.\n\ \ \ *\ also,\ notice\ the\ error\ message\ \"node\ \"\"\ does\ not\ exist\ in\ tree\ \"t\"\"\ when\ there\ is\ no\ link\ on\ the\ bullet,\ as\ in\ \"simplicite\"\n\ngetting\ as\ many\ links\ per\ bullet\ could\ be\ a\ good\ exercise\ for\ the\ reader.\n\n\[wsWithImg1\]\n\nAs\ a\ side\ note,\ I\ used\ \[LemonTree\ branch\]\ to\ easily\ find\ the\ location\ of\ the\ bulleted\ list\ block\ that\ I\ am\ parsing.\n\n\[wsWithImg2\]\n\n**Ways\ of\ accessing\ the\ data**\n***Walking\ the\ tree***\n\n======tcl\npackage\ require\ struct\npackage\ require\ htmlparse\npackage\ require\ http\n\nnamespace\ eval\ ::scraper\ \{\n\ \ \ \ #\ The\ tag\ at\ \$startNodePath\ should\ be\ a\ <ul>\ with\ its\ children\ having\ the\n\ \ \ \ #\ structure\ of\ <li><a\ href=\"...\">...</a><li>.\n\ \ \ \ proc\ parse-list-of-links\ \{url\ startNodePath\}\ \{\n\ \ \ \ \ \ \ \ set\ documentTree\ \[::struct::tree\]\n\n\ \ \ \ \ \ \ \ set\ conn\ \[::http::geturl\ \$url\]\n\ \ \ \ \ \ \ \ set\ html\ \[::http::data\ \$conn\]\n\n\ \ \ \ \ \ \ \ htmlparse::2tree\ \$html\ \$documentTree\n\ \ \ \ \ \ \ \ htmlparse::removeVisualFluff\ \$documentTree\n\ \ \ \ \ \ \ \ htmlparse::removeFormDefs\ \$documentTree\n\n\ \ \ \ \ \ \ \ set\ base\ \[walk\ \$documentTree\ \$startNodePath\]\n\ \ \ \ \ \ \ \ puts\ \"data:\ \[\$documentTree\ get\ \$base\ data\]\"\n\ \ \ \ \ \ \ \ puts\ \"type(tag):\ \[\$documentTree\ get\ \$base\ type\]\\n\"\n\n\ \ \ \ \ \ \ \ #\ Start\ with\ the\ first\ child\ of\ the\ base\ tag.\n\ \ \ \ \ \ \ \ set\ li\ \[walkf\ \$documentTree\ \$base\ \{0\}\]\n\ \ \ \ \ \ \ \ while\ \{\$li\ ne\ \"\"\}\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ set\ link\ \[\$documentTree\ get\ \[walkf\ \$documentTree\ \$li\ \{0\}\]\ data\]\n\ \ \ \ \ \ \ \ \ \ \ \ catch\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \$documentTree\ get\ \[walkf\ \$documentTree\ \$li\ \{0\ 0\}\]\ data\n\ \ \ \ \ \ \ \ \ \ \ \ \}\ title\n\ \ \ \ \ \ \ \ \ \ \ \ puts\ \"\$link:\ \$title\"\n\ \ \ \ \ \ \ \ \ \ \ \ #\ Go\ from\ the\ current\ li\ to\ its\ sibling\ node.\n\ \ \ \ \ \ \ \ \ \ \ \ set\ li\ \[\$documentTree\ next\ \$li\]\n\ \ \ \ \ \ \ \ \}\n\n\ \ \ \ \ \ \ \ \$documentTree\ destroy\n\ \ \ \ \ \ \ \ return\n\ \ \ \ \}\n\n\ \ \ \ proc\ walkf\ \{tree\ startNode\ path\}\ \{\n\ \ \ \ \ \ \ \ set\ node\ \$startNode\n\ \ \ \ \ \ \ \ foreach\ idx\ \$path\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ if\ \{\$node\ eq\ \"\"\}\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ break\n\ \ \ \ \ \ \ \ \ \ \ \ \}\n\ \ \ \ \ \ \ \ \ \ \ \ set\ node\ \[lindex\ \[\$tree\ children\ \$node\]\ \$idx\]\n\ \ \ \ \ \ \ \ \}\n\ \ \ \ \ \ \ \ return\ \$node\n\ \ \ \ \}\n\n\ \ \ \ proc\ walk\ \{tree\ path\}\ \{\n\ \ \ \ \ \ \ \ return\ \[walkf\ \$tree\ root\ \$path\]\n\ \ \ \ \}\n\}\n\n::scraper::parse-list-of-links\ \"http://wiki.tcl.tk/1683\"\ \{1\ 15\ 0\}\n======\n\n\[dbohdan\]\ 2015-01-11:\ I\ found\ the\ example\ code\ above\ hard\ to\ understand,\ so\ I\ updated\ it\ with\ some\ comments\ as\ well\ as\ variable\ and\ proc\ names\ that\ I\ think\ clarify\ what\ the\ script\ does\ at\ each\ step.\ JM,\ I\ hope\ you\ don't\ mind\ my\ changes.\n\n\[JM\]\ 2015-01-14:\ Of\ course\ not,\ this\ is\ much\ better,\ thanks!\n\n***TreeQL***\n\n\[dbohdan\]\ 2015-01-11:\ The\ following\ script\ scrapes\ the\ same\ data\ as\ the\ one\ above\ but\ processes\ multiple\ links\ in\ each\ list\ item,\ not\ just\ the\ first\ one.\ This\ is\ done\ using\ \[TreeQL\]\ queries\ with\ which\ manipulating\ every\ child\ node\ of\ a\ given\ node\ comes\ naturally.\ \n\n======\npackage\ require\ struct\npackage\ require\ fileutil\npackage\ require\ htmlparse\npackage\ require\ http\npackage\ require\ treeql\ 1.3\n\nproc\ parse-treeql\ \{url\}\ \{\n\ \ \ \ set\ documentTree\ \[::struct::tree\]\n\n\ \ \ \ set\ conn\ \[::http::geturl\ \$url\]\n\ \ \ \ set\ html\ \[::http::data\ \$conn\]\n\n\ \ \ \ htmlparse::2tree\ \$html\ \$documentTree\n\ \ \ \ htmlparse::removeVisualFluff\ \$documentTree\n\ \ \ \ htmlparse::removeFormDefs\ \$documentTree\n\n\ \ \ \ treeql\ q1\ -tree\ \$documentTree\n\ \ \ \ treeql\ q2\ -tree\ \$documentTree\n\n\ \ \ \ q1\ query\ tree\ withatt\ type\ ul\n\ \ \ \ set\ ul\ \[lindex\ \[q1\ result\]\ 2\]\n\ \ \ \ q1\ query\ replace\ \$ul\ children\ children\ map\ x\ \{\n\ \ \ \ \ \ \ \ #\ For\ each\ li\ in\ the\ ul...\n\ \ \ \ \ \ \ \ q2\ query\ replace\ \$x\ get\ data\n\ \ \ \ \ \ \ \ set\ link\ \[lindex\ \[q2\ result\]\ 0\]\n\ \ \ \ \ \ \ \ q2\ query\ replace\ \$x\ children\ get\ data\n\ \ \ \ \ \ \ \ set\ title\ \[lindex\ \[q2\ result\]\ 0\]\n\ \ \ \ \ \ \ \ if\ \{\$title\ ne\ \"\"\}\ \{\n\ \ \ \ \ \ \ \ \ \ \ \ puts\ \"\$link:\ \$title\"\n\ \ \ \ \ \ \ \ \}\n\ \ \ \ \}\n\n\ \ \ \ q1\ discard\n\ \ \ \ q2\ discard\n\ \ \ \ \$documentTree\ destroy\n\n\ \ \ \ return\n\}\n\nparse-treeql\ \"http://wiki.tcl.tk/1683\"\n======\n\n***Selectors***\n\nWith\ \[treeselect\]\ you\ can\ use\ CSS\ selector-like\ queries\ to\ access\ the\ elements\ of\ an\ HTML\ document\ stored\ in\ a\ tree\ object.\n\nTo\ run\ this\ example\ you\ will\ need\ a\ copy\ of\ the\ treeselect\ module\ in\ the\ same\ directory.\ You\ can\ download\ it\ with\ \[wiki-reaper\]:\ `wiki-reaper\ 41023\ 0\ 8\ >\ treeselect-0.3.1.tm`.\n\n======\n::tcl::tm::path\ add\ .\npackage\ require\ treeselect\ 0.3\n\nset\ tree\ \[::treeselect::url-to-tree\ \"http://wiki.tcl.tk/1683\"\]\nset\ anchorNodes\ \[::treeselect::query\ \$tree\ \{\n\ \ \ \ hmstart\ html\ body\ .container\ #wrapper\ div#content\n\ \ \ \ p:nth-child(10)\ ul\ li\ a\n\}\]\nforeach\ node\ \$anchorNodes\ \{\n\ \ \ \ set\ link\ \[\$tree\ get\ \$node\ data\]\n\ \ \ \ set\ title\ \[\$tree\ get\ \\\n\ \ \ \ \ \ \ \ \ \ \ \ \[::treeselect::query\ \$tree\ \"PCDATA\"\ \$node\]\ data\]\n\ \ \ \ puts\ \"\$link:\ \$title\"\n\}\n======\nRelated\ links:<<br>>\nhttp://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/struct/struct_tree.html\n======\n<<categories>>\ Internet\ |\ Web} CALL {my revision {Web Scraping with htmlparse}} CALL {::oo::Obj161023 process revision/Web+Scraping+with+htmlparse} CALL {::oo::Obj161021 process}

-errorcode

NONE

-errorinfo

Unknow state transition: LINE -> END
    while executing
"error $msg"
    (class "::Wiki" method "render_wikit" line 6)
    invoked from within
"my render_$default_markup $N $C $mkup_rendering_engine"
    (class "::Wiki" method "render" line 8)
    invoked from within
"my render $name $C"
    (class "::Wiki" method "revision" line 31)
    invoked from within
"my revision $page"
    (class "::Wiki" method "process" line 56)
    invoked from within
"$server process [string trim $uri /]"

-errorline

4