Simple PHP regex tokenizer question

Narks

Vastly intelligent whale-like being from the stars
Reaction score
90
So I have a web page stored in a string var, and it has a whole bunch of stuff like this:

<td>asldkfmasdf</td><td>more stuff</td> etc...

What I want to do is tokenize the web page, to extract the text between the td tags, then strip any html tags inside the extracted text (probably by removing < and > characters).

I'm really a novice when it comes to php, and I think I need to use regular expressions and preg_match to get all the information into a string array. I'm having trouble coming up with a regex for:
<td> at start of string
</td> at end of string

Can someone help me out?
 

Artificial

Without Intelligence
Reaction score
326
Parsing HTML using regular expressions is usually a bad idea, since it cannot be done reliably. In non-trivial cases, you're better off using a real HTML parser. Luckily for you, plenty of those are available, and PHP itself includes two built-in (DOM and XMLReader).

Using DOM for your problem could look something like this (don't read it if you wanna figure it out yourself!):
PHP:
<?php

$html = <<<E
<html>
    <head>
        <title>Til</title>
    </head>
    <body>
        <table>
            <tr>
                <td>Data1 <b>More data 1</b></td>
                <td>Data2</td>
            </tr>
        </table>
    </body>
</html>
E;

$dom = DOMDocument::loadHTML($html);
$elems = $dom->getElementsByTagName('td');
foreach ($elems as $elem) {
    echo $elem->textContent, "\n";
}

?>
 
General chit-chat
Help Users
  • No one is chatting at the moment.

      The Helper Discord

      Members online

      No members online now.

      Affiliates

      Hive Workshop NUON Dome World Editor Tutorials

      Network Sponsors

      Apex Steel Pipe - Buys and sells Steel Pipe.
      Top