Simple PHP regex tokenizer question

Narks · Dec 24, 2011

So I have a web page stored in a string var, and it has a whole bunch of stuff like this:

<td>asldkfmasdf</td><td>more stuff</td> etc...

What I want to do is tokenize the web page, to extract the text between the td tags, then strip any html tags inside the extracted text (probably by removing < and > characters).

I'm really a novice when it comes to php, and I think I need to use regular expressions and preg_match to get all the information into a string array. I'm having trouble coming up with a regex for:
<td> at start of string
</td> at end of string

Can someone help me out?

Artificial · Dec 24, 2011

Parsing HTML using regular expressions is usually a bad idea, since it cannot be done reliably. In non-trivial cases, you're better off using a real HTML parser. Luckily for you, plenty of those are available, and PHP itself includes two built-in (DOM and XMLReader).

Using DOM for your problem could look something like this (don't read it if you wanna figure it out yourself!):

PHP:

<?php

$html = <<<E
<html>
    <head>
        <title>Til</title>
    </head>
    <body>
        <table>
            <tr>
                <td>Data1 <b>More data 1</b></td>
                <td>Data2</td>
            </tr>
        </table>
    </body>
</html>
E;

$dom = DOMDocument::loadHTML($html);
$elems = $dom->getElementsByTagName('td');
foreach ($elems as $elem) {
    echo $elem->textContent, "\n";
}

?>

Simple PHP regex tokenizer question

Narks

Vastly intelligent whale-like being from the stars

Artificial

Without Intelligence

Settings Notifications

Options

The Helper Discord

Members online

Share this page

Affiliates

Network Sponsors

Simple PHP regex tokenizer question

Narks

Vastly intelligent whale-like being from the stars

Artificial

Without Intelligence

Log in

The Helper Discord

Members online

Share this page

Affiliates

Network Sponsors