Php - How To Get Main Html Content Like Reader Mode In Firefox

June 22, 2024 Post a Comment

in android Firefox app and safari iPad we can read only main content by 'Reader Mode'. read more... How to recognize only main content in HTML with PHP? I need to detect main news

Solution 1:

A new PHP library named PHP Goose seems to do a very good job at this too. It's pretty easy to use and is Composer friendly.

Here's a usage example given on the actual readme :

useGoose\ClientasGooseClient;

$goose = new GooseClient();
$article = $goose->extractContent('http://url.to/article');

$title = $article->getTitle();
$metaDescription = $article->getMetaDescription();
$metaKeywords = $article->getMetaKeywords();
$canonicalLink = $article->getCanonicalLink();
$domain = $article->getDomain();
$tags = $article->getTags();
$links = $article->getLinks();
$movies = $article->getMovies();
$articleText = $article->getCleanedArticleText();
$entities = $article->getPopularWords();
$image = $article->getTopImage();
$allImages = $article->getAllImages();

Solution 2:

Readability.php works pretty well but I've found you get more successful results if you curl for the html content and spoof the user agent. You can also use some redirect forwarding in case the url you are trying to hit is giving you the runaround. Here is what I'm using now slightly modified from another post (PHP Curl following redirects). Hope you find it useful.

functiongetData($url) {
    $url = str_replace('&amp;', '&', urldecode(trim($url)) );
    $timeout = 5;
    $cookie = tempnam('/tmp', 'CURLCOOKIE');
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_ENCODING, '');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
    $content = curl_exec($ch);
    curl_close ($ch);
    return$content;
}

Implementation:

$url = 'http://';
//$html = file_get_contents($url);$html = getData($url);

if (function_exists('tidy_parse_string')) {
    $tidy = tidy_parse_string($html, array(), 'UTF8');
    $tidy->cleanRepair();
    $html = $tidy->value;
}

$readability = new Readability($html, $url);

//...

Solution 3:

There is no such built-in function in PHP. I am afraid will have to parse and analyse the HTML document yourself. You will probably need to use some XML parser, the SimpleXML library is a good candidate.

I am not familiar with the "Reader mode" feature you are referring to, but a good starting point would probably be removing all <img> contents. The actual "cleanning" algorithm it uses is certainly not trivial at all, and it seems it is actually implemented as a call to a third party, closed soure, service in Javascript.

Solution 4:

Hooray!!!

I found this source code:

1) create Readability.php

2) create JSLikeHTMLElement.php

3) create index.php by this code:

<!DOCTYPE htmlPUBLIC"-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html><head><title>!</title><metahttp-equiv="Content-Type"content="text/html; charset=UTF-8"/></head><bodydir="rtl"><?phpinclude_once'Readability.php';


// get latest Medialens alert // (change this URL to whatever you'd like to test)$url = 'http://';
$html = file_get_contents($url);

// Note: PHP Readability expects UTF-8 encoded content.// If your content is not UTF-8 encoded, convert it // first before passing it to PHP Readability. // Both iconv() and mb_convert_encoding() can do this.// If we've got Tidy, let's clean up input.// This step is highly recommended - PHP's default HTML parser// often doesn't do a great job and results in strange output.if (function_exists('tidy_parse_string')) {
    $tidy = tidy_parse_string($html, array(), 'UTF8');
    $tidy->cleanRepair();
    $html = $tidy->value;
}

// give it to Readability$readability = new Readability($html, $url);
// print debug output? // useful to compare against Arc90's original JS version - // simply click the bookmarklet with FireBug's console window open$readability->debug = false;
// convert links to footnotes?$readability->convertLinksToFootnotes = true;
// process it$result = $readability->init();
// does it look like we found what we wanted?if ($result) {
    echo"== Title =====================================\n";
    echo$readability->getTitle()->textContent, "\n\n";
    echo"== Body ======================================\n";
    $content = $readability->getContent()->innerHTML;
    // if we've got Tidy, let's clean it up for outputif (function_exists('tidy_parse_string')) {
        $tidy = tidy_parse_string($content, array('indent'=>true, 'show-body-only' => true), 'UTF8');
        $tidy->cleanRepair();
        $content = $tidy->value;
    }
    echo$content;
} else {
    echo'Looks like we couldn\'t find the content. :(';
}
?></body></html>

in $url = 'http://'; set your site url.

Thank you;)

Solution 5:

this is to display the whole content if you want more information about this just search in Google about regular expression and how to get value between tags in a html file i will tell you why with a demo :)

first off, when you use function file get contents you will get the file with html code but the server or browser will display it like a page look at this code,

$html = file_get_contents('http://coder-dz.com');
preg_match_all('/<li>(.*?)<\/li>/s', $html, $matches);
foreach($matches[1] as$mytitle)
{
echo$mytitle."<br/>";
}

well what i did here? i get the content of my website is word press i get titles because title they are in a tag of HTML li after that i used regular expression to get the values between this tags.

i hope you get my point because I’m not at English, if you have any question feel free to ask me

Html5 Manual