为提升您的使用体验,本站正在维护,部分功能暂时无法使用。如果本站文章无法解决您的问题,您想要向社区提问的话,请到 Twitter 上的 @FirefoxSupport 或 Reddit 上的 /r/firefox 提问,我们的支持社区将会很快回复您的疑问。

搜索 | 用户支持

防范以用户支持为名的诈骗。我们绝对不会要求您拨打电话或发送短信,及提供任何个人信息。请使用“举报滥用”选项报告涉及违规的行为。

详细了解

Simplest way to extract text content regularly from HTML?

more options

When I call wget on a webpage, I get raw HTML in response. I would like to write a simple parsing script which extracts the main article text content from web pages with similar structure, i.e. different documentation articles about Microsoft Visual Basic for Applications. I am pretty sure I should inspect the HTML tree, figure out which nodes tend to contain article headers and paragraphs, and then just write a script that retrieves those nodes. What would be the simplest way to inspect the HTML tree to find the nodes, and then with which library should I extract the text content from those nodes? Thank you

When I call wget on a webpage, I get raw HTML in response. I would like to write a simple parsing script which extracts the main article text content from web pages with similar structure, i.e. different documentation articles about Microsoft Visual Basic for Applications. I am pretty sure I should inspect the HTML tree, figure out which nodes tend to contain article headers and paragraphs, and then just write a script that retrieves those nodes. What would be the simplest way to inspect the HTML tree to find the nodes, and then with which library should I extract the text content from those nodes? Thank you

所有回复 (2)

more options

Are you saying the webpage is not loading properly?

Load the web page. Then, to reload the page bypassing the cache and force a fresh retrieval; Ctrl+Shift+R (Mac=Command+Shift+R)

Try this several times.


https://support.mozilla.org/en-US/kb/view-web-pages-reader-view-firefox-ios View web pages in Reader View

Reader View in Firefox for iOS strips away images, ads, videos and menus from a web page, so you can focus on reading. Reader View is available for articles, blog posts and other web pages that can be simplified.

more options

This question is beyond the scope of Firefox support. You could consider Stack Overflow, or if this is specific to Microsoft tooling, one of their developer forums.

If you are using J(ava)Script, you could look at the Readability library, which is the foundation for Firefox's Reader View feature. It has some code to "guess" the important parts of a page:

https://github.com/mozilla/readability

(I have always found it difficult to follow, but you may have better code-reading skills than me.)