How does FFX read XML in Atom feeds
I have been given access to an Atom 1.0 feed who content I want to read. The feed page is formatted as XML without any style info. In particular I want to be able to read the XML from Python. The only way to display the content satisfactorily is to display it by viewing the source using FFX. In other words type the page address into the header. It then returns the appropriately formatted XML. [ In Python Feedparser doesn't deal with it satisfactorily] So my question is: How does FFX handle such a page? If I know I can presumably replicate this in Python. There are over fifty pages which I need to extract info from so automating it is the only solution acceptable.
Tutte le risposte (8)
I'm not sure I understand the problem: do you want the XML transformed using an indicated XSL stylesheet, or displayed in its raw form?
Firefox's source code is available, but it's a C/C++ application. Seems it would be easier to find a better Python library. (But I have never used Python, so perhaps that isn't as easy as it sounds.)
Thanks.
I want its raw form. I can display its raw form in FFX by asking for its source code in which case I get something like:
<Standard xml version 1.0 encoding="UTF-8" header>
<custom_header_1> <custom_header_2> stuff <field_ident>crown_jewels</field_ident> </custom_header_2> </custom_header_1>
Naturally I want to get the crown_jewels which is a field whose value will govern what my python system does.
If I can get it into this form as a file that I can access from python its just a question of regex searches.
The ability to display like this is something that FFX has which other browsers don't. So rather than actually playing with the code in FFX I wanted a description of what it is that FFX does with a feed of the type I described last time that displays it in the form above.
Firefox retrieves the raw xml using an HTTP request and receives the response in text form.
In Python, you can retrieve the same raw text into a string using the urlopen method of the urllib2 library. (I say the same, but browsers also send cookies and their own user agent string, which could influence the content of the server's response...)
It certainly could get more complicated, as I was reading here: http://www.diveintopython.net/http_web_services/index.html
jscher2000, thanks very much. I will look at that and get back if I need more advice...otherwise I will close this query. If not...
I wonder if you can get it "just" by urlopen..it was the other bits that you point out which might complicate things.
Thanks for the help so far
I believe you can fetch the raw text into a string variable just using urlopen, but of course, I haven't tried it. Then you can parse it as a string or use the xml minidom feature to parse it as a document. The "Dive Into" site has a chapter on the latter as well.
I read through and applied some of the earlier stuff in DiveInto which is really good stuff and thoroughly recommended. However when he says "this will give you pretty much everything", it produces acres of stuff but not what I see in FFX. Which brings me back to my original question; what is it that FFX is doing, that I am failing to duplicate in Python?
See:
- view-source:chrome://global/content/xml/XMLPrettyPrint.xml
- view-source:chrome://global/content/xml/XMLPrettyPrint.xsl
Cor-el
thanks for that I will look into it. Trouble is I don't really want to use cpp but I will see if there is something in there that gives me the hint that I need.
The issue seems to be that there are several xml docs embedded in the feed. FFX seems to look into the code (in source mode) and burrow down, giving me their contents.
Going back to yesterday's example what I see from FFX as source is really more like:
<Standard xml version 1.0 encoding="UTF-8" header>
<custom_header_1> <custom_header_2> stuff <field_ident>crown_jewels_1</field_ident> </custom_header_2>
<custom_header_2>
stuff <field_ident>crown_jewels_2</field_ident> </custom_header_2>
<custom_header_2>
stuff <field_ident>crown_jewels_3</field_ident> </custom_header_2> </custom_header_1>
And I need to be able to get to all the crown_jewels_xx of which there are several to a page and a large number of pages.
Python using dom and feedparser give me a lot; it just ain't enough.
I suspect that I am not doing - or rather not seeing in the xml - something fairly straightforward. The easiest thing would be to publish the url here and let you see it for yourself. Unfortunately...