Reading/Parsing RSS and Atom feeds in Java with Rome

(P) Codever is an open source bookmarks and snippets manager for developers & co. See our How To guides to help you get started. Public bookmarks repos on Github ⭐🙏
Contents
As you might have already guessed, Podcastpedia.org is all about podcasts and podcasting is all about distributing audio or video content via RSS or Atom. This post will presents how Atom and RSS podcast feeds are parsed and added to the directory, with the help of the Java project Rome.
Maven dependencies
In order to use Rome in the Java project, you have to add rome.jar
and jdom.jar
to your classpath, or if you use Meven the following dependencies in the ``<div id="toc_container" class="no_bullets">
Contents
</div>
As you might have already guessed, Podcastpedia.org is all about podcasts and podcasting is all about distributing audio or video content via RSS or Atom. This post will presents how Atom and RSS podcast feeds are parsed and added to the directory, with the help of the Java project Rome.
Maven dependencies
In order to use Rome in the Java project, you have to add rome.jar
and jdom.jar
to your classpath, or if you use Meven the following dependencies in the`` file:
<dependency> <groupId>rome</groupId> <artifactId>rome</artifactId> <version>1.0</version> </dependency> <dependency> <groupId>org.jdom</groupId> <artifactId>jdom</artifactId> <version>1.1</version> </dependency>
Building a SyndFeed object
ROME represents syndication feeds (RSS and Atom) as instances of the com.sun.syndication.synd.SyndFeed
interface. The SyndFeed
interfaces and its properties follow the Java Bean patterns. The default implementations provided with ROME are all lightweight classes.
XmlReader
ROME includes parsers to process syndication feeds into SyndFeed
instances. The SyndFeedInput
class handles the parsers using the correct one, based on the syndication feed being processed. The developer does not need to worry about selecting the right parser for a syndication feed, the SyndFeedInput
will take care of it by peeking at the syndication feed structure. All it takes to read a syndication feed using ROME are the following 2 lines of code:
SyndFeedInput input = new SyndFeedInput(); SyndFeed feed = input.build(new XmlReader(feedUrl));
The first line creates a SyndFeedInput
instance that will work with any syndication feed type (RSS and Atom versions). The second line instructs the SyndFeedInput
to read the syndication feed from the char based input stream of a URL pointing to the feed. The <a title="Implementation of XmlReader" href="https://java.net/projects/rome/sources/svn/content/trunk/src/java/com/sun/syndication/io/XmlReader.java" target="_blank">XmlReader</a>
is a character based Reader that resolves the encoding following the HTTP MIME types and XML rules for it. The SyndFeedInput.build()
method returns a SyndFeed
instance that can be easily processed.
InputSource
Using the approach just mentioned works fine for most of the podcast feeds out there, but for some, mean exceptions like "Content is not allowed in prolog"
or "Invalid byte 2 of 3-byte UTF-8 sequence"
started to occur. To tackle these exceptions I replaced the XmlReader
with InputSource
, which solved most of the problems – thank you Paŭlo Ebermann on StackOverflow for researching into this. The following code snippet presents how this is used to parse the feeds:
public SyndFeed getSyndFeedForUrl(String url) throws MalformedURLException, IOException, IllegalArgumentException, FeedException { SyndFeed feed = null; InputStream is = null; try { URLConnection openConnection = new URL(url).openConnection(); is = new URL(url).openConnection().getInputStream(); if("gzip".equals(openConnection.getContentEncoding())){ is = new GZIPInputStream(is); } InputSource source = new InputSource(is); SyndFeedInput input = new SyndFeedInput(); feed = input.build(source); } catch (Exception e){ LOG.error("Exception occured when building the feed object out of the url", e); } finally { if( is != null) is.close(); } return feed; }
Note the line if("gzip".equals(openConnection.getContentEncoding())
– this was needed because some web sites use gzip to compress the files, and althogh in the browsers you might not recognize this (they decompress the files automatically), if you have to decompress it programatically in your code.
FileInputStream
If for some reason ("Content is not allowed in prolog"
or "Invalid byte 2 of 3-byte UTF-8 sequence" etc.
), you cannot parse the feed from the online like via its URL, you can store to a local file, process it and modify the encoding for your needs (very easy with Notepad++ for example ) and parse it from there :
public SyndFeed getSyndFeedFromLocalFile(String filePath) throws MalformedURLException, IOException, IllegalArgumentException, FeedException { SyndFeed feed = null; FileInputStream fis = null; try { fis = new FileInputStream(filePath); InputSource source = new InputSource(fis); SyndFeedInput input = new SyndFeedInput(); feed = input.build(source); } finally { fis.close(); } return feed; }
Using the SyndFeed interface
Once the SyndFeed
instance is created, it is used to extract the metadata of the podcast(like title, description, author, copyright etc.):
@SuppressWarnings("unchecked") public void setPodcastFeedAttributes(Podcast podcast, boolean feedPropertyHasBeenSet) throws Exception { SyndFeed syndFeed = podcast.getPodcastFeed() if(syndFeed!=null){ //set DESCRIPTION for podcast - used in search if(syndFeed.getDescription()!=null && !syndFeed.getDescription().equals("")){ String description = syndFeed.getDescription(); //out of description remove tags if any exist and store also short description String descWithoutTabs = description.replaceAll("\\<[^>]*>", ""); if(descWithoutTabs.length() > MAX_LENGTH_DESCRIPTION) { podcast.setDescription(descWithoutTabs.substring(0, MAX_LENGTH_DESCRIPTION)); } else { podcast.setDescription(descWithoutTabs); } } else { podcast.setDescription("NO DESCRIPTION AVAILABLE for FEED"); } //set TITLE - used in search String podcastTitle = syndFeed.getTitle(); podcast.setTitle(podcastTitle); //set author podcast.setAuthor(syndFeed.getAuthor()); //set COPYRIGHT podcast.setCopyright(syndFeed.getCopyright()); //set LINK podcast.setLink(syndFeed.getLink()); //set url link of the podcast's image when selecting the podcast in the main application - mostly used through SyndImage podcastImage = syndFeed.getImage(); if(null!= podcastImage){ if(podcastImage.getUrl() != null){ podcast.setUrlOfImageToDisplay(podcastImage.getUrl()); } else if (podcastImage.getLink() != null){ podcast.setUrlOfImageToDisplay(podcastImage.getLink()); } else { podcast.setUrlOfImageToDisplay(configBean.get("NO_IMAGE_LOCAL_URL")); } } else { podcast.setUrlOfImageToDisplay(configBean.get("NO_IMAGE_LOCAL_URL")); } podcast.setPublicationDate(null);//default value is null, if cannot be set //set url media link of the last episode - this is used when generating the ATOM and RSS feeds from the Start page for example for(SyndEntryImpl entry: (List)syndFeed.getEntries()){ //get the list of enclosures List enclosures = (List) entry.getEnclosures(); if(null != enclosures){ //if in the enclosure list is a media type (either audio or video), this will set as the link of the episode for(SyndEnclosureImpl enclosure : enclosures){ if(null!= enclosure){ podcast.setLastEpisodeMediaUrl(enclosure.getUrl()); break; } } } if(entry.getPublishedDate() == null){ LOG.warn("PodURL[" + podcast.getUrl() + "] - " + "COULD NOT SET publication date for podcast, default date 08.01.1983 will be used " ); } else { podcast.setPublicationDate(entry.getPublishedDate()); } //first episode in the list is last episode - normally (are there any exceptions?? TODO -investigate) break; } } }
Well, that’s all folks. Many thanks to the Rome creators and contributers, to the open source communities, to Google, Stackoverflow and to all the great people out there.
Thanks for sharing and connecting with us
Don’t forget to check out Podcastpedia.org – you might find it really interesting. We are grateful for your support.
Resources
- Rome project
- Reads and prints any RSS/Atom feed type
- Problem with charset and rome – Stackoverflow
- JAVA: Resolving org.xml.sax.SAXParseException: Content is not allowed in prolog
- Stackoverflow – Getting strange characters when trying to read UTF-8 document from URL
P.S. The stack trace of the mean “Content is not allowed in prolog”-error is listed bellow:
2013-09-19 06:23:43,529 ERROR [org.podcastpedia.admin.service.impl.UpdateServiceImpl:?] - com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 1: Content is not allowed in prolog. at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:226) at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:136) at org.podcastpedia.admin.service.utils.impl.UtilsImpl.getSyndFeedForUrl(UtilsImpl.java:552) at org.podcastpedia.admin.service.impl.UpdateServiceImpl.getSyndFeedForUpdate(UpdateServiceImpl.java:472) at org.podcastpedia.admin.service.impl.UpdateServiceImpl.getNewEpisodes(UpdateServiceImpl.java:389) at org.podcastpedia.admin.service.impl.UpdateServiceImpl.updatePodcastById(UpdateServiceImpl.java:221) at org.podcastpedia.admin.service.impl.UpdateServiceImpl.updatePodcastsFromRange(UpdateServiceImpl.java:607) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150) at org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:89) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.jdom.input.JDOMParseException: Error on line 1: Content is not allowed in prolog. at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468) at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:222) ... 19 more Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog. at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453) ... 20 more