Generate sitemaps with sitemapgen4j

(P) Codever is an open source bookmarks and snippets manager for developers & co. See our How To guides to help you get started. Public bookmarks repos on Github ⭐🙏
This post is about automatically generating sitemaps. I chose this topic, because it is fresh in my mind as I have recently started using sitemaps for Podcastpedia.org After some research I came to the conclusion this would be a good thing – at the time of the posting Google had 3171 URLs indexed for the website (it has been live for 3 months now), whereby after generating sitemaps there were 87,818 URLs submitted. I am curios how many will get indexed after that…
So because I didn’t want to introduce over 80k URLs manually, I had to come up with an automated solution for that. Because Podcastpedia.org was developed with Java, it came easy to me to select sitemapgen4j
Maven depedency
Check out the latest version here:
com.google.code sitemapgen4j 1.0.1
The podcasts from Podcastpedia.org have an update frequency (DAILY, WEEKLY, MONTHLY, TERMINATED, UNKNOWN) associated, so it made sense to organize sub-sitemaps to make use of the lastMod and changeFreq properties accordingly. This way you can modify the lastMod of the daily sitemap in the sitemap index without modifying the lastMod of the monthly sitemap, and the Google bot doesn’t need to check the monthly sitemap everyday.
Generation of sitemap
Method : createSitemapForPodcastsWithFrequency – generates one sitemap file
/** * Creates sitemap for podcasts/episodes with update frequency * * @param updateFrequency update frequency of the podcasts * @param sitemapsDirectoryPath the location where the sitemap will be generated */ public void createSitemapForPodcastsWithFrequency( UpdateFrequencyType updateFrequency, String sitemapsDirectoryPath) throws MalformedURLException { //number of URLs counted int nrOfURLs = 0; File targetDirectory = new File(sitemapsDirectoryPath); WebSitemapGenerator wsg = WebSitemapGenerator.builder("https://github.com/CodepediaOrg/podcastpedia", targetDirectory) .fileNamePrefix("sitemap_" + updateFrequency.toString()) // name of the generated sitemap .gzip(true) //recommended - as it decreases the file's size significantly .build(); //reads reachable podcasts with episodes from Database with List podcasts = readDao.getPodcastsAndEpisodeWithUpdateFrequency(updateFrequency); for(Podcast podcast : podcasts) { String url = "https://github.com/CodepediaOrg/podcastpedia" + "/podcasts/" + podcast.getPodcastId() + "/" + podcast.getTitleInUrl(); WebSitemapUrl wsmUrl = new WebSitemapUrl.Options(url) .lastMod(podcast.getPublicationDate()) // date of the last published episode .priority(0.9) //high priority just below the start page which has a default priority of 1 by default .changeFreq(changeFrequencyFromUpdateFrequency(updateFrequency)) .build(); wsg.addUrl(wsmUrl); nrOfURLs++; for(Episode episode : podcast.getEpisodes() ){ url = "https://github.com/CodepediaOrg/podcastpedia" + "/podcasts/" + podcast.getPodcastId() + "/" + podcast.getTitleInUrl() + "/episodes/" + episode.getEpisodeId() + "/" + episode.getTitleInUrl(); //build websitemap url wsmUrl = new WebSitemapUrl.Options(url) .lastMod(episode.getPublicationDate()) //publication date of the episode .priority(0.8) //high priority but smaller than podcast priority .changeFreq(changeFrequencyFromUpdateFrequency(UpdateFrequencyType.TERMINATED)) // .build(); wsg.addUrl(wsmUrl); nrOfURLs++; } } // One sitemap can contain a maximum of 50,000 URLs. if(nrOfURLs <= 50000){ wsg.write(); } else { // in this case multiple files will be created and sitemap_index.xml file describing the files which will be ignored // workaround to resolve the issue described at https://code.google.com/p/sitemapgen4j/issues/attachmentText?id=8&aid=80003000&name=Admit_Single_Sitemap_in_Index.patch&token=p2CFJZ5OOE5utzZV1UuxnVzFJmE%3A1375266156989 wsg.write(); wsg.writeSitemapsWithIndex(); } }
The generated file contains URLs to podcasts and episodes, with changeFreq and lastMod set accordingly.
Snippet from the generated sitemap_MONTHLY.xml:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9" > <url> <loc>https://github.com/CodepediaOrg/podcastpedia/podcasts/581/heise-Developer-SoftwareArchitekTOUR-Podcast</loc> <lastmod>2013-07-05T17:01+02:00</lastmod> <changefreq>monthly</changefreq> <priority>0.9</priority> </url> <url> <loc>https://github.com/CodepediaOrg/podcastpedia/podcasts/581/heise-Developer-SoftwareArchitekTOUR-Podcast/episodes/130/Episode-40-Mobile-Multiplattform-Anwendungen-am-Beispiel-von-jQuery-Mobile</loc> <lastmod>2013-07-05T17:01+02:00</lastmod> <changefreq>never</changefreq> <priority>0.8</priority> </url> <url> <loc>https://github.com/CodepediaOrg/podcastpedia/podcasts/581/heise-Developer-SoftwareArchitekTOUR-Podcast/episodes/90/Episode-39-Entwicklung-fr-Embedded-Systeme-mit-mbeddr</loc> <lastmod>2013-03-11T15:40+01:00</lastmod> <changefreq>never</changefreq> <priority>0.8</priority> </url> ..... </urlset>
Generation of sitemap index
After sitemaps are generated for all update frequencies, a sitemap index is generated to list all the sitemaps. This file will be submitted in the Google Webmaster Toolos.
Method : createSitemapIndexFile
/** * Creates a sitemap index from all the files from the specified directory excluding the test files and sitemap_index.xml files * * @param sitemapsDirectoryPath the location where the sitemap index will be generated */ public void createSitemapIndexFile(String sitemapsDirectoryPath) throws MalformedURLException { File targetDirectory = new File(sitemapsDirectoryPath); // generate sitemap index for foo + bar grgrg File outFile = new File(sitemapsDirectoryPath + "/sitemap_index.xml"); SitemapIndexGenerator sig = new SitemapIndexGenerator("https://github.com/CodepediaOrg/podcastpedia", outFile); //get all the files from the specified directory File[] files = targetDirectory.listFiles(); for(int i=0; i < files.length; i++){ boolean isNotSitemapIndexFile = !files[i].getName().startsWith("sitemap_index") || !files[i].getName().startsWith("test"); if(isNotSitemapIndexFile){ SitemapIndexUrl sitemapIndexUrl = new SitemapIndexUrl("https://github.com/CodepediaOrg/podcastpedia/" + files[i].getName(), new Date(files[i].lastModified())); sig.addUrl(sitemapIndexUrl); } } sig.write(); }
The process is quite simple – the method looks in the folder where the sitemaps files were created and generates a sitemaps index with these files setting the lastmod value to the time each file had been last modified (line 18).
Et voilà sitemap_index.xml:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="https://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://github.com/CodepediaOrg/podcastpedia/sitemap_DAILY.xml.gz</loc> <lastmod>2013-08-01T07:24:38.450+02:00</lastmod> </sitemap> <sitemap> <loc>https://github.com/CodepediaOrg/podcastpedia/sitemap_MONTHLY.xml.gz</loc> <lastmod>2013-08-01T07:25:01.347+02:00</lastmod> </sitemap> <sitemap> <loc>https://github.com/CodepediaOrg/podcastpedia/sitemap_TERMINATED.xml.gz</loc> <lastmod>2013-08-01T07:25:10.392+02:00</lastmod> </sitemap> <sitemap> <loc>https://github.com/CodepediaOrg/podcastpedia/sitemap_UNKNOWN.xml.gz</loc> <lastmod>2013-08-01T07:26:33.067+02:00</lastmod> </sitemap> <sitemap> <loc>https://github.com/CodepediaOrg/podcastpedia/sitemap_WEEKLY.xml.gz</loc> <lastmod>2013-08-01T07:24:53.957+02:00</lastmod> </sitemap> </sitemapindex>
If you liked this, please show your support by helping us with Podcastpedia.org
We promise to only share high quality podcasts and episodes.
Source code
- SitemapService.zip – the archive contains the interface and class implementation for the methods described in the post