Have you ever wondered how some websites display combined data from various websites? This is a bit of trickery that can be done using a few methods, including having select database privileges from the sources, scraping data and using RSS feeds – or, in fact, any combination of these three methods.
In this tutorial we’ll focus on how to retrieve and combine data from other sources using RSS feeds. RSS, which stands for Rich Site Summary (and later Really Simple Syndication), has been around a long time – and stems from Ramanathan V. Guha’s work at the Apple Computer’s Advanced Technology Group in the mid-1990s – and you probably use them every day to display news feeds among other useful data.
RSS is generally XML files that contain data within organised tags, eg if you go to Craigslist and browse through regular listings you will see the usual list of items. This source code is HTML. But, if you look at the Craigslist RSS feed file for those same items you won’t see HTML code in your browser you’ll see an XML file that stores each listing within a parent ‘item’ tag.
Each item tag contains information for an entry such as title, date and description. Grabbing and aggregating RSS feeds with a simple PHP script is a fast and simple way to tap into your desired data and output the results you want. These RSS feeds are provided from many large classified and auction websites (such as eBay, eBay Classifieds, Monster and Craigslist etc) and many other websites, such as job websites (eg CareerBuilder and Indeed allow you to obtain XML feeds that can be parsed too).
In many instances, you can just use a specific URL to acquire RSS feeds, while at other times you’ll need to use an API with a specified publisher key in order to obtain the feeds. One example of websites that require the publisher key for the API is the popular job site, Indeed.com. Since we had to pick an example topic, we’ll explain how to aggregate RSS from the various sources we’ve mentioned above.
Our examples will show you how to narrow down a job search to location and job position or you could just as easy make this a personal RSS and XML feed aggregator to display bargains of items you want to buy from eBay, Craigslist etc. In addition to the above, the output will load in your browser on a local web server, such as a LAMP setup running on the Ubuntu distribution (distro). Installing LAMP only takes a minute or two but we’ll explain how to do that as well. When your output is shown in the web browser, it will link back to the original post and you can do what you want from there, eg inquire about the job position etc.
So let’s roll up our sleeves and get ready to find out how to have the edge on data. The first thing you’ll need to get started with aggregating and parsing RSS feeds is a working web server along with a few extra packages, such as PHP. Since it only takes a minute, the instructions to set it up are shown next. For those who are thinking to yourself “Not another web server installation” you can skip ahead. To install the Apache server, run the code below:
sudo apt-get update
sudo apt-get install apache2
Next, install the PHP packages with: sudo apt-get install php5 libapache2-mod-php5 php5-mcrypt In this tutorial, we’ll use a downloadable script to parse RSS feeds and show the methodology to do it manually. The parser we’ll be using is Magpie RSS . Note: This is hosted on Sourceforge (http://bit.ly/MagpieRSS) but there are alternatives to it that don’t use the site, such as Simple Pie (http://simplepie.org/downloads).
Once you download Magpie , you place it inside the www or html folder, eg new installations of Ubuntu with Apache will locate the web folders in the /var/www/html path while older installation such as Ubuntu 12.04 will use /var/www as the root web folder.
Whatever your root folder may be, you will now create a folder called rss. Within this folder, you add the extracted Magpie folder that will have a name like magpierss-0.72. For simplicity right-click the folder and rename it to magpie. In addition to the magpie folder, you’ll have two files; index.php and customs.php. The index.php file is used to display your content and open it in a browser and customs. php contains the functions to take the URLs and gather the data from the RSS sources (Craigslist, Ebay Classifieds, Monster.com, Careerbuilder.com and Indeed.com).
Note: Most of the feeds in the coding sample are parsed with Magpie , with the exception of those from Indeed and Careerbuilder. The parsing for the latter two are done using the PHP built-in class DOMDocument(). The code used in this tutorial is provided on the Linux Format website, which means the only lines you may want to edit are the actual URLs array that reside near line 3 in the index.php file. This is your source list of URLs for the RSS feeds that you want to aggregate.
However, you may want to make some changes to the custom parser starting near Line 108 in the file customs.php. At the moment, the custom URLs are set to use indeed and Careerbuilder. You will have to uncomment the URLs for Careerbuilder and Indeed since they both require your publisher key. If you sign up and get your developer keys, all you need to do is replace YOUR_KEY_HERE with the proper keys. The Indeed Publisher Program can be accessed at www.indeed.com/publisher while information for how to gather search results for the Careerbuilder website is available at http://bit.ly/CareerBuilderAPI.
Now that we’ve covered the basics, it’s time to go over the code and explain how it works. Start by opening the index. php file in a web browser. The URL on your local machine would be http://localhost/rss or http://localhost/rss/ index.php. Near the top of the file, an array called $urls is created and contains the feeds you want to aggregate. Getting RSS feeds from various sources is quite simple, but each provider will have a different method to find them. Some are definitely easier to find than others. Let’s start with Craigslist.
If you navigate to http://craigslist.co.uk and look at jobs in skilled trade/craft in London, you will see many listings. You could just look at the bottom right corner of your screen and click the RSS button to get the feed, or you could narrow your search and get a feed from that. At the upper left of your screen you’ll need to check the box ‘contract’. Next, click the yellow RSS icon on the bottom right and you will see the feed page. The URL in your browser is an example of what you could add into your feed URLs into the top of the index.php page.
Another example would be gathering eBay RSS feeds, and eBay’s feeds aren’t so obvious. But, you can narrow your search to whatever you want by adding &_rss=1 to the end of the URL in your web browser to see a new page with the desired RSS feed. The next two lines include the file Magpie will use to parse and the customs.php file, which uses custom functions and checking.
Now, things get a little trickier. But, if you look at around line 108 in the customs.php, you’ll see a function that checks if the $urls array exists. When it confirms the array exists, it runs through a foreach loop for each URL. If the first set of conditions are met then it checks the URL to see if a part of the string doesn’t contain the text ‘indeed’