org.xenbase.scraper
Class BasicScraper

java.lang.Object
  extended by org.xenbase.scraper.BasicScraper
Direct Known Subclasses:
Scraper_CurrBio_DevCell_Cell, Scraper_DevDyn, Scraper_Development, Scraper_JCellBio, Scraper_MechDev_DevBio, Scraper_PNAS

public abstract class BasicScraper
extends java.lang.Object


Constructor Summary
BasicScraper()
           
 
Method Summary
 byte[] getData(java.lang.String url)
          Takes a URL in String format and returns a byte array of the contents of the site at the the URL provided.
abstract  java.lang.String getRedirURL(java.lang.String url)
          Because we are using URLs from pubmed and because each journal publisher's website is different, we need to go through a series of HTTP 301 redirects, then search the resulting page to find the URL of the full article.
abstract  ScrapedData scrape(java.lang.String url)
          This is the actual function that takes the URL (produced by getRedirURL) and returns the images and captions of that article.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BasicScraper

public BasicScraper()
Method Detail

scrape

public abstract ScrapedData scrape(java.lang.String url)
                            throws java.lang.Exception,
                                   java.lang.Error
This is the actual function that takes the URL (produced by getRedirURL) and returns the images and captions of that article. This is the core of the scraper, and obviously each webpage is different, and so different string parsing is done for different journals.

Parameters:
String - URL - Direct URL to full article (usually produced by getRedirURL(String url)
Returns:
ScrapedData - The Object containing all the images and captions
Throws:
java.lang.Exception
java.lang.Error

getRedirURL

public abstract java.lang.String getRedirURL(java.lang.String url)
                                      throws java.lang.Exception,
                                             java.lang.Error
Because we are using URLs from pubmed and because each journal publisher's website is different, we need to go through a series of HTTP 301 redirects, then search the resulting page to find the URL of the full article. Because each publisher website is different, this function needs to be unique for each journal publisher website.

Parameters:
url - - URL to full article from PubMed
Returns:
String - Containing actual URL of full journal article
Throws:
java.lang.Exception
java.lang.Error

getData

public byte[] getData(java.lang.String url)
               throws java.lang.Exception,
                      java.lang.Error
Takes a URL in String format and returns a byte array of the contents of the site at the the URL provided. This is how pages and images are actually downloaded. This function makes use of the Apache HttpClient class, as it was one of the few HTTP classes that provides sufficient functionality for browser spoofing which was required in order to correctly access the journal websites.

Parameters:
String - URL
Returns:
byte[]
Throws:
java.lang.Exception
java.lang.Error