The DJ Web Scraper is designed specifically to scrape articles. That means whatever web page
you want to scrape MUST USE the Article tag.
STEP 1: If you haven't already downloaded Data Juggernaut's SSIS custom Web Scraper
you can do so now by clicking here.
Installation and troubleshooting instructions can be found here.
STEP 2:Navigate to https://www.engadget.com/reviews/latest/page/1/ . This is the website we'll be scraping. In
this example we will be using Google Chrome but you can use any web browser that has developer tools available. To open
up developer tools in Chrome you can press Ctrl+Shift+I at the same time.
STEP 3:Next we need to traverse the html code for this page using the developer tools window we just opened.
We are looking for the article tag. One of the nice features of Chrome's developer tools is that it highlights the
area on the wewb page as you click on the HTML tags. In the screenshot below, notice how the article is highlighted.
STEP 4:The DJ Web Scraper will scrape 3 pieces of information: (1)headline, (2)summary, and (3)the image. In
this example, we can see that the headline for the first article is a review for ALIENWAR m15: Dell's first thin gaming laptop.
We're not going to worry about this first review because the article tag is different than all the rest of the article tags because
it doesn't have a class defined. We are looking for uniformity and the rest of the articles do use a class, "o-hit".
STEP 5:Open up your SSIS project in Visual Studio and drag the DJ Web Scraper custom component into your package.
You can find it under Common. Next we will begin to populate the required fields.
We know we have a specific article class that we want to use so we can start with that. In the properties of the Web Scraper
Task, find the Web Scraper section.
STEP 6:We will work on the image next. In the developer tools window we navigate the HTML and find the class
name for the image, the tag the image sits under, and the image attribute. Note that the image
attribute is optional. Not all web pages will use image attributes.
STEP 7:Next, let's find the information we need for the summary.
STEP 8:Lastly, we can set the page start and page end fields. With the FREE version there is a limit of 25 pages but you can
always reset the page start and page end fields and then rerun. You will also want to review the filename and scrape results properties
that have been defaulted for you. Our reccomendation is to create a directory called DataJuggernaut on the C: drive and save your
scrape results here as well as the DJ Web Scraper .exe file. You can then save the web scraping results to
your database for later use in your data science projects!
Have questions about this tutorial or with a page you're trying to scrape? We'd love to hear from you! Email us at DataJuggernaut@gmail.com