DJ Web Scraper Example

The DJ Web Scraper is designed specifically to scrape articles. That means whatever web page you want to scrape MUST USE the Article tag.

STEP 1: If you haven't already downloaded Data Juggernaut's SSIS custom Web Scraper you can do so now by clicking here. Installation and troubleshooting instructions can be found here.

STEP 2:Navigate to https://www.engadget.com/reviews/latest/page/1/ . This is the website we'll be scraping. In this example we will be using Google Chrome but you can use any web browser that has developer tools available. To open up developer tools in Chrome you can press Ctrl+Shift+I at the same time. SSIS Custom Component DJWebScraper

STEP 3:Next we need to traverse the html code for this page using the developer tools window we just opened. We are looking for the article tag. One of the nice features of Chrome's developer tools is that it highlights the area on the wewb page as you click on the HTML tags. In the screenshot below, notice how the article is highlighted. SSIS Custom Component DJWebScraper

STEP 4:The DJ Web Scraper will scrape 3 pieces of information: (1)headline, (2)summary, and (3)the image. In this example, we can see that the headline for the first article is a review for ALIENWAR m15: Dell's first thin gaming laptop. SSIS Custom Component DJWebScraper We're not going to worry about this first review because the article tag is different than all the rest of the article tags because it doesn't have a class defined. We are looking for uniformity and the rest of the articles do use a class, "o-hit". SSIS Custom Component DJWebScraper

STEP 5:Open up your SSIS project in Visual Studio and drag the DJ Web Scraper custom component into your package. You can find it under Common. Next we will begin to populate the required fields. SSIS Custom Component DJWebScraper We know we have a specific article class that we want to use so we can start with that. In the properties of the Web Scraper Task, find the Web Scraper section. SSIS Custom Component DJWebScraper

STEP 6:We will work on the image next. In the developer tools window we navigate the HTML and find the class name for the image, the tag the image sits under, and the image attribute. Note that the image attribute is optional. Not all web pages will use image attributes. SSIS Custom Component DJWebScraper

STEP 7:Next, let's find the information we need for the summary. SSIS Custom Component DJWebScraper

STEP 8:Lastly, we can set the page start and page end fields. With the FREE version there is a limit of 25 pages but you can always reset the page start and page end fields and then rerun. You will also want to review the filename and scrape results properties that have been defaulted for you. Our reccomendation is to create a directory called DataJuggernaut on the C: drive and save your scrape results here as well as the DJ Web Scraper .exe file. You can then save the web scraping results to your database for later use in your data science projects! SSIS Custom Component DJWebScraper

Have questions about this tutorial or with a page you're trying to scrape? We'd love to hear from you! Email us at DataJuggernaut@gmail.com