Wednesday, 27 August 2014

Extract data from Web Scraping C#


I am MVC ASP.NET developer.

I have received the contents from any url, i.e. http, https etc. using WebRequest class.

I have received all the content of that particular url. (for now I took http://google.com)

My next step is to extract buttons, header, footer, colors, text etc.

Here is my code for now:

public ActionResult GetContent(UrlModel model) //model having a string URL
which is entered in a text box and method hits using submit button.
{
    //WebRequest request = WebRequest.Create(model.URL);

    WebRequest request = WebRequest.Create(model.URL);

    request.Credentials = CredentialCache.DefaultCredentials;

    WebResponse response = request.GetResponse();

    Stream dataStream = response.GetResponseStream();

    StreamReader reader = new StreamReader(dataStream);

    string responseFromServer = reader.ReadToEnd();
    ViewBag.Response = responseFromServer;

    reader.Close();
    response.Close();
    return View();
}

Can someone help me with writing the code ?

Also do suggest me with some techniques of data extraction in C#.



Source: http://stackoverflow.com/questions/21901162/extract-data-from-web-scraping-c-sharp

Scrapy, scraping price data from StubHub


I've been having a difficult time with this one.

I want to scrape all the prices listed for this Bruno Mars concert at the Hollywood Bowl so I can get the average price.

http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/

I've located the prices in the HTML and the xpath is pretty straightforward but I cannot get any values to return.

I think it has something to do with the content being generated via javascript or ajax but I can't figure out how to send the correct request to get the code to work.

Here's what I have:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from deeptix.items import DeeptixItem

class TicketSpider(BaseSpider):
    name = "deeptix"
    allowed_domains = ["stubhub.com"]
    start_urls = ["http://www.stubhub.com/bruno-mars-tickets/bruno-mars-hollywood-hollywood-bowl-31-5-2014-4449604/"]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//div[contains(@class, "q_cont")]')
    items = []
    for site in sites:
        item = DeeptixItem()
        item['price'] = site.xpath('span[contains(@class, "q")]/text()').extract()
        items.append(item)
    return items

Any help would be greatly appreciated I've been struggling with this one for quite some time now. Thank you in advance!


Source: http://stackoverflow.com/questions/22770917/scrapy-scraping-price-data-from-stubhub

Tuesday, 26 August 2014

How do you scrape AJAX pages?

Overview:

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

Example:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.


Sporce: http://stackoverflow.com/questions/260540/how-do-you-scrape-ajax-pages

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 select($oldh);
 while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.



Source: http://stackoverflow.com/questions/14654288/using-perl-to-scrape-a-website

Monday, 25 August 2014

Data Scraping using php


Here is my code

    $ip=$_SERVER['REMOTE_ADDR'];

    $url=file_get_contents("http://whatismyipaddress.com/ip/$ip");

    preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);

    $isp=$output[1][2];

    $city=$output[9][2];

    $state=$output[8][2];

    $zipcode=$output[12][2];

    $country=$output[7][2];

    ?>
    <body>
    <table align="center">
    <tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
    <tr><td>City :</td><td><?php echo $city;?></td></tr>
    <tr><td>State :</td><td><?php echo $state;?></td></tr>
    <tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
    <tr><td>Country :</td><td><?php echo $country;?></td></tr>
    </table>
    </body>

How do I find out the ISP provider of a person viewing a PHP page?

Is it possible to use PHP to track or reveal it?

Error: http://i.imgur.com/LGWI8.png

Curl Scrapping

<?php
$curl_handle=curl_init();
curl_setopt( $curl_handle, CURLOPT_FOLLOWLOCATION, true );
$url='http://www.whatismyipaddress.com/ip/132.123.23.23';
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, Array("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15") );
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Your application name');
$query = curl_exec($curl_handle);

curl_close($curl_handle);
preg_match_all('/<th>(.*?)<\/th><td>(.*?)<\/td>/s',$url,$output,PREG_SET_ORDER);
echo $query;
$isp=$output[1][2];

$city=$output[9][2];

$state=$output[8][2];

$zipcode=$output[12][2];

$country=$output[7][2];
?>
<body>
<table align="center">
<tr><td>ISP :</td><td><?php echo $isp;?></td></tr>
<tr><td>City :</td><td><?php echo $city;?></td></tr>
<tr><td>State :</td><td><?php echo $state;?></td></tr>
<tr><td>Zipcode :</td><td><?php echo $zipcode;?></td></tr>
<tr><td>Country :</td><td><?php echo $country;?></td></tr>
</table>
</body>

Error: http://i.imgur.com/FJIq6.png

What's is wrong with my code here? Any alternative code , that i can use here.

I am not able to scrape that data as described here. http://i.imgur.com/FJIq6.png

P.S. Please post full code. It would be easier for me to understand.



Source: http://stackoverflow.com/questions/10461088/data-scraping-using-php

PDF scraping using R

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system



Source: http://stackoverflow.com/questions/7918718/pdf-scraping-using-r

Saturday, 23 August 2014

Scraping data in dynamic sites

I'm trying to scrape data from our local government. What I want is address from kids adoption offices. Here, in Brazil, all adoptions go through the government. So I have the URL of one office, there are 2 or 3 thousands more. But if I can manage to get one, the others will be easy. I made many attempts, bellow I show three.

The problem could be related to a Javascript (Ajax maybe) that refresh the page.

Note: I am not a PHP developer.

First attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP GET 1</h1>';

echo ini_get("allow_url_fopen");
echo ini_get("allow_url_fopen");

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$html = file_get_contents($url);
var_dump($html);

echo '</body></html>';

// Output
// 11
// Warning:
file_get_contents(http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?
transacao=CONSULTA&vara=2673) [function.file-get-contents]: failed to open stream: HTTP
request failed! HTTP/1.1 404 Not Found in /home/rsl/www/sc01_get.php on line 14
// bool(false)

Second attempt

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 3</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl_exec($curl);

if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
   echo '<br>begin HTML[';
    echo  $html;
   echo '<br>]end html ';
}
echo '</body></html>';

// Output
// 1

third attempt

function curl($url){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.6 (KHTML, like Gecko) Chrome/16.0.897.0 Safari/535.6');
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_REFERER, "http://www.windowsphone.com");

    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

echo '<html><head></head><body>';
echo '<h1>Scraper PHP CURL 5</h1>';

// I used this url for test
//$url = 'http://www.portaldaadocao.com.br';

//This is the URL that I really want
$url = 'http://www.cnj.jus.br/cna/Controle/ConsultaPublicaBuscaControle.php?transacao=CONSULTA&vara=2673';

$curl = curl_init($url);
@curl_setopt($curl, CURLOPT_POSTFIELDS, "foo");
@curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
@curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");;

$html=@curl($curl);


if (!$html) {
    echo "<br />cURL error number:" .curl_errno($curl);
    echo "<br />cURL error:" . curl_error($curl);
    exit;
}
else{
    echo '<br>begin HTML[';
    echo  $html;
    echo '<br>]end html ';
}
echo '</body></html>';

// Output
// cURL error number:0
// cURL error:

If the pages are really ajax based meaning the information that you need to scrape is loaded or shown through javascript execution, you will need another approach. You would need to automate with a real browser. You can go the Selenium route which can be written in a number of languages or use CasperJS with Javascript as the programming language.



Source: http://stackoverflow.com/questions/24611046/scraping-data-in-dynamic-sites

Friday, 22 August 2014

What is the right way of storing screen-scraping data?



i'm working on a web site. it is scraping product details(names, features, prices etc.) from various web sites, processing and displaying them. i'am considering to run update script on each day and keep data fresh.

    scrape data
    process them
    store on database
    read(from db) and display them

i'am already storing all the data in a sql schema but i'm not sure. After each update, all the old records are vanishing. if the scraped new data comes corrupted somehow, there is nothing to show.

so, is there any common way to archive the old data? which one is more convenient: seperate sql schemas or xml files? or something else?

Source: http://stackoverflow.com/questions/13686474/what-is-the-right-way-of-storing-screen-scraping-data

Scraping dynamic data


I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

    https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

    http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

    http://twill.idyll.org/



Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Wednesday, 20 August 2014

Web Scraping data from different sites


I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB


Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Friday, 1 August 2014

How to Trick Google With Your SEO Articles and Web Content

So you're spending time writing SEO articles and creating highly optimised web content, or you're using an article service to create articles for you? What made you click on the link that brought you to this article then?

Perhaps you're looking for a sneaky little trick that will power your articles to the top of the search results in no time at all? You're looking for an edge that no one else has got that will let your content rush to the top of the results like a flatulent cork in water wings? Well read on...

Even the most average internet marketer cannot help but to have become aware that keyword stuffing is no longer effective. Indeed, keyword stuffing is highly likely to see a website demoted or even blacklisted. Today there is a need for high quality content, and for content which is unique and original, as well as popular. The trouble is that this can make the job much harder. Having to spend time creating good, solid, readable content which is useful and interesting is time-consuming.

Having to spend time creating content which might be considered worthwhile by real people is a lengthy an involved process. It used to be so much easier when you could just fling any old rubbish online and let the search engines lap it all up like hungry dogs. Today it seems that those dogs have turned, and unless you want them to bite, you need to spend time actually thinking about your potential customers, rather than just those nice friendly bots and spiders you've been so used to.

This is clearly a difficult situation, and the only option seems to be to succumb to the will of the search engines and spend time creating well-written, highly optimised content that appeals to both the search engines and real people. Goodness - you might even write something people really find interesting, and may want to link to. You never do know these days.

But of course, you clicked the link for this article, because you're looking to change all that. Rather than spending time crafting you'd rather be churning; rather than writing readable content you'd prefer to be chucking out text that looks as though your word processor and your thesaurus have been having an affair!

What you really want is to be able to press a magic button and have your articles fly up the search results, and magically draw thousands of keen, enthusiastic customers flooding to your website, ripping open their purses and wallets with such feverish excitement that you'll hardly know what to do with all that easy cash you'll be wallowing in.

As someone who provides an article service to internet marketers and business owners, and who writes SEO articles for a living, I have a few words of advice for those of you who want to try to get your articles above mine, who want to see your articles power ahead of mine and take hold of the search results pages by the horns.

Whilst I may sit here taking time to research each and every article I write, plan every article so that it has something to say, write it in a way that makes it entertaining, enjoyable and informative for those real live people who exist out there on the other side of the web, craft articles in a way that takes full advantage of Google's algorithms, optimised for latent semantic indexing, yet making it almost entirely undetectable, you want to discover a secret formula that will launch your articles with barely more than a flick of your wrist.

You probably want to find out what this secret formula is so that you can spend less time hurling hundreds, perhaps thousands of articles out every week just to scrape by. Meanwhile, I'll write an article once every week or so. You'll notice them because they always end up boosting my website up to the very top of Google for all the major keywords and key phrases I have chosen, despite several billion other sites all appearing for the same searches.

Well, here it is. The magic formula, the button you want to press is coming right up. Forget those black hat techniques that simply blast meaningless content at thousands of identical directories. To really achieve success with your SEO articles and enjoy the same level of exposure as my article service, the magic formula is this: forget writing SEO articles. That's it. When you're writing your next article, forget that it's an SEO article.

Source: http://ezinearticles.com/?How-to-Trick-Google-With-Your-SEO-Articles-and-Web-Content&id=4078570