Pages: [1]
[BOINCstats] Willy
 
Forum moderator - Administrator - Developer - Tester - Translator
BAM!ID: 1
Joined: 2006-01-09
Posts: 9419
Credits: 350,105,499
World-rank: 4,518

2006-06-16 11:33:35

The greatest challenge of BOINCstats is keeping it going. Over time, the number of BOINC users and the features on this site have greatly increased. This alone puts an enormous stress on the server. At the beginning of this year, the problems were becoming so bad that the site was down whenever the stats were updated, simply because the load for running stats and serving the web pages was too high.

Thanks to some generous donations and a collaboration with PrimeGrid, I was able to add a dedicated server just for updating the stats. The ‘old’ server was already upgraded once a few months before, and is now serving as the webserver. It keeps a copy of the stats database so essentially there are two copies of the stats on two servers, to keep the load under control.

Since then, more users came to BOINC and BOINCstats, and BAM! is launched. BAM! runs for the most part on the database server, only the webinterface runs on the webserver.

The load was still increasing and I have made drastic changes in the code of the website to lower the load on the databases. These measures include caching of webpages and images to prevent the same data being requested from the database.

The highest load is seen just after the stats are updated (either incremental or daily update): the caches are empty and need to be filled again. And by now, almost everybody knows when BOINCstats has the new daily stats online and this is reflected by the number of visitors at that time.

With all these things I have no problems. BOINCstats is pretty popular, and is for me personally a huge success. And I want to make it even better and attract more visitors. High load is (I think) perfectly normal for any popular website. I simply have to find ways to keep it going, and I already explained some of the measures I took. If all continues to grow as it does in the last months, I anticipate the need for a new (faster) server for the beginning of next year.

But, the one thing I can’t get under control is scraping. As explained before, scraping is the automated downloading of BOINCstats webpages, to extract just a small part of the page for use on another site or for other stats.
‘Professional’ scrapers write a program that fetches hundreds to thousands of pages from BOINcstats in sequence, which cripples the database. This can be compared with a load of ten times the number of visitors BOINCstats now has.
Instead of writing their own stats engine, they simply take the numbers from BOINCstats in the most inefficient way, without asking permission.
To accommodate all scrapers, I should add at least one web/database server to handle their requests, and that’s simply not an option.

If you watch your stats on BOINCstats and copy some of the numbers to an excel sheet or something else, or when you show the BOINCstats signature or another BOINCstats image on your site, you are NOT considered to be a scraper.

Most scrapers are found by either checking how much bandwidth a single IP address uses, or when the site goes down due to a large number of requests from a single IP. By simply viewing your stats you can’t bring the site down or have much traffic.

99.99% of you will never run into problems of being accused of being a scraper. I’m pretty good at filtering them out. The other 0.01% knows perfectly well what they are doing (especially after this warning).

I hope this clarifies things, but most probably I leave you with more questions .
Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.
bjango
 
Translator
BAM!ID: 7
Joined: 2006-01-11
Posts: 197
Credits: 598,676
World-rank: 275,609

2006-06-16 22:52:12

The other 0.01% knows perfectly well what they are doing (especially after this warning).


Willy, would naming and shaming be an option, ie embarass those resposible.
Lee Carre
 
BAM!ID: 41
Joined: 2006-04-19
Posts: 262
Credits: 299,581
World-rank: 394,917

2006-06-18 02:44:06

The other 0.01% knows perfectly well what they are doing (especially after this warning).

Willy, would naming and shaming be an option, ie embarass those resposible.
that could end up being a really good idea, due to the fact that team stats will be purged as well, this would encourage team managers to ban/reject members of their team who scrape the site
Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins
Guest

2006-06-19 00:36:36
last modified: 2006-06-19 00:40:49

Here's an idea, though I'm not sure how feasible it would be to implement. Those sites which are known to do this, could they somehow be encouraged to pay a nominal fee/help reimburse this site for the BW taken by their scraping activities? In that way, if they don't want to write their own engine, OK if they're willing to help contribute some of the funds that would be necessary to allow this sites servers to handle the extra BW... It'd be a "well, if you insist on using us for this sorta service, these are the problems, and we really need some financial support in exchange for this service you have sorta been using us for..."

The only other option (other then blocking the hardest hitting IPs), if things grow to a prohibative degree (both in terms of required hardware and BW perhaps?) would be something that I'm sure will be very unpopular, and could result in some negative rep being thrown for the suggestion. A little advertizing to help defray some of the expenses necessary to keep the site running smoothly, and hopefully if it becomes necessary some of the less obnoxious/intrusive forms of it...

True, visitors aren't really thrilled with such a prospect, though some other large sites have found it a necessary evil if you will, to pull in the sorta revenue necessary to remain open and support the needs of a rather large site...
Lee Carre
 
BAM!ID: 41
Joined: 2006-04-19
Posts: 262
Credits: 299,581
World-rank: 394,917

2006-06-19 14:06:04

Those sites which are known to do this, could they somehow be encouraged to pay a nominal fee/help reimburse this site for the BW taken by their scraping activities?

the issue with scraping is that it's very inefficient, and there are much better methods, and bandwidth isn't the problem, the issue is server-load

I'm not saying that they shouldn't have to pay, by all means charge them, but prevent them from scraping and make them use a more efficient method.

A little advertizing to help defray some of the expenses necessary to keep the site running smoothly, and hopefully if it becomes necessary some of the less obnoxious/intrusive forms of it...

True, visitors aren't really thrilled with such a prospect, though some other large sites have found it a necessary evil if you will, to pull in the sorta revenue necessary to remain open and support the needs of a rather large site...
advertising is already used, and Willy has said that the site runs smoothly under normal use, all the excessive load is caused by these damn scrapers, so willy's actions are quite reasonable.

Bad practices should never be encouraged.
Want to search the BOINC Wiki, BOINCstats, or various BOINC forums from within firefox? Try the BOINC related Firefox Search Plugins
Luigi R.
BAM!ID: 169503
Joined: 2014-07-27
Posts: 10
Credits: 0
World-rank: 0

2017-01-23 19:30:18

Hello, I'm very sorry to read this when I did the scraping and my domain get banned. I sent a pm to Willy on Saturday, but he didn't see.

Please, let me explain my point. I would like not to be banned by BOINCstats.

I didn't know this was intended as a very very bad practice for you and especially I didn't know the name of this practice.
I maintain a site that is totally free and no-profit to create/collect BOINC.Italy team "inner" rankings. Team is not responsible about it, I'm programming those stats scripts for fun during free time.

I wrote a PHP script that scans XML files via web-rpc for independent challenges (good practice), but I have another one that scans your site for your challenges. I used to scan your challenges users stats to filter BOINC.Italy members.

I would like to apologize again. I will not scrape you data anymore.

P.S.
If you can allow a way to export challenges data only, it would be fine. I can understand if you disagree cause of advertisements.
I propose you to program a 'team filter' within your site.
Finally, would you allow to export your challenge list to let me to generate my independent stats (via web-rpc)?
Pages: [1]

Index :: BOINCstats general :: Scraping, an explanation.
Reason: