PDA

View Full Version : Photo counter value increased by robots?


MMCM
30th of December 2004 (Thu), 16:20
After the worm attack on my EE server, which totally messed up the photo counters, I tried to figure out a way to rebuild them.

My first attempt was to parse the access_log and count all accesses to photos generated by worms, and reduce the counter in the database accordingly. Then I realized, in some cases, where the URL generated was not a valid parameter for EE, the displayed photo was in no way related to the input.

So the second attempt was to count all valid photo requests, which were NOT generated by the worm. That method worked well, I generated an SQL-Script, which corrected the counter values in the database.

Then I wondered, why there were still so many photo accesses, and I realized again, that accesses by robots from google, msn, ... where counted too.

Now I am thinking of a way to prevent that. If it's not possible to do, the best thing for me is to forget all about counters :-(

After a quick look at photo.php, on method could be to query the useragent parameter, and if its a robot, simply don't count the access. One would of course have to list all possible types of robots, which I think is nearly impossible.

Then I took a look at www.photocommunity.de, where I have several accounts, how they manage to do it. Only the owner can see the click counter there, and I think it's not affected by robots. Alas, my attempt to read robots.txt failed. In the source code, however, I discovered that they reference a small dummy gif-image (like <img src="http://www.fotocommunity.de/pc/fotocount.php?id=9999999" width=1 height=1>), and that access is obviously used to increase the counter. Using my own robots.txt, it's an easy task to prevent well behaved robot from retrieving that URL.

What do the other EE users think of that solution? Removing the counter code from photo.php and putting it into a new small php script would be an easy task to do.

Is anybody else interested? My consideration is to avoid all unnecessary code changes to EE, because when Pekka rolls out the next (offical) version, I have to redo it all again.

Pekka
30th of December 2004 (Thu), 17:30
If you want to exclude certain user agents (see http://www.psychedelix.com/agents1.html and http://www.jafsoft.com/searchengines/webbots.html#search_engine_robots_and_others and http://support.free-conversant.com/2701) I would prefer this method:

photo.php

$start = ee_get_microtime();

$ignore_ua = array(
"inktomi",
"infoseek",
"google",
"copernic",
"Slurp",
"whatever"
);

$ua_found = 0;
foreach ($ignore_ua as $key) {
if (is_int(strpos(strtoupper($agent),strtoupper($key) ))) {
$ua_found = 1; break;
}
}

if ($ua_found == 0) {
$updatecounter = mysql_query(
"
UPDATE
ee_counter
SET
ee_counter_value = ee_counter_value+1
WHERE ee_counter.ee_counter_id = '$counter_id'
"
);
ee_error ($updatecounter,"updatecounter",$currentpage);
}

stop_mysql_timer (__LINE__ . " - update counter");

This takes virtually no processing time. Put the most frequent and biggest engines (i.e. keyword to search for from user agent string) on array $ignore_ua's beginning. Of course the ignore list should be on database and editable in editors so that you don't have to update the files... perhaps it will be :)

stevehof
1st of January 2005 (Sat), 07:30
The intrusive robot thing is something I've been struggling with since mid November. MSNBot and Alexa in particular have on more than one occasion used multiple gigabytes of bandwidth on a DAILY basis. These two seem to be based on the same new prototype search engine and I think this robot gets confused with some of the dynamic web site content and it reads the same links over and over and over..... MSNBot will honor a deny in your robots.txt. I wasn't too sure Alexa was honoring my robots.txt so I ended up tracing the IP number for all of the Alexa servers and I entered them in my .htaccess file as deny access. While I'd like to be listed with Microsoft's search engines, I can't afford the bandwidth....

Put these lines in your robots.txt file:

User-agent: msnbot
Disallow: /

User-agent: alexa
Disallow: /

and put this code in your .htaccess file if you want to completely eliminate Alexa:

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from 209.237.238.172
deny from 209.237.238.173
deny from 209.237.238.174
deny from 209.237.238.175
deny from 209.237.238.170
deny from 209.237.238.176
deny from 209.237.238.177
deny from 209.237.238.178
deny from 209.237.238.179
deny from 209.237.238.180
deny from 209.237.238.181
deny from 209.237.238.182
deny from 209.237.238.184
deny from 209.237.238.185
deny from 209.237.238.186
deny from 209.237.238.187
deny from 209.237.238.188
deny from 209.237.238.190