Site Scrapers

I have become extremely annoyed with site scrapers. These web sites exist solely to bring in ad revenue, and they derive all of their content through RSS feeds. I think the way that they work is that the operator of the site has configured the site to update using an RSS feed that is based on key words. I have noticed that certain key terms, at any rate, have attracted the attention of scraper sites. Why do site scrapers do this? Because their sites are littered with ads, and they want to generate revenue without doing any work.

I think publishing an RSS feed is essential. I rely on RSS feeds to keep up with all the blogs I read. If a blog doesn’t publish an RSS feed, I probably won’t remember to check it for updates. I have no plans to stop publishing an RSS feed. However, I think educators should be aware that publishing an RSS feed will leave your site vulnerable to scraper sites, and there really isn’t a whole lot you can do about it. Yes, most of the time, scraper sites are violating copyright law, but fighting them may or may not be worth the time it would take.

First of all, how do you know you’ve been scraped? The answer to that one is that you might not, but I have noticed some site scrapers’ links in Technorati results for sites that link to mine. Once I visit the site to see why I am being linked, I discover a blog with a series of posts on the same topic and a sidebar full of ads.

If you want to fight site scraping, my suggestion would be find out who hosts the domain of the website that is scraping your material. If the blog is hosted on Blogger, WordPress.com, or some other blog hosting platform, the blog is most likely violating the terms of service for those hosting platforms, and reporting the blog should take care of it. If the site is hosted independently and operated via WordPress, Movable Type, or Blogger (or some other platform), then look up the hosting provider. You can do this by searching Whois.net. You will find out who is the site’s host, owner, and registrar (if you searched huffenglish.com, for instance, you’d find that my host is Bluehost). Then you can visit the domain host’s site or even the abuser’s e-mail address and report the abuse.

I have found three site scrapers stealing my content lately. All three were registered by Go Daddy, who reported that they are not the sites’ host, and therefore, not responsible for content.  All three sites did list an administrative contact when I looked up their Whois information.  I will let you know what, if anything, results from my contacting these administrative contacts (two of the offending domains appear to have been registered by the same person).

And now it’s time to address the root of the problem. Blog ads. I completely understand why someone would want to make some extra money. The concept behind blog ads is that when readers click on ads, they will generate revenue for the blog owner. Let me go on the record as saying I hate blog ads. I will never put them on my blogs, and I don’t like it when I see them on other blogs. I know some people who have them, and my husband even tried them for a while, but found they were really useless in terms of generating revenue. If you want to generate revenue, you will probably earn more through generous PayPal donors than you will through ads. However, in order to receive donations, you have to provide content that people might feel is valuable enough to pay for. I only have donation button links on my pages that contain such content, but it is freely offered, and anyone who takes the material is not obligated to put a tip in the jar. I think ads have become the bane of blogging. Because of Google AdSense and its ilk, scraper sites find that it might indeed be lucrative to steal other writers’ work in order to generate income for themselves. Frankly, I don’t know; it might be. However, what I do know is that if it weren’t for blog ads, we wouldn’t have site scrapers. And if it weren’t for people who made blog ads lucrative — hapless readers who click on ads — we wouldn’t have blog ads.

I recently posted about comment spam at EduStat Blog, and one astute commenter, Pete T., noted:

Great post, but I’d like to bring another element to the SPAM control discussion, Education.

In 2006, 40% of all email was SPAM, 2200 messages per user costing $8.9 billion to US Corporations and $255 million to others. It’s estimated that 2007 will bring a 63% increase, why? Because 8% of the people who receive the stuff actually buy something.

Enter Web 2.0 with its Blogs, Wikis, and forums. These new media outlets open a whole new horizon for these spammers to not only to pitch their wares, but also to gain search engine link popularity (another form of spamming.)

Yes, we need to continually develop technology to identify and filter spam as the virus protection industry has done – but there needs to be an education campaign that teaches the community the risks of doing business with a spammer.

Legislation and filtering can’t do it completely, only when it’s not making them any money – SPAM will really go away.

The same goes for site scraping. I am not going to tell you not to put ads on your site, but I would ask you to think about it and be sure it’s really right for you. Educators are not paid a great deal; no one goes into education for the money. Another thing to think about is that ads are randomly generated. I think bloggers should be responsible for all the content and links on their site. I think that if the blog links to a questionable site, then it is the blogger’s responsibility to either take down the link or stand by their decision to link it and to weather whatever fallout results from linking to the site. Ads take away some of that control, and the possibility exists that the ads might link to sites that the blog owner (or his or her employers) don’t approve of.

Food for thought, as the cliché goes.

[tags]site scraping, scraper site, spam, blog ads, AdSense, whois, RSS, Technorati[/tags]

Related posts:

2 thoughts on “Site Scrapers

  1. Someone scraped my site from wordpress.com. There is no way to contact the owner or find out who is sponsoring the site. There are no ads, also, but no credit or links are given to any of the articles. It did say at the bottom that it is a RSS feed. I don't understand their motive if they have no ads — perhaps something more sinister, like malware. My content is free to readers, but not free to reproduce, and I put an copyright on my article which went with it. There seems to be no way for me to stop it. I found the site scraper by doing a Google search for some of my keywords and my site didn't even show up, but the site scraper was at the TOP of the list — that means MY article and site should have been at the top of the list. They stole my article the FIRST day I published it with wordpress!!

    • Karen, I'm sorry that happened to you. If you can find out more about who owns the site or hosts it, you may have recourse.

Comments are closed.