« What's On the Net Stays on the Net: Thoughts on the Wayback Machine | Main | Article III Groupie Groupie »
November 16, 2005
Does Anything Really Disappear from the Internet?
I just posted about the Wayback Machine and that got me wondering whether anything really disappears from the Internet when it is deleted. Certainly, a ton gets archived in the Wayback Machine as well as in Google cache and in RSS readers. Of course, if something appears on the Internet, somebody could see it and copy it before it gets taken down.
But I was wondering to what extent information can vanish completely from the Internet. Thus, if a blogger posts something and then deletes it a minute later, can it escape from permanent fame? Maybe some ill-fated performances might be so brief that they can sneak on and off the Internet without being caught. What about a comment to a blog post that gets zapped quickly by the blog author? Can this escape becoming part of some permanent record?
The question, put another way: Can something posted briefly on the Internet, seen and heard by hardly anyone, not snatched up by anybody, and then deleted, be gone forever? Is there an Internet equivalent to a tree falling in the forest that nobody hears?
I don't know the answer to this question, and I would like to hear from those with more technical expertise.
UPDATE: People with expertise have answered, and their replies are worth checking out if you're interested in the issue.
Posted by Daniel J. Solove at November 16, 2005 12:10 AM
Trackback Pings
TrackBack URL for this entry:
http://www.concurringopinions.com/movabletype/mt-tb.cgi/214.
Comments
Sure it can vanish... if you didn't send it anywhere (e.g. via RSS), and if no one came to read it in the interim. For unpopular unlinked sites that can be a very long time. For a site like this one...well, just hope you didn't ping any sites to come and do updates, or have the bad luck to be visited by a googlebotor other robot, not to mention a person who kept a copy). As a practical matter, how long the 'window of forgiveness' may be depends on your traffic...and luck. But sure, lots of old stuff is gone forever, and new stuff too can vanish un-archived, especially if you get to it quickly enough.
[And if your sever is on Unix, when a file is deleted/changed it is much more erased than on Windows.]
Posted by: Michael Froomkin at November 16, 2005 12:38 AM
Yes, you can stay out of the wayback machine on archive.org and out of most search engines-- you put a robot exclusion on your website. you can also just stay out of the wayback machine and in the search engines etc. search on "robot exclusion" and you will find the magic incantations.
-brewster
Posted by: brewster kahle at November 16, 2005 01:05 AM
I've actually spent time trying to run things down that went away. Especially comments on blogs that have been deleted, etc.
The archive cycle, especially a while back, was not an hourly or even a daily one.
Posted by: Stephen M (Ethesis) at November 16, 2005 07:53 AM
It always surprises me that the engines seem to obey robots.txt, but they do. I don't know why, unless they think it will insulate them from liability.
I have successfully destroyed files from the early days of the internet, but today...
Posted by: Paul Gowder at November 16, 2005 10:00 AM
I used to blog back in college before the word was invented. Had to hard-code the HTML myself. That site is mostly dust in the wind. I've only ever been able to find the splash page archived anywhere, imploring surfers to look on my works and despair.
Could a site be up for more than a year and be scoured from the net anymore? I doubt it.
Posted by: John Armstrong at November 16, 2005 11:16 AM
Now, as someone noted, if you publish an RSS feed, you have a shorter lead time. It depends on how long it takes from various RSS readers to pick up you feed. Bloglines (rough estimate) usually takes from 1 to 3 hours to pick up a feed. So if you write something you regret, but delete it quickly, you have only a short time before your Bloglines subscribers read it.
Of course, as your blog gets more popular, there is a good chance that at any give time, more than one person will be reading your blog. So, putting aside technical issues to answer your question... Concurring Opinions ain't no empty forest.
Posted by: Mike at November 16, 2005 02:45 PM
Deleting the URL from your blog program, doesn't necessarily delete it from the net.
Tip: To "delete" content from places like Bloglines' cache, replace the words on the page with new words, or a dot, or something else. But, do it quickly - before you post the number of posts (lastn="_") specified in your feed file. I hope that makes sense.
Anyway, when Bloglines makes its next pass, it'll cache the new words and the old words will be gone. In theory, this should work with Goggle's bot, too.
Posted by: Marie at November 16, 2005 05:09 PM









