Does Anything Really Disappear from the Internet?
posted by Daniel Solove
I just posted about the Wayback Machine and that got me wondering whether anything really disappears from the Internet when it is deleted. Certainly, a ton gets archived in the Wayback Machine as well as in Google cache and in RSS readers. Of course, if something appears on the Internet, somebody could see it and copy it before it gets taken down.
But I was wondering to what extent information can vanish completely from the Internet. Thus, if a blogger posts something and then deletes it a minute later, can it escape from permanent fame? Maybe some ill-fated performances might be so brief that they can sneak on and off the Internet without being caught. What about a comment to a blog post that gets zapped quickly by the blog author? Can this escape becoming part of some permanent record?
The question, put another way: Can something posted briefly on the Internet, seen and heard by hardly anyone, not snatched up by anybody, and then deleted, be gone forever? Is there an Internet equivalent to a tree falling in the forest that nobody hears?
I don’t know the answer to this question, and I would like to hear from those with more technical expertise.
UPDATE: People with expertise have answered, and their replies are worth checking out if you’re interested in the issue.
November 16, 2005 at 12:10 am
Posted in: Blogging, Privacy, Technology
Print This Post







Responses (7)
Michael Froomkin - November 16, 2005 at 12:38 am
Sure it can vanish… if you didn’t send it anywhere (e.g. via RSS), and if no one came to read it in the interim. For unpopular unlinked sites that can be a very long time. For a site like this one…well, just hope you didn’t ping any sites to come and do updates, or have the bad luck to be visited by a googlebotor other robot, not to mention a person who kept a copy). As a practical matter, how long the ‘window of forgiveness’ may be depends on your traffic…and luck. But sure, lots of old stuff is gone forever, and new stuff too can vanish un-archived, especially if you get to it quickly enough.
[And if your sever is on Unix, when a file is deleted/changed it is much more erased than on Windows.]
brewster kahle - November 16, 2005 at 1:05 am
Yes, you can stay out of the wayback machine on archive.org and out of most search engines– you put a robot exclusion on your website. you can also just stay out of the wayback machine and in the search engines etc. search on “robot exclusion” and you will find the magic incantations.
-brewster
Stephen M (Ethesis) - November 16, 2005 at 7:53 am
I’ve actually spent time trying to run things down that went away. Especially comments on blogs that have been deleted, etc.
The archive cycle, especially a while back, was not an hourly or even a daily one.
Paul Gowder - November 16, 2005 at 10:00 am
It always surprises me that the engines seem to obey robots.txt, but they do. I don’t know why, unless they think it will insulate them from liability.
I have successfully destroyed files from the early days of the internet, but today…
John Armstrong - November 16, 2005 at 11:16 am
I used to blog back in college before the word was invented. Had to hard-code the HTML myself. That site is mostly dust in the wind. I’ve only ever been able to find the splash page archived anywhere, imploring surfers to look on my works and despair.
Could a site be up for more than a year and be scoured from the net anymore? I doubt it.
Mike - November 16, 2005 at 2:45 pm
Google Concurring Opinions. (.) You’ll notice a date on the 4th line of the first result. In green font it reads “Nov 14, 2005.” That’s the last time Google crawled your site. Generally Google crawls popular (as measured by incoming links/page rank) and reguarly updated blogs every 2 or 3 days. So you have at least a day to keep something you posted out of Google.
Now, as someone noted, if you publish an RSS feed, you have a shorter lead time. It depends on how long it takes from various RSS readers to pick up you feed. Bloglines (rough estimate) usually takes from 1 to 3 hours to pick up a feed. So if you write something you regret, but delete it quickly, you have only a short time before your Bloglines subscribers read it.
Of course, as your blog gets more popular, there is a good chance that at any give time, more than one person will be reading your blog. So, putting aside technical issues to answer your question… Concurring Opinions ain’t no empty forest.
Marie - November 16, 2005 at 5:09 pm
Deleting the URL from your blog program, doesn’t necessarily delete it from the net.
Tip: To “delete” content from places like Bloglines’ cache, replace the words on the page with new words, or a dot, or something else. But, do it quickly – before you post the number of posts (lastn=”_”) specified in your feed file. I hope that makes sense.
Anyway, when Bloglines makes its next pass, it’ll cache the new words and the old words will be gone. In theory, this should work with Goggle’s bot, too.
Leave a Reply