discuss AT lists.opennicproject.org
Subject: Discuss mailing list
List archive
- From: Jeff Taylor <shdwdrgn AT sourpuss.net>
- To: discuss AT lists.opennicproject.org
- Subject: Re: [opennic-discuss] Grep.geek offline for maintenance
- Date: Thu, 04 Oct 2012 09:04:33 -0600
That discussion makes sense at *any* time, and in fact has been
looked at before. The main concern is that even when using multiple
nodes to collect the data, it still needs to be housed in a single
database at each redundant location. Collecting actual OpenNic website data is not really a strain for a single server. The problem has always been in finding software that correctly identifies when exact copies of the same data are present at multiple sites (and preventing EACH site from being fully spidered in this case). The software I am using now (mnogosearch) has been told not to spider anything outside of the OpenNic TLDs, however when it gets a redirect to another website, it seems to take that as an invitation to break the rules. I have a script which collects the list of domains from each of our TLDs and builds a control list instructing mnogosearch on what to index. My changes last night now actively go through each discovered domain and check the http headers. If a redirect is found, it will only index the homepage of the site. I am also tossing out every instance where no webserver responded, because there are a large number of registered domains that do not actually have a website associated with them. On 10/04/2012 02:48 AM, mike wrote: >
|
- [opennic-discuss] Grep.geek offline for maintenance, Jeff Taylor, 10/03/2012
- Re: [opennic-discuss] Grep.geek offline for maintenance, Jeff Taylor, 10/04/2012
- Re: [opennic-discuss] Grep.geek offline for maintenance, mike, 10/04/2012
- Re: [opennic-discuss] Grep.geek offline for maintenance, Jeff Taylor, 10/04/2012
- Re: [opennic-discuss] Grep.geek offline for maintenance, mike, 10/04/2012
- Re: [opennic-discuss] Grep.geek offline for maintenance, Jeff Taylor, 10/04/2012
Archive powered by MHonArc 2.6.19.