discuss AT lists.opennicproject.org

Subject: Discuss mailing list

List archive

Re: [opennic-discuss] Grep.geek offline for maintenance

From: Jeff Taylor <shdwdrgn AT sourpuss.net>
To: discuss AT lists.opennicproject.org
Subject: Re: [opennic-discuss] Grep.geek offline for maintenance
Date: Thu, 04 Oct 2012 09:04:33 -0600

That discussion makes sense at *any* time, and in fact has been looked at before. The main concern is that even when using multiple nodes to collect the data, it still needs to be housed in a single database at each redundant location.

Collecting actual OpenNic website data is not really a strain for a single server. The problem has always been in finding software that correctly identifies when exact copies of the same data are present at multiple sites (and preventing EACH site from being fully spidered in this case). The software I am using now (mnogosearch) has been told not to spider anything outside of the OpenNic TLDs, however when it gets a redirect to another website, it seems to take that as an invitation to break the rules.

I have a script which collects the list of domains from each of our TLDs and builds a control list instructing mnogosearch on what to index. My changes last night now actively go through each discovered domain and check the http headers. If a redirect is found, it will only index the homepage of the site. I am also tossing out every instance where no webserver responded, because there are a large number of registered domains that do not actually have a website associated with them.

On 10/04/2012 02:48 AM, mike wrote:
>

Since grep.geek is such an important part of OpenNIC and is subject to
scalability issues, would it be appropriate to have a discussion
around some sort of distributed kind of architecture, such that others
could pool resources toward grep.geek and have some redundancy and
more storage capacity at the same time?

In other words, would a discussion around laying the foundation for a
scalable grep.geek make any sense at this time?

--Mike

[opennic-discuss] Grep.geek offline for maintenance, Jeff Taylor, 10/03/2012
- Re: [opennic-discuss] Grep.geek offline for maintenance, Jeff Taylor, 10/04/2012
  - Re: [opennic-discuss] Grep.geek offline for maintenance, mike, 10/04/2012
    - Re: [opennic-discuss] Grep.geek offline for maintenance, Jeff Taylor, 10/04/2012