Skip to Content.
Sympa Menu

discuss - Re: [opennic-discuss] seed for initial crawl...

discuss AT lists.opennicproject.org

Subject: Discuss mailing list

List archive

Re: [opennic-discuss] seed for initial crawl...


Chronological Thread 
  • From: Jeff Taylor <shdwdrgn AT sourpuss.net>
  • To: discuss AT lists.opennicproject.org
  • Subject: Re: [opennic-discuss] seed for initial crawl...
  • Date: Thu, 26 May 2011 21:52:30 -0600
  • List-archive: <http://lists.darkdna.net/pipermail/discuss>
  • List-id: <discuss.lists.opennicproject.org>

In the case of grep.geek...

First you need a list of TLDs that you want to query:
# dig @ns0.opennic.glue TXT tlds.opennic.glue +short | sed s/\"//g | sed
's/^[.] //'

Next, you want to generate a list of 'possible' domains by checking each TLD zone. In the following, $TLD is of course the TLD name, and $SRV is any server that will give you the answers (probably have to check the tier-1 servers for this... I am at ns2.opennic.glue):
# dig $TLD. @$SRV AXFR | grep -e "IN[[:space:]]A" -e "IN[[:space:]]CNAME" -e "IN[[:space:]]NS" | grep -v "^$TLD.[[:space:]]" | grep -v "^;" | sed 's/.[ \t].*//g' | sed 's/^\*.//' | sort | uniq | sed /^'\t'*$/d | sed /^$1*$/d

That will AXFR the zone, strip it down to A, AAAA, CNAME, and NS records, and clean up the results to just the base domain names. You could also remove any names that begin with "ns##." by inserting the following two commands before the |sort|
# | sed 's/^[nN][sS][0-9][.]//'
# | sed 's/^[nN][sS][0-9][0-9][.]//'

If you pipe the results of each TLD into a single file, you will have a massive list of around 6000 potential domain names. From there, I simply rely on the indexing program to attempt an spider each domain, and throw out the name if there is no website present.


On 05/25/2011 05:08 AM, Rene Paulokat wrote:
hey there,

again for the subject of search-engines.

looking for a starting point / initial seed of (opennic-tld) urls to seed an
incrementing crawler (nutch.apache.org)

openniclist.ing has somehow no more the content i would expect.

any ideas / hints?

how is grep.geek/search.geek initiating its data?

lg
rene
_______________________________________________
discuss mailing list
discuss AT lists.opennicproject.org
http://lists.darkdna.net/mailman/listinfo/discuss




Archive powered by MHonArc 2.6.19.

Top of Page