discuss AT lists.opennicproject.org
Subject: Discuss mailing list
List archive
- From: Jeff Taylor <shdwdrgn AT sourpuss.net>
- To: discuss AT lists.opennicproject.org
- Subject: Re: [opennic-discuss] seed for initial crawl...
- Date: Thu, 26 May 2011 21:52:30 -0600
- List-archive: <http://lists.darkdna.net/pipermail/discuss>
- List-id: <discuss.lists.opennicproject.org>
In the case of grep.geek...
First you need a list of TLDs that you want to query:
# dig @ns0.opennic.glue TXT tlds.opennic.glue +short | sed s/\"//g | sed
's/^[.] //'
Next, you want to generate a list of 'possible' domains by checking each TLD zone. In the following, $TLD is of course the TLD name, and $SRV is any server that will give you the answers (probably have to check the tier-1 servers for this... I am at ns2.opennic.glue):
# dig $TLD. @$SRV AXFR | grep -e "IN[[:space:]]A" -e "IN[[:space:]]CNAME" -e "IN[[:space:]]NS" | grep -v "^$TLD.[[:space:]]" | grep -v "^;" | sed 's/.[ \t].*//g' | sed 's/^\*.//' | sort | uniq | sed /^'\t'*$/d | sed /^$1*$/d
That will AXFR the zone, strip it down to A, AAAA, CNAME, and NS records, and clean up the results to just the base domain names. You could also remove any names that begin with "ns##." by inserting the following two commands before the |sort|
# | sed 's/^[nN][sS][0-9][.]//'
# | sed 's/^[nN][sS][0-9][0-9][.]//'
If you pipe the results of each TLD into a single file, you will have a massive list of around 6000 potential domain names. From there, I simply rely on the indexing program to attempt an spider each domain, and throw out the name if there is no website present.
On 05/25/2011 05:08 AM, Rene Paulokat wrote:
hey there,
again for the subject of search-engines.
looking for a starting point / initial seed of (opennic-tld) urls to seed an
incrementing crawler (nutch.apache.org)
openniclist.ing has somehow no more the content i would expect.
any ideas / hints?
how is grep.geek/search.geek initiating its data?
lg
rene
_______________________________________________
discuss mailing list
discuss AT lists.opennicproject.org
http://lists.darkdna.net/mailman/listinfo/discuss
- [opennic-discuss] seed for initial crawl..., Rene Paulokat, 05/25/2011
- Re: [opennic-discuss] can't associate openNIC domain with server, Dmitry Shalnoff, 05/26/2011
- Re: [opennic-discuss] can't associate openNIC domain with server, Julian DeMarchi, 05/26/2011
- Re: [opennic-discuss] can't associate openNIC domain with server, Dmitry Shalnoff, 05/26/2011
- Re: [opennic-discuss] can't associate openNIC domain with server, Julian DeMarchi, 05/26/2011
- Re: [opennic-discuss] seed for initial crawl..., Julian DeMarchi, 05/26/2011
- Re: [opennic-discuss] seed for initial crawl..., Rene Paulokat, 05/26/2011
- Re: [opennic-discuss] seed for initial crawl..., JP Blankert (thuis & PC based), 05/26/2011
- Re: [opennic-discuss] seed for initial crawl..., Rene Paulokat, 05/26/2011
- Re: [opennic-discuss] seed for initial crawl..., Jeff Taylor, 05/26/2011
- Re: [opennic-discuss] can't associate openNIC domain with server, Dmitry Shalnoff, 05/26/2011
Archive powered by MHonArc 2.6.19.