Skip to Content.
Sympa Menu

discuss - Re: [opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...

discuss AT lists.opennicproject.org

Subject: Discuss mailing list

List archive

Re: [opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...


Chronological Thread 
  • From: Jeff Taylor <shdwdrgn AT sourpuss.net>
  • To: discuss AT lists.opennicproject.org
  • Subject: Re: [opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...
  • Date: Fri, 27 May 2011 10:24:51 -0600
  • List-archive: <http://lists.darkdna.net/pipermail/discuss>
  • List-id: <discuss.lists.opennicproject.org>

We currently also peer with NewNations, and I do have code in place to include their TLDs as well.  When we set up peering with any group, I will ask for a way to programatically generate a list of their TLDs (such as the TXT record that OpenNic uses), and the IP of a master server that can be queried for the zone files.  These items are required specifically for generating our custom root zone, and ensuring that any new TLDs are added automatically as soon as they are created.

With this knowledge in hand, I can easily add information for new peers to the indexer, however if you want full discovery of your domain names, I will require access to AXFR a copy of your zones.  Once that information has been entered, the new TLDs will be added to the search engine.


On 05/27/2011 07:20 AM, JP Blankert (thuis & PC based) wrote:
Dear all, opennic, + Jeff Tayloar,

By underneeath: you want to index:
- opennic extenions
- IANA extextensions
by having them crawled quickly inititally ('seed').

What about  other and later extensions, as .2m, .golf and guitar.music that are hyperlinked often (by SARZ/altrootzone.org and by Unifiedroot.com)? Will they be indexed by 'normal procedure'  in nmgosearch later, or can there later be made a separate indexing run bij having those 'non-Iana, non-opennic' earlycrawled?

Thanks, jpblankert info AT altrootzone.org & jpblankert AT zonnet.nl
May 27, 2011



First you need a list of TLDs that you want to query:
# dig @ns0.opennic.glue TXT tlds.opennic.glue +short | sed s/\"//g | sed 's/^[.] //'

Next, you want to generate a list of 'possible' domains by checking each TLD 
zone.  In the following, $TLD is of course the TLD name, and $SRV is any server 
that will give you the answers (probably have to check the tier-1 servers for 
this... I am at ns2.opennic.glue):
# dig $TLD. @$SRV AXFR | grep -e "IN[[:space:]]A" -e "IN[[:space:]]CNAME" -e 
"IN[[:space:]]NS" | grep -v "^$TLD.[[:space:]]" | grep -v "^;" | sed 's/.[ 
\t].*//g' | sed 's/^\*.//' | sort | uniq | sed /^'\t'*$/d | sed /^$1*$/d

That will AXFR the zone, strip it down to A, AAAA, CNAME, and NS records, and 
clean up the results to just the base domain names.  You could also remove any 
names that begin with "ns##." by inserting the following two commands before the 
|sort|
#  | sed 's/^[nN][sS][0-9][.]//'
#  | sed 's/^[nN][sS][0-9][0-9][.]//'

If you pipe the results of each TLD into a single file, you will have a massive 
list of around 6000 potential domain names.  From there, I simply rely on the 
indexing program to attempt an spider each domain, and throw out the name if 
there is no website present.


On 05/25/2011 05:08 AM, Rene Paulokat wrote:
> hey there,
>
> again for the subject of search-engines.
>
> looking for a starting point / initial seed of (opennic-tld) urls to seed an incrementing crawler (nutch.apache.org)
>
> openniclist.ing has somehow no more the content i would expect.
>
> any ideas / hints?
>
> how is grep.geek/search.geek initiating its data?
>
> lg
> rene
> _______________________________________________
> discuss mailing list
> discuss AT lists.opennicproject.org
> http://lists.darkdna.net/mailman/listinfo/discuss
_______________________________________________
discuss mailing list
discuss AT lists.opennicproject.org
http://lists.darkdna.net/mailman/listinfo/discuss
Attached Message Part

Geen virus gevonden in het binnenkomende-bericht.
Gecontroleerd door AVG - www.avg.com 
Versie: 9.0.901 / Virusdatabase: 271.1.1/3661 - datum van uitgifte: 05/26/11 08:34:0



On 27-5-2011 6:29, JP Blankert (thuis & PC based) wrote:
First you need a list of TLDs that you want to query:




Archive powered by MHonArc 2.6.19.

Top of Page