Skip to Content.
Sympa Menu

discuss - [opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...

discuss AT lists.opennicproject.org

Subject: Discuss mailing list

List archive

[opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...


Chronological Thread 
  • From: "JP Blankert (thuis & PC based)" <jpblankert AT zonnet.nl>
  • To: "Blankert (privĂ©), Jean Philippe" <jpBlankert AT zonnet.nl>, Midlifetop <contact AT midlifetop.nl>, "Blankert (privĂ©), Jean Philippe" <jpBlankert AT zonnet.nl>, discuss AT lists.opennicproject.org, Jeff Taylor <shdwdrgn AT sourpuss.net>, "info AT AlternativeRootZone.org" <info AT alternativerootzone.org>
  • Subject: [opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...
  • Date: Fri, 27 May 2011 15:20:05 +0200
  • List-archive: <http://lists.darkdna.net/pipermail/discuss>
  • List-id: <discuss.lists.opennicproject.org>

Dear all, opennic, + Jeff Tayloar,

By underneeath: you want to index:
- opennic extenions
- IANA extextensions
by having them crawled quickly inititally ('seed').

What about  other and later extensions, as .2m, .golf and guitar.music that are hyperlinked often (by SARZ/altrootzone.org and by Unifiedroot.com)? Will they be indexed by 'normal procedure'  in nmgosearch later, or can there later be made a separate indexing run bij having those 'non-Iana, non-opennic' earlycrawled?

Thanks, jpblankert info AT altrootzone.org & jpblankert AT zonnet.nl
May 27, 2011



First you need a list of TLDs that you want to query:
# dig @ns0.opennic.glue TXT tlds.opennic.glue +short | sed s/\"//g | sed 's/^[.] //'

Next, you want to generate a list of 'possible' domains by checking each TLD 
zone.  In the following, $TLD is of course the TLD name, and $SRV is any server 
that will give you the answers (probably have to check the tier-1 servers for 
this... I am at ns2.opennic.glue):
# dig $TLD. @$SRV AXFR | grep -e "IN[[:space:]]A" -e "IN[[:space:]]CNAME" -e 
"IN[[:space:]]NS" | grep -v "^$TLD.[[:space:]]" | grep -v "^;" | sed 's/.[ 
\t].*//g' | sed 's/^\*.//' | sort | uniq | sed /^'\t'*$/d | sed /^$1*$/d

That will AXFR the zone, strip it down to A, AAAA, CNAME, and NS records, and 
clean up the results to just the base domain names.  You could also remove any 
names that begin with "ns##." by inserting the following two commands before the 
|sort|
#  | sed 's/^[nN][sS][0-9][.]//'
#  | sed 's/^[nN][sS][0-9][0-9][.]//'

If you pipe the results of each TLD into a single file, you will have a massive 
list of around 6000 potential domain names.  From there, I simply rely on the 
indexing program to attempt an spider each domain, and throw out the name if 
there is no website present.


On 05/25/2011 05:08 AM, Rene Paulokat wrote:
> hey there,
>
> again for the subject of search-engines.
>
> looking for a starting point / initial seed of (opennic-tld) urls to seed an incrementing crawler (nutch.apache.org)
>
> openniclist.ing has somehow no more the content i would expect.
>
> any ideas / hints?
>
> how is grep.geek/search.geek initiating its data?
>
> lg
> rene
> _______________________________________________
> discuss mailing list
> discuss AT lists.opennicproject.org
> http://lists.darkdna.net/mailman/listinfo/discuss
_______________________________________________
discuss mailing list
discuss AT lists.opennicproject.org
http://lists.darkdna.net/mailman/listinfo/discuss
Attached Message Part

Geen virus gevonden in het binnenkomende-bericht.
Gecontroleerd door AVG - www.avg.com 
Versie: 9.0.901 / Virusdatabase: 271.1.1/3661 - datum van uitgifte: 05/26/11 08:34:0


    
    

On 27-5-2011 6:29, JP Blankert (thuis & PC based) wrote:
First you need a list of TLDs that you want to query:




Archive powered by MHonArc 2.6.19.

Top of Page