discuss AT lists.opennicproject.org
Subject: Discuss mailing list
List archive
[opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...
Chronological Thread
- From: "JP Blankert (thuis & PC based)" <jpblankert AT zonnet.nl>
- To: "Blankert (privé), Jean Philippe" <jpBlankert AT zonnet.nl>, Midlifetop <contact AT midlifetop.nl>, "Blankert (privé), Jean Philippe" <jpBlankert AT zonnet.nl>, discuss AT lists.opennicproject.org, Jeff Taylor <shdwdrgn AT sourpuss.net>, "info AT AlternativeRootZone.org" <info AT alternativerootzone.org>
- Subject: [opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl...
- Date: Fri, 27 May 2011 15:20:05 +0200
- List-archive: <http://lists.darkdna.net/pipermail/discuss>
- List-id: <discuss.lists.opennicproject.org>
Dear all, opennic, + Jeff Tayloar, By underneeath: you want to index: - opennic extenions - IANA extextensions by having them crawled quickly inititally ('seed'). What about other and later extensions, as .2m, .golf and guitar.music that are hyperlinked often (by SARZ/altrootzone.org and by Unifiedroot.com)? Will they be indexed by 'normal procedure' in nmgosearch later, or can there later be made a separate indexing run bij having those 'non-Iana, non-opennic' earlycrawled? Thanks, jpblankert info AT altrootzone.org & jpblankert AT zonnet.nl May 27, 2011 First you need a list of TLDs that you want to query: # dig @ns0.opennic.glue TXT tlds.opennic.glue +short | sed s/\"//g | sed 's/^[.] //' Next, you want to generate a list of 'possible' domains by checking each TLD zone. In the following, $TLD is of course the TLD name, and $SRV is any server that will give you the answers (probably have to check the tier-1 servers for this... I am at ns2.opennic.glue): # dig $TLD. @$SRV AXFR | grep -e "IN[[:space:]]A" -e "IN[[:space:]]CNAME" -e "IN[[:space:]]NS" | grep -v "^$TLD.[[:space:]]" | grep -v "^;" | sed 's/.[ \t].*//g' | sed 's/^\*.//' | sort | uniq | sed /^'\t'*$/d | sed /^$1*$/d That will AXFR the zone, strip it down to A, AAAA, CNAME, and NS records, and clean up the results to just the base domain names. You could also remove any names that begin with "ns##." by inserting the following two commands before the |sort| # | sed 's/^[nN][sS][0-9][.]//' # | sed 's/^[nN][sS][0-9][0-9][.]//' If you pipe the results of each TLD into a single file, you will have a massive list of around 6000 potential domain names. From there, I simply rely on the indexing program to attempt an spider each domain, and throw out the name if there is no website present. On 05/25/2011 05:08 AM, Rene Paulokat wrote: > hey there, > > again for the subject of search-engines. > > looking for a starting point / initial seed of (opennic-tld) urls to seed an incrementing crawler (nutch.apache.org) > > openniclist.ing has somehow no more the content i would expect. > > any ideas / hints? > > how is grep.geek/search.geek initiating its data? > > lg > rene > _______________________________________________ > discuss mailing list > discuss AT lists.opennicproject.org > http://lists.darkdna.net/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss AT lists.opennicproject.org http://lists.darkdna.net/mailman/listinfo/discuss Attached Message Part Geen virus gevonden in het binnenkomende-bericht. Gecontroleerd door AVG - www.avg.com Versie: 9.0.901 / Virusdatabase: 271.1.1/3661 - datum van uitgifte: 05/26/11 08:34:0 On 27-5-2011 6:29, JP Blankert (thuis & PC based) wrote: First you need a list of TLDs that you want to query: |
- [opennic-discuss] Opennic+IANA or can more zones added be later (like SARZ and Unifieedrrom.com): init crawling: Re: seed for initial crawl..., JP Blankert (thuis & PC based), 05/27/2011
Archive powered by MHonArc 2.6.19.