Skip to Content.
Sympa Menu

discuss - Re: [opennic-discuss] search engine opennicproject?

discuss AT lists.opennicproject.org

Subject: Discuss mailing list

List archive

Re: [opennic-discuss] search engine opennicproject?


Chronological Thread 
  • From: Maximi89 <maximi89 AT gmail.com>
  • To: discuss AT lists.opennicproject.org
  • Subject: Re: [opennic-discuss] search engine opennicproject?
  • Date: Sun, 8 May 2011 23:01:01 -0400
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=P/67stZeYU1mFypF/532yJbAFmDB2k/Y83+Swx5u4OqZu5jb6lsoviouatZ4Q+Q+v/ uPUjYuO7tGz2/2N+b469kLyyk1B2dqgwzU4fWL4CtX9WaU88H9fEpRdvpbWMiJH/BGNN a6uaVcffaPEVW/ErW6W81fbkhq6jSTNxL9YiE=
  • List-archive: <http://lists.darkdna.net/pipermail/discuss>
  • List-id: <discuss.lists.opennicproject.org>



2011/5/8 Jeff Taylor <shdwdrgn AT sourpuss.net>
Sounds like an awful lot of maintenance to run it.  Is there any advantage to YaCy over other packages that would make it worthwhile?

Yacy it's based on P2P, the problem of memory are in progress, that said the developer f1ori on #yacy irc.freenode.net
Grep.geek runs off of Sphider, a PHP-based solution.  I give it 32MB of ram, and it is fairly passive, running in the background without interfering with other processes.  I can only think of one time I had to repair the database, and that was from a bad upgrade.  I have a fairly short script I run (written in bash) which generates the list of domains by reading all of the TLD zones.  A second script reads domains from the list, and spiders up to three domains at a time, refreshing the list whenever it runs out.  The last script I wrote just checks the database, and removes any domains that have not been reachable for several weeks.

I have three servers, each taking a 30-minute shift at spidering each day.  Despite the short run-time, I still cycle through all of the domains every 3-4 days.  The only maintenance I ever do is when we change our peering info, otherwise I completely forget about it for months at a time.



On 05/08/2011 12:37 PM, Morten Oesterlund Joergensen wrote:

Maybe someone should set up an instance of YaCy
(http://en.wikipedia.org/wiki/YaCy) beginning the crawling at some of
the websites using the top-level domains of OpenNIC?
I have had YaCy running for years up until about half a year ago. It
requires several GiBs of RAM; else the Java virtual machine runs out of
memory and that often results in internal corruption or something, which
requires a reinstall of YaCy itself. It should of course be possible to
find an easier fix, for instance like clearing its internal database.
That was the reason why I stopped running it. Maybe I should spend some
time installing it again.
One also need to tweak the I/O usage, as it really slows down the
system, if not configured correctly.
Even though not strictly necessary, I recommend doing a bit of
maintenance about every two weeks or similar. That is to restart the
crawling from the website it started at, otherwise it may never reach
that site again and to clear the old data from the index. I believe that
the search engine always returns the results from the most recent
crawling and old data is also as default overwritten if the crawler
actually stumbles upon an already visited website, but there is really
no reason to store old and possibly outdated data. Unfortunately there
isn't an automated way to delete old data, so one has to clear the
entire database at once like that.
It seems that one can test the search engine here:
http://yacy.net/en/Searchportal.html

_______________________________________________
discuss mailing list
discuss AT lists.opennicproject.org
http://lists.darkdna.net/mailman/listinfo/discuss
_______________________________________________
discuss mailing list
discuss AT lists.opennicproject.org
http://lists.darkdna.net/mailman/listinfo/discuss



--
Maximiliano Augusto Castañón Araneda
Santiago, Chile
Linux user # 394821

Skype: maximi89
MSN: maximi89 AT gmail.com
XMPP/Jabber: maximi89 AT gmail.com



Archive powered by MHonArc 2.6.19.

Top of Page