The Nutch project is now a couple of years old and can rightfully be considered a gold standard in crawling.
So far so good, but what was the problem? Well, because the state of the art in text search at that time was SOLR, but Elasticsearch has become more and more popular in the last years, there is of course an indexer integrated in Nutch.
However, since version 1.15 only supports Elasticsearch5, massive problems may have arisen not only on our systems.
After revising the used dependencies and adapting the original indexer code, Nutch can now also be used with Elasticsearch6. This means we still have this great tool in combination with the current best text search solution.
You can download the indexer for free on Github or you can contribute to it.
Elasticsearch is a search engine based on Lucene. It stores documents in a NoSQL format and is written in JAVA. The communication with the different clients is done via a RESTful web interface. Elasticsearch is the most widely used search server besides Solr.
If you are wondering now what the hell Nutch actually is, it is explained that it is a crawler written in Java. So it's a tool to grab web page contents and process them further. For example to index it in a search engine.
If you have any questions or requests regarding the module, we would be pleased to hear from you.
If you would like to stay on top of our Pimcore Module(AmPnBsP), please sign up for our newsletter.
Copyright © 2024 asioso. All Rights Reserved.