Apache Solr

Best Self-Hosted Search Engines  

Does your boss know that you’re looking for another job? Have you told your significant other about the inability to decide whether you want to have children or not? Do you parents know about your sexual orientation? Well, Google and other major search engines do.

“Most users search Google while signed in, so all of the information on their online life is available: YouTube searches, emails, and past search history,” says Adam Tauber, the lead developer of privacy-respecting metasearch engine Searx.

Of course, you could use Tor for anonymity and always delete all traces of your activity after each search, but doing so after each and every search would most likely get old pretty quickly. Instead, you should consider installing a self-hosted search engine capable of retrieving information for you without disclosing anything sensitive about you.

We have selected two such search engines, and we also introduce three additional search engines to show you that excellent alternatives to proprietary search engines such as Google or Bing already exist and are easier to install and use than you might think.

1. YaCy

YaCy is a free distributed peer-to-peer search engine whose core component is written in Java. Because all YaCy users are equal, and because the search engine doesn’t store user search requests, censorship is simply not possible.

Currently, YaCy indexes about 1.4 billion documents in its index thanks to the activity of more than 600 peer operators who contribute to it each month. For comparison, the Google Search index contains hundreds of billions of webpages and is well over 100,000,000 gigabytes in size.

While YaCy still has a long way to go before it can rival the largest centralized search engines in the world, it’s already usable as a search portal for private intranets and project-specific applications because YaCy can operate as a single search appliance without networking with other peers.

YaCy can be easily integrated into any web page thanks to its simple code snippets that can be effortlessly copied and pasted without any modification.

2. Searx

Searx is described as a privacy-respecting, hackable metasearch engine. It’s available under the GNU Affero General Public License version 3, and its main goal is to protect the privacy of its users by never sharing users’ IP addresses or search history with the search engines from which it gathers results.

“When using Searx, the IP address of Searx, a random User-Agent and a search query is sent to Google by default,” Adam Tauber, aka asciimoo, explains how his metasearch engine works. “Of course, you can customize Searx to forward other extra parameters like search language or the page number of the requested result page.”

Searx automatically blocks all tracking cookies served by the search engines to prevent user-profiling-based results modification, which can result from a search engine trying to implement search which is individualized based on what the engine knows about the user. Searx is 100 percent free, and anyone can modify it as needed. You can even take the Searx code and run the metasearch engine on your own server, which should definitely address any concerns you might have regarding logs.

3. ElasticSearch

ElasticSearch is a search engine based on Lucene, a free and open-source information retrieval software library supported by the Apache Software Foundation and is released under the Apache Software License.

ElasticSearch provides a full-text search engine with an HTTP web interface. The search engine can be used to search all kinds of documents, and it can be easily distributed across multiple nodes.

It’s possible to build a self-hosted search engine using ElasticSearch and Docker, and you can find a tutorial that describes the process here.

4. Ambar

Ambar is an open-source document search engine with many useful features. It supports automated crawling, tagging, and instant full-text search, just to give a few examples. One of the most exciting features of Ambar is its ability to perform OCR on images and PDF files. The supported languages include English, German, Russian, Italian, French, Spanish, Polish, and Dutch.

Ambar can be easily deployed with a single docker-compose file, and you can learn how to do it here.

5. Apache Solr

Written in Java, Apache Solr is an enterprise search platform that includes full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, and many other important features. It was created in 2004 for an in-house project at CNET Networks. CNET Networks kindly donated it to the Apache Software Foundation in 2006, where it graduated from incubation status into a standalone top-level project in 2007.

Today, Solr is a highly reliable, scalable, and fault tolerant, enterprise search platform that powers the search and navigation features of many of the world’s largest internet sites, including DuckDuckGo, eHarmony, and BestBuy. You can

How to Install and Configure YaCy

The installation of YaCy is very simple, and it takes only a couple of minutes because you don’t need to install an external database or web server—YaCy comes with everything needed.

  1. Go to the official website of YaCy and download the latest package for Linux.
  2. Install the OpenJDK 8 runtime environment.
    • If you’re using a Debian-based distribution, use the following command: $ sudo apt-get install openjdk-8-jre
    • If not, follow the instructions specific for your distribution.
  3. Extract the downloaded package to your preferred location.
  4. Go to the new folder and start the “startYACY.sh” script in Terminal.
  5. You should see a confirmation message informing you that YaCy started as a daemon

Conclusion

Search engines know more about us than most people would like to admit. If you would like to stop feeding big corporations with juicy data, you can take things into your own hands and set up a self-hosted search engine to protect your privacy. Although self-hosted search engines still have a long way to go to become fully usable, the potential for them to outperform the likes of Google is there and capturing it is just a matter of attracting more users.

About the author

David Morelo

David Morelo is a professional content writer in the technology niche, covering everything from consumer products to emerging technologies and their cross-industry application