Towards a renaissance in search

Par: Abilian 31/05/2023 Tous les articles

The last few years have seen a renaissance in search, just when you thought the game was up with Google for public web search engines and Elasticsearch for open source projects.

Here are a few trends that bear witness to this renaissance:

Artificial Intelligence and Machine Learning: These technologies are increasingly being used to improve the relevance of search results. They make it possible to analyse user behaviour, learn from it and personalise search results according to individual preferences. AI is also used to understand natural language, enabling better results to be obtained for more complex queries.
Semantic search: Applying the technologies from the previous point, semantic search focuses on understanding the meaning and context of the words used in a query, rather than the words themselves. This produces more precise and relevant results.
**With growing awareness of online privacy and security issues, privacy-focused search engines such as DuckDuckGo have become increasingly popular, as has the idea of indexing one's own data without using an external service.
Visual search: Visual search, which allows users to search for information based on images, is another fast-growing trend. Deep learning technologies, again in application of point 1, are increasingly being used to analyse and understand the content of images.
Real-time search: As the volume of available data increases, the ability to provide real-time search results becomes increasingly important.

Some open source projects

Here are a few recent projects that demonstrate the dynamism of the field.

Classic indexing engines

These tools can be seen as a 'new wave' of projects that follow on from older projects such as Lucene (2000) and its derivatives (including Elasticsearch (2004)).

They offer robust, flexible solutions for indexing, searching and retrieving data quickly and accurately. They generally offer advanced features such as full-text searching, typo tolerance, complex filtering and more. They are designed to integrate easily with a variety of applications and systems, offering ease of use and flexibility without compromising performance.

MeiliSearch : MeiliSearch is an ultra-fast, open-source search engine. It offers full-text search with low latency and is easy to use, deploy and integrate into web applications.
Sonic : Sonic is an open-source search index server written in Rust. It is lightweight and schema-free. Sonic's inverted index, based on a Levenshtein automaton, makes it a minimalist and resource-efficient alternative to database tools.
Typesense: Typesense is a lightweight, speed-optimised, open-source search engine written in C++. It is designed to provide instant search experiences with a focus on simplicity and ease of installation.
Bleve : Inspired by Apache Lucene, Bleve is a full-text search and indexing library for Go. It provides components for tokenising, filtering and parsing text, then creating a searchable index.
Tantivy: Tantivy is a text search engine library inspired by Lucene and written in Rust. It is a low-level library designed to be the basis for building a complete search engine.

Client-side search

There are several JavaScript libraries and frameworks that can be used to implement client-side search. These tools typically offer full-text search, with features such as search term highlighting, Boolean query support, typo tolerance and more.

Here are some popular options:

MiniSearch: MiniSearch is a client-side full-text search library for JavaScript. It offers advanced search features such as Boolean query support and typo tolerance, while remaining lightweight and fast.
Js-search: Js-search is a library for performing full-text searches on client-side data. It offers great flexibility in terms of configuration and supports several search and indexing strategies.
Lunr.js : Lunr.js is a small full-text search library for the browser. It offers a simple but powerful search syntax and is designed to be easy to install and use.
Fuse.js : Fuse.js is a lightweight library that provides a very flexible fuzzy search. It is perfect for situations where you need to search through a list of objects.
Elasticlunr.js : Elasticlunr.js is a lightweight, fast, and flexible full-text search library in JavaScript. It is based on Lunr.js, but adds extra functionality, such as the ability to add or remove documents from the index after it has been created.

Vector databases

Vector databases are designed to enable the storage and retrieval of large quantities of high-dimensional data, such as embeddings generated by machine learning models. These databases are particularly useful in AI applications, as they enable fast and efficient similarity searches.

Here are a few significant open source projects:

Qdrant : Qdrant is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API for storing, searching and managing points - vectors with extra payload. Qdrant is adapted to support extended filters, which makes it useful for all kinds of matching based on neural or semantic networks, faceted search, and other applications. Qdrant is written in Rust.
Milvus : Milvus is an open-source vector database designed specifically for AI and machine learning applications. It supports a variety of distance metrics and is scalable, reliable, and capable of handling hybrid (vector and scalar) search. Milvus is written in Go.
Weaviate : Weaviate is an open-source vector database that stores both objects and vectors, combining vector search with structured filtering with the fault tolerance and scalability of a cloud-native database, all accessible via GraphQL, REST, and various language clients. Weaviate is written in Go.
Deeplake : Deep Lake is a vector database powered by a single storage format optimised for deep learning and large language model (LLM) applications. It simplifies the deployment of enterprise-class LLM-based products by providing storage for all data types (embeddings, audio, text, video, images, pdfs, annotations, etc.), vector search, data dissemination when training large-scale models, versioning and data lineage for all workloads, and integrations with popular tools such as LangChain, LlamaIndex, Weights and Biases, and many others. Deep Lake is written in Python.

These databases are designed to allow efficient manipulation of vector data, and their use will depend on the specific needs of your application or project. Given the importance of Python in machine learning and AI, all these projects offer Python SDKs.

Summary

The projects we have listed illustrate how the open-source community is pushing the boundaries of search technology, taking advantage of the possibilities offered by new programming languages and machine learning, and focusing on specific problems, such as resource optimisation, ease of installation and fine-grained understanding of indexed documents.