IResearch/Open Source Project

Branch Status
master Build Status Build status

IResearch search engine

Version 1.0

Table of contents

Overview

The IResearch library is meant to be treated as a standalone index that is capable of both indexing and storing individual values verbatim. Indexed data is treated on a per-version/per-revision basis, i.e. existing data version/revision is never modified and updates/removals are treated as new versions/revisions of the said data. This allows for trivial multi-threaded read/write operations on the index. The index exposes its data processing functionality via a multi-threaded 'writer' interface that treats each document abstraction as a collection of fields to index and/or store. The index exposes its data retrieval functionality via 'reader' interface that returns records from an index matching a specified query. The queries themselves are constructed from either string IQL (index query language) requests or query trees built directly using the query building blocks available in the API. The querying infrastructure provides the capability of ordering the result set by one or more ranking/scoring implementations. The ranking/scoring implementation logic is plugin-based and lazy-initialized during runtime as needed, allowing for addition of custom ranking/scoring logic without the need to even recompile the IResearch library.

High level architecture and main concepts

Index

An index consists of multiple independent parts, called segments and index metadata. Index metadata stores information about active index segments for the particular index version/revision. Each index segment is an index itself and consists of the following logical components:
  • segment metadata
  • field metadata
  • term dictionary
  • postings lists
  • list of deleted documents
  • stored values
Read/write access to the components carried via plugin-based formats. Index may contain segments created using different formats.

Document

A database record is represented as an abstraction called a document. A document is actually a collection of indexed/stored fields. In order to be processed each field should satisfy at least IndexedField or StoredField concept.

IndexedField concept

For type T to be IndexedField, the following conditions have to be satisfied for an object m of type T:
Expression Requires Effects
m.name() The output type must be convertible to iresearch::string_ref A value uses as a key name.
m.boost() The output type must be convertible to float_t A value uses as a boost factor for a document.
m.get_tokens() The output type must be convertible to iresearch::token_stream* A token stream uses for populating in invert procedure. If value is nullptr field is treated as non-indexed.
m.features() The output type must be convertible to const iresearch::flags& A set of features requested for evaluation during indexing. E.g. it may contain request of processing positions and frequencies. Later the evaluated information can be used during querying.

StoredField concept

For type T to be StoredField, the following conditions have to be satisfied for an object m of type T:
Expression Requires Effects
m.name() The output type must be convertible to iresearch::string_ref A value uses as a key name.
m.write(iresearch::data_output& out) The output type must be convertible to bool. One may write arbitrary data to stream denoted by out in order to retrieve written value using index_reader API later. If nothing has written but returned value is true then stored value is treated as flag. If returned value is false then nothing is stored even if something has been written to out stream.

Directory

A data storage abstraction that can either store data in memory or on the filesystem depending on which implementation is instantiated. A directory stores at least all the currently in-use index data versions/revisions. For the case where there are no active users of the directory then at least the last data version/revision is stored. Unused data versions/revisions may be removed via the directory_cleaner. A single version/revision of the index is composed of one or more segments associated, and possibly shared, with the said version/revision.

Writer

A single instance per-directory object that is used for indexing data. Data may be indexed in a per-document basis or sourced from another reader for trivial directory merge functionality. Each commit() of a writer produces a new version/revision of the view of the data in the corresponding directory. Additionally the interface also provides directory defragmentation capabilities to allow compacting multiple smaller version/revision segments into larger more compact representations. A writer supports two-phase transactions via begin()/commit()/rollback() methods.

Reader

A reusable/refreshable view of an index at a given point in time. Multiple readers can use the same directory and may point to different versions/revisions of data in the said directory.

Build prerequisites

CMake

v3.2 or later

Boost

v1.57.0 or later (filesystem locale system thread)

install (*nix)

It looks like it is important to pass arguments to the bootstrap script in one line
./bootstrap.sh --with-libraries=filesystem,locale,system,regex,thread
./b2

install (MacOS)

Do not link Boost against 'iconv' because on MacOS it causes problems when linking against Boost locale. Unfortunately this requires linking against ICU.
./bootstrap.sh --with-libraries=filesystem,locale,system,regex,thread
./b2 -sICU_PATH="${ICU_ROOT}" boost.locale.iconv=off boost.locale.icu=on

install (win32)

bootstrap.bat --with-libraries=filesystem
bootstrap.bat --with-libraries=test
bootstrap.bat --with-libraries=thread
b2 --build-type=complete stage address-model=64

set environment

BOOST_ROOT=<path-to>/boost_1_57_0

Lz4

install (*nix)

make
make install
or point LZ4_ROOT at the source directory to build together with IResearch

install (win32)

If compiling IResearch with /MT add add_definitions("/MTd") to the end of cmake_unofficial/CMakeLists.txt since cmake will ignore the command line argument -DCMAKE_C_FLAGS=/MTd
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=<install-path> -DBUILD_LIBS=on -g "Visual studio 12" -Ax64 ../cmake_unofficial
cmake --build .
cmake --build . --target install
or point LZ4_ROOT at the source directory to build together with IResearch

set environment

LZ4_ROOT=<install-path>

Bison

v2.4 or later win32 binaries also available in: - https://git-scm.com/download/win - http://sourceforge.net/projects/mingw/files - http://sourceforge.net/projects/mingwbuilds/files/external-binary-packages

ICU

install (*nix)

./configure --disable-samples --disable-tests --enable-static --srcdir="$(pwd)" --prefix=<install-path> --exec-prefix=<install-path>
make install
or point ICU_ROOT at the source directory to build together with IResearch or via the distributions' package manager: libicu

install (win32)

look for link: "ICU4C Binaries"

set environment

ICU_ROOT=<path-to-icu>

Snowball

install (*nix)

the custom CMakeLists.txt was based on revision 5137019d68befd633ce8b1cd48065f41e77ed43e later versions may be used at your own risk of compilation failure
git clone https://github.com/snowballstem/snowball.git
git reset --hard 5137019d68befd633ce8b1cd48065f41e77ed43e
mkdir build && cd build
cmake -DENABLE_STATIC=OFF -DNO_SHARED=OFF -g "Unix Makefiles" ..
cmake --build .
cmake -DENABLE_STATIC=OFF -DNO_SHARED=ON -g "Unix Makefiles" ..
cmake --build .
or point SNOWBALL_ROOT at the source directory to build together with IResearch or via the distributions' package manager: libstemmer

install (win32)

the custom CMakeLists.txt was based on revision 5137019d68befd633ce8b1cd48065f41e77ed43e later versions may be used at your own risk of compilation failure
git clone https://github.com/snowballstem/snowball.git
git reset --hard 5137019d68befd633ce8b1cd48065f41e77ed43e
mkdir build && cd build
set PATH=%PATH%;<path-to>/build/Debug
cmake -DENABLE_STATIC=OFF -DNO_SHARED=OFF -g "Visual studio 12" -Ax64 ..
cmake --build .
cmake -DENABLE_STATIC=OFF -DNO_SHARED=ON -g "Visual studio 12" -Ax64 ..
cmake --build .
or point SNOWBALL_ROOT at the source directory to build together with IResearch
For static builds: 1. in MSVC open: build/snowball.sln 2. set: stemmer -> Properties -> Configuration Properties -> C/C++ -> Code Generation -> Runtime Library = /MTd 3. BUILD -> Build Solution

set environment

SNOWBALL_ROOT=<path-to-snowball>

BFD

install (*nix)

via the distributions' package manager: libbfd or build from source via:
cd libiberty
env CFLAGS=-fPIC ./configure
make

cd ../zlib
env CFLAGS=-fPIC ./configure
make

cd ../bfd
env LDFLAGS='-L../libiberty -liberty' ./configure --enable-targets=all --enable-shared
make

install (win32)

not yet available for win32

set environment

Note: BINUTILS_ROOT is a "reserved" variable internally used by some of the gcc compiler tools.
BFD_ROOT=<path-to-binutils>

Unwind

install (*nix)

via the distributions' package manager: libunwind or build from source via:
configure
make
make install

install (win32)

not yet available for win32

set environment

UNWIND_ROOT=<path-to-unwind>

Gooogle test

install (*nix)

mkdir build && cd build
cmake ..
make
or point GTEST_ROOT at the source directory to build together with IResearch

install (win32)

mkdir build && cd build
cmake -g "Visual studio 12" -Ax64 -Dgtest_force_shared_crt=ON -DCMAKE_DEBUG_POSTFIX="" ..
cmake --build .
mv Debug ../lib
or point GTEST_ROOT at the source directory to build together with IResearch

set environment

GTEST_ROOT=<path-to-gtest>

Stopword list (for use with analysis::text_analyzer)

download any number of lists of stopwords, e.g. from: https://github.com/snowballstem/snowball-website/tree/master/algorithms/*/stop.txt https://code.google.com/p/stop-words/

install

  1. mkdir
  2. for each language, (e.g. "c", "en", "es", "ru"), create a corresponding subdirectory (a directory name has 2 letters except the default locale "c" which has 1 letter)
  3. place the files with stopwords, (utf8 encoded with one word per line, any text after the first whitespace is ignored), in the directory corresponding to its language (multiple files per language are supported and will be interpreted as a single list)

set environment

IRESEARCH_TEXT_STOPWORD_PATH=<path-to-stopword-lists>
If the variable IRESEARCH_TEXT_STOPWORD_PATH is left unset then locale specific stopword-list subdirectories are deemed to be located in the current working directory

Build

git clone <IResearch code repository>/iresearch.git iresearch
cd iresearch
mkdir build && cd build
generate build file :
cmake -DCMAKE_BUILD_TYPE=[Debug|Release|Coverage] -g "Unix Makefiles" ..
  1. if some libraries are not found by the build then set the needed environment variables (e.g. BOOST_ROOT, BOOST_LIBRARYDIR, LZ4_ROOT, OPENFST_ROOT, GTEST_ROOT)
  2. if ICU or Snowball from the distribution paths are not found, the following additional environment variables might be required: ICU_ROOT_SUFFIX=x86_64-linux-gnu SNOWBALL_ROOT_SUFFIX=x86_64-linux-gnu
generate build file (win32):
cmake -g "Visual studio 12" -Ax64 ..
If some libraries are not found by the build then set the needed environment variables (e.g. BOOST_ROOT, BOOST_LIBRARYDIR, LZ4_ROOT, OPENFST_ROOT, GTEST_ROOT)
set Build Identifier for this build (optional)
echo "<build_identifier>" > BUILD_IDENTIFIER
build library:
cmake --build .
test library:
cmake --build . --target iresearch-check
install library:
cmake --build . --target install
code coverage:
cmake --build . --target iresearch-coverage

Included 3rd party dependencies

Code for all included 3rd party dependencies is located in the "external" directory.

MurMurHash

used for fast computation of hashes for byte arrays

OpenFST

used to generate very compact term dictionary prefix tries which can to be loaded in memory even for huge dictionaries

External 3rd party dependencies

External 3rd party dependencies must be made available to the IResearch library separately. They may either be installed through the distribution package management system or build from source and the appropriate environment variables set accordingly.

Boost

v1.57.0 or later (filesystem locale system thread) used for functionality not available in the STL (excluding functionality available in ICU)

Lz4

used for compression/decompression of byte/string data

Bison

v2.4 or later used for compilation of the IQL (index query language) grammar

ICU

used by analysis::text_analyzer for parsing, transforming and tokenising string data

Snowball

used by analysis::text_analyzer for computing word stems (i.e. roots) for more flexible matching matching of words from languages not supported by 'snowball' are done verbatim

Google Test

used for writing tests for the IResearch library

Stopword list

used by analysis::text_analyzer for filtering out noise words that should not impact text ranging e.g. for 'en' these are usualy 'a', 'the', etc... download any number of lists of stopwords, e.g. from: https://github.com/snowballstem/snowball-website/tree/master/algorithms/*/stop.txt https://code.google.com/p/stop-words/ or create a custom language-specific list of stopwords place the files with stopwords, (utf8 encoded with one word per line, any text after the first whitespace is ignored), in the directory corresponding to its language (multiple files per language are supported and will be interpreted as a single list)

Query filter building blocks

Filter Description
iresearch::by_granular_range for faster filtering of numeric values within a given range, with the possibility of specifying open/closed ranges
iresearch::by_phrase for word-position-sensitive filtering of values, with the possibility of skipping selected positions
iresearch::by_prefix for filtering of exact value prefixes
iresearch::by_range for filtering of values within a given range, with the possibility of specifying open/closed ranges
iresearch::by_same_position for term-insertion-order sensitive filtering of exact values
iresearch::by_term for filtering of exact values
iresearch::And boolean conjunction of multiple filters, influencing document ranks/scores as appropriate
iresearch::Or boolean disjunction of multiple filters, influencing document ranks/scores as appropriate (including "minimum match" functionality)
iresearch::Not boolean negation of multiple filters

Index Query Language

The IResearch index may be queries either via query trees built directly using the query building blocks available in the API or via the IQL query builder that generates a comparable query from a string representation of the query expressed using the IQL syntax.

API

The IQL parser is defined via Bison grammar and is accessible via iresearch::iql::parser and iresearch::iql::parser_context classes. The latter class is intended to be extended to expose at least the following methods as required: - query_state current_state() const; - query_node const& find_node(parser::semantic_type const& value) const; The iresearch::iql::parser_context::query_state object provides access to the results of the query parsing as well as any parse errors reported by Bison. - nOffset - next offset position to be parsed (size_t) - pnFilter - the filter portion (nodeID) of the query, or nullptr if unset (size_t const) - order - the order portion (nodeID, ascending) of the query (std::vector<std::pair> const&) - pnLimit - the limit value of the query, or nullptr if unset (size_t const) - pError - the last encountered error, or nullptr if no errors seen (iresearch::iql::query_position const*)

Grammar

The following grammar is currently defined via Bison (the root is ):
     ::= ?    ?

     ::= [[:space:]]+
            |  "/*" ... "*/"

     ::= ? "," ?

     ::= intersection
              |   "OR"  
              |  ? "||" ? 

     ::= 
                     |   "AND"  
                     |  ? "&amp;&amp;"  

     ::= 
                   | 
                   | 
                   | 

     ::=  ? "*" ? 
              |  ? "*" ? 

     ::= "NOT" ? 
                 | "!" ? 

     ::= "(" ?  ? ")"
                      |  "(" ? ")"
                      |  "(" ?  ? ")"

     ::=  ? "~=" ? 
                |  ? "!=" ? 
                |  ? "&lt;&quot;  ? 
                |  ? "&lt;=&quot; ? 
                |  ? "==" ? 
                |  ? "&gt;=" ? 
                |  ? "&gt;"  ? 
                |  ? "!=" ? 
                |  ? "==" ? 

     ::= "[" ?    ? "]"
              | "[" ?    ? ")"
              | "(" ?    ? ")"
              | "(" ?    ? "]"

     ::= 
             | 

     ::=  "(" ? ")"
                 |  "(" ?  ? ")"

     ::= 
                  |   

     ::= 
                 | 
                 | 

     ::= 
                      |  

     ::= [^[:space:][:punct:]]+
                 | [[:punct:]][^[:space:][:punct:]]*

     ::= """ [^"]* """
                        |  """ [^"]* """

     ::= "'" [^']* "'"
                        |  "'" [^']* "'"

     ::= ""
              |  "LIMIT"  

     ::= ""
              |  "ORDER"  

     ::= 
                   |   

     ::= 
                   |   "ASC"
                   |   "DESC"

License

Copyright (c) 2017 ArangoDB GmbH Copyright (c) 2016-2017 EMC Corporation This software is provided under the Apache 2.0 Software license provided in the LICENSE.md file. Licensing information for third-party products used by IResearch search engine can be found in THIRD_PARTY_README.md

Information

  • 14 Stars
  • 5 Forks
  • 4 Contributors
  • C+
  • Tools and Libraries
  • bm25 / C++ / Library / relevant-search / search-engine / tf-idf
  • From the {code} Blog

    • The Importance of Open Source Communities

      The Importance of Open Source Communities Have you ever wanted to sit down with community leaders from the open source community, and chat with them about what they look for when growing ambassador groups ...
      November 16, 2017
    • Ocopea: Application Copies for Kubernetes and Cloud Foundry

      Introducing DevHigh5 Project Ocopea: Application Copies for Kubernetes and Cloud Foundry By Amit Lieberman and Vijay Tirumalai A new DevHigh5 open source project, Ocopea, (pronounced Oh Copy!), introduces application copies for Kubernetes and Cloud Foundry ...
      November 16, 2017
    • Analysis of the CSI Spec

      The Container Storage Interface (CSI) is making steady progress on mapping out how it will eventually look. If this is the first time you’ve heard about CSI, we would recommend that you read The ...
      November 3, 2017
    More related posts on the {code} Blog

    Branch Status
    master Build Status Build status

    IResearch search engine

    Version 1.0

    Table of contents

    Overview

    The IResearch library is meant to be treated as a standalone index that is capable of both indexing and storing individual values verbatim. Indexed data is treated on a per-version/per-revision basis, i.e. existing data version/revision is never modified and updates/removals are treated as new versions/revisions of the said data. This allows for trivial multi-threaded read/write operations on the index. The index exposes its data processing functionality via a multi-threaded 'writer' interface that treats each document abstraction as a collection of fields to index and/or store. The index exposes its data retrieval functionality via 'reader' interface that returns records from an index matching a specified query. The queries themselves are constructed from either string IQL (index query language) requests or query trees built directly using the query building blocks available in the API. The querying infrastructure provides the capability of ordering the result set by one or more ranking/scoring implementations. The ranking/scoring implementation logic is plugin-based and lazy-initialized during runtime as needed, allowing for addition of custom ranking/scoring logic without the need to even recompile the IResearch library.

    High level architecture and main concepts

    Index

    An index consists of multiple independent parts, called segments and index metadata. Index metadata stores information about active index segments for the particular index version/revision. Each index segment is an index itself and consists of the following logical components:
    • segment metadata
    • field metadata
    • term dictionary
    • postings lists
    • list of deleted documents
    • stored values
    Read/write access to the components carried via plugin-based formats. Index may contain segments created using different formats.

    Document

    A database record is represented as an abstraction called a document. A document is actually a collection of indexed/stored fields. In order to be processed each field should satisfy at least IndexedField or StoredField concept.

    IndexedField concept

    For type T to be IndexedField, the following conditions have to be satisfied for an object m of type T:
    Expression Requires Effects
    m.name() The output type must be convertible to iresearch::string_ref A value uses as a key name.
    m.boost() The output type must be convertible to float_t A value uses as a boost factor for a document.
    m.get_tokens() The output type must be convertible to iresearch::token_stream* A token stream uses for populating in invert procedure. If value is nullptr field is treated as non-indexed.
    m.features() The output type must be convertible to const iresearch::flags&amp; A set of features requested for evaluation during indexing. E.g. it may contain request of processing positions and frequencies. Later the evaluated information can be used during querying.

    StoredField concept

    For type T to be StoredField, the following conditions have to be satisfied for an object m of type T:
    Expression Requires Effects
    m.name() The output type must be convertible to iresearch::string_ref A value uses as a key name.
    m.write(iresearch::data_output&amp; out) The output type must be convertible to bool. One may write arbitrary data to stream denoted by out in order to retrieve written value using index_reader API later. If nothing has written but returned value is true then stored value is treated as flag. If returned value is false then nothing is stored even if something has been written to out stream.

    Directory

    A data storage abstraction that can either store data in memory or on the filesystem depending on which implementation is instantiated. A directory stores at least all the currently in-use index data versions/revisions. For the case where there are no active users of the directory then at least the last data version/revision is stored. Unused data versions/revisions may be removed via the directory_cleaner. A single version/revision of the index is composed of one or more segments associated, and possibly shared, with the said version/revision.

    Writer

    A single instance per-directory object that is used for indexing data. Data may be indexed in a per-document basis or sourced from another reader for trivial directory merge functionality. Each commit() of a writer produces a new version/revision of the view of the data in the corresponding directory. Additionally the interface also provides directory defragmentation capabilities to allow compacting multiple smaller version/revision segments into larger more compact representations. A writer supports two-phase transactions via begin()/commit()/rollback() methods.

    Reader

    A reusable/refreshable view of an index at a given point in time. Multiple readers can use the same directory and may point to different versions/revisions of data in the said directory.

    Build prerequisites

    CMake

    v3.2 or later

    Boost

    v1.57.0 or later (filesystem locale system thread)

    install (*nix)

    It looks like it is important to pass arguments to the bootstrap script in one line
    ./bootstrap.sh --with-libraries=filesystem,locale,system,regex,thread
    ./b2
    

    install (MacOS)

    Do not link Boost against 'iconv' because on MacOS it causes problems when linking against Boost locale. Unfortunately this requires linking against ICU.
    ./bootstrap.sh --with-libraries=filesystem,locale,system,regex,thread
    ./b2 -sICU_PATH="${ICU_ROOT}" boost.locale.iconv=off boost.locale.icu=on
    

    install (win32)

    bootstrap.bat --with-libraries=filesystem
    bootstrap.bat --with-libraries=test
    bootstrap.bat --with-libraries=thread
    b2 --build-type=complete stage address-model=64
    

    set environment

    BOOST_ROOT=<path-to>/boost_1_57_0
    

    Lz4

    install (*nix)

    make
    make install
    
    or point LZ4_ROOT at the source directory to build together with IResearch

    install (win32)

    If compiling IResearch with /MT add add_definitions("/MTd") to the end of cmake_unofficial/CMakeLists.txt since cmake will ignore the command line argument -DCMAKE_C_FLAGS=/MTd
    mkdir build && cd build
    cmake -DCMAKE_INSTALL_PREFIX=<install-path> -DBUILD_LIBS=on -g "Visual studio 12" -Ax64 ../cmake_unofficial
    cmake --build .
    cmake --build . --target install
    
    or point LZ4_ROOT at the source directory to build together with IResearch

    set environment

    LZ4_ROOT=<install-path>
    

    Bison

    v2.4 or later win32 binaries also available in: - https://git-scm.com/download/win - http://sourceforge.net/projects/mingw/files - http://sourceforge.net/projects/mingwbuilds/files/external-binary-packages

    ICU

    install (*nix)

    ./configure --disable-samples --disable-tests --enable-static --srcdir="$(pwd)" --prefix=<install-path> --exec-prefix=<install-path>
    make install
    
    or point ICU_ROOT at the source directory to build together with IResearch or via the distributions' package manager: libicu

    install (win32)

    look for link: "ICU4C Binaries"

    set environment

    ICU_ROOT=<path-to-icu>
    

    Snowball

    install (*nix)

    the custom CMakeLists.txt was based on revision 5137019d68befd633ce8b1cd48065f41e77ed43e later versions may be used at your own risk of compilation failure
    git clone https://github.com/snowballstem/snowball.git
    git reset --hard 5137019d68befd633ce8b1cd48065f41e77ed43e
    mkdir build && cd build
    cmake -DENABLE_STATIC=OFF -DNO_SHARED=OFF -g "Unix Makefiles" ..
    cmake --build .
    cmake -DENABLE_STATIC=OFF -DNO_SHARED=ON -g "Unix Makefiles" ..
    cmake --build .
    
    or point SNOWBALL_ROOT at the source directory to build together with IResearch or via the distributions' package manager: libstemmer

    install (win32)

    the custom CMakeLists.txt was based on revision 5137019d68befd633ce8b1cd48065f41e77ed43e later versions may be used at your own risk of compilation failure
    git clone https://github.com/snowballstem/snowball.git
    git reset --hard 5137019d68befd633ce8b1cd48065f41e77ed43e
    mkdir build && cd build
    set PATH=%PATH%;<path-to>/build/Debug
    cmake -DENABLE_STATIC=OFF -DNO_SHARED=OFF -g "Visual studio 12" -Ax64 ..
    cmake --build .
    cmake -DENABLE_STATIC=OFF -DNO_SHARED=ON -g "Visual studio 12" -Ax64 ..
    cmake --build .
    
    or point SNOWBALL_ROOT at the source directory to build together with IResearch
    For static builds: 1. in MSVC open: build/snowball.sln 2. set: stemmer -> Properties -> Configuration Properties -> C/C++ -> Code Generation -> Runtime Library = /MTd 3. BUILD -> Build Solution

    set environment

    SNOWBALL_ROOT=<path-to-snowball>
    

    BFD

    install (*nix)

    via the distributions' package manager: libbfd or build from source via:
    cd libiberty
    env CFLAGS=-fPIC ./configure
    make
    
    cd ../zlib
    env CFLAGS=-fPIC ./configure
    make
    
    cd ../bfd
    env LDFLAGS='-L../libiberty -liberty' ./configure --enable-targets=all --enable-shared
    make
    

    install (win32)

    not yet available for win32

    set environment

    Note: BINUTILS_ROOT is a "reserved" variable internally used by some of the gcc compiler tools.
    BFD_ROOT=<path-to-binutils>
    

    Unwind

    install (*nix)

    via the distributions' package manager: libunwind or build from source via:
    configure
    make
    make install
    

    install (win32)

    not yet available for win32

    set environment

    UNWIND_ROOT=<path-to-unwind>
    

    Gooogle test

    install (*nix)

    mkdir build && cd build
    cmake ..
    make
    
    or point GTEST_ROOT at the source directory to build together with IResearch

    install (win32)

    mkdir build && cd build
    cmake -g "Visual studio 12" -Ax64 -Dgtest_force_shared_crt=ON -DCMAKE_DEBUG_POSTFIX="" ..
    cmake --build .
    mv Debug ../lib
    
    or point GTEST_ROOT at the source directory to build together with IResearch

    set environment

    GTEST_ROOT=<path-to-gtest>
    

    Stopword list (for use with analysis::text_analyzer)

    download any number of lists of stopwords, e.g. from: https://github.com/snowballstem/snowball-website/tree/master/algorithms/*/stop.txt https://code.google.com/p/stop-words/

    install

    1. mkdir
    2. for each language, (e.g. "c", "en", "es", "ru"), create a corresponding subdirectory (a directory name has 2 letters except the default locale "c" which has 1 letter)
    3. place the files with stopwords, (utf8 encoded with one word per line, any text after the first whitespace is ignored), in the directory corresponding to its language (multiple files per language are supported and will be interpreted as a single list)

    set environment

    IRESEARCH_TEXT_STOPWORD_PATH=<path-to-stopword-lists>
    
    If the variable IRESEARCH_TEXT_STOPWORD_PATH is left unset then locale specific stopword-list subdirectories are deemed to be located in the current working directory

    Build

    git clone <IResearch code repository>/iresearch.git iresearch
    cd iresearch
    mkdir build && cd build
    
    generate build file :
    cmake -DCMAKE_BUILD_TYPE=[Debug|Release|Coverage] -g "Unix Makefiles" ..
    
    1. if some libraries are not found by the build then set the needed environment variables (e.g. BOOST_ROOT, BOOST_LIBRARYDIR, LZ4_ROOT, OPENFST_ROOT, GTEST_ROOT)
    2. if ICU or Snowball from the distribution paths are not found, the following additional environment variables might be required: ICU_ROOT_SUFFIX=x86_64-linux-gnu SNOWBALL_ROOT_SUFFIX=x86_64-linux-gnu
    generate build file (win32):
    cmake -g "Visual studio 12" -Ax64 ..
    
    If some libraries are not found by the build then set the needed environment variables (e.g. BOOST_ROOT, BOOST_LIBRARYDIR, LZ4_ROOT, OPENFST_ROOT, GTEST_ROOT)
    set Build Identifier for this build (optional)
    echo "<build_identifier>" > BUILD_IDENTIFIER
    
    build library:
    cmake --build .
    
    test library:
    cmake --build . --target iresearch-check
    
    install library:
    cmake --build . --target install
    
    code coverage:
    cmake --build . --target iresearch-coverage
    

    Included 3rd party dependencies

    Code for all included 3rd party dependencies is located in the "external" directory.

    MurMurHash

    used for fast computation of hashes for byte arrays

    OpenFST

    used to generate very compact term dictionary prefix tries which can to be loaded in memory even for huge dictionaries

    External 3rd party dependencies

    External 3rd party dependencies must be made available to the IResearch library separately. They may either be installed through the distribution package management system or build from source and the appropriate environment variables set accordingly.

    Boost

    v1.57.0 or later (filesystem locale system thread) used for functionality not available in the STL (excluding functionality available in ICU)

    Lz4

    used for compression/decompression of byte/string data

    Bison

    v2.4 or later used for compilation of the IQL (index query language) grammar

    ICU

    used by analysis::text_analyzer for parsing, transforming and tokenising string data

    Snowball

    used by analysis::text_analyzer for computing word stems (i.e. roots) for more flexible matching matching of words from languages not supported by 'snowball' are done verbatim

    Google Test

    used for writing tests for the IResearch library

    Stopword list

    used by analysis::text_analyzer for filtering out noise words that should not impact text ranging e.g. for 'en' these are usualy 'a', 'the', etc... download any number of lists of stopwords, e.g. from: https://github.com/snowballstem/snowball-website/tree/master/algorithms/*/stop.txt https://code.google.com/p/stop-words/ or create a custom language-specific list of stopwords place the files with stopwords, (utf8 encoded with one word per line, any text after the first whitespace is ignored), in the directory corresponding to its language (multiple files per language are supported and will be interpreted as a single list)

    Query filter building blocks

    Filter Description
    iresearch::by_granular_range for faster filtering of numeric values within a given range, with the possibility of specifying open/closed ranges
    iresearch::by_phrase for word-position-sensitive filtering of values, with the possibility of skipping selected positions
    iresearch::by_prefix for filtering of exact value prefixes
    iresearch::by_range for filtering of values within a given range, with the possibility of specifying open/closed ranges
    iresearch::by_same_position for term-insertion-order sensitive filtering of exact values
    iresearch::by_term for filtering of exact values
    iresearch::And boolean conjunction of multiple filters, influencing document ranks/scores as appropriate
    iresearch::Or boolean disjunction of multiple filters, influencing document ranks/scores as appropriate (including "minimum match" functionality)
    iresearch::Not boolean negation of multiple filters

    Index Query Language

    The IResearch index may be queries either via query trees built directly using the query building blocks available in the API or via the IQL query builder that generates a comparable query from a string representation of the query expressed using the IQL syntax.

    API

    The IQL parser is defined via Bison grammar and is accessible via iresearch::iql::parser and iresearch::iql::parser_context classes. The latter class is intended to be extended to expose at least the following methods as required: - query_state current_state() const; - query_node const& find_node(parser::semantic_type const& value) const; The iresearch::iql::parser_context::query_state object provides access to the results of the query parsing as well as any parse errors reported by Bison. - nOffset - next offset position to be parsed (size_t) - pnFilter - the filter portion (nodeID) of the query, or nullptr if unset (size_t const) - order - the order portion (nodeID, ascending) of the query (std::vector<std::pair> const&) - pnLimit - the limit value of the query, or nullptr if unset (size_t const) - pError - the last encountered error, or nullptr if no errors seen (iresearch::iql::query_position const*)

    Grammar

    The following grammar is currently defined via Bison (the root is ):
         ::= ?    ?
    
         ::= [[:space:]]+
                |  "/*" ... "*/"
    
         ::= ? "," ?
    
         ::= intersection
                  |   "OR"  
                  |  ? "||" ? 
    
         ::= 
                         |   "AND"  
                         |  ? "&amp;&amp;"  
    
         ::= 
                       | 
                       | 
                       | 
    
         ::=  ? "*" ? 
                  |  ? "*" ? 
    
         ::= "NOT" ? 
                     | "!" ? 
    
         ::= "(" ?  ? ")"
                          |  "(" ? ")"
                          |  "(" ?  ? ")"
    
         ::=  ? "~=" ? 
                    |  ? "!=" ? 
                    |  ? "&lt;&quot;  ? 
                    |  ? "&lt;=&quot; ? 
                    |  ? "==" ? 
                    |  ? "&gt;=" ? 
                    |  ? "&gt;"  ? 
                    |  ? "!=" ? 
                    |  ? "==" ? 
    
         ::= "[" ?    ? "]"
                  | "[" ?    ? ")"
                  | "(" ?    ? ")"
                  | "(" ?    ? "]"
    
         ::= 
                 | 
    
         ::=  "(" ? ")"
                     |  "(" ?  ? ")"
    
         ::= 
                      |   
    
         ::= 
                     | 
                     | 
    
         ::= 
                          |  
    
         ::= [^[:space:][:punct:]]+
                     | [[:punct:]][^[:space:][:punct:]]*
    
         ::= """ [^"]* """
                            |  """ [^"]* """
    
         ::= "'" [^']* "'"
                            |  "'" [^']* "'"
    
         ::= ""
                  |  "LIMIT"  
    
         ::= ""
                  |  "ORDER"  
    
         ::= 
                       |   
    
         ::= 
                       |   "ASC"
                       |   "DESC"
    

    License

    Copyright (c) 2017 ArangoDB GmbH Copyright (c) 2016-2017 EMC Corporation This software is provided under the Apache 2.0 Software license provided in the LICENSE.md file. Licensing information for third-party products used by IResearch search engine can be found in THIRD_PARTY_README.md

    From the {code} Blog

    • The Importance of Open Source Communities

      The Importance of Open Source Communities Have you ever wanted to sit down with community leaders from the open source community, and chat with them about what they look for when growing ambassador groups ...
      November 16, 2017
    • Ocopea: Application Copies for Kubernetes and Cloud Foundry

      Introducing DevHigh5 Project Ocopea: Application Copies for Kubernetes and Cloud Foundry By Amit Lieberman and Vijay Tirumalai A new DevHigh5 open source project, Ocopea, (pronounced Oh Copy!), introduces application copies for Kubernetes and Cloud Foundry ...
      November 16, 2017
    • Analysis of the CSI Spec

      The Container Storage Interface (CSI) is making steady progress on mapping out how it will eventually look. If this is the first time you’ve heard about CSI, we would recommend that you read The ...
      November 3, 2017
    More related posts on the {code} Blog

    From the {code} Blog

    • The Importance of Open Source Communities

      The Importance of Open Source Communities Have you ever wanted to sit down with community leaders from the open source community, and chat with them about what they look for when growing ambassador groups ...
      November 16, 2017
    • Ocopea: Application Copies for Kubernetes and Cloud Foundry

      Introducing DevHigh5 Project Ocopea: Application Copies for Kubernetes and Cloud Foundry By Amit Lieberman and Vijay Tirumalai A new DevHigh5 open source project, Ocopea, (pronounced Oh Copy!), introduces application copies for Kubernetes and Cloud Foundry ...
      November 16, 2017
    • Analysis of the CSI Spec

      The Container Storage Interface (CSI) is making steady progress on mapping out how it will eventually look. If this is the first time you’ve heard about CSI, we would recommend that you read The ...
      November 3, 2017
    More related posts on the {code} Blog