dataset#

The core feature of Sayt.

class sayt.dataset.BaseField(name: str)[source]#
to_dict() dict[source]#

Serialize to dict.

classmethod from_dict(dct: dict) Union[StoredField, IdField, IdListField, KeywordField, TextField, NumericField, DatetimeField, BooleanField, NgramField, NgramWordsField][source]#

Deserialize from dict. Smartly choose the right class.

class sayt.dataset.StoredField(name: str)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.STORED

class sayt.dataset.IdField(name: str, stored: bool = False, unique: bool = False, field_boost: Union[int, float] = 1.0, sortable: bool = False, ascending: bool = True, analyzer: Optional[str] = None)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.ID

class sayt.dataset.IdListField(name: str, stored: bool = False, unique: bool = False, expression: Optional[str] = None, field_boost: Union[int, float] = 1.0)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.IDLIST

class sayt.dataset.KeywordField(name: str, stored: bool = False, lowercase: bool = False, commas: bool = False, scorable: bool = False, unique: bool = False, field_boost: Union[int, float] = 1.0, sortable: bool = False, ascending: bool = True, vector: Optional = None, analyzer: Optional = None)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.KEYWORD

class sayt.dataset.TextField(name: str, stored: bool = False, analyzer: Optional = None, phrase: bool = True, chars: bool = False, field_boost: Union[int, float] = 1.0, multitoken_query: str = 'default', spelling: bool = False, sortable: bool = False, ascending: bool = True, lang: Optional = None, vector: Optional = None, spelling_prefix: str = 'spell_')[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.TEXT

class sayt.dataset.NumericField(name: str, stored: bool = False, numtype: ~typing.Union[~typing.Type[int], ~typing.Type[float]] = <class 'int'>, bits: int = 32, unique: bool = False, field_boost: ~typing.Union[int, float] = 1.0, decimal_places: int = 0, shift_step: int = 4, signed: bool = True, sortable: bool = False, ascending: bool = True, default: ~typing.Optional[~typing.Union[int, float]] = None)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.NUMERIC

numtype#

alias of int

class sayt.dataset.DatetimeField(name: str, stored: bool = False, unique: bool = False, sortable: bool = False, ascending: bool = True)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.DATETIME

class sayt.dataset.BooleanField(name: str, stored: bool = False, field_boost: Union[int, float] = 1.0)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.BOOLEAN

class sayt.dataset.NgramField(name: str, stored: bool = False, minsize: int = 2, maxsize: int = 4, field_boost: Union[int, float] = 1.0, queryor: bool = False, phrase: bool = False, sortable: bool = False, ascending: bool = True)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.NGRAM

class sayt.dataset.NgramWordsField(name: str, stored: bool = False, minsize: int = 2, maxsize: int = 4, field_boost: Union[int, float] = 1.0, queryor: bool = False, tokenizer: Optional = None, at: Optional[str] = None, sortable: bool = False, ascending: bool = True)[source]#

Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.NGRAMWORDS

class sayt.dataset.T_Hit[source]#

Represent a hit in the search result.

class sayt.dataset.T_Result[source]#

Return type of the DataSet.search() method when simple_response = False.

Reference:

class sayt.dataset.DataSet(dir_index: ~pathlib.Path = <sayt.dataset._Nothing object>, index_name: str = <sayt.dataset._Nothing object>, fields: ~typing.List[~typing.Union[~sayt.dataset.StoredField, ~sayt.dataset.IdField, ~sayt.dataset.IdListField, ~sayt.dataset.KeywordField, ~sayt.dataset.TextField, ~sayt.dataset.NumericField, ~sayt.dataset.DatetimeField, ~sayt.dataset.BooleanField, ~sayt.dataset.NgramField, ~sayt.dataset.NgramWordsField]] = <factory>, dir_cache: ~typing.Optional[~pathlib.Path] = None, cache: ~typing.Optional[~diskcache.core.Cache] = None, cache_key: str = <sayt.dataset._Nothing object>, cache_tag: ~typing.Optional[str] = None, cache_expire: ~typing.Optional[int] = None, downloader: ~typing.Callable[[...], ~typing.Iterable[~typing.Dict[str, ~typing.Any]]] = <function DataSet.<lambda>>, skip_validation: bool = False)[source]#

An abstraction of a searchable dataset. It defines:

  • how you want to index and search your dataset.

  • how to download your dataset.

You should run DataSet.build_index() to create the index for your dataset, then you can start using DataSet.search() to search your data.

If it is time-consuming to load your dataset, for example, you have to download it from internet, you can consider RefreshableDataSet to cache your index and dataset and refresh them when need needed.

Parameters:
  • dir_index – the directory to store the index. If it does not exist, it will be created automatically.

  • index_name – the name of the index. An index is like a table in a database. Different indexes under the same index directory will be stored in different files. Files under the same index will have the same prefix.

  • fields – define how your dataset will be indexed and searched.

  • dir_cache – the directory to store the cache. If it does not exist, it will be created automatically. You can either set this and let the program create the diskcache.Cache object for you, or you can explicitly create the diskcache.Cache object and pass it to the cache parameter.

  • cache – a diskcache.Cache object. If you set this, you should not set dir_cache parameter.

  • cache_key – the key used to indicate that the dataset is successfully downloaded and indexed.

  • cache_tag – the tag used to clear the data cache and query cache for this dataset.

  • cache_expire – cache expire time in seconds.

  • downloader – a callable function that pull the dataset we need, and returns a list of record, each record is a dict data. This function will be called if your cache expired or you force to refresh the data.

  • skip_validation – whether to skip the validation of the dataset. Default is False, which means the dataset will be validated.

property schema: Schema#

Access the whoosh schema based on the setting.

remove_index()[source]#

Remove the whoosh index for this dataset.

remove_all_index()[source]#

Remove all whoosh index in the index directory.

is_indexing() bool[source]#

Return a boolean value to indicate that if this dataset is indexing.

If True, we should not allow other thread working on the same dataset to index.

build_index(data: Iterable[Dict[str, Any]], memory_limit: int = 512, multi_thread: bool = True, rebuild: bool = True, raise_lock_error: bool = False) bool[source]#

A wrapper of the DataSet._build_index(). Also prevent from concurrent indexing.

Parameters:
  • data – list of dictionary documents data.

  • memory_limit – maximum memory you can use for indexing, default is 512MB, you can use a larger number if you have more memory.

  • multi_thread – use multi-threading to build index, default is False.

  • rebuild – if True, remove the existing index and rebuild it.

  • raise_lock_error – if True, it will raise an error when attempts to index a dataset that there’s another thread is indexing. if False, then it silently pass without doing anying.

Returns:

a boolean value to indicate whether building index happened.

remove_cache()[source]#

Remove the cache for this dataset.

remove_all_cache()[source]#

Remove all cache in the cache directory.

search(query: Union[str, Query], limit: int = 20, simple_response: bool = True, refresh_data: bool = False, verbose: bool = False) Union[List[dict], T_Result][source]#

Run full-text search. For details about the query language, check this link.

From 0.3.1, you can set simple_response to False to get the elasticsearch-HTTP-response styled result. For example:

{
    'index': '3dd28d068ad007367ac7816d7752d382',
    'took': 5,
    'size': 4, # milliseconds
    'cache': False,
    'hits': [
        {
            '_id': 470,
            '_score': -2147485651,
            '_source': {
                'id': 'c7242d2f47cb4aa2a1eebd75c7e81bbf',
                'title': 'More parent message heavy police development how simply.',
                'author': 'Margaret Ellis',
                'year': 2003
            }
        },
        {
            '_id': 456,
            '_score': -2147485642,
            '_source': {
                'id': 'ff91fd8545c64af59637caa043435f50',
                'author': 'Laura Walters',
                'title': 'Discover police discussion kitchen.',
                'year': 1994
            }
        },
        ...
    ]
}
Parameters:
  • query – 如果是一个字符串, 则使用 MultifieldParser 解析. 如果是一个 Query 对象, 则直接使用.

  • limit – 返回结果的最大数量.

  • simple_response – 如果为 True, 则返回 list of dict 对象, 否则返回 类似于 ElasticSearch 的 HTTP response 的那种 Result 对象.

  • refresh_data – if True, then will force to download the data and refresh the index and cache.