dataset#
The core feature of Sayt.
- class sayt.dataset.BaseField(name: str)[source]#
-
- classmethod from_dict(dct: dict) Union[StoredField, IdField, IdListField, KeywordField, TextField, NumericField, DatetimeField, BooleanField, NgramField, NgramWordsField][source]#
Deserialize from dict. Smartly choose the right class.
- class sayt.dataset.StoredField(name: str)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.STORED
- class sayt.dataset.IdField(name: str, stored: bool = False, unique: bool = False, field_boost: Union[int, float] = 1.0, sortable: bool = False, ascending: bool = True, analyzer: Optional[str] = None)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.ID
- class sayt.dataset.IdListField(name: str, stored: bool = False, unique: bool = False, expression: Optional[str] = None, field_boost: Union[int, float] = 1.0)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.IDLIST
- class sayt.dataset.KeywordField(name: str, stored: bool = False, lowercase: bool = False, commas: bool = False, scorable: bool = False, unique: bool = False, field_boost: Union[int, float] = 1.0, sortable: bool = False, ascending: bool = True, vector: Optional = None, analyzer: Optional = None)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.KEYWORD
- class sayt.dataset.TextField(name: str, stored: bool = False, analyzer: Optional = None, phrase: bool = True, chars: bool = False, field_boost: Union[int, float] = 1.0, multitoken_query: str = 'default', spelling: bool = False, sortable: bool = False, ascending: bool = True, lang: Optional = None, vector: Optional = None, spelling_prefix: str = 'spell_')[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.TEXT
- class sayt.dataset.NumericField(name: str, stored: bool = False, numtype: ~typing.Union[~typing.Type[int], ~typing.Type[float]] = <class 'int'>, bits: int = 32, unique: bool = False, field_boost: ~typing.Union[int, float] = 1.0, decimal_places: int = 0, shift_step: int = 4, signed: bool = True, sortable: bool = False, ascending: bool = True, default: ~typing.Optional[~typing.Union[int, float]] = None)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.NUMERIC
- class sayt.dataset.DatetimeField(name: str, stored: bool = False, unique: bool = False, sortable: bool = False, ascending: bool = True)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.DATETIME
- class sayt.dataset.BooleanField(name: str, stored: bool = False, field_boost: Union[int, float] = 1.0)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.BOOLEAN
- class sayt.dataset.NgramField(name: str, stored: bool = False, minsize: int = 2, maxsize: int = 4, field_boost: Union[int, float] = 1.0, queryor: bool = False, phrase: bool = False, sortable: bool = False, ascending: bool = True)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.NGRAM
- class sayt.dataset.NgramWordsField(name: str, stored: bool = False, minsize: int = 2, maxsize: int = 4, field_boost: Union[int, float] = 1.0, queryor: bool = False, tokenizer: Optional = None, at: Optional[str] = None, sortable: bool = False, ascending: bool = True)[source]#
Ref: https://whoosh.readthedocs.io/en/latest/api/fields.html#whoosh.fields.NGRAMWORDS
- class sayt.dataset.T_Result[source]#
Return type of the
DataSet.search()method whensimple_response = False.Reference:
- class sayt.dataset.DataSet(dir_index: ~pathlib.Path = <sayt.dataset._Nothing object>, index_name: str = <sayt.dataset._Nothing object>, fields: ~typing.List[~typing.Union[~sayt.dataset.StoredField, ~sayt.dataset.IdField, ~sayt.dataset.IdListField, ~sayt.dataset.KeywordField, ~sayt.dataset.TextField, ~sayt.dataset.NumericField, ~sayt.dataset.DatetimeField, ~sayt.dataset.BooleanField, ~sayt.dataset.NgramField, ~sayt.dataset.NgramWordsField]] = <factory>, dir_cache: ~typing.Optional[~pathlib.Path] = None, cache: ~typing.Optional[~diskcache.core.Cache] = None, cache_key: str = <sayt.dataset._Nothing object>, cache_tag: ~typing.Optional[str] = None, cache_expire: ~typing.Optional[int] = None, downloader: ~typing.Callable[[...], ~typing.Iterable[~typing.Dict[str, ~typing.Any]]] = <function DataSet.<lambda>>, skip_validation: bool = False)[source]#
An abstraction of a searchable dataset. It defines:
how you want to index and search your dataset.
how to download your dataset.
You should run
DataSet.build_index()to create the index for your dataset, then you can start usingDataSet.search()to search your data.If it is time-consuming to load your dataset, for example, you have to download it from internet, you can consider
RefreshableDataSetto cache your index and dataset and refresh them when need needed.- Parameters:
dir_index – the directory to store the index. If it does not exist, it will be created automatically.
index_name – the name of the index. An index is like a table in a database. Different indexes under the same index directory will be stored in different files. Files under the same index will have the same prefix.
fields – define how your dataset will be indexed and searched.
dir_cache – the directory to store the cache. If it does not exist, it will be created automatically. You can either set this and let the program create the
diskcache.Cacheobject for you, or you can explicitly create thediskcache.Cacheobject and pass it to thecacheparameter.cache – a
diskcache.Cacheobject. If you set this, you should not setdir_cacheparameter.cache_key – the key used to indicate that the dataset is successfully downloaded and indexed.
cache_tag – the tag used to clear the data cache and query cache for this dataset.
cache_expire – cache expire time in seconds.
downloader – a callable function that pull the dataset we need, and returns a list of record, each record is a dict data. This function will be called if your cache expired or you force to refresh the data.
skip_validation – whether to skip the validation of the dataset. Default is False, which means the dataset will be validated.
- property schema: Schema#
Access the whoosh schema based on the setting.
- is_indexing() bool[source]#
Return a boolean value to indicate that if this dataset is indexing.
If True, we should not allow other thread working on the same dataset to index.
- build_index(data: Iterable[Dict[str, Any]], memory_limit: int = 512, multi_thread: bool = True, rebuild: bool = True, raise_lock_error: bool = False) bool[source]#
A wrapper of the
DataSet._build_index(). Also prevent from concurrent indexing.- Parameters:
data – list of dictionary documents data.
memory_limit – maximum memory you can use for indexing, default is 512MB, you can use a larger number if you have more memory.
multi_thread – use multi-threading to build index, default is False.
rebuild – if True, remove the existing index and rebuild it.
raise_lock_error – if True, it will raise an error when attempts to index a dataset that there’s another thread is indexing. if False, then it silently pass without doing anying.
- Returns:
a boolean value to indicate whether building index happened.
- search(query: Union[str, Query], limit: int = 20, simple_response: bool = True, refresh_data: bool = False, verbose: bool = False) Union[List[dict], T_Result][source]#
Run full-text search. For details about the query language, check this link.
From 0.3.1, you can set
simple_responsetoFalseto get the elasticsearch-HTTP-response styled result. For example:{ 'index': '3dd28d068ad007367ac7816d7752d382', 'took': 5, 'size': 4, # milliseconds 'cache': False, 'hits': [ { '_id': 470, '_score': -2147485651, '_source': { 'id': 'c7242d2f47cb4aa2a1eebd75c7e81bbf', 'title': 'More parent message heavy police development how simply.', 'author': 'Margaret Ellis', 'year': 2003 } }, { '_id': 456, '_score': -2147485642, '_source': { 'id': 'ff91fd8545c64af59637caa043435f50', 'author': 'Laura Walters', 'title': 'Discover police discussion kitchen.', 'year': 1994 } }, ... ] }
- Parameters:
query – 如果是一个字符串, 则使用
MultifieldParser解析. 如果是一个Query对象, 则直接使用.limit – 返回结果的最大数量.
simple_response – 如果为
True, 则返回 list of dict 对象, 否则返回 类似于 ElasticSearch 的 HTTP response 的那种Result对象.refresh_data – if True, then will force to download the data and refresh the index and cache.