SlideShare a Scribd company logo
1 of 19
Searching Relational Data
with Elasticsearch
Dr. Renaud Delbru
CTO, Siren Solutions
● CTO, SIREn Solutions
– Search, Big Data, Knowledge Graph
● Lucene / Solr Contributor
– E.g., Cross Data Center Replication
– Lucene Revolution 2013, 2014
– Lucene In Action, 2nd Edition
● Author of the SIREn plugin
Introducing myself
● Open source search
systems
– Lucene, Solr, Elasticsearch
● Document-based model
– Flat key-value model
– Originally developed for
searching full-text documents
Background
firstname John
lastname
title
Smith
Mr Dr
Background
● Data is usually more
complex
– Nested objects
● XML, JSON
● E.g., US patents
– Relations
● RDBMS, RDF, Graph, Documents
with links to entities or other
documents
Article
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address" : {
"street" : "21 2nd
Street",
"city" : "New York",
"state" : "NY"
},
"phoneNumber" : [
{ "type" : "home", "number" : "212 555-1234" },
{ "type" : "fax", "number" : "646 555-4567" }
]
}
Person
Company
Crunchbase example
Elastic
Series A
Series B
Data
Collective
Benchmark
Index
Venture
name : Elastic
funding_rounds.round_code : A
funding_rounds.founded_year : 2012
funding_rounds.round_code : B
funding_rounds.founded_year : 2013
funding_rounds.investments.name : Benchmark
funding_rounds.investments.name : Data Collective
funding_rounds.investments.name : Index Ventures
● Pros:
– Relatively easy
– Fast
● Cons:
– Loss of precision, false positive
– Index-time data materialisation
– Data duplication (child)
– Not optimal for updates
Common solutions
name : Elastic
f_r.round_code : A
f_r.founded_year : 2012
f_r.inv.name : Benchmarkname : Elastic
f_r.round_code : A
f_r.founded_year : 2012
f_r.inv.name : Data Collectivename : Elastic
f_r.round_code : B
f_r.founded_year : 2013
f_r.inv.name : Benchmarkname : Elastic
f_r.round_code : B
f_r.founded_year : 2013
f_r.inv.name : Index Ventures
● Pros:
– Relatively easy
– No loss of precision
● Cons:
– Index-time data materialisation
– Combinatorial explosion
– Duplicate results: query-time grouping is necessary
– Data duplication (parent and child)
– Not optimal for updates
Common solutions
● Lucene's BlockJoin
– Feature to provide relational search
– “Nested” type in Elasticsearch
● Model
– One (flat) document per record
– Joins computed at index time
– Related documents are indexed in
a same “block”
{
"company": {
"properties" : {
"funding_rounds" : {
"type" : "nested",
"properties" : {
"investments" : {
"type" : "nested"
} } } } } }
Index-time join
Index-time join
● Pros:
– Fast (join precomputed, data locality)
– No loss of precision
● Cons:
– Index-time data materialisation
– Data duplication (child)
– Not optimal for updates
– High memory usage for complex nested model
Document Block
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
Name : Data Collective
Type : Org
Name : Benchmark
Type : Org
round_code : B
founded_year : 2013
...
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
Index-time join
● SIREn Plugin
– Plugin to Lucene, Solr, Elasticsearch
– Add native index for nested data type
– http://siren.solutions/siren/overview/
● Model
– One document per “tree”
– Joins computed at index time
– Rich data model (JSON)
● Nested objects, nested arrays, multi-valued
attributes, datatypes
{
"company": {
"properties" : {
"_siren_source" : {
"analyzer" : "concise",
"postings_format" : "Siren10AFor",
"store" : "no",
"type" : "string"
} } } }
Index-time join
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
round_code : B
founded_year : 2013
...
Name : Data Collective
Type : Org
Name : Benchmark
Type : Org
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
● Pros:
– Fast (join precomputed, data locality)
– No loss of precision
– Low memory usage, even for complex nested model
● Cons:
– Index-time data materialisation
– Data duplication (child)
– Not optimal for updates
1
1.1
1.2
1.1.1
1.1.2
1.2.1
1.2.2
Index-time join
More information on our blog post
Query-time join
● Elasticsearch's Parent-Child
– Query-time join for nested data
● Model
– One (flat) document per record
– At index time, child documents should
specify their parent ID with the
_parent field
– Joins computed at query time
{
"company": {},
"investment" : {
"_parent" : {
"type" : "company",
}
},
"investor" : {
"_parent" : {
"type" : "investment",
}
}
}
Query-time join
● Pros:
– Update friendly
– No loss of precision
– Data locality: parent and child on same shard
● Cons:
– Slower than index-time solutions
– Larger memory use than nested
– Data duplication (child)
● A child cannot have more than one parent
– Index-time data materialisation
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
Name : Data Collective
Type : Org
Name : Benchmark
Type : Org
round_code : B
founded_year : 2013
...
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
Query-time join
● FilterJoin's Plugin
– Query-time join for relational data
● Inspired from #3278
● Model
– One (flat) document per record
– At index time, documents should specify the IDs of their related documents in
a given field
– At query time, lookup ID values from a given field to filter documents from
another index
Query-time join
● Pros:
– Update friendly
– No loss of precision
– No data duplication
– No index-time data materialisation
● Cons:
– Slower than parent-child
– No data locality principle: network transfer
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
Name : Data Collective
Type : Org
round_code : B
founded_year : 2013
...
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
● Each solution has its own advantages and disadvantages
– Trade-off between performance, scalability and flexibility
BlockJoin SIREn Parent-Child FilterJoin
Performance ++ ++ + -
Scalability + ++ + +
Flexibility - - + ++
Best for ●Simple nested
model
●Fixed data
●Complex nested
model
●Fixed data
●Simple nested
model
●Dynamic data
●Relational model
●Dynamic data
Summary
Pivot Browser
Knowledge Browser
Crunchbase Demo
Contact Info
76 Tudor Lawn, Newcastle
info@siren.solutions
siren.solutions
We're hiring!

More Related Content

What's hot

Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational DatabasesChris Baglieri
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문SeungHyun Eom
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민종민 김
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTOClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB FundamentalsMongoDB
 
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesWebscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesJonathan Katz
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
이것이 레디스다.
이것이 레디스다.이것이 레디스다.
이것이 레디스다.Kris Jeong
 
What you need to know for postgresql operation
What you need to know for postgresql operationWhat you need to know for postgresql operation
What you need to know for postgresql operationAnton Bushmelev
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기AWSKRUG - AWS한국사용자모임
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOAltinity Ltd
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
 
WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k RetrievalAndrew Zhang
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
 

What's hot (20)

Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
ClickHouse Intro
ClickHouse IntroClickHouse Intro
ClickHouse Intro
 
Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문Elastic Search (엘라스틱서치) 입문
Elastic Search (엘라스틱서치) 입문
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민검색엔진이 데이터를 다루는 법 김종민
검색엔진이 데이터를 다루는 법 김종민
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTOClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB Fundamentals
 
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling StrategiesWebscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
이것이 레디스다.
이것이 레디스다.이것이 레디스다.
이것이 레디스다.
 
What you need to know for postgresql operation
What you need to know for postgresql operationWhat you need to know for postgresql operation
What you need to know for postgresql operation
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
PRIM'S ALGORITHM
PRIM'S ALGORITHMPRIM'S ALGORITHM
PRIM'S ALGORITHM
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k Retrieval
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 

Viewers also liked

Elasticsearch in Zalando
Elasticsearch in ZalandoElasticsearch in Zalando
Elasticsearch in ZalandoAlaa Elhadba
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionDavid Pilato
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedBeyondTrees
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseKristijan Duvnjak
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsAlaa Elhadba
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaAmazee Labs
 

Viewers also liked (8)

Elasticsearch in Zalando
Elasticsearch in ZalandoElasticsearch in Zalando
Elasticsearch in Zalando
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Elasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & AggregationsElasticsearch Introduction to Data model, Search & Aggregations
Elasticsearch Introduction to Data model, Search & Aggregations
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 

Similar to Searching Relational Data with Elasticsearch

FIWARE Wednesday Webinars - Introduction to NGSI-LD
FIWARE Wednesday Webinars - Introduction to NGSI-LDFIWARE Wednesday Webinars - Introduction to NGSI-LD
FIWARE Wednesday Webinars - Introduction to NGSI-LDFIWARE
 
No Sql in Enterprise Java Applications
No Sql in Enterprise Java ApplicationsNo Sql in Enterprise Java Applications
No Sql in Enterprise Java ApplicationsPatrick Baumgartner
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
 
The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015Michele Pasin
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsikanow
 
Schema Design
Schema DesignSchema Design
Schema DesignMongoDB
 
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
ECIR-2014: Multilanguage Content Discovery Through Entity Driven SearchECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
ECIR-2014: Multilanguage Content Discovery Through Entity Driven SearchAntonio David Pérez Morales
 
Content Discovery Through Entity Driven Search
Content Discovery Through Entity Driven SearchContent Discovery Through Entity Driven Search
Content Discovery Through Entity Driven SearchAlessandro Benedetti
 
Back to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documentsBack to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documentsMongoDB
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Globus
 
Structured Data: It's All about the Graph | Richard Wallis, Data Liberate
Structured Data: It's All about the Graph | Richard Wallis, Data LiberateStructured Data: It's All about the Graph | Richard Wallis, Data Liberate
Structured Data: It's All about the Graph | Richard Wallis, Data LiberateClick Consult (Part of Ceuta Group)
 
Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Richard Wallis
 
Next generation linked in talent search
Next generation linked in talent searchNext generation linked in talent search
Next generation linked in talent searchRyan Wu
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsHugh McCamphill
 

Similar to Searching Relational Data with Elasticsearch (20)

FIWARE Wednesday Webinars - Introduction to NGSI-LD
FIWARE Wednesday Webinars - Introduction to NGSI-LDFIWARE Wednesday Webinars - Introduction to NGSI-LD
FIWARE Wednesday Webinars - Introduction to NGSI-LD
 
Data Modeling with NGSI, NGSI-LD
Data Modeling with NGSI, NGSI-LDData Modeling with NGSI, NGSI-LD
Data Modeling with NGSI, NGSI-LD
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
LOD2 Webinar: SIREn
 
No Sql in Enterprise Java Applications
No Sql in Enterprise Java ApplicationsNo Sql in Enterprise Java Applications
No Sql in Enterprise Java Applications
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 
The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015
 
Beyond SQL: Managing Events and Relationships in Social Care
Beyond SQL: Managing Events and Relationships in Social CareBeyond SQL: Managing Events and Relationships in Social Care
Beyond SQL: Managing Events and Relationships in Social Care
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problems
 
Schema Design
Schema DesignSchema Design
Schema Design
 
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
ECIR-2014: Multilanguage Content Discovery Through Entity Driven SearchECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
 
Content Discovery Through Entity Driven Search
Content Discovery Through Entity Driven SearchContent Discovery Through Entity Driven Search
Content Discovery Through Entity Driven Search
 
Back to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documentsBack to Basics 1: Thinking in documents
Back to Basics 1: Thinking in documents
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
 
Structured Data: It's All about the Graph | Richard Wallis, Data Liberate
Structured Data: It's All about the Graph | Richard Wallis, Data LiberateStructured Data: It's All about the Graph | Richard Wallis, Data Liberate
Structured Data: It's All about the Graph | Richard Wallis, Data Liberate
 
Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!
 
Next generation linked in talent search
Next generation linked in talent searchNext generation linked in talent search
Next generation linked in talent search
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely tests
 

Recently uploaded

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Searching Relational Data with Elasticsearch

  • 1. Searching Relational Data with Elasticsearch Dr. Renaud Delbru CTO, Siren Solutions
  • 2. ● CTO, SIREn Solutions – Search, Big Data, Knowledge Graph ● Lucene / Solr Contributor – E.g., Cross Data Center Replication – Lucene Revolution 2013, 2014 – Lucene In Action, 2nd Edition ● Author of the SIREn plugin Introducing myself
  • 3. ● Open source search systems – Lucene, Solr, Elasticsearch ● Document-based model – Flat key-value model – Originally developed for searching full-text documents Background firstname John lastname title Smith Mr Dr
  • 4. Background ● Data is usually more complex – Nested objects ● XML, JSON ● E.g., US patents – Relations ● RDBMS, RDF, Graph, Documents with links to entities or other documents Article { "firstName": "John", "lastName": "Smith", "age": 25, "address" : { "street" : "21 2nd Street", "city" : "New York", "state" : "NY" }, "phoneNumber" : [ { "type" : "home", "number" : "212 555-1234" }, { "type" : "fax", "number" : "646 555-4567" } ] } Person Company
  • 5. Crunchbase example Elastic Series A Series B Data Collective Benchmark Index Venture
  • 6. name : Elastic funding_rounds.round_code : A funding_rounds.founded_year : 2012 funding_rounds.round_code : B funding_rounds.founded_year : 2013 funding_rounds.investments.name : Benchmark funding_rounds.investments.name : Data Collective funding_rounds.investments.name : Index Ventures ● Pros: – Relatively easy – Fast ● Cons: – Loss of precision, false positive – Index-time data materialisation – Data duplication (child) – Not optimal for updates Common solutions
  • 7. name : Elastic f_r.round_code : A f_r.founded_year : 2012 f_r.inv.name : Benchmarkname : Elastic f_r.round_code : A f_r.founded_year : 2012 f_r.inv.name : Data Collectivename : Elastic f_r.round_code : B f_r.founded_year : 2013 f_r.inv.name : Benchmarkname : Elastic f_r.round_code : B f_r.founded_year : 2013 f_r.inv.name : Index Ventures ● Pros: – Relatively easy – No loss of precision ● Cons: – Index-time data materialisation – Combinatorial explosion – Duplicate results: query-time grouping is necessary – Data duplication (parent and child) – Not optimal for updates Common solutions
  • 8. ● Lucene's BlockJoin – Feature to provide relational search – “Nested” type in Elasticsearch ● Model – One (flat) document per record – Joins computed at index time – Related documents are indexed in a same “block” { "company": { "properties" : { "funding_rounds" : { "type" : "nested", "properties" : { "investments" : { "type" : "nested" } } } } } } Index-time join
  • 9. Index-time join ● Pros: – Fast (join precomputed, data locality) – No loss of precision ● Cons: – Index-time data materialisation – Data duplication (child) – Not optimal for updates – High memory usage for complex nested model Document Block name : Elastic country_code : A ... round_code : A founded_year : 2012 ... Name : Data Collective Type : Org Name : Benchmark Type : Org round_code : B founded_year : 2013 ... Name : Index Venture Type : Org Name : Benchmark Type : Org
  • 10. Index-time join ● SIREn Plugin – Plugin to Lucene, Solr, Elasticsearch – Add native index for nested data type – http://siren.solutions/siren/overview/ ● Model – One document per “tree” – Joins computed at index time – Rich data model (JSON) ● Nested objects, nested arrays, multi-valued attributes, datatypes { "company": { "properties" : { "_siren_source" : { "analyzer" : "concise", "postings_format" : "Siren10AFor", "store" : "no", "type" : "string" } } } }
  • 11. Index-time join name : Elastic country_code : A ... round_code : A founded_year : 2012 ... round_code : B founded_year : 2013 ... Name : Data Collective Type : Org Name : Benchmark Type : Org Name : Index Venture Type : Org Name : Benchmark Type : Org ● Pros: – Fast (join precomputed, data locality) – No loss of precision – Low memory usage, even for complex nested model ● Cons: – Index-time data materialisation – Data duplication (child) – Not optimal for updates 1 1.1 1.2 1.1.1 1.1.2 1.2.1 1.2.2
  • 13. Query-time join ● Elasticsearch's Parent-Child – Query-time join for nested data ● Model – One (flat) document per record – At index time, child documents should specify their parent ID with the _parent field – Joins computed at query time { "company": {}, "investment" : { "_parent" : { "type" : "company", } }, "investor" : { "_parent" : { "type" : "investment", } } }
  • 14. Query-time join ● Pros: – Update friendly – No loss of precision – Data locality: parent and child on same shard ● Cons: – Slower than index-time solutions – Larger memory use than nested – Data duplication (child) ● A child cannot have more than one parent – Index-time data materialisation name : Elastic country_code : A ... round_code : A founded_year : 2012 ... Name : Data Collective Type : Org Name : Benchmark Type : Org round_code : B founded_year : 2013 ... Name : Index Venture Type : Org Name : Benchmark Type : Org
  • 15. Query-time join ● FilterJoin's Plugin – Query-time join for relational data ● Inspired from #3278 ● Model – One (flat) document per record – At index time, documents should specify the IDs of their related documents in a given field – At query time, lookup ID values from a given field to filter documents from another index
  • 16. Query-time join ● Pros: – Update friendly – No loss of precision – No data duplication – No index-time data materialisation ● Cons: – Slower than parent-child – No data locality principle: network transfer name : Elastic country_code : A ... round_code : A founded_year : 2012 ... Name : Data Collective Type : Org round_code : B founded_year : 2013 ... Name : Index Venture Type : Org Name : Benchmark Type : Org
  • 17. ● Each solution has its own advantages and disadvantages – Trade-off between performance, scalability and flexibility BlockJoin SIREn Parent-Child FilterJoin Performance ++ ++ + - Scalability + ++ + + Flexibility - - + ++ Best for ●Simple nested model ●Fixed data ●Complex nested model ●Fixed data ●Simple nested model ●Dynamic data ●Relational model ●Dynamic data Summary
  • 19. Contact Info 76 Tudor Lawn, Newcastle info@siren.solutions siren.solutions We're hiring!

Editor's Notes

  1. <number> S