[iks-community] Semantic search engine kickoff

Stephane Corlosquet stephane.corlosquet at deri.org
Fri Jul 3 13:25:24 CEST 2009


Hi all,

Below is the architecture that DERI would like to suggest for the IKS 
Semantic Search Engine. The figure [1] contains a set of CMS sites 
complying to the best practises of RDF data publishing, which include 
RDFa, a local schema export (site vocabulary), a SPARQL endpoint. We 
have worked on a set of modules for Drupal detailed in a technical 
report at [2], but their features could be generalized to other CMSs. 
The sites can request to be included in the IKS search engine via a form 
on the IKS search engine site or programmatically via a ping. Pings are 
also used in the case where a specific resource/page has been updated on 
a given site in order for the search engine to schedule a recrawl of the 
resource as soon as possible.

The semantic search engine stack is composed of several layers of data 
gathering, parsing, validation and indexing. The search engine first 
gathers the data by crawling the sites, it then parses the RDF data with 
the any23 parser [3], a java library that extracts structured data in 
RDF format from a variety of Web documents (supports microformats, RDFa 
and other common RDF serialization formats). If needed, the NxParser [4] 
cleans up the data and formats it in n-quads [5]. Before a site can be 
included in the IKS search engine, it first goes through the RDFAlerts 
validator, which ensures the RDF data contained in the sites complies 
with the RDF publishing best practices. RDFAlerts also does some RDF 
consistency checking. Additionally, other IKS specific policies 
regarding the sites included in the search engine could be added here. 
Finally, the SWSE engine [6] takes care of the indexing and storage of 
the data. Powered by YARS2, it provides distributed storage and 
retrieval facilities. Indexing structures are optimized for retrieval of 
RDF statements including context (quads) while minimizing the need for 
joins, plus Lucene fulltext indexing for efficient keyword searches. 
SWSE's SPARQL endpoint allows to plugin any RDF visualization tool, e.g. 
VisiNav [7] for example. See the screencast at [8] (1'36) for the 
possibilities offered by VisiNav.


[1] http://srvgal65.deri.ie/files/iks_search_engine_cloud.pdf
[2] http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-30.pdf
[3] http://code.google.com/p/any23/
[4] http://sw.deri.org/2006/08/nxparser/
[5] http://sw.deri.org/2008/07/n-quads/
[6] http://www.swse.org/
[7] http://visinav.deri.org/
[8] http://www.youtube.com/watch?v=r4WgTRIRoa0

Bertrand Delacretaz wrote:
> Hi,
>
> Time has flown and I haven't kicked off the semantic search engine
> disussions yet, following up on our discussions at the Salzburg IKS
> meeting.
>
> I'll be mostly offline next week, but I wanted to at least start the
> discussion here, so that we can go forward.
>
> The idea is to start from the
> http://www.interactive-knowledge.org/content/iks-search-engine-proposal,
> and prototype something that we can play with quickly.
>
> The first use case that I'd like us to implement is like:
>
> 0. Select a website that contains interesting data in microformats and/or RDFa
> 1. Add the homepage URL to the search engine crawler config
> 2. Search engine crawls website, indexes full text and structured data
> extracted from microformats and/or RDFa
> 3. Simple UI allows for searching that data, both full-text and structured
> 4. Structured data should be exportable in standard formats for
> further processing with semantic tools
>
> If anyone knows of existing software that would allow us to set this
> up with no or minimal programming work that would be cool (I don't). I
> assume we can host that on IKS servers, though details of that have to
> be finalized.
>
> If there's no existing software that does that, lets see what are the
> minimal steps that allow us to implement this, just as a first
> prototype that can be used as a basis for creating the next one. I'd
> lean towards Lucene, Solr or Jackrabbit as those are the things that I
> know best in this area, but this is all open.
>
> Comments are welcome, of course!
>
> -Bertrand (mostly offline until next Thursday July 2nd)
> _______________________________________________
> iks-community mailing list
> iks-community at iks-project.eu
> http://lists.iks-project.eu/cgi-bin/mailman/listinfo/iks-community
>   


More information about the iks-community mailing list