[iks-community] Semantic search engine kickoff
Stephane Corlosquet
stephane.corlosquet at deri.org
Fri Jul 3 13:25:24 CEST 2009
Hi all,
Below is the architecture that DERI would like to suggest for the IKS
Semantic Search Engine. The figure [1] contains a set of CMS sites
complying to the best practises of RDF data publishing, which include
RDFa, a local schema export (site vocabulary), a SPARQL endpoint. We
have worked on a set of modules for Drupal detailed in a technical
report at [2], but their features could be generalized to other CMSs.
The sites can request to be included in the IKS search engine via a form
on the IKS search engine site or programmatically via a ping. Pings are
also used in the case where a specific resource/page has been updated on
a given site in order for the search engine to schedule a recrawl of the
resource as soon as possible.
The semantic search engine stack is composed of several layers of data
gathering, parsing, validation and indexing. The search engine first
gathers the data by crawling the sites, it then parses the RDF data with
the any23 parser [3], a java library that extracts structured data in
RDF format from a variety of Web documents (supports microformats, RDFa
and other common RDF serialization formats). If needed, the NxParser [4]
cleans up the data and formats it in n-quads [5]. Before a site can be
included in the IKS search engine, it first goes through the RDFAlerts
validator, which ensures the RDF data contained in the sites complies
with the RDF publishing best practices. RDFAlerts also does some RDF
consistency checking. Additionally, other IKS specific policies
regarding the sites included in the search engine could be added here.
Finally, the SWSE engine [6] takes care of the indexing and storage of
the data. Powered by YARS2, it provides distributed storage and
retrieval facilities. Indexing structures are optimized for retrieval of
RDF statements including context (quads) while minimizing the need for
joins, plus Lucene fulltext indexing for efficient keyword searches.
SWSE's SPARQL endpoint allows to plugin any RDF visualization tool, e.g.
VisiNav [7] for example. See the screencast at [8] (1'36) for the
possibilities offered by VisiNav.
[1] http://srvgal65.deri.ie/files/iks_search_engine_cloud.pdf
[2] http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-30.pdf
[3] http://code.google.com/p/any23/
[4] http://sw.deri.org/2006/08/nxparser/
[5] http://sw.deri.org/2008/07/n-quads/
[6] http://www.swse.org/
[7] http://visinav.deri.org/
[8] http://www.youtube.com/watch?v=r4WgTRIRoa0
Bertrand Delacretaz wrote:
> Hi,
>
> Time has flown and I haven't kicked off the semantic search engine
> disussions yet, following up on our discussions at the Salzburg IKS
> meeting.
>
> I'll be mostly offline next week, but I wanted to at least start the
> discussion here, so that we can go forward.
>
> The idea is to start from the
> http://www.interactive-knowledge.org/content/iks-search-engine-proposal,
> and prototype something that we can play with quickly.
>
> The first use case that I'd like us to implement is like:
>
> 0. Select a website that contains interesting data in microformats and/or RDFa
> 1. Add the homepage URL to the search engine crawler config
> 2. Search engine crawls website, indexes full text and structured data
> extracted from microformats and/or RDFa
> 3. Simple UI allows for searching that data, both full-text and structured
> 4. Structured data should be exportable in standard formats for
> further processing with semantic tools
>
> If anyone knows of existing software that would allow us to set this
> up with no or minimal programming work that would be cool (I don't). I
> assume we can host that on IKS servers, though details of that have to
> be finalized.
>
> If there's no existing software that does that, lets see what are the
> minimal steps that allow us to implement this, just as a first
> prototype that can be used as a basis for creating the next one. I'd
> lean towards Lucene, Solr or Jackrabbit as those are the things that I
> know best in this area, but this is all open.
>
> Comments are welcome, of course!
>
> -Bertrand (mostly offline until next Thursday July 2nd)
> _______________________________________________
> iks-community mailing list
> iks-community at iks-project.eu
> http://lists.iks-project.eu/cgi-bin/mailman/listinfo/iks-community
>
More information about the iks-community
mailing list