Review of the GSOC SOLR work for integration in XWiki Platform

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Review of the GSOC SOLR work for integration in XWiki Platform

vmassol
Administrator
Hi Savitha,

I've started reviewing quickly the SOLR code in preparation for an integration in the platform and I have some questions which I have jotted down below as I was reviewing the code. Sorry for the terse format, I actually wrote the questions to myself and then decide to send them as is :)

General:

* Need an architecture diagram showing the main modules and threads and how they interact with the platform

Search-api:

* Is the Search API supposed to be independent of SOLR?
* Search interface is strange, it has implementation details such as: getImplementation(), initialize(),
* It also has other concerns such as getStatus(), getStatusAsJson(), getVelocityUtils(), getSearchRequest()
* Why do we need a Search interface? Why not instead use the Query module and introduce a new query type? (note return List from Query.execute() probably needs to be clarified). Replace SearchRequest with Query impl
* Naming of interfaces are a bit strange. For example: BuildIndex; should it be IndexBuilder instead? What about DeleteIndex, should it be IndexDeleter?
* I don't think we need deleteDocumentIndex(), deleteWikiIndex(), deleteSpaceIndex(), etc. We need a single deleteEntity(EntityReference reference, EntityType type). Same for IndexBuilder.
* Why is there a DocumentIndexer interface? Why is a Document different from other entities? For ex I can see DocumentIndexer.deleteIndex() why not IndexDeleter.deleteEntity(documentRef)?
* Why is there a need for RebuildIndex (which I assume is IndexRebuilder) and why cannot we use the IndexBuilder?
* Why the need for SearchIndex?

Search-solrj:

* solrj server in embedded mode is used.
* Shouldn't use system property but the xwiki configuration instead for the solrj home (see below in misc)
* EmbeddedSolrServer depends on Servlet API? "Also, if using EmbeddedSolrServer, keep in mind that Solr depends on the Servlet API. " from http://wiki.apache.org/solr/Solrj
* EmbeddedSolrServer should be started by listening to the app started event instead of lazily in Initializable IMO
* Since we use EmbeddedSolrServer how do we handle clustering? One instance per wiki instance? How do they reconcile their indexes? Need an architecture diagram for our solution for heavy loads.

Misc:

* all API to review and improve/stabilize
* typos to fix
* licenses to fix
* pom to fix
* missing class javadoc (eg BuildIndex, DeleteIndex, etc)
* exception handling to verify  (ex: SolrjSearchEngine)
* Remove unneeded javadoc when @override
* Need to use the XWiki Permanent Directory for storing SOLR configuration data (the solr home) - Need to move data currenty in solr/ in a solr-configuration jar module which gets used as a fallback if the data doesn't exist in the solr home dir.
* Idea: use solr JMX to provide admin features (http://wiki.apache.org/solr/SolrJmx)
* TODO: Think about how to migrate users to use SOLR instead of Lucene or DB Searches. Need a plan.

Thanks!
-Vincent

_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
Reply | Threaded
Open this post in threaded view
|

Re: Review of the GSOC SOLR work for integration in XWiki Platform

sasinda
Hi,

> Since we use EmbeddedSolrServer how do we handle clustering? One instance
per wiki instance? How do they reconcile their
> indexes? Need an architecture diagram for our solution for heavy loads.
May be you can use an embedded Infinispan
<http://www.jboss.org/infinispan/>  (in
replication mode ) to store the index and do the clustering.

Regards,
Sasinda.


On Tue, Aug 28, 2012 at 5:34 PM, Vincent Massol <[hidden email]> wrote:

> Hi Savitha,
>
> I've started reviewing quickly the SOLR code in preparation for an
> integration in the platform and I have some questions which I have jotted
> down below as I was reviewing the code. Sorry for the terse format, I
> actually wrote the questions to myself and then decide to send them as is :)
>
> General:
>
> * Need an architecture diagram showing the main modules and threads and
> how they interact with the platform
>
> Search-api:
>
> * Is the Search API supposed to be independent of SOLR?
> * Search interface is strange, it has implementation details such as:
> getImplementation(), initialize(),
> * It also has other concerns such as getStatus(), getStatusAsJson(),
> getVelocityUtils(), getSearchRequest()
> * Why do we need a Search interface? Why not instead use the Query module
> and introduce a new query type? (note return List from Query.execute()
> probably needs to be clarified). Replace SearchRequest with Query impl
> * Naming of interfaces are a bit strange. For example: BuildIndex; should
> it be IndexBuilder instead? What about DeleteIndex, should it be
> IndexDeleter?
> * I don't think we need deleteDocumentIndex(), deleteWikiIndex(),
> deleteSpaceIndex(), etc. We need a single deleteEntity(EntityReference
> reference, EntityType type). Same for IndexBuilder.
> * Why is there a DocumentIndexer interface? Why is a Document different
> from other entities? For ex I can see DocumentIndexer.deleteIndex() why not
> IndexDeleter.deleteEntity(documentRef)?
> * Why is there a need for RebuildIndex (which I assume is IndexRebuilder)
> and why cannot we use the IndexBuilder?
> * Why the need for SearchIndex?
>
> Search-solrj:
>
> * solrj server in embedded mode is used.
> * Shouldn't use system property but the xwiki configuration instead for
> the solrj home (see below in misc)
> * EmbeddedSolrServer depends on Servlet API? "Also, if using
> EmbeddedSolrServer, keep in mind that Solr depends on the Servlet API. "
> from http://wiki.apache.org/solr/Solrj
> * EmbeddedSolrServer should be started by listening to the app started
> event instead of lazily in Initializable IMO
> * Since we use EmbeddedSolrServer how do we handle clustering? One
> instance per wiki instance? How do they reconcile their indexes? Need an
> architecture diagram for our solution for heavy loads.
>
> Misc:
>
> * all API to review and improve/stabilize
> * typos to fix
> * licenses to fix
> * pom to fix
> * missing class javadoc (eg BuildIndex, DeleteIndex, etc)
> * exception handling to verify  (ex: SolrjSearchEngine)
> * Remove unneeded javadoc when @override
> * Need to use the XWiki Permanent Directory for storing SOLR configuration
> data (the solr home) - Need to move data currenty in solr/ in a
> solr-configuration jar module which gets used as a fallback if the data
> doesn't exist in the solr home dir.
> * Idea: use solr JMX to provide admin features (
> http://wiki.apache.org/solr/SolrJmx)
> * TODO: Think about how to migrate users to use SOLR instead of Lucene or
> DB Searches. Need a plan.
>
> Thanks!
> -Vincent
>
> _______________________________________________
> devs mailing list
> [hidden email]
> http://lists.xwiki.org/mailman/listinfo/devs
>
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
Reply | Threaded
Open this post in threaded view
|

Re: Review of the GSOC SOLR work for integration in XWiki Platform

Paul Libbrecht-2
>> Since we use EmbeddedSolrServer how do we handle clustering? One instance
> per wiki instance? How do they reconcile their
>> indexes? Need an architecture diagram for our solution for heavy loads.
> May be you can use an embedded Infinispan
> <http://www.jboss.org/infinispan/>  (in
> replication mode ) to store the index and do the clustering.

That's a cute solution but I/O is really a strong point of SOLR and using a distributed storage is likely to be a performance issue.
SOLR has replication, it actually started as a replicator for Lucene.
So my best suggestion (and what we intend to do in Curriki) is that clustering must be complemented by externalizing the solr using the remote client of SolrJ.

Savitha, can you confirm this is possible?

Then, if more scalability is desired, you'd want to start more SOLR slaves.
The way SOLR does replication, and updates, is fundamentally different than XWiki.
SOLR works with a master-and-slave approach and does take some time to update; in exchange it caches extremely eagerly.

Paul
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
Reply | Threaded
Open this post in threaded view
|

Re: Review of the GSOC SOLR work for integration in XWiki Platform

Paul Libbrecht-2
In reply to this post by vmassol
Hello community,

here are my guts responses, I have not synched with Savitha.

Le 28 août 2012 à 14:04, Vincent Massol a écrit :

> Search-api:
> * Is the Search API supposed to be independent of SOLR?

I think this can be aimed at but it does not mean it will find a valid implementation for each function.

> * Search interface is strange, it has implementation details such as: getImplementation(), initialize(),

These two methods are not implementation details to me.
Maybe initialize() is a natural thing to happen outside in the component lifecycle, so it should be hidden.
Did you mean getImplementation() should rather be a cast?

> * It also has other concerns such as getStatus(), getStatusAsJson(),

These are useful!
I am not sure a unified UI is possible but if yes, they have to be there.

> getVelocityUtils(), getSearchRequest()
> * Why do we need a Search interface? Why not instead use the Query module and introduce a new query type? (note return List from Query.execute() probably needs to be clarified). Replace SearchRequest with Query impl

List is definitely a problem, one needs at least something that says the total match count, the availability and parameters of a previous and next result page and probably some debug information.

I also note that the practice of using strings as single entry point for queries is bad and that Savitha partially changed it.

It is bad because of the possibility of injecting queries aside which, for examples, makes it so that there's no way in the current lucene plugin to force an added condition (e.g. to create a UI to query all tasks by adding the query of the existence of a task object).
I find the Lucene Query object very workable for this; that of Solr as well. I do not think they are compatible.

Also, the normal Solr way to do is actually use a query-parser which will operate such transformations as the stemming of a query (you query for "arbr" when searching for "arbres" for example). All search engines I've worked on use a customizable query expansion mechanism where a part of the user-query is turned into a decomposed query, then expanded into the various fields (query for title with bigger boost, query for precise matches with bigger boost, allow matches in other languages...).

> * Naming of interfaces are a bit strange. For example: BuildIndex; should it be IndexBuilder instead? What about DeleteIndex, should it be IndexDeleter?
> * I don't think we need deleteDocumentIndex(), deleteWikiIndex(), deleteSpaceIndex(), etc. We need a single deleteEntity(EntityReference reference, EntityType type). Same for IndexBuilder.
> * Why is there a DocumentIndexer interface?

I believe that the reason here is that it is of utmost importance to allow the indexing process to be customized.
This can be done, on the one hand, in the Solr schema.
But that is not enough and it is quite common for an application to fetch related data (for example to fetch the quality from an external source, for example to grab related documents) and index it along. Thus an interface where, for each page, added data can be injected sounds like a strict requirement to me.

> Why is a Document different from other entities? For ex I can see DocumentIndexer.deleteIndex() why not IndexDeleter.deleteEntity(documentRef)?
> * Why is there a need for RebuildIndex (which I assume is IndexRebuilder) and why cannot we use the IndexBuilder?

An index Rebuilder object is necessary because index-rebuilding may take several days. It needs to display a monitoring status (e.g the queue size) and probably added actions.

> * Why the need for SearchIndex?
>
> Search-solrj:
>
> * solrj server in embedded mode is used.

This should be flexible, either embedded or not.

> * Shouldn't use system property but the xwiki configuration instead for the solrj home (see below in misc)

A challenge, but hopefully responded.
> [...] (responded elsewhere)
> Misc:
> * all API to review and improve/stabilize
> * typos to fix
> * licenses to fix
> * pom to fix

Maybe you can be more precise?

thanks for your comments.

Paul

_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
Reply | Threaded
Open this post in threaded view
|

Re: Review of the GSOC SOLR work for integration in XWiki Platform

vmassol
Administrator
Hi Paul,

Thanks for your comments. I hope Savitha can chime in too at some point :)

See below.

On Aug 29, 2012, at 11:29 AM, Paul Libbrecht <[hidden email]> wrote:

> Hello community,
>
> here are my guts responses, I have not synched with Savitha.
>
> Le 28 août 2012 à 14:04, Vincent Massol a écrit :
>
>> Search-api:
>> * Is the Search API supposed to be independent of SOLR?
>
> I think this can be aimed at but it does not mean it will find a valid implementation for each function.
>
>> * Search interface is strange, it has implementation details such as: getImplementation(), initialize(),
>
> These two methods are not implementation details to me.
> Maybe initialize() is a natural thing to happen outside in the component lifecycle, so it should be hidden.

It definitely should. The fact that such implementation needs some initialization and how the initialization is done is completely implementation-dependent.

> Did you mean getImplementation() should rather be a cast?

The interface should have absolutely no knowledge if its implementation. When you use a given implementation you know which one you're using.

BTW I've done a search and it's not used anywhere which proves it's not needed ;)

>> * It also has other concerns such as getStatus(), getStatusAsJson(),
>
> These are useful!

I never said they were not.

> I am not sure a unified UI is possible but if yes, they have to be there.

You're now talking about UI. But we're here talking about API not UI. This is definitely the wrong place and wrong module for getStatusAsJson(). What if I want XML? What you need is somewhere to return a Status object and then have a StatusSerializer interface and then a JSONStatusSerializer implementation, or something like that.

Now re the getStatus() method, it feels wrong because this is stateful so it shouldn't be part of the Search interface. What happens is that the Search interface should be used to call search*() and that should return an object such as SearchResponse.

BTW the implementation also shows it's wrong:

    @Override
    public Map<String, AbstractDocumentIndexerStatus> getStatus()
    {
        return indexer.getStatus();
    }

This shows that the status is actually the indexer status and if someone needs the status he should get it from the indexer, not from the Search interface!

>> getVelocityUtils(), getSearchRequest()

Same as above, they shouldn't be there.

getVelocityUtils() shouldn't exist. If we need some Velocity-specific code the architecture we have is to create some Velocity Tool class and register them as Velocity tools.

>> * Why do we need a Search interface? Why not instead use the Query module and introduce a new query type? (note return List from Query.execute() probably needs to be clarified). Replace SearchRequest with Query impl
>
> List is definitely a problem, one needs at least something that says the total match count, the availability and parameters of a previous and next result page and probably some debug information.

Yes but that's needed too for the Query module so it can be refactored.

> I also note that the practice of using strings as single entry point for queries is bad and that Savitha partially changed it.
>
> It is bad because of the possibility of injecting queries aside which, for examples, makes it so that there's no way in the current lucene plugin to force an added condition (e.g. to create a UI to query all tasks by adding the query of the existence of a task object).
> I find the Lucene Query object very workable for this; that of Solr as well. I do not think they are compatible.

Cool, we've also identified we need this for the Query module.

> Also, the normal Solr way to do is actually use a query-parser which will operate such transformations as the stemming of a query (you query for "arbr" when searching for "arbres" for example). All search engines I've worked on use a customizable query expansion mechanism where a part of the user-query is turned into a decomposed query, then expanded into the various fields (query for title with bigger boost, query for precise matches with bigger boost, allow matches in other languages…).

Ok all this makes me think we definitely need to implement v2 of the Query module so that it supports HQL, XWQL and SOLR queries.

I really think this is the right way and that we shouldn't invent a new way of doing the same thing.

>> * Naming of interfaces are a bit strange. For example: BuildIndex; should it be IndexBuilder instead? What about DeleteIndex, should it be IndexDeleter?
>> * I don't think we need deleteDocumentIndex(), deleteWikiIndex(), deleteSpaceIndex(), etc. We need a single deleteEntity(EntityReference reference, EntityType type). Same for IndexBuilder.
>> * Why is there a DocumentIndexer interface?
>
> I believe that the reason here is that it is of utmost importance to allow the indexing process to be customized.
> This can be done, on the one hand, in the Solr schema.
> But that is not enough and it is quite common for an application to fetch related data (for example to fetch the quality from an external source, for example to grab related documents) and index it along. Thus an interface where, for each page, added data can be injected sounds like a strict requirement to me.

That's not my question. I'm talking design here. I was asking why we need the DocumentIndexer interface when we already can do the same thing with the other interfaces such as IndexBuilder, IndexDeleter, etc.

>> Why is a Document different from other entities? For ex I can see DocumentIndexer.deleteIndex() why not IndexDeleter.deleteEntity(documentRef)?
>> * Why is there a need for RebuildIndex (which I assume is IndexRebuilder) and why cannot we use the IndexBuilder?
>
> An index Rebuilder object is necessary because index-rebuilding may take several days. It needs to display a monitoring status (e.g the queue size) and probably added actions.

And why isn't this the case for an IndexBuilder which has exactly the same amount of work when you start for the first time?

>> * Why the need for SearchIndex?
>>
>> Search-solrj:
>>
>> * solrj server in embedded mode is used.
>
> This should be flexible, either embedded or not.

Well that's my question because using solrj in non embedded mode is more complicated to integrate and is not covered ATM. I'd like to see it covered in the Architecture and define how it would integrate with XWiki Platform.

>> * Shouldn't use system property but the xwiki configuration instead for the solrj home (see below in misc)
>
> A challenge, but hopefully responded.

That's easy.

>> [...] (responded elsewhere)
>> Misc:
>> * all API to review and improve/stabilize
>> * typos to fix
>> * licenses to fix
>> * pom to fix
>
> Maybe you can be more precise?

There are lots of issues which is why I didn't specify them. Let's say this is just a reminder to review them.

Let me give you just one. In one pom file I can see:

<?xml version="1.0" encoding="UTF-8"?>
<!-- * * See the NOTICE file distributed with this work for additional *
        information regarding copyright ownership. * * This is free software; you
        can redistribute it and/or modify it * under the terms of the GNU Lesser
        General Public License as * published by the Free Software Foundation; either
        version 2.1 of * the License, or (at your option) any later version. * *
        This software is distributed in the hope that it will be useful, * but WITHOUT
        ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS
        FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for
        more details. * * You should have received a copy of the GNU Lesser General
        Public * License along with this software; if not, write to the Free * Software
        Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA * 02110-1301 USA,
        or see the FSF site: http://www.fsf.org. * -->

The good news is that this won't build at all since we have a license checker. If it builds it means the build is not correct.

Another example is that there are a lot of stuff to remove from the pom.xml files since those are already specified in the xwiki top level pom which these modules should depend on.

Thanks
-Vincent

> thanks for your comments.
>
> Paul

_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs