[Brainstorm] Notifications filters capabilities and performances.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Brainstorm] Notifications filters capabilities and performances.

Guillaume Delhumeau
Hi developers.

I am trying to add a new filter to the notifications to be able to follow
pages
that are marked with a given tag. And it leads me to some questions about
the
technical implementation of the notifications.

To remind the context: notifications are computed on top of the events
recorded
by the event stream (a.k.a activity stream). We take events from the event
stream SQL table, we apply some transformations on them, and we display
them to
the user.

Then we have implemented the ability to filter on these events: for example
"don't show events concerning the document A nor the wiki B". Filters are
implemented with 2 distinct ways:

  1/ SQL injections: each filter can add SQL elements in the query we make
to
     fetch the events from the event stream table. We made this mechanism
so we
     can let the database do a lot of the filtering process. After all,
it's its
     job and it's supposed to perform well. To be precise, Clement has even
     created an Abstract Syntax Tree (AST) so it's easier to inject some
content
     in the query and it creates an abstraction over the SQL language so we
can
     even consider to change the storage of the event stream someday.

     The bad thing is that some complex filtering are difficult to write
with
     the SQL language (event with the AST) or even impossible.

  2/ Post-filtering: after the events have been fetched from the database,
each
     filter can still decide to keep or filter them. This is useful for
     complex filtering that cannot be expressed with the SQL language. It
is
     also needed by the real-time notification email sender, because it
takes
     the events immediately when they occurs without fetching them in the
     database (so SQL filters are bypassed).

     The bad thing is that some events are loaded in the memory to finally
be
     rejected, and these filters can perform costly operations such as
loading
     documents.

Until now, this double mechanism was working quite well, with each
mechanism
filling the lacks of the other.

However, we still have technical limitations in our design:

  1/ Users who have a lot of filter preferences can end up with a giant SQL
     query that is almost impossible to perform by the database. Actually
we had
     a user complaining about an OutOfMemory problem in the HQL to SQL
     translator !

  2/ I cannot implement the tag filter !

The tag filter is supposed to show events that concern pages that hold a
given
tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
don't
want to receive notifications about wiki A except for pages marked with the
tag
T".

And it is not working. First because it is difficult to write a SQL query
for
that. It requires to make a join clause with the document and the object
tables,
which our SQL injection mechanism does not support. Even if it were
possible,
creating a SQL join with the document table will de-facto filter events
that do
not concern any pages or pages that do not have any objects: so many other
filters will be broken. I also don't consider creating a SQL subquery, I
think
the whole query would became too big. So I decided to just not inject any
SQL
code for this filter and only implement the post-filtering mechanism.

But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
"WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched from
the
database. Consequence: the tag filter never see the events that it is
supposed
to keep. It would be actually possible to by-pass the first SQL injections
by
injecting something like "OR 1=1". But doing something like this is like
dropping the all SQL injections mechanism.

I see some solutions for this problem:

  A/ For each tag, create a permanent list of pages that hold it. So I can
     inject "OR document IN (that_list)". I think this is heavy.

  B/ Drop the SQL injection mechanism and only rely on the post-filtering
     mechanism. It would require to load from the database A LOT of events,
     but maybe we could cache this.

  C/ Don't drop the SQL injection mechanism completely but use it as little
as
     possible (for example, do not use it for LOCATION filtering). Seems
hard to
     determine when a filter should use this feature or not.

  D/ Don't implement the "tags" filter, since it is the root of the issue.
But
     it is like sweeping dirt under the carpet!

Since we have the OutOfMemory problem with the SQL injections becoming too
huge,
I am more in favor of solution B or C. But I'm not sure for now, since I do
not
know how much it would impact the performances and the scalability of the
whole
notifications feature.

This is a complex topic, but I hope this message will inspire you some
suggestions or things I have not seen with my own eyes.

Thanks for your help,

Guillaume
Reply | Threaded
Open this post in threaded view
|

Re: [Brainstorm] Notifications filters capabilities and performances.

Guillaume Delhumeau
For the tags filter, I can also:

* perform a query to fetch all documents that correspond to the given tags
(it doesn't need to be permanent, but it would be better to cache it)
* add a HQL filter on these pages (OR document IN :list_of_pages).

It's a little variation of solution A. It's ugly but it could do the job.

2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
[hidden email]>:

> Hi developers.
>
> I am trying to add a new filter to the notifications to be able to follow
> pages
> that are marked with a given tag. And it leads me to some questions about
> the
> technical implementation of the notifications.
>
> To remind the context: notifications are computed on top of the events
> recorded
> by the event stream (a.k.a activity stream). We take events from the event
> stream SQL table, we apply some transformations on them, and we display
> them to
> the user.
>
> Then we have implemented the ability to filter on these events: for
> example
> "don't show events concerning the document A nor the wiki B". Filters are
> implemented with 2 distinct ways:
>
>   1/ SQL injections: each filter can add SQL elements in the query we make
> to
>      fetch the events from the event stream table. We made this mechanism
> so we
>      can let the database do a lot of the filtering process. After all,
> it's its
>      job and it's supposed to perform well. To be precise, Clement has
> even
>      created an Abstract Syntax Tree (AST) so it's easier to inject some
> content
>      in the query and it creates an abstraction over the SQL language so
> we can
>      even consider to change the storage of the event stream someday.
>
>      The bad thing is that some complex filtering are difficult to write
> with
>      the SQL language (event with the AST) or even impossible.
>
>   2/ Post-filtering: after the events have been fetched from the database,
> each
>      filter can still decide to keep or filter them. This is useful for
>      complex filtering that cannot be expressed with the SQL language. It
> is
>      also needed by the real-time notification email sender, because it
> takes
>      the events immediately when they occurs without fetching them in the
>      database (so SQL filters are bypassed).
>
>      The bad thing is that some events are loaded in the memory to finally
> be
>      rejected, and these filters can perform costly operations such as
> loading
>      documents.
>
> Until now, this double mechanism was working quite well, with each
> mechanism
> filling the lacks of the other.
>
> However, we still have technical limitations in our design:
>
>   1/ Users who have a lot of filter preferences can end up with a giant
> SQL
>      query that is almost impossible to perform by the database. Actually
> we had
>      a user complaining about an OutOfMemory problem in the HQL to SQL
>      translator !
>
>   2/ I cannot implement the tag filter !
>
> The tag filter is supposed to show events that concern pages that hold a
> given
> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
> don't
> want to receive notifications about wiki A except for pages marked with
> the tag
> T".
>
> And it is not working. First because it is difficult to write a SQL query
> for
> that. It requires to make a join clause with the document and the object
> tables,
> which our SQL injection mechanism does not support. Even if it were
> possible,
> creating a SQL join with the document table will de-facto filter events
> that do
> not concern any pages or pages that do not have any objects: so many other
> filters will be broken. I also don't consider creating a SQL subquery, I
> think
> the whole query would became too big. So I decided to just not inject any
> SQL
> code for this filter and only implement the post-filtering mechanism.
>
> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched
> from the
> database. Consequence: the tag filter never see the events that it is
> supposed
> to keep. It would be actually possible to by-pass the first SQL injections
> by
> injecting something like "OR 1=1". But doing something like this is like
> dropping the all SQL injections mechanism.
>
> I see some solutions for this problem:
>
>   A/ For each tag, create a permanent list of pages that hold it. So I can
>      inject "OR document IN (that_list)". I think this is heavy.
>
>   B/ Drop the SQL injection mechanism and only rely on the post-filtering
>      mechanism. It would require to load from the database A LOT of events,
>      but maybe we could cache this.
>
>   C/ Don't drop the SQL injection mechanism completely but use it as
> little as
>      possible (for example, do not use it for LOCATION filtering). Seems
> hard to
>      determine when a filter should use this feature or not.
>
>   D/ Don't implement the "tags" filter, since it is the root of the issue.
> But
>      it is like sweeping dirt under the carpet!
>
> Since we have the OutOfMemory problem with the SQL injections becoming too
> huge,
> I am more in favor of solution B or C. But I'm not sure for now, since I
> do not
> know how much it would impact the performances and the scalability of the
> whole
> notifications feature.
>
> This is a complex topic, but I hope this message will inspire you some
> suggestions or things I have not seen with my own eyes.
>
> Thanks for your help,
>
> Guillaume
>



--
Guillaume Delhumeau ([hidden email])
Research & Development Engineer at XWiki SAS
Committer on the XWiki.org project
Reply | Threaded
Open this post in threaded view
|

Re: [Brainstorm] Notifications filters capabilities and performances.

Sergiu Dumitriu-3
The two main problems that I see are that you're mixing two separate things:

1. Tags which are stored in one place, and events which are stored in a
different place
2. Again tags, and document locations. They may seem similar, but they
are completely different as they are implemented. That's why saying "not
in this wiki, except for this tag" cannot ever be implemented in a sane
way since the exception is on a different plane.

The logical approach would be to also store the tags in the events, but
this is not a good thing to do. Where do we stop? For future-proofing
filters, we would have to store the entire document in the event, in
case someone wants to add a filter on "documents with a specific type of
object attached", or "documents with a specific value for a specific
property of a specific object", or "documents with attachments ending in
.extension".

On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
> For the tags filter, I can also:
>
> * perform a query to fetch all documents that correspond to the given tags
> (it doesn't need to be permanent, but it would be better to cache it)
> * add a HQL filter on these pages (OR document IN :list_of_pages).
>
> It's a little variation of solution A. It's ugly but it could do the job.

It's the sanest proposal so far, but still impossible.

Since "not in LOCATION except WITH TAG" doesn't work because it mixes
two types of information in the same query fragment, how about making
this a one-dimensional query as "only WITH TAG". And since this
information is in a different place, pre-querying it is needed. However,
if there are thousands or tens/hundreds of thousands of documents with a
tag, won't the query crash anyway?

> 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
> [hidden email]>:
>
>> Hi developers.
>>
>> I am trying to add a new filter to the notifications to be able to follow
>> pages
>> that are marked with a given tag. And it leads me to some questions about
>> the
>> technical implementation of the notifications.
>>
>> To remind the context: notifications are computed on top of the events
>> recorded
>> by the event stream (a.k.a activity stream). We take events from the event
>> stream SQL table, we apply some transformations on them, and we display
>> them to
>> the user.
>>
>> Then we have implemented the ability to filter on these events: for
>> example
>> "don't show events concerning the document A nor the wiki B". Filters are
>> implemented with 2 distinct ways:
>>
>>   1/ SQL injections: each filter can add SQL elements in the query we make
>> to
>>      fetch the events from the event stream table. We made this mechanism
>> so we
>>      can let the database do a lot of the filtering process. After all,
>> it's its
>>      job and it's supposed to perform well. To be precise, Clement has
>> even
>>      created an Abstract Syntax Tree (AST) so it's easier to inject some
>> content
>>      in the query and it creates an abstraction over the SQL language so
>> we can
>>      even consider to change the storage of the event stream someday.
>>
>>      The bad thing is that some complex filtering are difficult to write
>> with
>>      the SQL language (event with the AST) or even impossible.
>>
>>   2/ Post-filtering: after the events have been fetched from the database,
>> each
>>      filter can still decide to keep or filter them. This is useful for
>>      complex filtering that cannot be expressed with the SQL language. It
>> is
>>      also needed by the real-time notification email sender, because it
>> takes
>>      the events immediately when they occurs without fetching them in the
>>      database (so SQL filters are bypassed).
>>
>>      The bad thing is that some events are loaded in the memory to finally
>> be
>>      rejected, and these filters can perform costly operations such as
>> loading
>>      documents.
>>
>> Until now, this double mechanism was working quite well, with each
>> mechanism
>> filling the lacks of the other.
>>
>> However, we still have technical limitations in our design:
>>
>>   1/ Users who have a lot of filter preferences can end up with a giant
>> SQL
>>      query that is almost impossible to perform by the database. Actually
>> we had
>>      a user complaining about an OutOfMemory problem in the HQL to SQL
>>      translator !
>>
>>   2/ I cannot implement the tag filter !
>>
>> The tag filter is supposed to show events that concern pages that hold a
>> given
>> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
>> don't
>> want to receive notifications about wiki A except for pages marked with
>> the tag
>> T".
>>
>> And it is not working. First because it is difficult to write a SQL query
>> for
>> that. It requires to make a join clause with the document and the object
>> tables,
>> which our SQL injection mechanism does not support. Even if it were
>> possible,
>> creating a SQL join with the document table will de-facto filter events
>> that do
>> not concern any pages or pages that do not have any objects: so many other
>> filters will be broken. I also don't consider creating a SQL subquery, I
>> think
>> the whole query would became too big. So I decided to just not inject any
>> SQL
>> code for this filter and only implement the post-filtering mechanism.
>>
>> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
>> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched
>> from the
>> database. Consequence: the tag filter never see the events that it is
>> supposed
>> to keep. It would be actually possible to by-pass the first SQL injections
>> by
>> injecting something like "OR 1=1". But doing something like this is like
>> dropping the all SQL injections mechanism.
>>
>> I see some solutions for this problem:
>>
>>   A/ For each tag, create a permanent list of pages that hold it. So I can
>>      inject "OR document IN (that_list)". I think this is heavy.

Definitely not. You'll have to maintain this list in parallel to the
canonical storage for tags, the xwikidoc table itself, with all the
headaches that data duplication brings.

>>   B/ Drop the SQL injection mechanism and only rely on the post-filtering
>>      mechanism. It would require to load from the database A LOT of events,
>>      but maybe we could cache this.

Nope, filtering millions of events by inspecting them one by one is a
performance nightmare.

>>   C/ Don't drop the SQL injection mechanism completely but use it as
>> little as
>>      possible (for example, do not use it for LOCATION filtering). Seems
>> hard to
>>      determine when a filter should use this feature or not.

I think location filtering should still happen in the query itself,
since it is a lot faster.

>>   D/ Don't implement the "tags" filter, since it is the root of the issue.
>> But
>>      it is like sweeping dirt under the carpet!

Well, not implementing something is always the fastest, least intrusive
solution to any problem...

Humor aside, I think that in this case it is the best approach.

>> Since we have the OutOfMemory problem with the SQL injections becoming too
>> huge,
>> I am more in favor of solution B or C. But I'm not sure for now, since I
>> do not
>> know how much it would impact the performances and the scalability of the
>> whole
>> notifications feature.
>>
>> This is a complex topic, but I hope this message will inspire you some
>> suggestions or things I have not seen with my own eyes.
>>
>> Thanks for your help,
>>
>> Guillaume
>>
>
>
>
Other options:

E/ Drop SQL altogether, move the event stream into a nosql database, and
do stream filtering; kind of like the B proposal

F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
a subscriber is created for every user and his filters, and matching
events are placed in a queue until they are all sent (consumed) in a
notification. Obviously, this only works for email notifications, not
for browsing past events in a webpage.

--
Sergiu Dumitriu
http://purl.org/net/sergiu/
Reply | Threaded
Open this post in threaded view
|

Re: [Brainstorm] Notifications filters capabilities and performances.

caubin
Hi,

With the complexity of the additive filtering system, it's kind of
normal to see OOM errors. AFAIR we already talked about that a year ago,
when looking at how we could implement this feature.

On 06/29/2018 05:41 AM, Sergiu Dumitriu wrote:

> The two main problems that I see are that you're mixing two separate things:
>
> 1. Tags which are stored in one place, and events which are stored in a
> different place
> 2. Again tags, and document locations. They may seem similar, but they
> are completely different as they are implemented. That's why saying "not
> in this wiki, except for this tag" cannot ever be implemented in a sane
> way since the exception is on a different plane.
>
> The logical approach would be to also store the tags in the events, but
> this is not a good thing to do. Where do we stop? For future-proofing
> filters, we would have to store the entire document in the event, in
> case someone wants to add a filter on "documents with a specific type of
> object attached", or "documents with a specific value for a specific
> property of a specific object", or "documents with attachments ending in
> .extension".
>
> On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
>> For the tags filter, I can also:
>>
>> * perform a query to fetch all documents that correspond to the given tags
>> (it doesn't need to be permanent, but it would be better to cache it)
>> * add a HQL filter on these pages (OR document IN :list_of_pages).
>>
>> It's a little variation of solution A. It's ugly but it could do the job.
>
> It's the sanest proposal so far, but still impossible.
>
> Since "not in LOCATION except WITH TAG" doesn't work because it mixes
> two types of information in the same query fragment, how about making
> this a one-dimensional query as "only WITH TAG". And since this
> information is in a different place, pre-querying it is needed. However,
> if there are thousands or tens/hundreds of thousands of documents with a
> tag, won't the query crash anyway?
>
>> 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
>> [hidden email]>:
>>
>>> Hi developers.
>>>
>>> I am trying to add a new filter to the notifications to be able to follow
>>> pages
>>> that are marked with a given tag. And it leads me to some questions about
>>> the
>>> technical implementation of the notifications.
>>>
>>> To remind the context: notifications are computed on top of the events
>>> recorded
>>> by the event stream (a.k.a activity stream). We take events from the event
>>> stream SQL table, we apply some transformations on them, and we display
>>> them to
>>> the user.
>>>
>>> Then we have implemented the ability to filter on these events: for
>>> example
>>> "don't show events concerning the document A nor the wiki B". Filters are
>>> implemented with 2 distinct ways:
>>>
>>>   1/ SQL injections: each filter can add SQL elements in the query we make
>>> to
>>>      fetch the events from the event stream table. We made this mechanism
>>> so we
>>>      can let the database do a lot of the filtering process. After all,
>>> it's its
>>>      job and it's supposed to perform well. To be precise, Clement has
>>> even
>>>      created an Abstract Syntax Tree (AST) so it's easier to inject some
>>> content
>>>      in the query and it creates an abstraction over the SQL language so
>>> we can
>>>      even consider to change the storage of the event stream someday.
>>>
>>>      The bad thing is that some complex filtering are difficult to write
>>> with
>>>      the SQL language (event with the AST) or even impossible.
>>>
>>>   2/ Post-filtering: after the events have been fetched from the database,
>>> each
>>>      filter can still decide to keep or filter them. This is useful for
>>>      complex filtering that cannot be expressed with the SQL language. It
>>> is
>>>      also needed by the real-time notification email sender, because it
>>> takes
>>>      the events immediately when they occurs without fetching them in the
>>>      database (so SQL filters are bypassed).
>>>
>>>      The bad thing is that some events are loaded in the memory to finally
>>> be
>>>      rejected, and these filters can perform costly operations such as
>>> loading
>>>      documents.
>>>
>>> Until now, this double mechanism was working quite well, with each
>>> mechanism
>>> filling the lacks of the other.
>>>
>>> However, we still have technical limitations in our design:
>>>
>>>   1/ Users who have a lot of filter preferences can end up with a giant
>>> SQL
>>>      query that is almost impossible to perform by the database. Actually
>>> we had
>>>      a user complaining about an OutOfMemory problem in the HQL to SQL
>>>      translator !
>>>
>>>   2/ I cannot implement the tag filter !
>>>
>>> The tag filter is supposed to show events that concern pages that hold a
>>> given
>>> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
>>> don't
>>> want to receive notifications about wiki A except for pages marked with
>>> the tag
>>> T".
>>>
>>> And it is not working. First because it is difficult to write a SQL query
>>> for
>>> that. It requires to make a join clause with the document and the object
>>> tables,
>>> which our SQL injection mechanism does not support. Even if it were
>>> possible,
>>> creating a SQL join with the document table will de-facto filter events
>>> that do
>>> not concern any pages or pages that do not have any objects: so many other
>>> filters will be broken. I also don't consider creating a SQL subquery, I
>>> think
>>> the whole query would became too big. So I decided to just not inject any
>>> SQL
>>> code for this filter and only implement the post-filtering mechanism.
>>>
>>> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
>>> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched
>>> from the
>>> database. Consequence: the tag filter never see the events that it is
>>> supposed
>>> to keep. It would be actually possible to by-pass the first SQL injections
>>> by
>>> injecting something like "OR 1=1". But doing something like this is like
>>> dropping the all SQL injections mechanism.
>>>
>>> I see some solutions for this problem:
>>>
>>>   A/ For each tag, create a permanent list of pages that hold it. So I can
>>>      inject "OR document IN (that_list)". I think this is heavy.
>
> Definitely not. You'll have to maintain this list in parallel to the
> canonical storage for tags, the xwikidoc table itself, with all the
> headaches that data duplication brings.
>
>>>   B/ Drop the SQL injection mechanism and only rely on the post-filtering
>>>      mechanism. It would require to load from the database A LOT of events,
>>>      but maybe we could cache this.
>
> Nope, filtering millions of events by inspecting them one by one is a
> performance nightmare.
>
>>>   C/ Don't drop the SQL injection mechanism completely but use it as
>>> little as
>>>      possible (for example, do not use it for LOCATION filtering). Seems
>>> hard to
>>>      determine when a filter should use this feature or not.
>
> I think location filtering should still happen in the query itself,
> since it is a lot faster.
>
>>>   D/ Don't implement the "tags" filter, since it is the root of the issue.
>>> But
>>>      it is like sweeping dirt under the carpet!
>
> Well, not implementing something is always the fastest, least intrusive
> solution to any problem...
>
> Humor aside, I think that in this case it is the best approach.
>
>>> Since we have the OutOfMemory problem with the SQL injections becoming too
>>> huge,
>>> I am more in favor of solution B or C. But I'm not sure for now, since I
>>> do not
>>> know how much it would impact the performances and the scalability of the
>>> whole
>>> notifications feature.
>>>
>>> This is a complex topic, but I hope this message will inspire you some
>>> suggestions or things I have not seen with my own eyes.
>>>
>>> Thanks for your help,
>>>
>>> Guillaume
>>>
>>
>>
>>
> Other options:
>
> E/ Drop SQL altogether, move the event stream into a nosql database, and
> do stream filtering; kind of like the B proposal
>
> F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
> a subscriber is created for every user and his filters, and matching
> events are placed in a queue until they are all sent (consumed) in a
> notification. Obviously, this only works for email notifications, not
> for browsing past events in a webpage.
>

Note that the notification filtering system isn't dependent from SQL:
filters implement their conditions through an AST that gets translated
to SQL by the default manager in charge of retrieving the notifications.
This means that switching from SQL to something else shouldn't break
backward compat of the notification filters already developed.

@Sergiu: (also AFAIR) option E was considered but not implemented as it
would have taken a lot of time to do ; still a nice idea IMO.

A mix of the options E and F could be the nicest IMO : adding a NoSQL DB
between the event stream and the UI so that notifications are picked up
from the ES, filtered and put into the NoSQL DB in a separate thread.
This is AFAIK the most widely used solution for important platforms.

However, this solution needs to be wisely considered it could increase
the complexity of an XWiki install and / or significantly increase the
size of the distribution artifacts that we produce. (Also, it takes time
to implement OFC).

Thanks,
Clément
Reply | Threaded
Open this post in threaded view
|

Re: [Brainstorm] Notifications filters capabilities and performances.

Guillaume Delhumeau
Hi Clément,

2018-06-29 13:34 GMT+02:00 Clément Aubin <[hidden email]>:

> Hi,
>
> With the complexity of the additive filtering system, it's kind of
> normal to see OOM errors. AFAIR we already talked about that a year ago,
> when looking at how we could implement this feature.
>

We could try to improve a bit the situation by "optimizing" the filters.
For example, we can currently end up with filters like these:
- filter events about SpaceA.PageB
- filter events about SpaceA.PageC
- filter events about SpaceA.

Obviously the 2 first filters can be removed since the third one is already
doing their job. I've tried to minimize the number of unnecessary filters
by not creating a new one if the page is already concerned by a wider
filter, but it won't fix existing filters.

An other example is the watch list bridge that simply convert the list of
watched pages as they were in the previous version of XWiki and translate
them to the same number of filters. This may explain many of the OOM
problems. It is even enforced by the fact we had (and we still have) the
"autowatch" mode enabled by default.

So a filter optimizer could be a great help.


>
> On 06/29/2018 05:41 AM, Sergiu Dumitriu wrote:
> > The two main problems that I see are that you're mixing two separate
> things:
> >
> > 1. Tags which are stored in one place, and events which are stored in a
> > different place
> > 2. Again tags, and document locations. They may seem similar, but they
> > are completely different as they are implemented. That's why saying "not
> > in this wiki, except for this tag" cannot ever be implemented in a sane
> > way since the exception is on a different plane.
> >
> > The logical approach would be to also store the tags in the events, but
> > this is not a good thing to do. Where do we stop? For future-proofing
> > filters, we would have to store the entire document in the event, in
> > case someone wants to add a filter on "documents with a specific type of
> > object attached", or "documents with a specific value for a specific
> > property of a specific object", or "documents with attachments ending in
> > .extension".
> >
> > On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
> >> For the tags filter, I can also:
> >>
> >> * perform a query to fetch all documents that correspond to the given
> tags
> >> (it doesn't need to be permanent, but it would be better to cache it)
> >> * add a HQL filter on these pages (OR document IN :list_of_pages).
> >>
> >> It's a little variation of solution A. It's ugly but it could do the
> job.
> >
> > It's the sanest proposal so far, but still impossible.
> >
> > Since "not in LOCATION except WITH TAG" doesn't work because it mixes
> > two types of information in the same query fragment, how about making
> > this a one-dimensional query as "only WITH TAG". And since this
> > information is in a different place, pre-querying it is needed. However,
> > if there are thousands or tens/hundreds of thousands of documents with a
> > tag, won't the query crash anyway?
> >
> >> 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
> >> [hidden email]>:
> >>
> >>> Hi developers.
> >>>
> >>> I am trying to add a new filter to the notifications to be able to
> follow
> >>> pages
> >>> that are marked with a given tag. And it leads me to some questions
> about
> >>> the
> >>> technical implementation of the notifications.
> >>>
> >>> To remind the context: notifications are computed on top of the events
> >>> recorded
> >>> by the event stream (a.k.a activity stream). We take events from the
> event
> >>> stream SQL table, we apply some transformations on them, and we display
> >>> them to
> >>> the user.
> >>>
> >>> Then we have implemented the ability to filter on these events: for
> >>> example
> >>> "don't show events concerning the document A nor the wiki B". Filters
> are
> >>> implemented with 2 distinct ways:
> >>>
> >>>   1/ SQL injections: each filter can add SQL elements in the query we
> make
> >>> to
> >>>      fetch the events from the event stream table. We made this
> mechanism
> >>> so we
> >>>      can let the database do a lot of the filtering process. After all,
> >>> it's its
> >>>      job and it's supposed to perform well. To be precise, Clement has
> >>> even
> >>>      created an Abstract Syntax Tree (AST) so it's easier to inject
> some
> >>> content
> >>>      in the query and it creates an abstraction over the SQL language
> so
> >>> we can
> >>>      even consider to change the storage of the event stream someday.
> >>>
> >>>      The bad thing is that some complex filtering are difficult to
> write
> >>> with
> >>>      the SQL language (event with the AST) or even impossible.
> >>>
> >>>   2/ Post-filtering: after the events have been fetched from the
> database,
> >>> each
> >>>      filter can still decide to keep or filter them. This is useful for
> >>>      complex filtering that cannot be expressed with the SQL language.
> It
> >>> is
> >>>      also needed by the real-time notification email sender, because it
> >>> takes
> >>>      the events immediately when they occurs without fetching them in
> the
> >>>      database (so SQL filters are bypassed).
> >>>
> >>>      The bad thing is that some events are loaded in the memory to
> finally
> >>> be
> >>>      rejected, and these filters can perform costly operations such as
> >>> loading
> >>>      documents.
> >>>
> >>> Until now, this double mechanism was working quite well, with each
> >>> mechanism
> >>> filling the lacks of the other.
> >>>
> >>> However, we still have technical limitations in our design:
> >>>
> >>>   1/ Users who have a lot of filter preferences can end up with a giant
> >>> SQL
> >>>      query that is almost impossible to perform by the database.
> Actually
> >>> we had
> >>>      a user complaining about an OutOfMemory problem in the HQL to SQL
> >>>      translator !
> >>>
> >>>   2/ I cannot implement the tag filter !
> >>>
> >>> The tag filter is supposed to show events that concern pages that hold
> a
> >>> given
> >>> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
> >>> don't
> >>> want to receive notifications about wiki A except for pages marked with
> >>> the tag
> >>> T".
> >>>
> >>> And it is not working. First because it is difficult to write a SQL
> query
> >>> for
> >>> that. It requires to make a join clause with the document and the
> object
> >>> tables,
> >>> which our SQL injection mechanism does not support. Even if it were
> >>> possible,
> >>> creating a SQL join with the document table will de-facto filter events
> >>> that do
> >>> not concern any pages or pages that do not have any objects: so many
> other
> >>> filters will be broken. I also don't consider creating a SQL subquery,
> I
> >>> think
> >>> the whole query would became too big. So I decided to just not inject
> any
> >>> SQL
> >>> code for this filter and only implement the post-filtering mechanism.
> >>>
> >>> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
> >>> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched
> >>> from the
> >>> database. Consequence: the tag filter never see the events that it is
> >>> supposed
> >>> to keep. It would be actually possible to by-pass the first SQL
> injections
> >>> by
> >>> injecting something like "OR 1=1". But doing something like this is
> like
> >>> dropping the all SQL injections mechanism.
> >>>
> >>> I see some solutions for this problem:
> >>>
> >>>   A/ For each tag, create a permanent list of pages that hold it. So I
> can
> >>>      inject "OR document IN (that_list)". I think this is heavy.
> >
> > Definitely not. You'll have to maintain this list in parallel to the
> > canonical storage for tags, the xwikidoc table itself, with all the
> > headaches that data duplication brings.
> >
> >>>   B/ Drop the SQL injection mechanism and only rely on the
> post-filtering
> >>>      mechanism. It would require to load from the database A LOT of
> events,
> >>>      but maybe we could cache this.
> >
> > Nope, filtering millions of events by inspecting them one by one is a
> > performance nightmare.
> >
> >>>   C/ Don't drop the SQL injection mechanism completely but use it as
> >>> little as
> >>>      possible (for example, do not use it for LOCATION filtering).
> Seems
> >>> hard to
> >>>      determine when a filter should use this feature or not.
> >
> > I think location filtering should still happen in the query itself,
> > since it is a lot faster.
> >
> >>>   D/ Don't implement the "tags" filter, since it is the root of the
> issue.
> >>> But
> >>>      it is like sweeping dirt under the carpet!
> >
> > Well, not implementing something is always the fastest, least intrusive
> > solution to any problem...
> >
> > Humor aside, I think that in this case it is the best approach.
> >
> >>> Since we have the OutOfMemory problem with the SQL injections becoming
> too
> >>> huge,
> >>> I am more in favor of solution B or C. But I'm not sure for now, since
> I
> >>> do not
> >>> know how much it would impact the performances and the scalability of
> the
> >>> whole
> >>> notifications feature.
> >>>
> >>> This is a complex topic, but I hope this message will inspire you some
> >>> suggestions or things I have not seen with my own eyes.
> >>>
> >>> Thanks for your help,
> >>>
> >>> Guillaume
> >>>
> >>
> >>
> >>
> > Other options:
> >
> > E/ Drop SQL altogether, move the event stream into a nosql database, and
> > do stream filtering; kind of like the B proposal
> >
> > F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
> > a subscriber is created for every user and his filters, and matching
> > events are placed in a queue until they are all sent (consumed) in a
> > notification. Obviously, this only works for email notifications, not
> > for browsing past events in a webpage.
> >
>
> Note that the notification filtering system isn't dependent from SQL:
> filters implement their conditions through an AST that gets translated
> to SQL by the default manager in charge of retrieving the notifications.
> This means that switching from SQL to something else shouldn't break
> backward compat of the notification filters already developed.
>

Might be harder than expected in practice, but it's indeed a great thing
that we have this AST.


>
> @Sergiu: (also AFAIR) option E was considered but not implemented as it
> would have taken a lot of time to do ; still a nice idea IMO.
>
> A mix of the options E and F could be the nicest IMO : adding a NoSQL DB
> between the event stream and the UI so that notifications are picked up
> from the ES, filtered and put into the NoSQL DB in a separate thread.
> This is AFAIK the most widely used solution for important platforms.
>
> However, this solution needs to be wisely considered it could increase
> the complexity of an XWiki install and / or significantly increase the
> size of the distribution artifacts that we produce. (Also, it takes time
> to implement OFC).
>

Notifications is a complex "science" that would require almost the same
consideration than the search engine :)


>
> Thanks,
> Clément
>

Thanks for your help.

Note that we still have no answer for this problem.

The ugly solution "getting all pages that are marked with a given tag
before generating the query for notifications" is already in place in AS,
so at least we won't do worse !

Guillaume
Reply | Threaded
Open this post in threaded view
|

Fwd: [Brainstorm] Notifications filters capabilities and performances.

Guillaume Delhumeau
In reply to this post by Sergiu Dumitriu-3
I resend this message that was accidentally sent to Sergiu only.

---------- Forwarded message ----------
From: Guillaume Delhumeau <[hidden email]>
Date: 2018-07-02 16:06 GMT+02:00
Subject: Re: [xwiki-devs] [Brainstorm] Notifications filters capabilities
and performances.
To: Sergiu Dumitriu <[hidden email]>


Hi and thanks you for your answers.

2018-06-29 5:41 GMT+02:00 Sergiu Dumitriu <[hidden email]>:

> The two main problems that I see are that you're mixing two separate
> things:
>
> 1. Tags which are stored in one place, and events which are stored in a
> different place
> 2. Again tags, and document locations. They may seem similar, but they
> are completely different as they are implemented. That's why saying "not
> in this wiki, except for this tag" cannot ever be implemented in a sane
> way since the exception is on a different plane.
>
> The logical approach would be to also store the tags in the events, but
> this is not a good thing to do.


Also, a tag could be added to a page AFTER the event has been triggered. In
this case, if you still want to fetch these pages, it won't work.



> Where do we stop? For future-proofing
> filters, we would have to store the entire document in the event, in
> case someone wants to add a filter on "documents with a specific type of
> object attached", or "documents with a specific value for a specific
> property of a specific object", or "documents with attachments ending in
> .extension".
>
> On 06/28/2018 04:44 PM, Guillaume Delhumeau wrote:
> > For the tags filter, I can also:
> >
> > * perform a query to fetch all documents that correspond to the given
> tags
> > (it doesn't need to be permanent, but it would be better to cache it)
> > * add a HQL filter on these pages (OR document IN :list_of_pages).
> >
> > It's a little variation of solution A. It's ugly but it could do the job.
>
> It's the sanest proposal so far, but still impossible.
>

Apparently it's the way it's done on Activity Stream:
https://github.com/xwiki/xwiki-platform/blob/553afe7287dcc4ce8b588276e639bf
283352e805/xwiki-platform-core/xwiki-platform-activitystream/xwiki-platform-
activitystream-ui/src/main/resources/Main/Activity.xml#L1008


>
> Since "not in LOCATION except WITH TAG" doesn't work because it mixes
> two types of information in the same query fragment, how about making
> this a one-dimensional query as "only WITH TAG". And since this
> information is in a different place, pre-querying it is needed. However,
> if there are thousands or tens/hundreds of thousands of documents with a
> tag, won't the query crash anyway?
>

Sure, it won't scale.


>
> > 2018-06-27 12:58 GMT+02:00 Guillaume Delhumeau <
> > [hidden email]>:
> >
> >> Hi developers.
> >>
> >> I am trying to add a new filter to the notifications to be able to
> follow
> >> pages
> >> that are marked with a given tag. And it leads me to some questions
> about
> >> the
> >> technical implementation of the notifications.
> >>
> >> To remind the context: notifications are computed on top of the events
> >> recorded
> >> by the event stream (a.k.a activity stream). We take events from the
> event
> >> stream SQL table, we apply some transformations on them, and we display
> >> them to
> >> the user.
> >>
> >> Then we have implemented the ability to filter on these events: for
> >> example
> >> "don't show events concerning the document A nor the wiki B". Filters
> are
> >> implemented with 2 distinct ways:
> >>
> >>   1/ SQL injections: each filter can add SQL elements in the query we
> make
> >> to
> >>      fetch the events from the event stream table. We made this
> mechanism
> >> so we
> >>      can let the database do a lot of the filtering process. After all,
> >> it's its
> >>      job and it's supposed to perform well. To be precise, Clement has
> >> even
> >>      created an Abstract Syntax Tree (AST) so it's easier to inject some
> >> content
> >>      in the query and it creates an abstraction over the SQL language so
> >> we can
> >>      even consider to change the storage of the event stream someday.
> >>
> >>      The bad thing is that some complex filtering are difficult to write
> >> with
> >>      the SQL language (event with the AST) or even impossible.
> >>
> >>   2/ Post-filtering: after the events have been fetched from the
> database,
> >> each
> >>      filter can still decide to keep or filter them. This is useful for
> >>      complex filtering that cannot be expressed with the SQL language.
> It
> >> is
> >>      also needed by the real-time notification email sender, because it
> >> takes
> >>      the events immediately when they occurs without fetching them in
> the
> >>      database (so SQL filters are bypassed).
> >>
> >>      The bad thing is that some events are loaded in the memory to
> finally
> >> be
> >>      rejected, and these filters can perform costly operations such as
> >> loading
> >>      documents.
> >>
> >> Until now, this double mechanism was working quite well, with each
> >> mechanism
> >> filling the lacks of the other.
> >>
> >> However, we still have technical limitations in our design:
> >>
> >>   1/ Users who have a lot of filter preferences can end up with a giant
> >> SQL
> >>      query that is almost impossible to perform by the database.
> Actually
> >> we had
> >>      a user complaining about an OutOfMemory problem in the HQL to SQL
> >>      translator !
> >>
> >>   2/ I cannot implement the tag filter !
> >>
> >> The tag filter is supposed to show events that concern pages that hold a
> >> given
> >> tag, EVEN IF THE PAGE WAS EXCLUDED BY THE USER. Example of use-case: "I
> >> don't
> >> want to receive notifications about wiki A except for pages marked with
> >> the tag
> >> T".
> >>
> >> And it is not working. First because it is difficult to write a SQL
> query
> >> for
> >> that. It requires to make a join clause with the document and the object
> >> tables,
> >> which our SQL injection mechanism does not support. Even if it were
> >> possible,
> >> creating a SQL join with the document table will de-facto filter events
> >> that do
> >> not concern any pages or pages that do not have any objects: so many
> other
> >> filters will be broken. I also don't consider creating a SQL subquery, I
> >> think
> >> the whole query would became too big. So I decided to just not inject
> any
> >> SQL
> >> code for this filter and only implement the post-filtering mechanism.
> >>
> >> But the other filter "EXCLUDE WIKI A" generates a SQL injection such as
> >> "WIKI <> 'WIKI A'" so the events concerning the wiki A are not fetched
> >> from the
> >> database. Consequence: the tag filter never see the events that it is
> >> supposed
> >> to keep. It would be actually possible to by-pass the first SQL
> injections
> >> by
> >> injecting something like "OR 1=1". But doing something like this is like
> >> dropping the all SQL injections mechanism.
> >>
> >> I see some solutions for this problem:
> >>
> >>   A/ For each tag, create a permanent list of pages that hold it. So I
> can
> >>      inject "OR document IN (that_list)". I think this is heavy.
>
> Definitely not. You'll have to maintain this list in parallel to the
> canonical storage for tags, the xwikidoc table itself, with all the
> headaches that data duplication brings.
>

I agree.


>
> >>   B/ Drop the SQL injection mechanism and only rely on the
> post-filtering
> >>      mechanism. It would require to load from the database A LOT of
> events,
> >>      but maybe we could cache this.
>
> Nope, filtering millions of events by inspecting them one by one is a
> performance nightmare.
>
> >>   C/ Don't drop the SQL injection mechanism completely but use it as
> >> little as
> >>      possible (for example, do not use it for LOCATION filtering). Seems
> >> hard to
> >>      determine when a filter should use this feature or not.
>
> I think location filtering should still happen in the query itself,
> since it is a lot faster.
>
> >>   D/ Don't implement the "tags" filter, since it is the root of the
> issue.
> >> But
> >>      it is like sweeping dirt under the carpet!
>
> Well, not implementing something is always the fastest, least intrusive
> solution to any problem...
>
> Humor aside, I think that in this case it is the best approach.
>

Thanks for saying it :)


>
> >> Since we have the OutOfMemory problem with the SQL injections becoming
> too
> >> huge,
> >> I am more in favor of solution B or C. But I'm not sure for now, since I
> >> do not
> >> know how much it would impact the performances and the scalability of
> the
> >> whole
> >> notifications feature.
> >>
> >> This is a complex topic, but I hope this message will inspire you some
> >> suggestions or things I have not seen with my own eyes.
> >>
> >> Thanks for your help,
> >>
> >> Guillaume
> >>
> >
> >
> >
> Other options:
>
> E/ Drop SQL altogether, move the event stream into a nosql database, and
> do stream filtering; kind of like the B proposal
>

That's unfortunately out of the scope right now.


>
> F/ Drop SQL and querying altogether, switch to a pub-sub mechanism where
> a subscriber is created for every user and his filters, and matching
> events are placed in a queue until they are all sent (consumed) in a
> notification. Obviously, this only works for email notifications, not
> for browsing past events in a webpage.
>

Indeed it cannot replace Activity Stream. However, it was my first design
attempt:
http://design.xwiki.org/xwiki/bin/view/Proposal/
NotificationCenterforAppsImplementation#HIteration1

But then we decided to rely on what we had (the event stream) and to kill 2
birds with the same stones (having notifications + replacing activity
stream).

Thanks,

Guillaume




>
> --
> Sergiu Dumitriu
> http://purl.org/net/sergiu/
>



--
Guillaume Delhumeau ([hidden email])
Research & Development Engineer at XWiki SAS
Committer on the XWiki.org project



--
Guillaume Delhumeau ([hidden email])
Research & Development Engineer at XWiki SAS
Committer on the XWiki.org project