[STAMP/Test] Metrics we need to improve + strategy

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
Hi devs (and anyone else interested to improve the tests of XWiki),

History
======

It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.

I sent several email threads:

- Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
- TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
- Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7

Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).

As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
- Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
- Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52

The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.

Full Strategy proposal
=================

1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
* Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
* Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.

2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.

3) All committers: Of course, the jacoco strategy is also active at each module level.

STAMP tools
==========

There are 4 tools developed by STAMP:
* Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
* DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
* CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
* EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash

Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.

Objectives/KPIs/Metrics for STAMP
===========================

The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:

1) K01: Increase test coverage
* Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
* Increase the coverage contributions of each tool developed by STAMP.

Strategy:
* Primary goal:
** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
* Secondary goal:
** Increase our global TPC as mentioned above by fixing the modules in red.

2) K02: Reduce flaky tests.
* Objective: reduce the number of flaky tests by 20%

Strategy:
* Record flaky tests in jira
* Fix the max number of them

3) K03: Better test quality
* Objective: increase mutation score by 20%

Strategy:
* Same strategy as K01.

4) K04: More configuration-related paths tested
* Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).

Strategy:
* Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.

5) K05: Reduce system-specific bugs
* Objective: 30% improvement

Strategy:
* Run TestContainers, execute existing tests and find new bugs related to configurations. Record them

6) K06: More configurations/Faster tests
* Objective: increase the number of automatically tested configurations by 50%

Strategy:
* Increase the # of configurations we test with TestContainers. I’ll do that part initially.
* Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.

7) K07: Pending, nothing to do FTM

8) K08: More crash replicating test cases
* Objective: increase the number of crash replicating test cases by at least 70%

Strategy:
* For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
* Record and count the number of successful EvoCrash-generated test cases.
* Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
* Measure the new coverage increase
* Note that I haven’t experimented much with this yet myself.

9) K09: Pending, nothing to do FTM.

Conclusion
=========

Right now, I need your help for the following KPIs: K01, K02, K03, K08.

Since there’s a lot to understand in this email, I’m open to:
* Organizing a meeting on youtube live to discuss all this
* Answering any questions on this thread ofc
* Also feel free to ask on IRC/Matrix.

Here’s an extract from STAMP which has more details about the KPIs/metrics:
https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA

Thanks
-Vincent






Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

Adel Atallah
Hello,

Maybe we should agree on having a whole day dedicated on using these
tools with a maximum number of developers.
That way we will be able to help each other and maybe it will make the
process easier to carry out in the future.

WDYT?

Thanks,
Adel


On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:

> Hi devs (and anyone else interested to improve the tests of XWiki),
>
> History
> ======
>
> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>
> I sent several email threads:
>
> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>
> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>
> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>
> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>
> Full Strategy proposal
> =================
>
> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>
> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>
> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>
> STAMP tools
> ==========
>
> There are 4 tools developed by STAMP:
> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>
> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>
> Objectives/KPIs/Metrics for STAMP
> ===========================
>
> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>
> 1) K01: Increase test coverage
> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
> * Increase the coverage contributions of each tool developed by STAMP.
>
> Strategy:
> * Primary goal:
> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
> * Secondary goal:
> ** Increase our global TPC as mentioned above by fixing the modules in red.
>
> 2) K02: Reduce flaky tests.
> * Objective: reduce the number of flaky tests by 20%
>
> Strategy:
> * Record flaky tests in jira
> * Fix the max number of them
>
> 3) K03: Better test quality
> * Objective: increase mutation score by 20%
>
> Strategy:
> * Same strategy as K01.
>
> 4) K04: More configuration-related paths tested
> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>
> Strategy:
> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>
> 5) K05: Reduce system-specific bugs
> * Objective: 30% improvement
>
> Strategy:
> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>
> 6) K06: More configurations/Faster tests
> * Objective: increase the number of automatically tested configurations by 50%
>
> Strategy:
> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>
> 7) K07: Pending, nothing to do FTM
>
> 8) K08: More crash replicating test cases
> * Objective: increase the number of crash replicating test cases by at least 70%
>
> Strategy:
> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
> * Record and count the number of successful EvoCrash-generated test cases.
> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
> * Measure the new coverage increase
> * Note that I haven’t experimented much with this yet myself.
>
> 9) K09: Pending, nothing to do FTM.
>
> Conclusion
> =========
>
> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>
> Since there’s a lot to understand in this email, I’m open to:
> * Organizing a meeting on youtube live to discuss all this
> * Answering any questions on this thread ofc
> * Also feel free to ask on IRC/Matrix.
>
> Here’s an extract from STAMP which has more details about the KPIs/metrics:
> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>
> Thanks
> -Vincent
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

Thomas Mortagne
Administrator
Indeed we discussed this but I don't see it in your mail Vincent.

On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:

> Hello,
>
> Maybe we should agree on having a whole day dedicated on using these
> tools with a maximum number of developers.
> That way we will be able to help each other and maybe it will make the
> process easier to carry out in the future.
>
> WDYT?
>
> Thanks,
> Adel
>
>
> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>
>> History
>> ======
>>
>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>
>> I sent several email threads:
>>
>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>
>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>
>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>
>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>
>> Full Strategy proposal
>> =================
>>
>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>
>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>
>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>
>> STAMP tools
>> ==========
>>
>> There are 4 tools developed by STAMP:
>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>
>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>
>> Objectives/KPIs/Metrics for STAMP
>> ===========================
>>
>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>
>> 1) K01: Increase test coverage
>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>> * Increase the coverage contributions of each tool developed by STAMP.
>>
>> Strategy:
>> * Primary goal:
>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>> * Secondary goal:
>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>
>> 2) K02: Reduce flaky tests.
>> * Objective: reduce the number of flaky tests by 20%
>>
>> Strategy:
>> * Record flaky tests in jira
>> * Fix the max number of them
>>
>> 3) K03: Better test quality
>> * Objective: increase mutation score by 20%
>>
>> Strategy:
>> * Same strategy as K01.
>>
>> 4) K04: More configuration-related paths tested
>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>
>> Strategy:
>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>
>> 5) K05: Reduce system-specific bugs
>> * Objective: 30% improvement
>>
>> Strategy:
>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>
>> 6) K06: More configurations/Faster tests
>> * Objective: increase the number of automatically tested configurations by 50%
>>
>> Strategy:
>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>
>> 7) K07: Pending, nothing to do FTM
>>
>> 8) K08: More crash replicating test cases
>> * Objective: increase the number of crash replicating test cases by at least 70%
>>
>> Strategy:
>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>> * Record and count the number of successful EvoCrash-generated test cases.
>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>> * Measure the new coverage increase
>> * Note that I haven’t experimented much with this yet myself.
>>
>> 9) K09: Pending, nothing to do FTM.
>>
>> Conclusion
>> =========
>>
>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>
>> Since there’s a lot to understand in this email, I’m open to:
>> * Organizing a meeting on youtube live to discuss all this
>> * Answering any questions on this thread ofc
>> * Also feel free to ask on IRC/Matrix.
>>
>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>
>> Thanks
>> -Vincent
>>
>>
>>
>>
>>
>>



--
Thomas Mortagne
Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
Hi,

I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
* we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
* test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
* it can be boring to spend a full day on them

Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.

So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).

In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.

WDYT? What do you prefer?

Thanks
-Vincent

> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>
> Indeed we discussed this but I don't see it in your mail Vincent.
>
> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>> Hello,
>>
>> Maybe we should agree on having a whole day dedicated on using these
>> tools with a maximum number of developers.
>> That way we will be able to help each other and maybe it will make the
>> process easier to carry out in the future.
>>
>> WDYT?
>>
>> Thanks,
>> Adel
>>
>>
>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>
>>> History
>>> ======
>>>
>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>
>>> I sent several email threads:
>>>
>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>
>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>
>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>
>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>
>>> Full Strategy proposal
>>> =================
>>>
>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>
>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>
>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>
>>> STAMP tools
>>> ==========
>>>
>>> There are 4 tools developed by STAMP:
>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>
>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>
>>> Objectives/KPIs/Metrics for STAMP
>>> ===========================
>>>
>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>
>>> 1) K01: Increase test coverage
>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>
>>> Strategy:
>>> * Primary goal:
>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>> * Secondary goal:
>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>
>>> 2) K02: Reduce flaky tests.
>>> * Objective: reduce the number of flaky tests by 20%
>>>
>>> Strategy:
>>> * Record flaky tests in jira
>>> * Fix the max number of them
>>>
>>> 3) K03: Better test quality
>>> * Objective: increase mutation score by 20%
>>>
>>> Strategy:
>>> * Same strategy as K01.
>>>
>>> 4) K04: More configuration-related paths tested
>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>
>>> Strategy:
>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>
>>> 5) K05: Reduce system-specific bugs
>>> * Objective: 30% improvement
>>>
>>> Strategy:
>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>
>>> 6) K06: More configurations/Faster tests
>>> * Objective: increase the number of automatically tested configurations by 50%
>>>
>>> Strategy:
>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>
>>> 7) K07: Pending, nothing to do FTM
>>>
>>> 8) K08: More crash replicating test cases
>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>
>>> Strategy:
>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>> * Record and count the number of successful EvoCrash-generated test cases.
>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>> * Measure the new coverage increase
>>> * Note that I haven’t experimented much with this yet myself.
>>>
>>> 9) K09: Pending, nothing to do FTM.
>>>
>>> Conclusion
>>> =========
>>>
>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>
>>> Since there’s a lot to understand in this email, I’m open to:
>>> * Organizing a meeting on youtube live to discuss all this
>>> * Answering any questions on this thread ofc
>>> * Also feel free to ask on IRC/Matrix.
>>>
>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>
>>> Thanks
>>> -Vincent
>>>
>>>
>>>
>>>
>>>
>>>
>
>
>
> --
> Thomas Mortagne

Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

Adel Atallah
Just to be clear, when I proposed "having a whole day dedicated on
using these tools", I didn't meant having to have it every week but
only once, so we can properly start improving the tests. It would be
some kind of training.
On my side I don't think I'll be able to have on a week one day
dedicated to tests and one for bug fixing, I won't have time left for
the roadmap as I will only work on the product 50% of the time.


On Thu, Aug 30, 2018 at 12:18 PM, Vincent Massol <[hidden email]> wrote:

> Hi,
>
> I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
> * we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
> * test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
> * it can be boring to spend a full day on them
>
> Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.
>
> So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).
>
> In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.
>
> WDYT? What do you prefer?
>
> Thanks
> -Vincent
>
>> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>>
>> Indeed we discussed this but I don't see it in your mail Vincent.
>>
>> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>>> Hello,
>>>
>>> Maybe we should agree on having a whole day dedicated on using these
>>> tools with a maximum number of developers.
>>> That way we will be able to help each other and maybe it will make the
>>> process easier to carry out in the future.
>>>
>>> WDYT?
>>>
>>> Thanks,
>>> Adel
>>>
>>>
>>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>>
>>>> History
>>>> ======
>>>>
>>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>>
>>>> I sent several email threads:
>>>>
>>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>>
>>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>>
>>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>>
>>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>>
>>>> Full Strategy proposal
>>>> =================
>>>>
>>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>>
>>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>>
>>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>>
>>>> STAMP tools
>>>> ==========
>>>>
>>>> There are 4 tools developed by STAMP:
>>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>>
>>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>>
>>>> Objectives/KPIs/Metrics for STAMP
>>>> ===========================
>>>>
>>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>>
>>>> 1) K01: Increase test coverage
>>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>>
>>>> Strategy:
>>>> * Primary goal:
>>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>>> * Secondary goal:
>>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>>
>>>> 2) K02: Reduce flaky tests.
>>>> * Objective: reduce the number of flaky tests by 20%
>>>>
>>>> Strategy:
>>>> * Record flaky tests in jira
>>>> * Fix the max number of them
>>>>
>>>> 3) K03: Better test quality
>>>> * Objective: increase mutation score by 20%
>>>>
>>>> Strategy:
>>>> * Same strategy as K01.
>>>>
>>>> 4) K04: More configuration-related paths tested
>>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>>
>>>> Strategy:
>>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>>
>>>> 5) K05: Reduce system-specific bugs
>>>> * Objective: 30% improvement
>>>>
>>>> Strategy:
>>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>>
>>>> 6) K06: More configurations/Faster tests
>>>> * Objective: increase the number of automatically tested configurations by 50%
>>>>
>>>> Strategy:
>>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>>
>>>> 7) K07: Pending, nothing to do FTM
>>>>
>>>> 8) K08: More crash replicating test cases
>>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>>
>>>> Strategy:
>>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>>> * Record and count the number of successful EvoCrash-generated test cases.
>>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>>> * Measure the new coverage increase
>>>> * Note that I haven’t experimented much with this yet myself.
>>>>
>>>> 9) K09: Pending, nothing to do FTM.
>>>>
>>>> Conclusion
>>>> =========
>>>>
>>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>>
>>>> Since there’s a lot to understand in this email, I’m open to:
>>>> * Organizing a meeting on youtube live to discuss all this
>>>> * Answering any questions on this thread ofc
>>>> * Also feel free to ask on IRC/Matrix.
>>>>
>>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>>
>>>> Thanks
>>>> -Vincent
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>> --
>> Thomas Mortagne
>
Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
I propose to do this tomorrow Tuesday, starting with an intro from me, using youtube live.

WDYT?

Thanks
-Vincent

> On 30 Aug 2018, at 12:27, Adel Atallah <[hidden email]> wrote:
>
> Just to be clear, when I proposed "having a whole day dedicated on
> using these tools", I didn't meant having to have it every week but
> only once, so we can properly start improving the tests. It would be
> some kind of training.
> On my side I don't think I'll be able to have on a week one day
> dedicated to tests and one for bug fixing, I won't have time left for
> the roadmap as I will only work on the product 50% of the time.
>
>
> On Thu, Aug 30, 2018 at 12:18 PM, Vincent Massol <[hidden email]> wrote:
>> Hi,
>>
>> I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
>> * we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
>> * test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
>> * it can be boring to spend a full day on them
>>
>> Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.
>>
>> So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).
>>
>> In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.
>>
>> WDYT? What do you prefer?
>>
>> Thanks
>> -Vincent
>>
>>> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>>>
>>> Indeed we discussed this but I don't see it in your mail Vincent.
>>>
>>> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>>>> Hello,
>>>>
>>>> Maybe we should agree on having a whole day dedicated on using these
>>>> tools with a maximum number of developers.
>>>> That way we will be able to help each other and maybe it will make the
>>>> process easier to carry out in the future.
>>>>
>>>> WDYT?
>>>>
>>>> Thanks,
>>>> Adel
>>>>
>>>>
>>>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>>>
>>>>> History
>>>>> ======
>>>>>
>>>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>>>
>>>>> I sent several email threads:
>>>>>
>>>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>>>
>>>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>>>
>>>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>
>>>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>>>
>>>>> Full Strategy proposal
>>>>> =================
>>>>>
>>>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>>>
>>>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>>>
>>>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>>>
>>>>> STAMP tools
>>>>> ==========
>>>>>
>>>>> There are 4 tools developed by STAMP:
>>>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>>>
>>>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>>>
>>>>> Objectives/KPIs/Metrics for STAMP
>>>>> ===========================
>>>>>
>>>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>>>
>>>>> 1) K01: Increase test coverage
>>>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>>>
>>>>> Strategy:
>>>>> * Primary goal:
>>>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>>>> * Secondary goal:
>>>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>>>
>>>>> 2) K02: Reduce flaky tests.
>>>>> * Objective: reduce the number of flaky tests by 20%
>>>>>
>>>>> Strategy:
>>>>> * Record flaky tests in jira
>>>>> * Fix the max number of them
>>>>>
>>>>> 3) K03: Better test quality
>>>>> * Objective: increase mutation score by 20%
>>>>>
>>>>> Strategy:
>>>>> * Same strategy as K01.
>>>>>
>>>>> 4) K04: More configuration-related paths tested
>>>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>>>
>>>>> Strategy:
>>>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>>>
>>>>> 5) K05: Reduce system-specific bugs
>>>>> * Objective: 30% improvement
>>>>>
>>>>> Strategy:
>>>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>>>
>>>>> 6) K06: More configurations/Faster tests
>>>>> * Objective: increase the number of automatically tested configurations by 50%
>>>>>
>>>>> Strategy:
>>>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>>>
>>>>> 7) K07: Pending, nothing to do FTM
>>>>>
>>>>> 8) K08: More crash replicating test cases
>>>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>>>
>>>>> Strategy:
>>>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>>>> * Record and count the number of successful EvoCrash-generated test cases.
>>>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>>>> * Measure the new coverage increase
>>>>> * Note that I haven’t experimented much with this yet myself.
>>>>>
>>>>> 9) K09: Pending, nothing to do FTM.
>>>>>
>>>>> Conclusion
>>>>> =========
>>>>>
>>>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>>>
>>>>> Since there’s a lot to understand in this email, I’m open to:
>>>>> * Organizing a meeting on youtube live to discuss all this
>>>>> * Answering any questions on this thread ofc
>>>>> * Also feel free to ask on IRC/Matrix.
>>>>>
>>>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>>>
>>>>> Thanks
>>>>> -Vincent
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>> --
>>> Thomas Mortagne
>>

Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator

> On 3 Sep 2018, at 09:55, Vincent Massol <[hidden email]> wrote:
>
> I propose to do this tomorrow Tuesday, starting with an intro from me, using youtube live.

Say, 10AM Paris time.

Thanks
-Vincent

> WDYT?
>
> Thanks
> -Vincent
>
>> On 30 Aug 2018, at 12:27, Adel Atallah <[hidden email]> wrote:
>>
>> Just to be clear, when I proposed "having a whole day dedicated on
>> using these tools", I didn't meant having to have it every week but
>> only once, so we can properly start improving the tests. It would be
>> some kind of training.
>> On my side I don't think I'll be able to have on a week one day
>> dedicated to tests and one for bug fixing, I won't have time left for
>> the roadmap as I will only work on the product 50% of the time.
>>
>>
>> On Thu, Aug 30, 2018 at 12:18 PM, Vincent Massol <[hidden email]> wrote:
>>> Hi,
>>>
>>> I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
>>> * we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
>>> * test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
>>> * it can be boring to spend a full day on them
>>>
>>> Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.
>>>
>>> So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).
>>>
>>> In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.
>>>
>>> WDYT? What do you prefer?
>>>
>>> Thanks
>>> -Vincent
>>>
>>>> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>>>>
>>>> Indeed we discussed this but I don't see it in your mail Vincent.
>>>>
>>>> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>>>>> Hello,
>>>>>
>>>>> Maybe we should agree on having a whole day dedicated on using these
>>>>> tools with a maximum number of developers.
>>>>> That way we will be able to help each other and maybe it will make the
>>>>> process easier to carry out in the future.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> Thanks,
>>>>> Adel
>>>>>
>>>>>
>>>>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>>>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>>>>
>>>>>> History
>>>>>> ======
>>>>>>
>>>>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>>>>
>>>>>> I sent several email threads:
>>>>>>
>>>>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>>>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>>>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>>>>
>>>>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>>>>
>>>>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>>>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>>>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>>
>>>>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>>>>
>>>>>> Full Strategy proposal
>>>>>> =================
>>>>>>
>>>>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>>>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>>>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>>>>
>>>>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>>>>
>>>>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>>>>
>>>>>> STAMP tools
>>>>>> ==========
>>>>>>
>>>>>> There are 4 tools developed by STAMP:
>>>>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>>>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>>>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>>>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>>>>
>>>>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>>>>
>>>>>> Objectives/KPIs/Metrics for STAMP
>>>>>> ===========================
>>>>>>
>>>>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>>>>
>>>>>> 1) K01: Increase test coverage
>>>>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>>>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>>>>
>>>>>> Strategy:
>>>>>> * Primary goal:
>>>>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>>>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>>>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>>>>> * Secondary goal:
>>>>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>>>>
>>>>>> 2) K02: Reduce flaky tests.
>>>>>> * Objective: reduce the number of flaky tests by 20%
>>>>>>
>>>>>> Strategy:
>>>>>> * Record flaky tests in jira
>>>>>> * Fix the max number of them
>>>>>>
>>>>>> 3) K03: Better test quality
>>>>>> * Objective: increase mutation score by 20%
>>>>>>
>>>>>> Strategy:
>>>>>> * Same strategy as K01.
>>>>>>
>>>>>> 4) K04: More configuration-related paths tested
>>>>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>>>>
>>>>>> Strategy:
>>>>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>>>>
>>>>>> 5) K05: Reduce system-specific bugs
>>>>>> * Objective: 30% improvement
>>>>>>
>>>>>> Strategy:
>>>>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>>>>
>>>>>> 6) K06: More configurations/Faster tests
>>>>>> * Objective: increase the number of automatically tested configurations by 50%
>>>>>>
>>>>>> Strategy:
>>>>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>>>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>>>>
>>>>>> 7) K07: Pending, nothing to do FTM
>>>>>>
>>>>>> 8) K08: More crash replicating test cases
>>>>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>>>>
>>>>>> Strategy:
>>>>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>>>>> * Record and count the number of successful EvoCrash-generated test cases.
>>>>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>>>>> * Measure the new coverage increase
>>>>>> * Note that I haven’t experimented much with this yet myself.
>>>>>>
>>>>>> 9) K09: Pending, nothing to do FTM.
>>>>>>
>>>>>> Conclusion
>>>>>> =========
>>>>>>
>>>>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>>>>
>>>>>> Since there’s a lot to understand in this email, I’m open to:
>>>>>> * Organizing a meeting on youtube live to discuss all this
>>>>>> * Answering any questions on this thread ofc
>>>>>> * Also feel free to ask on IRC/Matrix.
>>>>>>
>>>>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>>>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>>>>
>>>>>> Thanks
>>>>>> -Vincent
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thomas Mortagne
>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

Adel Atallah
+1


On Mon, Sep 3, 2018 at 9:55 AM, Vincent Massol <[hidden email]> wrote:

>
>> On 3 Sep 2018, at 09:55, Vincent Massol <[hidden email]> wrote:
>>
>> I propose to do this tomorrow Tuesday, starting with an intro from me, using youtube live.
>
> Say, 10AM Paris time.
>
> Thanks
> -Vincent
>
>> WDYT?
>>
>> Thanks
>> -Vincent
>>
>>> On 30 Aug 2018, at 12:27, Adel Atallah <[hidden email]> wrote:
>>>
>>> Just to be clear, when I proposed "having a whole day dedicated on
>>> using these tools", I didn't meant having to have it every week but
>>> only once, so we can properly start improving the tests. It would be
>>> some kind of training.
>>> On my side I don't think I'll be able to have on a week one day
>>> dedicated to tests and one for bug fixing, I won't have time left for
>>> the roadmap as I will only work on the product 50% of the time.
>>>
>>>
>>> On Thu, Aug 30, 2018 at 12:18 PM, Vincent Massol <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
>>>> * we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
>>>> * test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
>>>> * it can be boring to spend a full day on them
>>>>
>>>> Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.
>>>>
>>>> So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).
>>>>
>>>> In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.
>>>>
>>>> WDYT? What do you prefer?
>>>>
>>>> Thanks
>>>> -Vincent
>>>>
>>>>> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>>>>>
>>>>> Indeed we discussed this but I don't see it in your mail Vincent.
>>>>>
>>>>> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Maybe we should agree on having a whole day dedicated on using these
>>>>>> tools with a maximum number of developers.
>>>>>> That way we will be able to help each other and maybe it will make the
>>>>>> process easier to carry out in the future.
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> Thanks,
>>>>>> Adel
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>>>>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>>>>>
>>>>>>> History
>>>>>>> ======
>>>>>>>
>>>>>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>>>>>
>>>>>>> I sent several email threads:
>>>>>>>
>>>>>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>>>>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>>>>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>>>>>
>>>>>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>>>>>
>>>>>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>>>>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>>>>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>>>
>>>>>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>>>>>
>>>>>>> Full Strategy proposal
>>>>>>> =================
>>>>>>>
>>>>>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>>>>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>>>>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>>>>>
>>>>>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>>>>>
>>>>>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>>>>>
>>>>>>> STAMP tools
>>>>>>> ==========
>>>>>>>
>>>>>>> There are 4 tools developed by STAMP:
>>>>>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>>>>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>>>>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>>>>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>>>>>
>>>>>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>>>>>
>>>>>>> Objectives/KPIs/Metrics for STAMP
>>>>>>> ===========================
>>>>>>>
>>>>>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>>>>>
>>>>>>> 1) K01: Increase test coverage
>>>>>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>>>>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Primary goal:
>>>>>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>>>>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>>>>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>>>>>> * Secondary goal:
>>>>>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>>>>>
>>>>>>> 2) K02: Reduce flaky tests.
>>>>>>> * Objective: reduce the number of flaky tests by 20%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Record flaky tests in jira
>>>>>>> * Fix the max number of them
>>>>>>>
>>>>>>> 3) K03: Better test quality
>>>>>>> * Objective: increase mutation score by 20%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Same strategy as K01.
>>>>>>>
>>>>>>> 4) K04: More configuration-related paths tested
>>>>>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>>>>>
>>>>>>> 5) K05: Reduce system-specific bugs
>>>>>>> * Objective: 30% improvement
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>>>>>
>>>>>>> 6) K06: More configurations/Faster tests
>>>>>>> * Objective: increase the number of automatically tested configurations by 50%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>>>>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>>>>>
>>>>>>> 7) K07: Pending, nothing to do FTM
>>>>>>>
>>>>>>> 8) K08: More crash replicating test cases
>>>>>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>>>>>> * Record and count the number of successful EvoCrash-generated test cases.
>>>>>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>>>>>> * Measure the new coverage increase
>>>>>>> * Note that I haven’t experimented much with this yet myself.
>>>>>>>
>>>>>>> 9) K09: Pending, nothing to do FTM.
>>>>>>>
>>>>>>> Conclusion
>>>>>>> =========
>>>>>>>
>>>>>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>>>>>
>>>>>>> Since there’s a lot to understand in this email, I’m open to:
>>>>>>> * Organizing a meeting on youtube live to discuss all this
>>>>>>> * Answering any questions on this thread ofc
>>>>>>> * Also feel free to ask on IRC/Matrix.
>>>>>>>
>>>>>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>>>>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Vincent
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thomas Mortagne
>>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

Thomas Mortagne
Administrator
In reply to this post by vmassol
Sounds good.

On Mon, Sep 3, 2018 at 9:55 AM, Vincent Massol <[hidden email]> wrote:

>
>> On 3 Sep 2018, at 09:55, Vincent Massol <[hidden email]> wrote:
>>
>> I propose to do this tomorrow Tuesday, starting with an intro from me, using youtube live.
>
> Say, 10AM Paris time.
>
> Thanks
> -Vincent
>
>> WDYT?
>>
>> Thanks
>> -Vincent
>>
>>> On 30 Aug 2018, at 12:27, Adel Atallah <[hidden email]> wrote:
>>>
>>> Just to be clear, when I proposed "having a whole day dedicated on
>>> using these tools", I didn't meant having to have it every week but
>>> only once, so we can properly start improving the tests. It would be
>>> some kind of training.
>>> On my side I don't think I'll be able to have on a week one day
>>> dedicated to tests and one for bug fixing, I won't have time left for
>>> the roadmap as I will only work on the product 50% of the time.
>>>
>>>
>>> On Thu, Aug 30, 2018 at 12:18 PM, Vincent Massol <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
>>>> * we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
>>>> * test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
>>>> * it can be boring to spend a full day on them
>>>>
>>>> Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.
>>>>
>>>> So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).
>>>>
>>>> In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.
>>>>
>>>> WDYT? What do you prefer?
>>>>
>>>> Thanks
>>>> -Vincent
>>>>
>>>>> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>>>>>
>>>>> Indeed we discussed this but I don't see it in your mail Vincent.
>>>>>
>>>>> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> Maybe we should agree on having a whole day dedicated on using these
>>>>>> tools with a maximum number of developers.
>>>>>> That way we will be able to help each other and maybe it will make the
>>>>>> process easier to carry out in the future.
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> Thanks,
>>>>>> Adel
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>>>>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>>>>>
>>>>>>> History
>>>>>>> ======
>>>>>>>
>>>>>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>>>>>
>>>>>>> I sent several email threads:
>>>>>>>
>>>>>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>>>>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>>>>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>>>>>
>>>>>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>>>>>
>>>>>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>>>>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>>>>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>>>
>>>>>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>>>>>
>>>>>>> Full Strategy proposal
>>>>>>> =================
>>>>>>>
>>>>>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>>>>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>>>>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>>>>>
>>>>>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>>>>>
>>>>>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>>>>>
>>>>>>> STAMP tools
>>>>>>> ==========
>>>>>>>
>>>>>>> There are 4 tools developed by STAMP:
>>>>>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>>>>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>>>>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>>>>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>>>>>
>>>>>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>>>>>
>>>>>>> Objectives/KPIs/Metrics for STAMP
>>>>>>> ===========================
>>>>>>>
>>>>>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>>>>>
>>>>>>> 1) K01: Increase test coverage
>>>>>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>>>>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Primary goal:
>>>>>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>>>>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>>>>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>>>>>> * Secondary goal:
>>>>>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>>>>>
>>>>>>> 2) K02: Reduce flaky tests.
>>>>>>> * Objective: reduce the number of flaky tests by 20%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Record flaky tests in jira
>>>>>>> * Fix the max number of them
>>>>>>>
>>>>>>> 3) K03: Better test quality
>>>>>>> * Objective: increase mutation score by 20%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Same strategy as K01.
>>>>>>>
>>>>>>> 4) K04: More configuration-related paths tested
>>>>>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>>>>>
>>>>>>> 5) K05: Reduce system-specific bugs
>>>>>>> * Objective: 30% improvement
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>>>>>
>>>>>>> 6) K06: More configurations/Faster tests
>>>>>>> * Objective: increase the number of automatically tested configurations by 50%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>>>>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>>>>>
>>>>>>> 7) K07: Pending, nothing to do FTM
>>>>>>>
>>>>>>> 8) K08: More crash replicating test cases
>>>>>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>>>>>
>>>>>>> Strategy:
>>>>>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>>>>>> * Record and count the number of successful EvoCrash-generated test cases.
>>>>>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>>>>>> * Measure the new coverage increase
>>>>>>> * Note that I haven’t experimented much with this yet myself.
>>>>>>>
>>>>>>> 9) K09: Pending, nothing to do FTM.
>>>>>>>
>>>>>>> Conclusion
>>>>>>> =========
>>>>>>>
>>>>>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>>>>>
>>>>>>> Since there’s a lot to understand in this email, I’m open to:
>>>>>>> * Organizing a meeting on youtube live to discuss all this
>>>>>>> * Answering any questions on this thread ofc
>>>>>>> * Also feel free to ask on IRC/Matrix.
>>>>>>>
>>>>>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>>>>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>>>>>
>>>>>>> Thanks
>>>>>>> -Vincent
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thomas Mortagne
>>>>
>>
>



--
Thomas Mortagne
Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

Simon Urli
OK for me too.

Simon

On 9/3/18 10:31 AM, Thomas Mortagne wrote:

> Sounds good.
>
> On Mon, Sep 3, 2018 at 9:55 AM, Vincent Massol <[hidden email]> wrote:
>>
>>> On 3 Sep 2018, at 09:55, Vincent Massol <[hidden email]> wrote:
>>>
>>> I propose to do this tomorrow Tuesday, starting with an intro from me, using youtube live.
>>
>> Say, 10AM Paris time.
>>
>> Thanks
>> -Vincent
>>
>>> WDYT?
>>>
>>> Thanks
>>> -Vincent
>>>
>>>> On 30 Aug 2018, at 12:27, Adel Atallah <[hidden email]> wrote:
>>>>
>>>> Just to be clear, when I proposed "having a whole day dedicated on
>>>> using these tools", I didn't meant having to have it every week but
>>>> only once, so we can properly start improving the tests. It would be
>>>> some kind of training.
>>>> On my side I don't think I'll be able to have on a week one day
>>>> dedicated to tests and one for bug fixing, I won't have time left for
>>>> the roadmap as I will only work on the product 50% of the time.
>>>>
>>>>
>>>> On Thu, Aug 30, 2018 at 12:18 PM, Vincent Massol <[hidden email]> wrote:
>>>>> Hi,
>>>>>
>>>>> I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
>>>>> * we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
>>>>> * test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
>>>>> * it can be boring to spend a full day on them
>>>>>
>>>>> Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.
>>>>>
>>>>> So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).
>>>>>
>>>>> In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.
>>>>>
>>>>> WDYT? What do you prefer?
>>>>>
>>>>> Thanks
>>>>> -Vincent
>>>>>
>>>>>> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>>>>>>
>>>>>> Indeed we discussed this but I don't see it in your mail Vincent.
>>>>>>
>>>>>> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> Maybe we should agree on having a whole day dedicated on using these
>>>>>>> tools with a maximum number of developers.
>>>>>>> That way we will be able to help each other and maybe it will make the
>>>>>>> process easier to carry out in the future.
>>>>>>>
>>>>>>> WDYT?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Adel
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>>>>>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>>>>>>
>>>>>>>> History
>>>>>>>> ======
>>>>>>>>
>>>>>>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>>>>>>
>>>>>>>> I sent several email threads:
>>>>>>>>
>>>>>>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>>>>>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>>>>>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>>>>>>
>>>>>>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>>>>>>
>>>>>>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>>>>>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>>>>>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>>>>
>>>>>>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>>>>>>
>>>>>>>> Full Strategy proposal
>>>>>>>> =================
>>>>>>>>
>>>>>>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>>>>>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>>>>>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>>>>>>
>>>>>>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>>>>>>
>>>>>>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>>>>>>
>>>>>>>> STAMP tools
>>>>>>>> ==========
>>>>>>>>
>>>>>>>> There are 4 tools developed by STAMP:
>>>>>>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>>>>>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>>>>>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>>>>>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>>>>>>
>>>>>>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>>>>>>
>>>>>>>> Objectives/KPIs/Metrics for STAMP
>>>>>>>> ===========================
>>>>>>>>
>>>>>>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>>>>>>
>>>>>>>> 1) K01: Increase test coverage
>>>>>>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>>>>>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>>>>>>
>>>>>>>> Strategy:
>>>>>>>> * Primary goal:
>>>>>>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>>>>>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>>>>>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>>>>>>> * Secondary goal:
>>>>>>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>>>>>>
>>>>>>>> 2) K02: Reduce flaky tests.
>>>>>>>> * Objective: reduce the number of flaky tests by 20%
>>>>>>>>
>>>>>>>> Strategy:
>>>>>>>> * Record flaky tests in jira
>>>>>>>> * Fix the max number of them
>>>>>>>>
>>>>>>>> 3) K03: Better test quality
>>>>>>>> * Objective: increase mutation score by 20%
>>>>>>>>
>>>>>>>> Strategy:
>>>>>>>> * Same strategy as K01.
>>>>>>>>
>>>>>>>> 4) K04: More configuration-related paths tested
>>>>>>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>>>>>>
>>>>>>>> Strategy:
>>>>>>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>>>>>>
>>>>>>>> 5) K05: Reduce system-specific bugs
>>>>>>>> * Objective: 30% improvement
>>>>>>>>
>>>>>>>> Strategy:
>>>>>>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>>>>>>
>>>>>>>> 6) K06: More configurations/Faster tests
>>>>>>>> * Objective: increase the number of automatically tested configurations by 50%
>>>>>>>>
>>>>>>>> Strategy:
>>>>>>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>>>>>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>>>>>>
>>>>>>>> 7) K07: Pending, nothing to do FTM
>>>>>>>>
>>>>>>>> 8) K08: More crash replicating test cases
>>>>>>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>>>>>>
>>>>>>>> Strategy:
>>>>>>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>>>>>>> * Record and count the number of successful EvoCrash-generated test cases.
>>>>>>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>>>>>>> * Measure the new coverage increase
>>>>>>>> * Note that I haven’t experimented much with this yet myself.
>>>>>>>>
>>>>>>>> 9) K09: Pending, nothing to do FTM.
>>>>>>>>
>>>>>>>> Conclusion
>>>>>>>> =========
>>>>>>>>
>>>>>>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>>>>>>
>>>>>>>> Since there’s a lot to understand in this email, I’m open to:
>>>>>>>> * Organizing a meeting on youtube live to discuss all this
>>>>>>>> * Answering any questions on this thread ofc
>>>>>>>> * Also feel free to ask on IRC/Matrix.
>>>>>>>>
>>>>>>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>>>>>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -Vincent
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thomas Mortagne
>>>>>
>>>
>>
>
>
>

--
Simon Urli
Software Engineer at XWiki SAS
[hidden email]
More about us at http://www.xwiki.com
Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
In reply to this post by Adel Atallah
So we had a conf call this morning and we agreed to have TFD (Test Fixing Day) on Tuesdays for the XWiki 10.8 timeframe. Those who cannot attend on Tuesday will work on the tests during the other days to catch up.

This means starting today! :)

Thanks
-Vincent

> On 30 Aug 2018, at 12:27, Adel Atallah <[hidden email]> wrote:
>
> Just to be clear, when I proposed "having a whole day dedicated on
> using these tools", I didn't meant having to have it every week but
> only once, so we can properly start improving the tests. It would be
> some kind of training.
> On my side I don't think I'll be able to have on a week one day
> dedicated to tests and one for bug fixing, I won't have time left for
> the roadmap as I will only work on the product 50% of the time.
>
>
> On Thu, Aug 30, 2018 at 12:18 PM, Vincent Massol <[hidden email]> wrote:
>> Hi,
>>
>> I don’t remember discussing this with you Thomas. Actually I’m not convinced to have a fixed day:
>> * we already have a fixed BFD and having a second one doesn’t leave much flexibility for working on roadmap items when it’s the best
>> * test sessions can be short (0.5-1 hours) and it’s easy to do them between other tasks
>> * it can be boring to spend a full day on them
>>
>> Now, I agree that not having a fixed day will make it hard to make sure that we work 20% on that topic.
>>
>> So if you prefer we can define a day, knowing that some won’t be able to always attend during that day and in this case they should do it on another day. What’s important is to have 20% done each week (i.e. enough work done on it).
>>
>> In term of day, if we have to choose one, I’d say Tuesday. That’s the most logical to me.
>>
>> WDYT? What do you prefer?
>>
>> Thanks
>> -Vincent
>>
>>> On 30 Aug 2018, at 10:38, Thomas Mortagne <[hidden email]> wrote:
>>>
>>> Indeed we discussed this but I don't see it in your mail Vincent.
>>>
>>> On Thu, Aug 30, 2018 at 10:33 AM, Adel Atallah <[hidden email]> wrote:
>>>> Hello,
>>>>
>>>> Maybe we should agree on having a whole day dedicated on using these
>>>> tools with a maximum number of developers.
>>>> That way we will be able to help each other and maybe it will make the
>>>> process easier to carry out in the future.
>>>>
>>>> WDYT?
>>>>
>>>> Thanks,
>>>> Adel
>>>>
>>>>
>>>> On Wed, Aug 29, 2018 at 11:20 AM, Vincent Massol <[hidden email]> wrote:
>>>>> Hi devs (and anyone else interested to improve the tests of XWiki),
>>>>>
>>>>> History
>>>>> ======
>>>>>
>>>>> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>>>>>
>>>>> I sent several email threads:
>>>>>
>>>>> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
>>>>> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
>>>>> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>>>>>
>>>>> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>>>>>
>>>>> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
>>>>> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
>>>>> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>>>>>
>>>>> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>>>>>
>>>>> Full Strategy proposal
>>>>> =================
>>>>>
>>>>> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
>>>>> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
>>>>> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>>>>>
>>>>> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>>>>>
>>>>> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>>>>>
>>>>> STAMP tools
>>>>> ==========
>>>>>
>>>>> There are 4 tools developed by STAMP:
>>>>> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
>>>>> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot
>>>>> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
>>>>> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>>>>>
>>>>> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>>>>>
>>>>> Objectives/KPIs/Metrics for STAMP
>>>>> ===========================
>>>>>
>>>>> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>>>>>
>>>>> 1) K01: Increase test coverage
>>>>> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
>>>>> * Increase the coverage contributions of each tool developed by STAMP.
>>>>>
>>>>> Strategy:
>>>>> * Primary goal:
>>>>> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
>>>>> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
>>>>> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
>>>>> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
>>>>> * Secondary goal:
>>>>> ** Increase our global TPC as mentioned above by fixing the modules in red.
>>>>>
>>>>> 2) K02: Reduce flaky tests.
>>>>> * Objective: reduce the number of flaky tests by 20%
>>>>>
>>>>> Strategy:
>>>>> * Record flaky tests in jira
>>>>> * Fix the max number of them
>>>>>
>>>>> 3) K03: Better test quality
>>>>> * Objective: increase mutation score by 20%
>>>>>
>>>>> Strategy:
>>>>> * Same strategy as K01.
>>>>>
>>>>> 4) K04: More configuration-related paths tested
>>>>> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>>>>>
>>>>> Strategy:
>>>>> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>>>>>
>>>>> 5) K05: Reduce system-specific bugs
>>>>> * Objective: 30% improvement
>>>>>
>>>>> Strategy:
>>>>> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>>>>>
>>>>> 6) K06: More configurations/Faster tests
>>>>> * Objective: increase the number of automatically tested configurations by 50%
>>>>>
>>>>> Strategy:
>>>>> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
>>>>> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>>>>>
>>>>> 7) K07: Pending, nothing to do FTM
>>>>>
>>>>> 8) K08: More crash replicating test cases
>>>>> * Objective: increase the number of crash replicating test cases by at least 70%
>>>>>
>>>>> Strategy:
>>>>> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
>>>>> * Record and count the number of successful EvoCrash-generated test cases.
>>>>> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
>>>>> * Measure the new coverage increase
>>>>> * Note that I haven’t experimented much with this yet myself.
>>>>>
>>>>> 9) K09: Pending, nothing to do FTM.
>>>>>
>>>>> Conclusion
>>>>> =========
>>>>>
>>>>> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>>>>>
>>>>> Since there’s a lot to understand in this email, I’m open to:
>>>>> * Organizing a meeting on youtube live to discuss all this
>>>>> * Answering any questions on this thread ofc
>>>>> * Also feel free to ask on IRC/Matrix.
>>>>>
>>>>> Here’s an extract from STAMP which has more details about the KPIs/metrics:
>>>>> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>>>>>
>>>>> Thanks
>>>>> -Vincent
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>> --
>>> Thomas Mortagne
>>

Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
In reply to this post by vmassol


> On 29 Aug 2018, at 11:20, Vincent Massol <[hidden email]> wrote:

[snip]

> Objectives/KPIs/Metrics for STAMP
> ===========================
>
> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>
> 1) K01: Increase test coverage
> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
> * Increase the coverage contributions of each tool developed by STAMP.
>
> Strategy:
> * Primary goal:
> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.

Contrary to what was proposed initially, it would be nice to run DSpot too.

FTR a good command line to use for DSpot is:
java -jar <path>/dspot-1.1.1-SNAPSHOT-jar-with-dependencies.jar --path-to-properties dspot.properties --verbose --generate-new-test-class --with-comment

The --generate-new-test-class tells DSpot to generate in its output dir only the new tests added and not include existing tests.
The --with-comment tells DSpot to keep the comments and thus the license header too

I did a session today and committed the results in https://github.com/STAMP-project/dspot-usecases-output/commit/113726c0aac3af3df30334d14115d89227eaebdc

What I did:
* For each module tested with DSpot create a folder in https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki
* For cases where DSpot could generate some tests, commit them and modify the pom.xml so that they are executed
* Note: tests need to have their license headers adjusted so that they don’t fail the build
* Computed coverage + mutation scores before and after and reported in the README.md in each folder

Thanks
-Vincent

> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
> * Secondary goal:
> ** Increase our global TPC as mentioned above by fixing the modules in red.
>
> 2) K02: Reduce flaky tests.
> * Objective: reduce the number of flaky tests by 20%
>
> Strategy:
> * Record flaky tests in jira
> * Fix the max number of them
>
> 3) K03: Better test quality
> * Objective: increase mutation score by 20%
>
> Strategy:
> * Same strategy as K01.

[snip]

Thanks
-Vincent

Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
In reply to this post by vmassol
Hi there,

We need some more DSpot results. Would be great if you could help out.

See below for instructions.

> On 29 Aug 2018, at 11:20, Vincent Massol <[hidden email]> wrote:
>
> Hi devs (and anyone else interested to improve the tests of XWiki),
>
> History
> ======
>
> It all started when I analyzed our global TPC and found that it was going down globally even though we have the fail-build-on-jacoco-threshold strategy.
>
> I sent several email threads:
>
> - Loss of TPC: http://markmail.org/message/hqumkdiz7jm76ya6
> - TPC evolution: http://markmail.org/message/up2gc2zzbbe4uqgn
> - Improve our TPC strategy: http://markmail.org/message/grphwta63pp5p4l7
>
> Note: As a consequence of this last thread, I implemented a Jenkins Pipeline to send us a mail when the global TPC of an XWiki module goes down so that we fix it ASAP. This is still a development in progress. A first version is done and running at https://ci.xwiki.org/view/Tools/job/Clover/ but I need to debug it and fix it (it’s not working ATM).
>
> As a result of the global TPC going down/stagnating, I have proposed to have 10.7 focused on Tests + BFD.
> - Initially I proposed to focus on increasing the global TPC by looking at the reports from 1) above (http://markmail.org/message/qjemnip7hjva2rjd). See the last report at https://up1.xwikisas.com/#mJ0loeB6nBrAgYeKA7MGGw (we need to fix the red parts).
> - Then with the STAMP mid-term review, a bigger urgency surfaced and I asked if we could instead focus on fixing tests as reported by Descartes to increase both coverage and mutation score (ie test quality), since those are 2 metrics/KPIs measured by STAMP and since XWiki participates to STAMP we need to work on them and increase them substantially. See http://markmail.org/message/ejmdkf3hx7drkj52
>
> The results of XWiki 10.7 has been quite poor on test improvements  (more focus on BFD than tests, lots of devs on holidays, etc). This forces us to have a different strategy.
>
> Full Strategy proposal
> =================
>
> 1) As many XWiki SAS devs as possible (and anyone else from the community who’s interested ofc! :)) should spend 1 day per week working on improving STAMP metrics
> * Currently the agreement is that Thomas and myself will do this for the foreseeable future till we get some good-enough metric progress
> * Some other devs from XWiki SAS will help out for XWiki 10.8 only FTM (Marius, Adel if he can, Simon in the future). The idea is to see where that could get us by using substantial manpower.
>
> 2) All committers: More generally the global TPC failure is also already active and dev need to modify modules that see their global TPC go down.
>
> 3) All committers: Of course, the jacoco strategy is also active at each module level.
>
> STAMP tools
> ==========
>
> There are 4 tools developed by STAMP:
> * Descartes: Improves quality of tests by increasing their mutation scores. See http://markmail.org/message/bonb5f7f37omnnog and also https://massol.myxwiki.org/xwiki/bin/view/Blog/MutationTestingDescartes
> * DSpot: Automatically generate new tests, based on existing tests. See https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot

Process to run DSpot:
1) Pick a module. Measure coverage and mutation score (or take the value there already if they’re in the pom.xml). Same as for Descartes testing.
2) Run DSpot on the module, see https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot for explanations
3) If DSpot has generated tests, add them to XWiki’s source code in src/test/dspot and add the following to the pom of that module:

<build>
  <plugins>
    <!-- Add test source root for executing DSpot-generated tests -->
    <plugin>
      <groupId>org.codehaus.mojo</groupId>
      <artifactId>build-helper-maven-plugin</artifactId>
    </plugin>
  </plugins>
</build>

Example: https://github.com/xwiki/xwiki-commons/tree/244ee07976c691c335b7f54c48e6308004ba3d82/xwiki-commons-core/xwiki-commons-crypto/xwiki-commons-crypto-cipher

Note: The generated tests sometimes need to be modified a bit to pass. Personally I’ve only committed tests that were passing and I reported issues for those that were not passing.

4) File the various reports:
a) https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki both for success and failures
b) https://docs.google.com/spreadsheets/d/1LULpGpsJirmFyvHNstLGv-Gv5DVBdpLTM2hm0jgCKUw/edit#gid=2061481816
c) for failures, file a github issue at https://github.com/STAMP-project/dspot/issues and link to the place on https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki where we put the failing result.

Note: The reason we need to report failures too is because DSpot fails a lot so we need to show what we have tested

Thanks
-Vincent

> * CAMP: Takes a Dockerfile and generates mutations of it, then deploys and execute tests on the software to see if the mutation works or not. Note this is currently not fitting the need of XWiki and thus I’ve been developing another tool as an experiment (which may go back in CAMP one day), based on TestContainers, see https://massol.myxwiki.org/xwiki/bin/view/Blog/EnvironmentTestingExperimentations
> * EvoCrash: Takes a stack trace from production logs and generates a test that, when executed, reproduces the crash. See https://markmail.org/message/v74g3tsmflquqwra. See also https://github.com/SERG-Delft/EvoCrash
>
> Since XWiki is part of the STAMP research project, we need to use those 4 tools to increase the KPIs associated with the tools. See below.
>
> Objectives/KPIs/Metrics for STAMP
> ===========================
>
> The STAMP project has defined 9 KPIs that all partners (and thus XWiki) need to work on:
>
> 1) K01: Increase test coverage
> * Global increase by reducing by 40% the non-covered code. For XWiki since we’re at about 70%, this means reaching about 80% before the end of STAMP (ie. before end of 2019)
> * Increase the coverage contributions of each tool developed by STAMP.
>
> Strategy:
> * Primary goal:
> ** Increase coverage by executing Descartes and improving our tests. This is http://markmail.org/message/ejmdkf3hx7drkj52
> ** Don’t do anything with DSpot. I’ll do that part. Note that the goal is to write a Jenkins pipeline to automatically execute DSpot from time to time and commit the generated tests in a separate test source and have our build execute both src/test/java and this new test source.
> ** Don’t do anything with TestContainers FTM since I need to finish a first working version. I may need help in the future to implement docker images for more configurations (on Oracle, in a cluster, with LibreOffice, with an external SOLR server, etc).
> ** For EvoCrash: We’ll count contributions of EvoCrash to coverage in K08.
> * Secondary goal:
> ** Increase our global TPC as mentioned above by fixing the modules in red.
>
> 2) K02: Reduce flaky tests.
> * Objective: reduce the number of flaky tests by 20%
>
> Strategy:
> * Record flaky tests in jira
> * Fix the max number of them
>
> 3) K03: Better test quality
> * Objective: increase mutation score by 20%
>
> Strategy:
> * Same strategy as K01.
>
> 4) K04: More configuration-related paths tested
> * Objective: increase the code coverage of configuration-related paths in our code by 20% (e.g. DB schema creation, cluster)related code, SOLR-related code, LibreOffice-related code, etc).
>
> Strategy:
> * Leave it to FTM. The idea is to measure Clover TPC with the base configuration, then execute all other configurations (with TestContainers) and regenerate the Clover report to see how much the TPC has increased.
>
> 5) K05: Reduce system-specific bugs
> * Objective: 30% improvement
>
> Strategy:
> * Run TestContainers, execute existing tests and find new bugs related to configurations. Record them
>
> 6) K06: More configurations/Faster tests
> * Objective: increase the number of automatically tested configurations by 50%
>
> Strategy:
> * Increase the # of configurations we test with TestContainers. I’ll do that part initially.
> * Reduce time it takes to deploy the software under a given configuration vs time it used to take when done manually before STAMP. I’ll do this one. I’ve already worked on it in the past year with the dockerization of XWiki.
>
> 7) K07: Pending, nothing to do FTM
>
> 8) K08: More crash replicating test cases
> * Objective: increase the number of crash replicating test cases by at least 70%
>
> Strategy:
> * For all issues that are still open and that have stack traces and for all issues closed but without tests, run EvoCrash on them to try to generate a test.
> * Record and count the number of successful EvoCrash-generated test cases.
> * Derive a regression test (which can be very different from the negative of the test generated by evocrash!).
> * Measure the new coverage increase
> * Note that I haven’t experimented much with this yet myself.
>
> 9) K09: Pending, nothing to do FTM.
>
> Conclusion
> =========
>
> Right now, I need your help for the following KPIs: K01, K02, K03, K08.
>
> Since there’s a lot to understand in this email, I’m open to:
> * Organizing a meeting on youtube live to discuss all this
> * Answering any questions on this thread ofc
> * Also feel free to ask on IRC/Matrix.
>
> Here’s an extract from STAMP which has more details about the KPIs/metrics:
> https://up1.xwikisas.com/#QJyxqspKXSzuWNOHUuAaEA
>
> Thanks
> -Vincent
>
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
Hi,

[snip]

> Process to run DSpot:
> 1) Pick a module. Measure coverage and mutation score (or take the value there already if they’re in the pom.xml). Same as for Descartes testing.
> 2) Run DSpot on the module, see https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot for explanations

One important detail that I had missed. We need to run Dspot with “—descartes” on the command line so that it uses Descartes for computing the mutation score for mutations and only keep tests that increase the mutation score as reported by Descartes.

> 3) If DSpot has generated tests, add them to XWiki’s source code in src/test/dspot and add the following to the pom of that module:
>
> <build>
>  <plugins>
>    <!-- Add test source root for executing DSpot-generated tests -->
>    <plugin>
>      <groupId>org.codehaus.mojo</groupId>
>      <artifactId>build-helper-maven-plugin</artifactId>
>    </plugin>
>  </plugins>
> </build>
>
> Example: https://github.com/xwiki/xwiki-commons/tree/244ee07976c691c335b7f54c48e6308004ba3d82/xwiki-commons-core/xwiki-commons-crypto/xwiki-commons-crypto-cipher
>
> Note: The generated tests sometimes need to be modified a bit to pass. Personally I’ve only committed tests that were passing and I reported issues for those that were not passing.
>
> 4) File the various reports:
> a) https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki both for success and failures
> b) https://docs.google.com/spreadsheets/d/1LULpGpsJirmFyvHNstLGv-Gv5DVBdpLTM2hm0jgCKUw/edit#gid=2061481816
> c) for failures, file a github issue at https://github.com/STAMP-project/dspot/issues and link to the place on https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki where we put the failing result.
>
> Note: The reason we need to report failures too is because DSpot fails a lot so we need to show what we have tested
>
> Thanks
> -Vincent
>

[snip]

Thanks
-Vincent


Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator
Hi,

> On 17 Oct 2018, at 11:20, Vincent Massol <[hidden email]> wrote:
>
> Hi,
>
> [snip]
>
>> Process to run DSpot:
>> 1) Pick a module. Measure coverage and mutation score (or take the value there already if they’re in the pom.xml). Same as for Descartes testing.
>> 2) Run DSpot on the module, see https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot for explanations
>
> One important detail that I had missed. We need to run Dspot with “—descartes” on the command line so that it uses Descartes for computing the mutation score for mutations and only keep tests that increase the mutation score as reported by Descartes.

So actually, after speaking with Benjamin, I’ve realized a few things:

* By default DSpot runs with the PIT selector (PitMutantScoreSelector) which is configured to use the default PIT mutations. This is why we need to run with the PIT selector but configured to use the Descartes mutation, and this is done by specifying --descartes.
* Now this will optimize the generation of new tests for their increased mutation score. Right now we got 0% all the time on our tests (see https://docs.google.com/spreadsheets/d/1LULpGpsJirmFyvHNstLGv-Gv5DVBdpLTM2hm0jgCKUw/edit#gid=2061481816) and it’s because we didn’t use --descartes. We need to try again or run on new modules with --descartes and see what it gives us. It’s possible it’ll generate even less tests…
* For the coverage part, there are 2 other selectors that can be used with DSpot to generate tests that all increase the coverage:
** "--test-criterion JacocoCoverageSelector": uses jacoco and keep tests that increase the instruction coverage
** "--test-criterion CloverCoverageSelector”: uses openclover and keep tests that increase the branch coverage

So we need to test with the various selectors and see what we get.

If we want to get the best values, we should use --descartes for K03 and either jacoco or clover selector for K01. Now we need to see what tests we get.

Thanks
-Vincent

>
>> 3) If DSpot has generated tests, add them to XWiki’s source code in src/test/dspot and add the following to the pom of that module:
>>
>> <build>
>> <plugins>
>>   <!-- Add test source root for executing DSpot-generated tests -->
>>   <plugin>
>>     <groupId>org.codehaus.mojo</groupId>
>>     <artifactId>build-helper-maven-plugin</artifactId>
>>   </plugin>
>> </plugins>
>> </build>
>>
>> Example: https://github.com/xwiki/xwiki-commons/tree/244ee07976c691c335b7f54c48e6308004ba3d82/xwiki-commons-core/xwiki-commons-crypto/xwiki-commons-crypto-cipher
>>
>> Note: The generated tests sometimes need to be modified a bit to pass. Personally I’ve only committed tests that were passing and I reported issues for those that were not passing.
>>
>> 4) File the various reports:
>> a) https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki both for success and failures
>> b) https://docs.google.com/spreadsheets/d/1LULpGpsJirmFyvHNstLGv-Gv5DVBdpLTM2hm0jgCKUw/edit#gid=2061481816
>> c) for failures, file a github issue at https://github.com/STAMP-project/dspot/issues and link to the place on https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki where we put the failing result.
>>
>> Note: The reason we need to report failures too is because DSpot fails a lot so we need to show what we have tested
>>
>> Thanks
>> -Vincent
>>
>
> [snip]
>
> Thanks
> -Vincent
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [STAMP/Test] Metrics we need to improve + strategy

vmassol
Administrator


> On 17 Oct 2018, at 15:54, Vincent Massol <[hidden email]> wrote:
>
> Hi,
>
>> On 17 Oct 2018, at 11:20, Vincent Massol <[hidden email]> wrote:
>>
>> Hi,
>>
>> [snip]
>>
>>> Process to run DSpot:
>>> 1) Pick a module. Measure coverage and mutation score (or take the value there already if they’re in the pom.xml). Same as for Descartes testing.
>>> 2) Run DSpot on the module, see https://massol.myxwiki.org/xwiki/bin/view/Blog/TestGenerationDspot for explanations
>>
>> One important detail that I had missed. We need to run Dspot with “—descartes” on the command line so that it uses Descartes for computing the mutation score for mutations and only keep tests that increase the mutation score as reported by Descartes.
>
> So actually, after speaking with Benjamin, I’ve realized a few things:
>
> * By default DSpot runs with the PIT selector (PitMutantScoreSelector) which is configured to use the default PIT mutations. This is why we need to run with the PIT selector but configured to use the Descartes mutation, and this is done by specifying --descartes.
> * Now this will optimize the generation of new tests for their increased mutation score. Right now we got 0% all the time on our tests (see https://docs.google.com/spreadsheets/d/1LULpGpsJirmFyvHNstLGv-Gv5DVBdpLTM2hm0jgCKUw/edit#gid=2061481816) and it’s because we didn’t use --descartes. We need to try again or run on new modules with --descartes and see what it gives us. It’s possible it’ll generate even less tests…
> * For the coverage part, there are 2 other selectors that can be used with DSpot to generate tests that all increase the coverage:
> ** "--test-criterion JacocoCoverageSelector": uses jacoco and keep tests that increase the instruction coverage
> ** "--test-criterion CloverCoverageSelector”: uses openclover and keep tests that increase the branch coverage
>
> So we need to test with the various selectors and see what we get.

I’ve retested on xwiki-commons-component-default:
1) With —descartes: failure, see https://github.com/STAMP-project/dspot/issues/584
2) With jacoco selector: failure, see https://github.com/STAMP-project/dspot/issues/586. I’ve manually fixed the tests and remove those that didn’t pass. I got only +0.18% jacoco coverage increase and -2% descartes mutation score… That’s the problem, we would need a selector that optimizes for both. I’ve created https://github.com/STAMP-project/dspot/issues/587
3) With clover selector: no tests generated! Opened https://github.com/STAMP-project/dspot/issues/588

So my recommendation is to wait for https://github.com/STAMP-project/dspot/issues/584 to be fixed and then to use —descartes for our measures FTM.

Thanks
-Vincent

PS: Command lines used for reference:

- java -jar /Users/vmassol/dev/dspot/dspot/target/dspot-1.1.1-SNAPSHOT-jar-with-dependencies.jar --path-to-properties dspot.properties --descartes --verbose --generate-new-test-class --with-comment
- java -jar /Users/vmassol/dev/dspot/dspot/target/dspot-1.1.1-SNAPSHOT-jar-with-dependencies.jar --path-to-properties dspot.properties --test-criterion JacocoCoverageSelector --verbose --generate-new-test-class --with-comment
- java -jar /Users/vmassol/dev/dspot/dspot/target/dspot-1.1.1-SNAPSHOT-jar-with-dependencies.jar --path-to-properties dspot.properties --test-criterion CloverCoverageSelector --verbose --generate-new-test-class --with-comment


>
> If we want to get the best values, we should use --descartes for K03 and either jacoco or clover selector for K01. Now we need to see what tests we get.
>
> Thanks
> -Vincent
>
>>
>>> 3) If DSpot has generated tests, add them to XWiki’s source code in src/test/dspot and add the following to the pom of that module:
>>>
>>> <build>
>>> <plugins>
>>>  <!-- Add test source root for executing DSpot-generated tests -->
>>>  <plugin>
>>>    <groupId>org.codehaus.mojo</groupId>
>>>    <artifactId>build-helper-maven-plugin</artifactId>
>>>  </plugin>
>>> </plugins>
>>> </build>
>>>
>>> Example: https://github.com/xwiki/xwiki-commons/tree/244ee07976c691c335b7f54c48e6308004ba3d82/xwiki-commons-core/xwiki-commons-crypto/xwiki-commons-crypto-cipher
>>>
>>> Note: The generated tests sometimes need to be modified a bit to pass. Personally I’ve only committed tests that were passing and I reported issues for those that were not passing.
>>>
>>> 4) File the various reports:
>>> a) https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki both for success and failures
>>> b) https://docs.google.com/spreadsheets/d/1LULpGpsJirmFyvHNstLGv-Gv5DVBdpLTM2hm0jgCKUw/edit#gid=2061481816
>>> c) for failures, file a github issue at https://github.com/STAMP-project/dspot/issues and link to the place on https://github.com/STAMP-project/dspot-usecases-output/tree/master/xwiki where we put the failing result.
>>>
>>> Note: The reason we need to report failures too is because DSpot fails a lot so we need to show what we have tested
>>>
>>> Thanks
>>> -Vincent
>>>
>>
>> [snip]
>>
>> Thanks
>> -Vincent