Wikipedia import issues.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Wikipedia import issues.

kgardas

Hello,

I'm trying to import Wikipedia xml (English dumps w/o history) into the
Xwiki 6.0.1 running using PostgreSQL as a DB. I'm using mediawiki/1.0
syntax to easy job on my side especially when the task is to test if
xwiki is able to hold just this amount of data and nothing more.

So far probably the most critical found issues are:

1) wikipedia's links are a little bit longer than expected. I'm afraid
this is usually whole citation going into the link hence after
installing xwiki and initialization of hibernate I needed to switch it
off and alter PostgreSQL table by:

alter table xwikilinks alter column xwl_link type varchar(4096);

that ensures that much more pages may be imported.

2) while importing I hit issue on duplication of the xwikircs_pkey key.
It shows as:

STATEMENT:  insert into xwikircs (XWR_DATE, XWR_COMMENT, XWR_AUTHOR,
XWR_DOCID, XWR_VERSION1, XWR_VERSION2) values ($1, $2, $3, $4, $5, $6)
ERROR:  duplicate key value violates unique constraint "xwikircs_pkey"
DETAIL:  Key (xwr_docid, xwr_version1,
xwr_version2)=(3170339397610733377, 1, 1) already exists.

in PostgreSQL console and as:

2014-09-22 00:53:51,601
[http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
WARN  o.h.u.JDBCExceptionReporter    - SQL Error: 0, SQLState: 23505
2014-09-22 00:53:51,601
[http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
ERROR o.h.u.JDBCExceptionReporter    - Batch entry 0 insert into
xwikircs (XWR_DATE, XWR_COMMENT, XWR_AUTHOR, XWR_DOCID, XWR_VERSION1,
XWR_VERSION2) values ('2014-09-22 00:53:51.000000 +02:00:00', '',
'XWiki.Admin', 3170339397610733377, 1, 1) was aborted.  Call
getNextException to see the cause.
2014-09-22 00:53:51,601
[http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
WARN  o.h.u.JDBCExceptionReporter    - SQL Error: 0, SQLState: 23505
2014-09-22 00:53:51,601
[http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
ERROR o.h.u.JDBCExceptionReporter    - ERROR: duplicate key value
violates unique constraint "xwikircs_pkey"
   Detail: Key (xwr_docid, xwr_version1,
xwr_version2)=(3170339397610733377, 1, 1) already exists.

in xwiki/tomcat console.

This issue I'm not able to solve so far as it looks like the key value
itself is somehow generated by xwiki probably from some other data and
I'm not able to find so far related code.
Also the question is if this is kind of hash function if I did not break
that by making links longer by hack in (1).

Any comment on (1) and its correctness and idea to fix (2) is highly
appreciated here.

Thanks!
Karel
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia import issues.

vmassol
Administrator
Hi Karel, 


On 24 Sep 2014 at 11:49:02, Karel Gardas ([hidden email](mailto:[hidden email])) wrote:

>  
> Hello,
>  
> I'm trying to import Wikipedia xml (English dumps w/o history) into the
> Xwiki 6.0.1 running using PostgreSQL as a DB. I'm using mediawiki/1.0
> syntax to easy job on my side especially when the task is to test if
> xwiki is able to hold just this amount of data and nothing more.

Interesting experiment :)

> So far probably the most critical found issues are:
>  
> 1) wikipedia's links are a little bit longer than expected. I'm afraid
> this is usually whole citation going into the link hence after
> installing xwiki and initialization of hibernate I needed to switch it
> off and alter PostgreSQL table by:
>  
> alter table xwikilinks alter column xwl_link type varchar(4096);
>  
> that ensures that much more pages may be imported.

The xwiklinks table is the table containing all the backlinks for a given document.

Indeed the default is 255 chars for the “link” field which contains a serialized reference to linked pages (but without the wiki part if the wiki is the same as the wiki of the document containing the link).

And "fullName” is also 255 chars by default and contains a serialized reference of the document containing a link (without the wiki part).

So indeed it can quickly become not enough if space names and wiki pages are a bit long.

> 2) while importing I hit issue on duplication of the xwikircs_pkey key.
> It shows as:
>  
> STATEMENT: insert into xwikircs (XWR_DATE, XWR_COMMENT, XWR_AUTHOR,
> XWR_DOCID, XWR_VERSION1, XWR_VERSION2) values ($1, $2, $3, $4, $5, $6)
> ERROR: duplicate key value violates unique constraint "xwikircs_pkey"
> DETAIL: Key (xwr_docid, xwr_version1,
> xwr_version2)=(3170339397610733377, 1, 1) already exists.
>  
> in PostgreSQL console and as:
>  
> 2014-09-22 00:53:51,601
> [http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
> WARN o.h.u.JDBCExceptionReporter - SQL Error: 0, SQLState: 23505
> 2014-09-22 00:53:51,601
> [http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
> ERROR o.h.u.JDBCExceptionReporter - Batch entry 0 insert into
> xwikircs (XWR_DATE, XWR_COMMENT, XWR_AUTHOR, XWR_DOCID, XWR_VERSION1,
> XWR_VERSION2) values ('2014-09-22 00:53:51.000000 +02:00:00', '',
> 'XWiki.Admin', 3170339397610733377, 1, 1) was aborted. Call
> getNextException to see the cause.
> 2014-09-22 00:53:51,601
> [http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
> WARN o.h.u.JDBCExceptionReporter - SQL Error: 0, SQLState: 23505
> 2014-09-22 00:53:51,601
> [http://localhost:8080/xwiki/rest/wikis/xwiki/spaces/Wikipedia/pages/Brecon_&_Radnorshire]
> ERROR o.h.u.JDBCExceptionReporter - ERROR: duplicate key value
> violates unique constraint "xwikircs_pkey"
> Detail: Key (xwr_docid, xwr_version1,
> xwr_version2)=(3170339397610733377, 1, 1) already exists.
>  
> in xwiki/tomcat console.
>  
> This issue I'm not able to solve so far as it looks like the key value
> itself is somehow generated by xwiki probably from some other data and
> I'm not able to find so far related code.

The code is in XWikiDocument.getId().

There’s this caveat in the code:

        // TODO: Ensure uniqueness of the generated id
        // The implementation doesn't guarantee a unique id since it uses a hashing method which never guarantee
        // uniqueness. However, the hash algorithm is really unlikely to collide in a given wiki. This needs to be
        // fixed to produce a real unique id since otherwise we can have clashes in the database.

I don’t have much ideas except rename the pages causing the problems since the unique id is computed based on that.

Here’s the algorithm FYI:

/**
 * <p>
 * Serialize a reference into a unique identifier string within a wiki. Its similar to the
 * {@link UidStringEntityReferenceSerializer}, but is made appropriate for a wiki independent storage.
 * </p>
 * <p>
 * The string created looks like {@code 5:space3:doc} for the {@code wiki:space.doc} document reference.
 * and {@code 5:space3:doc15:xspace.class[0]} for the wiki:space.doc^wiki:xspace.class[0] object.
 * (with {@code 5} being the length of the space name, i.e the length of {@code space} and {@code 3} being the length of
 * the page name, i.e. the length of {@code doc}).
 * </p>

Denis might know better since he improved the uniqueness some time back.

> Also the question is if this is kind of hash function if I did not break
> that by making links longer by hack in (1).

No, it’s unrelated.

Thanks
-Vincent

> Any comment on (1) and its correctness and idea to fix (2) is highly
> appreciated here.
>  
> Thanks!
> Karel
_______________________________________________
devs mailing list
[hidden email]
http://lists.xwiki.org/mailman/listinfo/devs