Wikidata talk:WikiProject Taxonomy/Archive/2014/06

From Wikidata
Jump to navigation Jump to search

Duplicates

Following the recent discussion on the project chat, I saw that there were some duplicate taxons created by accident. See: https://www.wikidata.org/w/index.php?title=Special:Contributions&offset=20150101000000&limit=500&tagfilter=&contribs=user&target=Reinheitsgebot&namespace=&newOnly=1 - I already merged a few, but it will probably take a few days until all the duplicates are merged. We should probably team up on this or is there a bot that could help us? -Tobias1984 (talk) 10:26, 22 May 2014 (UTC)

So we have only 8108 potential duplicates to check. Wonderful. :( --Succu (talk) 18:58, 22 May 2014 (UTC)
I think they are almost all duplicates. I think the algorithm only checked the English label for duplicates and not all languages. That's why it didn't find any matches. -Tobias1984 (talk) 21:07, 22 May 2014 (UTC)
@Tobias1984: I made a roughly check: about 80 percent are duplicates. --Succu (talk) 17:39, 23 May 2014 (UTC)

I wrote a script to merge items automatically so if you can provide a surely-checked list of items needs to be merged or provide a way to check it with my bot (for example: merging items which have the same taxon name (P225)) I would be happy to do it Amir (talk) 22:06, 22 May 2014 (UTC)

@Ladsgroup: I'd like to review your code. Where can I find it? --Succu (talk) 22:11, 22 May 2014 (UTC)
I can send it to you (if you don't mind e-mailing me your e-mail address :D) and surely based on the check you may say (like P225) I will change it Amir (talk) 22:20, 22 May 2014 (UTC)
Hi Amir, I think a review can wait, because there a two problems. None of these items has statements. And same taxon name (P225) does not mean the items have to be merged. This is in particular true if they represent genera (see this list). Magnus started to merge some items, unfortunatly without RfDs. --Succu (talk) 17:39, 23 May 2014 (UTC)
We can merge items that has the same taxon rank (both are species or genera, etc.) or we can merge items which both of them aren't genera, tons of crazy checks can be added (in order to prevent mistakes) to make the number of duplicate items into a reasonable value so the remained can be checked by hand. sorry for being very stupid about this because my field of study in university is physics so I'm not very familiar with biology Amir (talk) 20:59, 23 May 2014 (UTC)
It should be safe to merge species (with the same two-part name). For genera it is not safe, and anyway for these cases there probably will not even be a "taxon rank=genus", so matching can only be done by checking to the Wikipedia page and finding the same "parent taxon". Likely too difficult for most bots?- Brya (talk) 07:16, 24 May 2014 (UTC)

As a very simple way to help the humans helping the bot that is helping the humans (:-)), could the list of probably duplicates be split into different ranks. i.e. list all family duplicates together, then genus, etc. and finally species & subspecies. Humans can probably review these lists and manually merge the complicated cases and then say 'merge the rest'. If detecting the rank is too difficult, a rough approximation is to separate the taxon which are only one word (i.e. zero spaces) vs the taxon with multiple words. Doing them as two different batches will, I assume and based on my own limited experience in this area, simplify the process of reviewing the bots edits, and probably also simplify the bot code if humans can eliminate the difficult cases before the bot does its work. John Vandenberg (talk) 08:00, 24 May 2014 (UTC)

It can be simpler, it should be okay to merge everything with a name of two parts or more (lots of ranks, but this does not matter) while everything with a name of a single part ('word') is best handled with caution. - Brya (talk) 12:03, 24 May 2014 (UTC)

I'm not sure, but apparently this ist the last batch deletion Magnus filed. I hope the problem is gone, but I will check it next week. Most of the remaining items should have statements now. --Succu (talk) 19:23, 25 May 2014 (UTC)

There are around 1,300 more potential duplicates. --Succu (talk)


 Info: All these items are merged or enhanced with some properties. --Succu (talk) 17:36, 1 June 2014 (UTC)

Usage of Wikidata taxoboxes in English Wikipedia

As far as I understand, it's not possible to use Wikidata taxoboxes in Wikipedias now. Introduction of Wikidata taxoboxes to Wikipedias could save a lot of time to editors and give them motivation to invest their time in filling Wikidata with accurate information. What current estimations are? Where to find information on progress and obstacles? Who is in charge of this? -- Alexander Vasenin (talk) 19:54, 29 May 2014 (UTC)

Try @FelixReimann:. - Brya (talk) 05:05, 30 May 2014 (UTC)
Thanks Brya for pinging :)
Dear Alex, currently bug 47930 still prohibits anything else than simplest Wikidata-based templates. This is technically the only obstacle we still have. Lydia told me that this feature has already a quite high priority for the developers (see also development plan) but requires still quite an amount of work. Nonetheless, it is a core feature for Wikidata and will come definitely - and I hope it comes soon. A second obstacle is to find a Wikipedia chapter willing to serve as an early adopter. I'm unsure if it is a good idea to start in one of the bigger Wikipedias as all article have a taxobox yet there and they have a more or less satisfying quality. Perhaps we could find a chapter which has only few species articles with taxoboxes to start with as they might gain the biggest improvements. I would then try to adapt the prototype taxobox to the specific needs of the corresponding chapter.  — Felix Reimann (talk) 13:19, 30 May 2014 (UTC)
hewiki would be an ideal candidat. There are less than 4,000 taxa (see this old discussion). --Succu (talk) 13:30, 30 May 2014 (UTC)
@FelixReimann: Thank you very much for the info! I've looked into bug 47930 and what I couldn't understand is why it's depends on bug 47288. There are many good reasons of making blocks of code with different functionality as much independent as possible. Making code handles single client queries depended on tracking queries statistics code looks like madness to me (I've probably worked with Objective-C too much lately ;-) ). Any thoughts about this? As for the second question, I understand your concerns about introduction of new unrefined code to live environment with high usage - it could screw up a lot. But Wikidata needs editors, many of them. And right now they are scattered to the branches and they doing the same job (filling the same taxoboxes) over and over again. Wikidata have 13400 active users, hewiki - only 2400. Even if 10% of them would be Wikidata contributors it's less than 2% addition to Wikidata active user base. Yes, small branches need infoboxes more than the big ones, but the main reason for this - they don't have editors. To give information to the small branches we should acquire it from the big branches first. In contrast enwiki have 134000 active users. The same 10% would double Wikidata active userbase. If you have a choice, what would you use: {{taxobox | name = Dugong | status = VU | status_system = iucn3.1 | image = Dugong Marsa Alam.jpg | A dugong in [[Marsa Alam]] | image_width = 250px | regnum = [[Animal]]ia | phylum = [[Chordata]] | insert-43-other-things-there}} or just {{wd:taxobox}}? -- Alexander Vasenin (talk) 20:54, 30 May 2014 (UTC)
Dear Alex. I'm not a Wikidata developer, I'm just developing the Wikidata-backed taxobox together with the other participants of the Wikiproject Taxonomy. Thus, I cannot give you a definitive answer regarding your first question. I think bug 47288 is required for arbitrary access to know when to rerender a cached Wikipedia article. Currently, whenever the Wikidata item is changed, the corresponding Wikipedia articles are rerendered. However, if the information shown on a Wikipedia article depends on several Wikidata items, you need to track which one and rerender the article (or mark it as outdated) whenever one of the Wikidata items are changed. This is more or less what I feel that the problem is. If you are interested in details, please ask Lydia (Wikidata:Contact the development team).
You do not need to convince me! :-) I'm working here because I think Wikidata is advantageous especially in this field. However, I'm not sure if all the Wikipedia-local biology teams are waiting impatiently for Wikidata to come but need to be convinced first. And for this, a smaller chapter showing the benefits of central taxobox data live is perhaps beneficial. If you're speaking for a major chapters taxonomy team: I would love to help you roll this thing out! :)  — Felix Reimann (talk) 10:30, 2 June 2014 (UTC)
With that kind of caching it makes sense. Hope they'll fix it soon. I'm not associated with any taxonomy teams. I just like to make my everything it the most elegant and efficient way ;-) --Alexander Vasenin (talk) 20:06, 2 June 2014 (UTC)

AlgaeBase property proposal

I've added a proposal for AlgaeBase property. Please comment. -- Alexander Vasenin (talk) 21:28, 10 June 2014 (UTC)

also known as

I see that the designers of the Wikdata game have chosen to treat an "also known as" as if it were an "is the same as". Usually this will be a wrong assumption. I guess this means that we have to eliminate just about all the "also known as"-statements so as to prevent stupid merges. And all the information will have to go to the Talk page. - Brya (talk) 17:13, 11 June 2014 (UTC)

@Brya: why don't you contact Magnus Manske and explain him the problem ? this seems more efficient if a solution at the root of the problem is found. TomT0m (talk) 17:19, 11 June 2014 (UTC)
I am not treating any of these properties in any way. However, if one of the items links to the other (irrespective of the property), they should not be shown in the game. In any case, bad merges are the fault of the user ordering the merge, not the game. --Magnus Manske (talk) 10:39, 12 June 2014 (UTC)
Well, this page says "Some topics have duplicate items on Wikidata. Two items with the same title or alias will be suggested to you." Usually, there won't be a link to the other item, as Wikidata is quite short on fundamental properties, and "also known as" is the only field available.
        But, yes, I agree that bad merges are the fault of the user ordering the merge, not the game, and given how popular the game is, the number of bad merges is not all that high. Still, if that number get higher it will become necessary to eliminate "also known as". - Brya (talk) 10:54, 12 June 2014 (UTC)
Hi Magnus Manske, I'm not here to distribute points to say whose fault is what :) But if there is a solution to solve this problem and reduce the set of candidates to merge or warn that in taxonomy items sometimes things are a bit subtle, maybe it's a good idea to do so. "The game" is adressed to anyone, and a lot of its users are not aware of taxonomy. There are subtilities in that field, such as the doubling of the species and other kinds of living organisms with almost the same name that are not really easy to catch with the naked eye. TomT0m (talk) 10:59, 12 June 2014 (UTC)
And I'm saying that it's a game because merge decisions usually need a human to decide; otherwise, it could be a bot :-) That aside, I believe I now understand what you mean; you are calling aliases "also known as", right? Well, at no point I suggest that the two items the game shows are the same (again, if I'd know that, I could just make a bot instead). As quoted above, they are items "with the same title or alias", nothing more. And yes, many of these pairs are the same topic (the number of "same" and "different" decisions is about 50:50), which is why they should be merged. --Magnus Manske (talk) 11:33, 12 June 2014 (UTC)

Side note : @Magnus Manske, when you got an afterthought you might have made a mistake in "The game", there is not a lot of ways to review your your work, just to see your edit history on Wikidata. Any idea on ho to improve this ? TomT0m (talk) 11:19, 12 June 2014 (UTC)

If you click on Settings (or your user name in the title), you get a list of your last actions in the game. More views are certainly possible, but not a priority for me right now. Feel free to submit code. --Magnus Manske (talk) 11:33, 12 June 2014 (UTC)
Yes, but many are not the same (see here for some of the known exceptions) and should not be merged. This is all the more so in the case of the "also known as". - Brya (talk) 18:45, 12 June 2014 (UTC)

User script for taxonomy statements sorting and highlighting

I've wrote a small script which can be useful for our project. It reorders statements so instance of (P31) is always on top of the list, than comes taxon name (P225), taxon rank (P105), parent taxon (P171), and than other properties. It also highlights statements with colors (by type) for easier editing. To use it add the the following line: importScript('User:Alex.vasenin/taxohelper.js'); to your common.js file. Hope you find it useful ;-) -- Alexander Vasenin (talk) 20:40, 12 June 2014 (UTC)

Why do you think instance of (P31) should be the topmost statement? --Succu (talk) 20:55, 12 June 2014 (UTC)
Because it's fundamental property of any entity. It answers the question - what is it? -- Alexander Vasenin (talk) 21:03, 12 June 2014 (UTC)
P31=taxon is the most superfluous of statements, and is often not added. Somebody should still present a case of why it is there at all. - Brya (talk) 05:15, 13 June 2014 (UTC)
The purpose of the script is to present statements in the most digestible and user-friendly way. If you don't like instance of (P31) - feel free to copy the script to your userspace and remove first elements from both JavaScript arrays -- Alexander Vasenin (talk) 07:58, 13 June 2014 (UTC)
Somebody should still present a case of why it is there at all We did multiple time, you just do not want to hear. It's not useful to you because you have your own classification system here. It's totally redundant with the one uses in the rest of the project, so actually, YOU have to make a case on why you don't use it. Here Taxonomy is synonym of biological taxonomy, this is ignoring the meaning of this term is actually broader. See taxonomy (Q7211)  View with Reasonator View with SQID. TomT0m (talk) 10:29, 13 June 2014 (UTC)

Ptereleotridae vs Ptereleotrinae

Could anyone have a suggestions how to fix this mess Ptereleotridae (Q1423033). Family Ptereleotridae and subfamily Ptereleotrinae are not the same thing, but looks like wikipedias treat them as synonyms (for example, enwp redirects both terms to dartfish). -- Alexander Vasenin (talk) 16:51, 15 June 2014 (UTC)

It is best to just have one rank per item / each rank its own item. In this case this is obligatory as the French Wikipedia has a page on each. - Brya (talk) 17:41, 15 June 2014 (UTC)
I prefer clarity too, but that way we lose much of interwiki links. Well, at least someone might notice they aren't synonms. Thanks. -- Alexander Vasenin (talk) 18:37, 15 June 2014 (UTC)
Interwiki links are important, and it is always nice to have as many as possible linking together. In this case the interwiki links must be divided over two items, so there is no choice. It is good anyway to have a separate item for each rank and for each name (with exceptions for homotypic names and for names used at several ranks), so as to to organize the data. As Wikipedia's grow the number of interwiki links will grow as well, so it is best to plan for it and have sufficient items here, that will offer a place for them. - Brya (talk) 05:15, 16 June 2014 (UTC)

Fossils

There are some infoboxes describing fossil taxons which sometimes marked with † (dagger). Is there an established practice of attributing taxon as extinct in Wikidata? -- Alexander Vasenin (talk) 23:07, 14 June 2014 (UTC)

Not that I know of, but I do agree there should be a way to include this information. BTW: to me "fossil" and "extinct" are different things. - Brya (talk) 05:22, 15 June 2014 (UTC)
@Alex.vasenin: For taxa that went extinct you can set temporal range end (P524) (together with temporal range start (P523)). Helpful link to all time slices: User:Tobias1984/Geologic Time Scale. For species that went extinct recently you can set temporal range end (P524) = holocene. But we need an additional property for "calendaric date of extinction" or "last sighting" for the year the last specimen was sighted. -Tobias1984 (talk) 10:48, 18 June 2014 (UTC)

RIP

Paul Silva (1922-2014). - Brya (talk) 05:27, 15 June 2014 (UTC)

Paul Silva (Q10346275) for reference. -Tobias1984 (talk) 10:43, 18 June 2014 (UTC)
For those who needed that reference, Paul Silva was the founder (and builder) of the AlgaeBase which just got its own property and may be regarded as the "Brummitt for Algae". - Brya (talk) 16:46, 18 June 2014 (UTC)

Separate project for biology?

Is general biology in the scope of this project? I think we should have a quick vote if we should include properties like ploidy (P1349) and spore print color (P787). My only concern is that there will be many more such properties in the future and that this project will be too overloaded for new contributors. It might be better to focus on identifiers and taxonomy here and outsource the rest of biology. Please also think about biometrics in your decision. I am sure that we will have a barrage of properties like "average heart beat", "number of teeth", "liver volume" and "average hibernation duration" as soon as those data-types become available. -Tobias1984 (talk) 11:09, 13 June 2014 (UTC)

Tobias, yes, I think a separate project for biology would make sense. Properties like ploidy (P1349) are in scope for the Wikiproject Molecular biology, but spore print color (P787), "average heart beat", "number of teeth", "liver volume", "average hibernation duration" and other more macroscopic biological properties are not. They are also outside the scope of taxonomy, although they might be of interest to taxonomy (as they are to molecular biology). Emw (talk) 11:52, 13 June 2014 (UTC)
 Comment This in not a real opposition to make a separate project, just pulling a string I opened in the previous thread. Before genetic taxonomy and phylogenetic modern tools, taxonomy was closely related to the study of common characteristic of the organisms, I think I understand. Vertebrae for example was the class of all animals with this characteristic. This makes a link beetween OWL class expressions and biological taxonomy on Wikidata : these toWikiProjects (biology and taxonomy) are (will be) in fact closely related as we can define classes of organisms as a function of the properties (as in wikidata properties) in Wikidata :) Metaclasses like Clade are actually modern developments of taxonomy. Modern languages like OWL2 are perfectly fine with classing the classes and reason about taxonomy itself. Actually Metaclasses are interesting in history of science studying (this one is for Emw :) ). TomT0m (talk) 15:27, 13 June 2014 (UTC)
To answer your initial question Tobias, it's clearly beyond the scope of the project. --Succu (talk) 21:37, 13 June 2014 (UTC)
I agree with Succu, although I would not know what "general biology" is. - Brya (talk) 05:07, 14 June 2014 (UTC)
@Brya: Now that I read the sentence again, it does sound like I am referring to a introductory class at an USA university. What I meant is "biology with the exception of taxonomy", and it would only be an Wikidata-organizational thing.
Another thought: It is important that the WikiProjects have a healthy size and a healthy volume of activity. People only watch pages where they get a reasonable amount of relevant information in digestible portions. -Tobias1984 (talk) 21:18, 14 June 2014 (UTC)
Tobias1984: Dog breeds task force and Cat breeds task force are „dead”. What are we doing with these? --Succu (talk) 21:31, 14 June 2014 (UTC)
@Tobias1984: I'm new here, but I have some experience in project planning. The best way to refine project scope is to get back to the project goal. The goal of phase 2 is to deliver infoboxes to wikipedias. There are quite a lot of taxoboxes in wikipedias - hence our project. General biology infoboxes are different (despite taxonomy still a part of biology). -- Alexander Vasenin (talk) 21:45, 14 June 2014 (UTC)
✓ Done Wikidata:WikiProject Biology -Tobias1984 (talk) 11:07, 19 June 2014 (UTC)