October 7, 2013

Semantic MediaWiki: A promising platform for the development of web geospatial crowdsourcing applications


Are you looking for a simple web platform to develop a crowdsourcing data application? Are you planning to develop your own application with some web development language like PHP, Java, Python or Ruby? Do you want to do it quick at a minimal cost? Do you want your users to be able to capture geographical entities as part of the data? Read on...

Semantic MediaWiki is a semantic extension to MediaWiki, the wiki platform on which Wikipedia is built. MediaWiki itself is serious business: It is used by thousands of organizations as the base for their websites and Wikipedia is the sixth most active website in the world. MediaWiki can be scaled to accept thousands of editions and millions of visits per hours. There is no doubt MediaWiki is an incredibly robust website development platform.

Semantic MediaWiki (SMW) adds the possibility to add structured data to MediaWiki pages. In other terms: "to create objects (pages) of a certain class (category) having certain properties". For those with a relational database background the transcription looks like: "to create records in tables having certain attributes". With the SMW extension (and a bunch of other extensions), you can build a complete relational database on the web, present the data  into pages formatted with wikitext templates (instead of complex CSS or HTML) and list and edit them with fully customizable forms. You can query the data to create views and format them in a multitude of ways. All that online, without having to write one line of low level Python, Java, JavaScript or PHP code! Only wiki markups! The same simple markup language that has allowed hundred of thousand of people with no development skills to contribute to Wikipedia!

I challenge anybody to find a CMS offering the same features and flexibility SMW does. Seriously! All the big players in data management, Microsoft, Google, Oracle (certainly not Oracle) still don't offer any system that compete with Semantic MediaWiki in terms of simplicity, from an administrator and a user point of view, to develop a database on the web, styled with templates and editable with forms.

In this article, I want to emphasis the characteristics of a good crowdsourcing web development platform and try to demonstrate that MediaWiki, when it is installed with the Semantic MediaWiki extension, is about the best platform for developing crowdsourcing websites and that it could also become, with little efforts, a very good platform for crowdsourcing of geographic information. I also try to show how fundamental geographic information systems concepts could be transposed to the world of a web semantic database like SMW. I presented most of these ideas during the last New-York City's SMW conference. I’m addressing two kinds of reader: geospatial web application developers familiar with geospatial and web technology but unfamiliar with wikis and mostly with MediaWiki, and Semantic MediaWiki gurus who might be interested in adding complex geospatial information to their wiki.



MediaWiki for Crowdsourcing Content!


Besides being open source, MediaWiki is the perfect platform for content developed by users for users (you could say bottom-bottom content creation). In opposition to top-down approaches, in which reputed organizations (like the Britannica encyclopedia or a governmental instances) create and publish content for the public, bottom-bottom content is created by the crowd for the crowd. Wikipedia, OpenStreeMap and YouTube are among the most well-known examples of databases exploiting this paradigm with great success.

With governmental organizations budgets constantly decreasing, the growing demand for open data of any kind and the enormous task involved in assembling huge datasets, organizations are turning more and more towards internet citizens to collect and maintain data about everything. This tendency should increase, providing a nice future for crowdsourcing technologies. Geospatial data is no exception. With the venue of Google Map, hundreds of websites have popped up allowing neogeographers to tag geographic features on a map, to inventory any kind of observations still invisible in satellite images (like potholes) and to mashup various sources of data to create unexpected original datasets.

SMW is a very good player at that. The Wiki of the Month page on the Semantic MediaWiki website list featured sites which are using the platform as their web content management application. Notable sites include OpenEI, a site enabling the sharing of information on energy and other digital resources and WikiApiary, a site collecting information about the numerous sites based on MediaWiki. Both sites are based on crowdsourced data.

Not every website development platform are good for building successful crowdsourcing data websites. We can say that there are really only three fundamental functionalities that make wikis very good at this task and make them different from other Content Management Systems (CMS):

Change history - Every change made in a wiki is recorded and can be reverted. This is very important to identify deviant (or misinformed) editors and undo their changes. This is the key to content stability and confidence. This is why Wikipedia was based on a wiki platform, not a CMS. We can easily say that a web development system is not a wiki system if it doesn't provide change history.

Page-based content and edition - In most CMS you get a published mode (what normal visitors see) and an administrative mode (where a privileged user named administrator can change the website structure and content). The pages in the administrative mode generally involve management concepts which are very different from CMS to CMS. That makes the learning curve, when passing from one system to another, steeper than it should be. The content you want to edit (e.g. the text of a menu item or the header of the website) may often be several clicks away from the administrative home page and refer to an abstract concept specific to this CMS. In a wiki platform, by contrast, everything can be change directly in a content page. Just navigate to the page containing the content you want to change, edit and save. You don’t have to understand a complex administrative interface and other management concepts. All contents, even the menus, are in wiki pages, all editable the same way. This is a unique, simple and easy to learn concept that is shared by all wiki platforms and makes them quite different from most CMS. This is why Wikipedia was based on a wiki platform, not a CMS.

Easy content edition - Content contributors do not want to know about Content Management Systems concepts and do not want to learn how to edit complex data structures. Not to mention, they certainly do not want to mess with HTML and CSS. In a wiki, content is formatted using wikitext: a limited numbers of tag insuring minimal consistency of content appearance over all the pages created by different editors. No HTML, no CSS, no choice between 2000 colors, no hesitation on how to format a page header, just use the minimal set of tags and the contributed content will look consistent with the rest of the site. Easy content edition is the key to attract and retain a good number of collaborators and wikitext insure minimal formatting consistency. This is why Wikipedia was based on a wiki platform, not a CMS!

Besides those three fundamental wiki must-haves and the fact that it is open source, MediaWiki also shares, with most popular CMS, other interesting features for developing open data portals. All features you certainly don't want to develop from scratch...

Support for many users - CMS and wikis already allow many users to contribute to content. They are ready to register users, provide them with a profile and associate their changes to their profiles. All that can generally be done in a quite secure way. Most will let you create user groups and manage permissions to read, write, upload files and change those permissions for a page or a group of pages. You don’t have to reinvent the wheel by coding your own support for multiple users in PHP or Java and adventure in the development of bullet-proof security.

Easy global change of look and feel - CMS and wikis generally offer a variety of templates (or skins) to change the global look and feel of a site. A new skin can generally be applied with a minor intervention at the file system level. The fact that many websites based on MediaWiki share the same look as Wikipedia discourages many web developers to select MediaWiki. But counter examples of nice looking websites based on MediaWiki abound: here, here, here, here and here.

A myriad of extensions (notably the integration of Google Map or Open Layers) - A mature website development platforms provides a great number of extensions adding functionalities to the base system. Extensions for adding a calendar, a picture gallery, a discussion forum, a RSS feed, etc. are common in most platforms. The fact that an extension exists for a platform or that an extension is customizable in some specific ways is often critical in the choice of the overall platform. This is particularly true for small organizations which can’t afford to pay a programmer to develop specific functionalities.

A set of extensions of particular interest for the geographic community are the ones allowing the integration of maps in the website. Most mature website development platforms have an extension allowing embedding Google Map or OpenLayers in a page. MediaWiki has its own one called MediaWiki Maps. Again, no need to write code: These extensions support a set of parameters matching the ones available in the Google Map or the OpenLayer API. They let you configure the navigation and zooming controls and which base maps should be available. They provide means to display markers with info bubbles or overlaying KML files. Some KML files might be composed of markers only and some might contain lines and polygons. We could say that this is actually the main difference between so called neogeographers sites and professional geographic sites. The former tend to be happy with markers and points and the latter require more complex representations of geographic features like lines and polygons. The fact of not providing easy means to add support for the creation, storage and edition of complex geographical entities like lines and polygons is certainly a limit encountered in most CMS and wikis mapping extensions. We'll speak more about this below.

Online modification of all content - You don’t need to know PHP, Java, Ruby or Python to modify the two levels menu, add a new section to the website, edit a page or change permissions. This might sound strange for people working in big organizations where the website is handled by a team of designers and programmers, but the fact that 95% of a site is modifiable through a web interface without having to play with the underlying code is a fundamental advantage for small organizations who can’t afford the luxury of  even one skilled web developer. The fact that a web designer with no programming skills can create and modify a website is a key feature that made the success of web development platforms like CMS and wikis. Even big organizations benefit from this easiness: the content of the website can be updated faster by a more diverse, not necessarily very skilled, group of employee, not requiring access to the server filesystem and the knowledge to deal with it.

What’s Missing? Easy Development of Web Data Management Applications...


Even though CMS and wikis are very flexible and let the most basic user modify almost every part of a website online, there is still one thing they hardly let you do simply by dealing with a couple of forms or wikitext style tags in a browser: Developing a custom data management application, i.e. an application that let you display and edit structured data on the web. Examples of this kind of application abound. Any organization of reasonable size is at some point confronted with the challenge of developing a web application to manage data specific to its business model: employees, members, products, locations, surveys, publications, all kinds of structured data which management has to be distributed over many persons on the web and for which CMS do not offers readily implemented solutions. Small and medium size organization will generally rely on a consulting company to develop such applications and specific solutions will be implemented by skilful developers in low level languages such as PHP, Ruby, Python of Java.

To let a web developer with no programming skill develop such an application CMS and wiki should provide a web interface allowing to:
  1. Define a custom data model (a simple table or a relational model). The result of this definition should be a set of empty tables in a server side storage backend (a set of flat files in the file system or a set of tables in a DBMS). No direct interaction with the database and knowledge of SQL or any other exotic query language should be necessary.
  2. Construct forms to feed this data model. The structure and look of these forms should be customizable with a minimal set of dialog boxes or tags without having to know about HTML and CSS.
  3. Embed different representations of the collected data in a web page as lists or individual pages, using some kind of templates based on a simple metalanguage. Users should be able to sort and filter those the content of those lists.
Besides design, collection and representation of data, you may also want to import precollected data in the system and let your users export them in their favorite file format (CSV, Excel, XML or shapefile if it includes some geographic information).

All those things should be doable through a web interface by a web designer without having to write code in a low level language like Python, Java, Ruby or PHP.

Even though most data management applications are not very complicated - the data structure often resume to a simple table with a few columns - most CMS and wikis will NOT let you define such a simple data model online! Most CMS will let you construct a form but the data collected will be sent to an email address. You normally prefer the data to be stored in your favorite database for further treatment. Some systems, like Google Spreadsheet or SurveyMonkey will let you define a simple data structure and build a form, but they won’t let you customize the look of this form with HTML and CSS. Other systems (like Google Fusion Table or CartoDB) will also let you create a nice representation of the data but they won’t let you change the spreadsheet-like input grid they use in place of a form. Generally speaking, you have to use a high level web application frameworks like Django (Python) or Drupal (PHP) to implement the simplest application and this always implies some sort of programming in Python, Java, Ruby, JavaScript or whatever presumably "simple language" the framework is based on. This is without speaking about implementing everything from scratch with one of those low level languages; a solution still too often chosen by developers who like to reinvent the wheel...

To summarize we can say that some systems let you implement some of the feature necessary to a web data management applications, but none of them let you implement all of them without having to write low level code. As web developers and experts of web usability, we want to be able to define data models, to construct required forms in a usable way and display the information collected as recommended by our best practices without always relying on a developer. Few systems let you do this easily now. Semantic MediaWiki is one of those.

Semantic MediaWiki for Structured Crowdsourced Data!


So here comes the Semantic MediaWiki extension. Along with a couple of other extensions, SMW extend MediaWiki to allow doing all the features defined above and develop a complete data management web application. All this online, only with wikitext, without having to write one line of PHP! You can define a set of generic properties, build a form (as a page written with normal wikitext in the Form namespace) so users can enter values for the properties and display the collected values formatted with wikitext based templates. You can create SQL-like views (or queries) on the data and display the results using the same type of templates. Among a number of simple and very useful features, forms created with wikitext support autocompletion based on already entered values which helps avoiding duplicate or misspelled values.

Another extension called Semantic Result Formats let any editor display query results in a very impressive number of sophisticated formats like calendars, graphs, timelines, trees, slideshows, image galleries, tagclouds, BibTeX, and many more! Another extension, Semantic Drilldown proposes an interface for progressive filtering and listing of semantic data. An extension named Data Transfer let contributors import CSV or XML files as page properties and users export pages properties to XML.

In brief Semantic MediaWiki let you do with structured data what MediaWiki let you do for textual content: putting quickly online a quite complex and generic data management system.

What about geospatial data?

The MediaWiki Maps extension, which is used to display markers over basemaps, already has its Semantic little sister called Semantic Maps. This extension allows capturing the location of a geographic feature in a form and aggregating all the features resulting from a SQL-like query into a single map. Everything you have ever dreamed of to build a neogeography application! Again, and I like to highlight the point, without having to write a single line of low level code! A non-programmer can develop forms to collect information about punctual features, query them and display the results on a map in a couple of hours. He has full control over the appearance of the form, the display format of the data collected, the parameters of the query and the characteristics of the map. This is sufficient to develop a whole lot of different geoweb applications, but still Semantic MediaWiki has its niche...

The Semantic MediaWiki geospatial niche

If we classify geoweb applications on a continuum going from application showing no information about each geographical feature to application showing a lot of information about each feature, we can group them into three fundamental categories considering the relative importance given to the content vs the map:
  1. Map only applications - in which only the geographical representation the features are presented to the users. These applications are mainly simple dynamic maps. Google Map, Bing Map and OpenStreetMap, when no query is performed in their search boxes easily fall into this category.
  2. Map with info window applications - in which the map is prominent but information about each geographic feature is presented in a popup info balloon when the users click on the feature. They look more like traditional geographic information systems. Emphasis is put on the functionalities associated with the map, not with the content associated with the feature. Some examples: WikiMapia, GeoIndex+, the global Drowning Tracker, the greek PAE-RAE Geospatial Map for Energy Units and Requests. There are literally thousands of them...
  3. Content based applications - in which textual and structured information about geographical feature is important and presented page per page (one geographic feature = one page). Maps serve only as a side representation of the data. Some examples: Wikipedia, the list of buildings on the University Laval campus.
Some commentators of the geoweb pretend that there are too many of the first and second type of applications and that many applications, mostly search and discovery ones, would be more efficient if they were constructed like the third type. It’s often hard to balance the importance to give to content vs mapping in a geoweb application. SMW is an excellent platform to develop the third type of application with rich content for each geographic feature.

Here are some examples of applications with emphasis on rich textual, structured and geospatial content:
  • A gazetteer web service in which the collection of information related to an entry (name, synonyms, description, classification, parent and children entries, statistics) is important and in which not only the latitude and the longitude of the feature are stored and delivered (e.g. as a WFS service) but detailed polygonal and linear geometries when appropriate.
  • An online atlas proposing a lot of content and data about each geographical feature (like Wikipedia).
  • A cadastral survey containing all the historical information related to each properties of a specific area.
  • A catalog of geospatial data in which the description of each dataset is as or more important than its geographical coverage.
Funnily the wiki of the OSGEO foundation, which "supports the collaborative development of open source geospatial software", and most notably the development of PostGIS, the most used open source geospatial database, is made with MediaWiki. Even though SMW and the Semantic Form extension are installed on the base platform nobody in this organization apparently thought about its potential as spatial database. Hey guys! This is a web database! Would it not be nice if it would be a web geospatial database? Let see if we get some reactions here...

Exploring Microsoft SharePoint, Drupal and others

Clearly, SMW is a web database coupled with a very flexible web development platform. If you know any other system bringing this data collection power and this flexibility to a non-programmer, let me know! I have been searching for such a system for a while now and I always come back to SMW. Some have compared Microsoft SharePoint to SMW but still the comparison does not stand very long.

Another challenger is the Drupal CMS which allows defining content based on a custom data structure and collect corresponding data in the form of "nodes". You can then display views over those nodes in a very flexible way to build blocks of content to be embedded in some pages of the website. We are very near from the goal!

Still, Drupal proposes very different interfaces: a hard (or impossible) to customize CMS-like one for administrators and a very customizable one for end-users. All Drupal configurations are done using forms which can get surprisingly overwhelmingly complex. It becomes quite hard for end-users to understand the new configuration forms available to them when they gain editing privileges and to assimilate the concepts associated with those forms. It’s funny to see experienced Drupal users presenting new Drupal features on YouTube getting themselves lost in the maze of Drupal forms  (at 22:00). That's in contrast with wikis which in general propose a "do it in a page using wikitext" general approach to everything independently of the editor being an experienced administrator or a simple end user.

Furthermore Drupal makes a distinction between end users (or surveys) forms and administrative forms for creating website content. The first ones generate data that can be only viewed as list or sent to you as emails and the second ones generate content nodes. Data generated with end users forms cannot be reused as website content. End user forms are quite customizable online but content creation form can only be modified by a general change of skin or by altering some .php file.

These observations stand for Cartaro, a Drupal based geospatial CMS which represent one of the best attempt at offering a simple web development platform for geospatial data management. All content addition is done through administrative forms and Drupal specific concepts which can get quite complex for a newcomer. Easy SDI, a similar product based on the Joomla CMS suffers from the same drawbacks.

At the lower end of the continuum, Django and its geospatial extension GeoDjango, are targeted at Python developers. Developing a web data management application with this framework is impossible it you don't have programming skills. GeoNode, an open source product maintained by the ubiquitous OpenGeo company and built on top of Django is an out of the box user-to-user Spatial Data Infrastructures. Good knowledge of Python and CSS is necessary to make any significant change to the GeoNode interface or add a metadata field to the catalog. "Open source" might mean open to skilled developers but we are looking at frameworks "open" to a larger user community which skills are in other domains like UI design, data modeling, content creation, community development or merely web site construction. Not programming...

So Drupal, like most CMS, and Django, like most web development frameworks, are not designed as end user content creation platforms but as administrative centric ones, and, at worst, as developer centric ones. In contrast, SMW forms always create new reusable content and are easily customizable by non-developers. It is clearly an end user content creation platform making it a better choice for evolving crowdsourcing open data creation and modification.

Here is a summary of wikis, SMW and CMS key features for crowdsourcing applications:

Wikis (and MediaWiki) features shared with most CMS
  • Online customization of content and formatting
  • User and group management
  • Easy change of global look and feel (even though MediaWiki is not necessarily very good at this right now)
  • Many extensions for various extra functionalities (like Google Map or Open Layers embedding)
  • Form creation and formatting
Wikis (and MediaWiki) specific features
  • Change history (for everything)
  • Page-based content and edition or everything
  • Easy edition and formatting of page content
Semantic MediaWiki specific features
  • Client side data structure definitions for storing form data on the server
  • Customizable rendition of data collected with forms (with templates)
  • Customizable rendition of queries on data collected with forms (with templates)
Features provided by Semantic MediaWiki related extensions
  • Customizable rendition of geographic queries as maps (Semantic Maps extension)
  • Customizable rendition of queries as graph (Semantic Result Formats extension)
  • Bulk import of data (Data Transfer extension)
  • Export of collected data (Data Transfer extension)
  • Transparent use of external data (External Data extension)
  • Dynamic filtering of data (Semantic Drilldown extension)


The World is Never Perfect: MediaWiki Weaknesses


No content management system is perfect. MediaWiki (by itself without the Semantic extension) also suffers from some weaknesses. Yaron Koren, a major developer of SMW extensions and author of the book "Working with MediaWiki", already identified some of them in his former blog. Here is my own list:
  • WYSIWYG - Even if this is presently identified as the number one weakness of MediaWiki, I share Yorun though that wikitext is simple enough for most applications. I run a wiki (based on PmWiki) open to about 200 researchers and all of them have been able to create their own page without any guidance other than a couple of very simple documentation items I wrote. Anyway this weakness is now a past issue with the release of the Visual Editor which seems to have overcome all the difficulties involved in developing a robust WYSYWYG for wikitext
  • Flexible access permission control for page, sections and documents - This would be my first real weakness for MediaWiki. Opening everything and securing pages edition only with some sort of CAPTCHA might be sufficient for organizations' intranets in which users are trustable but if your wiki is also the website of the organization, you want to have different degree of security on some sections (e.g. on the home page or on key administrative pages). MediaWiki does not have good support for page security since it was conceived as an open-to-all wiki. Most security configurations are done by modifying the main .php configuration file which means only administrators with low level file access can change actions groups are allowed to do and pages on which they are allowed.

    A typically Linux-like security scheme would normally let an administrator:

    • Define users and groups of users.
    • Assign users or groups of users read permissions to pages or groups of pages (and their associated attached files).
    • Assign users or groups of users write permissions to pages or groups of pages (and their associated attached files).
    • Assign users or groups of users permissions to change permissions.

    All that through a web interface without having to directly modify files on the server. Except for attached file permission, this security scheme is the one adopted by PmWiki.

    As for MediaWiki, it proposes eight user groups by default: unauthenticated users (anonymous), authenticated users, autoconfirmed authenticated users, emailconfirmed authenticated users, bots, sysops, bureaucrats and filesystem access administrators. Only a filesystem access administrator can create new groups (by editing the main .php MediaWiki configuration file). Users can be created and assigned to groups online. Permissions are generally applied globally in the .php file. Some permissions, like ‘edit’ and ‘move’ (curiously not the ‘read’ one), can be applied on a page per page basis but hardly to a group of pages.

    Many MediaWiki extensions try to address this lack of flexibility but there are so many (107) of them, it gets quite touchy to choose the right one addressing your particular needs. Furthermore none of them guarantee 100% security on the additional feature they provide. Clearly, setting page per page or group of page levels of security in MediaWiki is an adventurous endeavour and this is probably why most MediaWiki gurus tend to promote full access to almost everything in the wiki (maybe not for the right reason).
  • Too many configurations still done through configuration files - Even though all the content of a MediaWiki installation is modifiable online, it is quite different for many administrative tasks. Many of them can be done via Special pages but still too many are doable only by editing the main .php configuration file.
  • No extension manager - One of the administrative tasks which is often possible to do online is the installation and configuration of extensions. However, despite a long history of request by users and many attempts from skilled developers, no extension manager worthy of the name exists yet for MediaWiki.
  • Difficult skin development - Developing a MediaWiki skin is not reputed being an easy task. Daniel J. Barrett, in his O’Reilly book about MediaWiki explains why: "Of all the ways to configure MediaWiki, skinning is one of the most complicated, mainly because the code is not well-factored. From the supplied skin files, it’s nontrivial to understand which parts are required to write a skin and what are the best practices." This might be the reason why there are not many beautiful skins for MediaWiki, most of them being close variations of the Wikipedia one.
  • Too much geared towards the needs of the Wikimedia Foundation - From the MediaWiki manual: "The software is primarily developed to run on a large server farm for Wikipedia and its sister projects. Features, performance, configurability, ease-of-use, etc. are designed in this light; if your needs are radically different the software might not be appropriate for you." This might explain why there is still not a good extension manager, why advanced document management and access right features are neglected and why skin development is still hard... Maybe MediaWiki would gain features allowing developing a broader spectrum of applications if it would distance itself a bit from Wikipedia? Maybe a fork could reach a more diversified group of potential users?

Semantic MediaWiki Weaknesses

MediaWiki along with its semantic set of extensions is an amazing website development platform; Don’t misinterpret me on that. But still, as I tried to express in a short 5 minutes presentation during the last North American SMW conference, it suffers from some weaknesses:

  • Semantic! - To me the first thing that keeps people away from Semantic MediaWiki is its name. How many people in your website development team know the meaning of the word "semantic"? Most people in the web site development business know about "relational database", about "tables", about "attributes" but nothing about "semantic". Speak to them with a language they understand. Speak about "meaning", "data", "database", "properties", but not about "semantic". I thing SMW needs a serious rebranding to better connect with the community of web developers which constitute its main user target.
  • Triples VS relational schema - Who knows about triplestore databases? Not many people. Most people know about relational databases. Having to deal with a new website platform is one thing; dealing with an unknown data storage paradigm is another. Does this unknown triplestore paradigm scares web developers when they consider SMW as an option for the data management application they have to put on the web? I tend to think "yes". I think SMW would not only gain from getting rid of the "semantic" terminology but also from adopting a storage paradigm more similar to the relational database one. That would significantly ease selling this platform to the huge relational database savvy community who is in quest for simple web solutions.

    "Impossible" will say the Semantic Web 2.0 gurus, "as semantic is tied to RDF and the way to store RDF relationships is with triplestores." Funnily SMW main database back-end is MySQL, a relational database, not a triplestore. Another oddity is that most people use SMW with the Semantic Form extension which forces users to always define the same set of properties on entities of the same category. This is a direct mapping to a relational schema, not to a triplestore schema which allows properties to be set on any entity. When using the Semantic Form extension, categories act like tables and properties act like tables’ columns. A SMW installation based on Semantic Forms acts very much like a relational database, not like a triplestore. Why continuing to refer to RDF and triplestore? No matter how subject-predicate-object are stored in the database (always storing them as triples is handy as it prevent continuously modifying the database schema), the important thing is how they are structured from the SMW user point of view. When using Semantic Form they look like a relational schema: there is no need to speak about "semantic", "triplestore" or "RDF". When every property can be set in any page with annotations they ARE triples and cannot appear like relations in a relational schema: you can now speak of them with "semantic", "triplestore" and "RDF".

    This incongruity could easily be fixed by coercing the definition of properties only in the context of categories (directly in the category pages) instead of in an independent page. With this slight modification, SMW data schema would map directly to a relational schema and a better-known relational database terminology could be adopted. This could get relational database dummies to stop by and consider SMW for their web applications. This represents a lot of people and potential contributors to the SMW code base and related extensions. Each property would be tied to a category by definition (like columns are tied to tables), describing the equivalent of a relational database table schema for this category. Right now, properties can be defined using annotations, in a triplestore way, outside of any context and without using Semantic Forms, on any object of any categories. It is indeed very strange for someone having a relational database background to be able to define properties outside any context and then be able to assign values for those properties on any entity.

    SMW could leave the possibility to define properties outside the context of a category only when it is explicitly installed over a triplestore backend and values for properties are not set using Semantic Forms. This would result in two kinds of properties: "category properties" describing a relational database-like schema and "global properties" describing a triplestore "semantic" or "RDF"-like schema. Both properties would be a different way to append semantic to entities. "Category properties" would be queried using SQL-like "ask" queries and "global properties" would be queried using SPARQL-like queries. The default would be to accept only "category properties" when SMW is installed over a relational database. It would be possible to define "global properties" in addition to "category properties" if a triplestore is installed on the back end.

    Forcing the definition of properties in the context of category would not have much impact on the rest of SMW functionalities. {{#ask}} and {{#show}} queries would work transparently. {{{for template}}} Semantic Forms tags would have to refer to a specific category and subsequent {{{field}}} tags would have to refer only to properties defined for this category.

    Semantic is not the prerogative of RDF triples. As far as we don’t get into OWL and we stay into the "Semantic MediaWiki with the Semantic Forms extension" domain, a relational schema is sufficient to define semantic on objects. There is a direct mapping between relation schemas to RDF schemas. RDF can also be seen as a generalization of the relational paradigm in which properties do not have the constraint of having to be defined in the context of a table. "Category properties" are just another way, closer to the relational database way, to define semantic. Allowing a simpler relational perspective would certainly make SMW more accessible for most data model developers. It could be marketed as a true "web relational database solution" and certainly attract more users, developers and funds.
  • Limited number of subobject levels - In MediaWiki everything is stored as a page. It make sense since everything is in the form of an article. One article per page, fair enough. When using SMW, things gets a little bit more complicated. It was designed to add properties to articles. All properties about an article are stored within the page for this article using special tags. When using the Semantic Form extension, property values are stored in a template call, still within the article. When the article is saved, the template "tag" the values so they are properly defined as properties.

    Up to now the structure are always very basic: one article per page, many properties per article. Now let say you want to "group" some properties together and have many of those groups for one article. Database users would call this a one-to-many relationship. A classical example are the many actors plying in a movie. This is simple since it is natural to create articles (or pages) about actors and hence to store properties about actors in those pages.

    But what if groups of properties represent something that doesn't make much sense to store as a page? For example consider the list of references in an article. Each reference has an author, a title, a year of publication, etc... We don't want to create one page for each reference. We would end up with thousands or millions of almost empty pages. SMW answer to this are "subobjects". Subobjects are like articles (objects) with properties but they don't have to "live" (or be stored) in their own pages. They can "live", many at the same time, in the page of an existing article. Very clever, very useful. This is a two levels, one-to-many relationship and thanks god, not everything has to become a page.

    Now lets complicate the problem a little bit by wanting to store all reference authors themselves as objects with properties, not just as a mere, unparsed list of authors for each reference. Each author could have a name, a surname, an email address and an affiliate institution (to make things even more complicated they could have many addresses and many affiliate institutions but we don't need this level of complexity to make our point). Remember that we don't want to store references, nor authors, as pages. We want to store them as subobjects of articles in the articles pages. So we need to be able to define a third level, one-to-many relationship, as subobjects (authors) of existing subobjects (references), all to be stored in the same page.

    Unfortunately this is not possible now in SMW because it is not possible to embed subobjects in other subobjects... You then have to make compromises and always define some levels as pages (articles). This is most of the time possible but generally not desirable as, very often, these objects can be defined with a very limited set of properties and we don't want to create pages for objects having only two or three properties. That chills our willingness to implement a true database schema in SMW and use it as a data gathering application. Don't read me bad, you CAN implement any relational schema, but at some point, you have make some tables, pages, even when the objects represented in those tables are not well represented as pages.
  • Messy extension support - A very interesting fork of SMW called SMW+ have been developed over the years by startup companies like Ontoprise (now Semafora) and DIQA. For some unclear reasons, the development of those extensions has been discontinued and they became incompatible with the last versions of MediaWiki. Some functionalities provided by new, well supported, extensions of SMW seems to reimplement some features that were provided by the old SMW+. It is also unclear which ones are provided by new products from those companies. SWM+ provided some very interesting and unique features and it is hard to tell what is still available and what is lost. There is now a project wanting to revive these features but it doesn’t look very active. A similar pattern seems to have happened to another major extension called Halo. Everything should have been done by the MediaWiki and the SMW+ teams to avoid such loss of precious and unique functionalities. Better backward compatibility and inter team communication would probably have avoid the loss.

What is Missing for Better Geospatial Support in Semantic MediaWiki?


From a developer of geospatial applications, the main weakness of SMW is its limited support for geospatial data. The Semantic Map extension already works very well with points. What is missing to make SMW go from a web database and web application development platform to a full featured web spatial database and a web geographic information system? Mostly notably support for complex geometries. Here is rough picture of the critical functionalities that are missing:
  • Tools for importing, creating and editing complex geometries - As in any GIS application, creation of new features and edition are key building blocks. The Semantic Map extension allows creating new geographic coordinates (or points) by entering them in a form or by clicking on a map. Direct creation of complex geometries is however harder. We generally want to import them from an existing KML or shapefiles. We might also want to import a single geometry by specifying its unique ID in the shapefile or to import many features at the same time using the SMW Data Transfer extension. In a typical SMW project, each row from a shapefile would create or would add a property to a page describing an entity having a geographical footprint: a town, a country, a road, a lake, etc... The other shapefile attributes could also be imported and become additional semantic properties. Another way of creating a geometry property might be to simply drop a WKT string in the field of a form.

    Direct creation and edition of complex geometries is already possible with the Semantic Map MapEditor. This editor should be enhanced in a number of ways: 1) It should be integrated as a special input type widget in the Semantic Form extension so edition of complex geographic entities could be done at the same time as the creation or edition of other type of information. This implies that resulting geometries should automatically be stored as properties when edition is finished. Right now you have to copy the geometry string generated by the tool in the target object page. 2) Multipolygons and holes creation should be allowed. This is essential to be able to create realistic geographical entities like lakes with island or countries sprinkled with lakes. 3) Display of previously entered features should be allowed in the editor and snapping of vertexes to these entities while creating or modifying the current entity should be enabled. This is essential to be able to create and edit topological, non-overlapping coverages of entities.
  • Storage for complex geometries in the database - Once geometries have been captured or imported, SMW has to store them in the database as it does for any other page property. All databases supported by SMW have support for storing complex geometries following the OGC (Open Geospatial Consortium) Simple Features Specification standard. MySQL supports the geometry class, PostgreSQL has a very popular extension called PostGIS, which is the most used and advanced spatial database in the world, and SQLite has an extension called SpatiaLite. All of them store geometries in a GEOMETRY complex column type.

    Simple points (or geographic coordinates), as supported by the Semantic Map extension, can easily be manipulated by modifying their latitude or longitude component. It is hence useful to store them as a two decimal numbers in the wiki page of the geographical entity. Complex geometries however, sometimes composed of thousands of vertexes, cannot really be modified this way (think about the complex multilinestring composing the limits of the United States). One needs a graphical interface to move, add or delete the numerous vertexes. It might be misleading and useless to actually store and show the values for all those vertexes in a wiki page (in the WKT format for example). It would be better to simply show the unique identifier of each geometry and allow their modification only through a proper graphical interface.

    It would also be misleading to try to reinvent a way to store complex geometries specifically for SMW. It is true that one might be tempted to store complex geometries as WKT strings in a column of type STRING or TEXT but this would not benefit from a very important feature of the GEOMETRY type: spatial indexing, which dramatically speed up search for spatial entities. Complex geometries should be stored using the database proper spatial extension. Period. This implies that the support for complex geometry would be available over a PostgreSQL or a SQLite back end ONLY if the corresponding spatial extensions would be installed with the database. MySQL comes with the geometry type installed by default so there would be no need for an extra installation step for this backend.

    Geospatial triplestores

    There are many reasons why some users prefer to store SMW data in a triplestore databases. The first is to be able to use the SPARQL language which is better suited to query RDF stores and offers better performance than equivalent relation SQL-based implementations. A second reason is to benefit from the ontologies capabilities of some triplestore implementations. A third reason is to be able to interact with data stored in SMW with other SPARQL capable interface. These users will eventually also want equivalent support for geographical objects. Support for geometries storage and querying is however way less common in triplestores than it is in relational databases. There is an OGC standard called GeoSPARQL specifying how to represent and query geographical information in a triplestore but few systems support it up to now.

    Of the two triplestores most well-supported by SMW (4Store and Virtuoso), only the commercial version of Virtuoso has support for geometry and this is for points only. Support for complex geometries is planned for future versions. A number of other triplestores that support GeoSPARQL also provide support for complex geometries. The following table lists popular triplestores and their support for complex geometries and for GeoSPARQL.

TRIPLESTORE
LICENSE
GEOMETRY SUPPORT
GEOSPARQL SUPPORT
SQLStore3
Open Source
No
No
4Store
Open Source (GPL)
No
Virtuoso
Open Source (GPL)
and commercial
No
Parliament
Open Source (BSD)
Strabon
Open Source (MPL)
Yes
Yes
Sesame
Open Source (BSD)
Not without uSeekM
Bigdata
Open Source (GPL)
Not without uSeekM
AllegroGraph
Commercial (free and
paying editions)
No
OWLIM-SE
(formerly BigOWLIM)
Commercial
No
Jena
Open Source (Apache)
No

  • Display and symbology - Display of individual geometries does not really pose a problem. Technology for displaying georeferenced information like points, lines and polygons is mature and already very well used by the Semantic Map extension. Complex geometries can be displayed the same way points are displayed over a base map like Google Map or OpenStreetMap using the mapping widget API. A challenge arises when geometries have to be displayed as the result of a spatial query (like what are the geographical entities 5 kilometers close to another entity). Spatially querying for points in a database is one thing (it can resume to a search based on two indexed columns like longitude and latitude), but querying for complex geometries is another. Queries have to be efficient to select only geometries that must be displayed at some zoom level in the map displaying query results. And this is where it becomes important to store the geometries using the spatial facilities provided by each databases, indexes included.

    Another consideration is map styling (or symbolization). The same way semantic data is formatted using templates in SMW, geographic information has to be drawn using some reusable styles. As in other information systems, it is a very good practice to separate data (here the geometries themselves and their attributes) from their representation (some styling protocol).

    The KML format proposes a number of styling primitives. They are normally embedded with the data in the KML file but the same primitives could be reused to define styles which maps refered to using some identifiers. SLD (Styled Layer Descriptor) is another standard for styling geographic entities which is increasing in popularity and is clearly inspired by CSS.

    The same way templates are stored in pages in the "Template" namespace, geographic styles could be stored as pages in a "Geographic Style" namespace. The page names could be the way to identify styles used in a map. These styles would be created and modified using a Semantic Form form. A number of predefined styles could be provided for often used points, lines and polygons geographical entities as part of a spatially enabled Semantic MediaWiki installation. Sounds like a Geographic Information System, no?
  • Functions and Queries - If SMW is to provide the main functionalities of a web spatial database, it must provide, as far as possible, the main functions and operators that are expected to be found in a spatial database. In MediaWiki and SMW terminology we speak about Parser functions and Comparators. Parser functions, like in any other language, take input arguments and return a value computed from those arguments. They are used in the wikitext of any page to display some values derived from a property. Comparators are special function that return boolean values and are used in semantic queries to discriminate entities based on their properties.

    There is a huge number of functions possible on complex georeferenced geometries. A spatially enabled SMW should first provide basic functions that could be useful in articles about geographic entities: area of the geometry, length of the geometry, type of the geometry, a geometry representing the envelope (or the extent) of the geometry, a geometry representing a buffer around the geometry or the centroid of the geometry.

    The most used operators are "intersects" which returns TRUE if a geometry touches another one and "within" if one geometry is completely inside another one.
  • Support for various coordinate systems - Old geographic information systems did not have any support for spatial data georerenced in different coordinate systems. Everything had to be projected in the same coordinate system BEFORE being integrated into the system. Then came more sophisticated systems reprojecting geographic features on-the-fly in different coordinate systems. Those systems require a way to describe coordinate systems and associate them with a unique identifier so that geometries in one system can be projected to another one. This unique identifier is stored in the geometry to refer to the coordinate system in which it is stored.

    There are many ways to describe coordinate systems: OGC WKT, Proj4, GML, ESRI and many coordinate systems have unique identifier (SRID) given by organizations like ESRI or the EPSG.

    On the same principle described above for maps styles, coordinate systems could be stored in SMW each in a separate page of a special namespace as a set of properties. One of those properties would be the SRID and this number could then be referenced in each geometry.

    SMW does not have to support multiple coordinate systems now. As in former GIS, it could at first only support representation of entities using the same coordinates system. The KML format, for instance, and hence Google map and Google Earth, only support geometries defined in the WGS 84 coordinate system. Sticking to one coordinate system avoid the problem of storing many of them in a new namespace. On-the-fly reprojection could be added later and a namespace with many predefined coordinate systems could be provided. An easy way to add coordinate systems should be described so that only the ones used by a specific application could be loaded in the namespace.
  • Support for raster data - A geographic information system or geodatabase is not complete until it provides support for raster datasets. Geometries and rasters are the two most fundamental ways to represent geographic phenomena: geometries (or vector) for discontinuous phenomena like administrative limits, roads, hydrographic networks or populated places and rasters for continuous phenomena like elevation or temperature. Most spatial databases provide support for raster in addition to vector and a pleiade of functions to manipulate rasters and extract information from them based on geometries (e.g. what is the mean elevation for a set of villages?).

    It is still however hard to figure how images could be used in a SMW installation. A typical SMW installation act like a database storing information about entities having properties. A geometry is just one property of an entity among others (its geographical footprint). We can think of two cases where raster act as the property of an entity: 1) the raster is the entity itself (e.g. in a catalog of satellite or aerial images) 2) the entity is represented better by a raster than by a geometry (e.g. fuzzy objects like animal territories or soil deposits). In any case, a raster would be stored very much like a geography, in a special format, held as a property. We are still far from needing raster support in a geospatial enabled SMW. After all, some geodatabase systems, like PostGIS, have been used for almost a decade before providing support for raster and that did not prevent users from using it in thousands of applications.

A Development Roadmap


So far we have defined what SMW needs to be turned into a full web geospatial database. Such a project has to be well planned and implemented gradually by little useful steps. What needs to be done? What are the priorities? Here is a sketch of what could become a possible development roadmap:
  1. Start from the Semantic Map extension. Either extend it or fork it. The first task would be to make the "Geographic Coordinate" type to be stored as a WKT string.
  2. Make the backend code store "Geographic Coordinates" as a "geometry" type with the spatial extension when such an extension is available in the database.
  3. Change the name "Geographic Coordinates" for "geometry" to be conformant with other spatial databases. Geographic coordinates would become geometries of type "point".
  4. Add the capacity to create lines and polygons by adding a "Special:ImportSHP" to the Data Transfer extension to be able to import shapefiles as new SMW pages. Add the possibility to import only some rows and assign them to existing pages. Add the option to import other attributes as well. Store those geometries with the database spatial extension "geometry" type and make sure to be able to display them in the wiki pages. You can display all the vertexes coordinates in the wikitext for now.
  5. For lines and polygons, stop showing the vertexes coordinates in the wikitext. Show only an identifier representing the geometry stored in the DB. You could also hide the coordinates only when there are more than 10 or 20 of them. Show them for "point" geometries as they are easy to edit.
  6. Add an "intersects" operator (or comparator) so we can query for entities inside some geographical limits or not farther than some kilometers from another one. Such an operation should be delegated to the database spatial extension. Don’t reinvent the wheel...
  7. Reuse the Special:ImportSHP code to create a "Load Shape" form input in Semantic Form so geometries can be imported, one at a time, when creating new pages with the Semantic Form extension.
  8. Do a "Special:ImportKML" and a form input. Importing from shapefile might be sufficient for a while but people might want to import from KML files very soon.
  9. Make sure to be able to edit points, lines and polygons with the Semantic Map MapEditor directly from a Semantic Form form. Make sure it supports multipolylines and multipolygons and holes creation and edition. Allows it to display other surrounding geometries so one can "snap" to their vertexes or edges while editing.
  10. Define and create predefined MapStyle pages to be used as symbology alias when mapping features.
  11. Provides standardized coordinates systems and a way to import some of them in an active wiki. Implement on the fly reprojection of entities defined in different coordinate systems when mapping them.

This is it! This should turn Semantic MediaWiki in one of the most flexible and accessible web geospatial application development platform. That would take development of rich geoweb applications from the hands of web developers and put it in the hands of web and content designers. I hope some organizations will find this project attractive and estimate that it would be worth investing some developer time or money in it. There are a number of companies specialized in MediaWiki consulting out there just ready to hear from you!

And, please, your comments are welcome!


4 comments:

  1. __Very__ interesting, and you've nailed the pros and cons.

    The way I have addressed with the security issue is by running more than one wiki and skinning them so you can tell which one you are editing. One for intranet and one for public consumption. It takes 10 minutes to set up a new instance, and there is no license fee, so there is little penalty in having more than one wiki.

    Skins don't survive upgrades. Every time I upgrade mediawiki I have to edit my custom skin to get it functioning again. Just upgraded last week so for now wildsong.biz looks generic again.

    ReplyDelete
  2. There are a few things I would like to comment on here:

    I don't think developing a skin is so difficult, I have made a few myself.

    There was a working patch by me to allow for editing geometric objects using Special:MapEditor on a form itself, where it also displayed the previous values. I am not sure if it ever made in though.

    You are right about the possibilities of SMW in geo-storage. We started storing everything in the form of a string but it is easily possible to add some geospatial storage backend.

    In all this is a very nice view for the future of SMW. Is your presentation at SMWCon available online?

    ReplyDelete
  3. @Nischay Yes it's entitled "Beyond points - How to turn SMW into a complete Geographic Information System".

    http://semantic-mediawiki.org/wiki/SMWCon_Spring_2013/Beyond_Semantic

    ReplyDelete
  4. Very nice article. And I have the same opinion on the topic of "semantic". It seems that Semantic Web become rather negative buzzword, something about throwing money on the wind.

    ReplyDelete