Tag Archives: sparql

The oldest Swedish university – a fact check with SPARQL and DBpedia

This morning I started to read a newspaper article about Gotland University College being merged into Uppsala University, but I stopped after the second sentence which basically stated that Uppsala University was the oldest university in Sweden. I remembered a discussion between some friends of mine years ago in which they disagreed on which Swedish university was established first. I don’t quite remember which universities they talked about, but I think it was Lund and Uppsala.

Not remembering to which conclusion they came I just wanted to do a quick fact check. The easiest and probably least time-consuming approach would have been to just click through a couple of Wikipedia pages about Swedish universities or just look at the Wikipedia list of universities in Sweden which also happens to include the dates of establishment.

Well, googling and clicking on a few links is neither the most exciting thing to do on an average Thursday morning nor does it solve a general problem, so I decided to get an answer with a SPARQL query against DBpedia Live. As an absolute minimum I needed the names of all Swedish universities and their date of establishment. A quick check at the DBpedia page about Uppsala University revealed which properties I possibly can use, so I drafted a first query:

SELECT DISTINCT ?name ?established
  [] dbpprop:country dbpedia:Sweden;
     rdf:type dbpedia-owl:University;
     dbpprop:nativeName ?name;
     dbpprop:established ?established.
ORDER BY ?established

(You can use prefix.cc to expand the namespaces in the query above.)

Nice, this worked! But was Umeå University really established in the year 17? Let’s check what went wrong by taking a look at the DBpedia page of Umeå University. There are two values for “established”, one as xsd:integer (the wrong one), the other one is an xsd:date (and correct). All other universities use integers. Well, there is not so much I can do about this, except for perhaps falling back to the dbpedia-owl:foundingDate property (which is given for Umeå but no other university).

What about university colleges? They are not included in the query result because they are not of the University type. Let’s see how we can include those: perhaps rdf:type yago:UniversitiesAndCollegesInSweden? I dislike several things about this: universities and colleges are summarized in the same property, this should better be a handled by a UNION in the query itself. The country is included too, but I already have a property stating the country in the query. The same applies to rdf:type http://schema.org/CollegeOrUniversity (even though is quite interesting to see that schema.org properties are being used).

So let’s give rdf:type dbpedia-owl:EducationalInstitution a try, this should cover all organizations we want (given there is sufficient information in Wikipedia and the transition into DBpedia went well). It also seems that sometimes we get multiple values for the number of students. As a dirty hack we just average the number of students with the avg() function in the first line of the query. If we also add some other information such as website,  amount of students, and a fallback to the literal "Sweden" (in addition to the resource dbpedia:Sweden) then we end up with the following query:

SELECT DISTINCT str(?name) as ?name, ?url, xsd:integer(?established) as ?established, avg(?students) as ?students
  { ?s dbpprop:country dbpedia:Sweden } UNION { ?s dbpprop:country "Sweden"@en }
  ?s dbpprop:nativeName ?name;
     dbpprop:established ?established;
     rdf:type dbpedia-owl:EducationalInstitution.
  OPTIONAL { ?s dbpedia-owl:numberOfStudents ?students }
  OPTIONAL { ?s foaf:homepage ?url }
  FILTER (datatype(?established) != xsd:integer && ?established > 500)
ORDER BY ?established

Done! You can take a look at the result yourself. (Side note: I used the datatype functions to make the output more pleasant for human eyes.)

Let’s double-check with the list of Swedish universities on Wikipedia to see whether we got the same result… there seems to be a difference between “established as a university” and “first establishment” which is not reflected by the data in DBpedia. Checking the information about Lund University tells us:

“The university […] traces its roots back to 1425 […] making it the oldest institution of higher education in Scandinavia followed by studium generales in Uppsala in 1477 and Copenhagen in 1479. The current university was founded in 1666.”

So, depending on your definition of “establishment” the winner is either Uppsala or Lund. You decide…

We saw that SPARQL queries on DBpedia are powerful and can be used for checking some simple facts, but it takes some attempts to build a query that returns the results we need and want. This is mostly caused by the heterogeneity of the queried data (e.g. some deviating properties, see the date of establishment vs founding date), even though it originates from the same repository.

One could now of course continue, grab sgvizler and try to visualize an eventual correlation between date of establishment and number of students on a scatter chart…