Methodology
Organisation Data
The organisations included in the UK WAIfinder tool come from 2 main sources; organisations researching AI come from Gateway to Research , and AI companies and incubators come from a proprietary dataset from Glass.ai . We also find organisations that fund AI companies using Crunchbase . Due to legal reasons we don’t expose the information about these funder organisations directly using the Crunchbase dataset, but rather we find the information using the Bing search API .
Gateway to Research
The Gateway to Research (GtR) data comes via Nesta’s SQL database.
The first step is searching for projects with certain topic tags that we felt were relevant to AI, e.g. “Image & Vision Computing”, “Robotics & Autonomy” and “Artificial Intelligence”. The organisations these projects took place at were collated together and filtered with the following criteria:
- The organisation is in a predefined list of organisations - which is a combination of universities listed by HESA , the list of research institutes in the UKRI eligibility list and a list of research and technology organisations (RTOs) given to us by UKRI.
- The organisation received any amount of funding in the last 5 years
- The organisation has at least 400 projects OR it has had a total of at least £50 million in funding
- The organisation is in the UK
- The organisation has longitude/latitude data
This leaves us with research organisations that are large, relevant, and recent.
To supplement this data with URLs and organisation descriptions, we use the Bing search API .
Crunchbase
We query the Crunchbase database via Nesta’s SQL server. This data is used to find the investors of AI organisations.
We first find the organisations that are tagged with topics we felt were relevant to AI (e.g. “artificial intelligence”, “augmented reality”, “autonomous vehicles”). We then find all investors of these organisations, where each investor may have funded multiple AI organisations, and each AI organisation may have been funded by multiple investors. Thus, for each investor we have:
- The number of AI organisations they have funded
- The number of total organisations they have funded
We get the longitude/latitude data (which Crunchbase doesn’t have) for these investors using the NSPL postcode look up .
We filter this data to only include key AI investors with the following criteria:
- At least 10% of the organisations they fund are AI organisations
- They have funded at least 10 organisations
- The investor’s address is in the UK
- The “type” field for this investor is “organisation” (not “person”)
- The investor has longitude/latitude data
We then create our funders dataset by using the remaining funder organisation names to query the Bing search API to find their URLs and company descriptions.
Glass.ai
Our data for companies and incubators/accelerators comes from Glass.ai . Through a process of scraping companies’ websites and searching for AI-related keywords in the company descriptions, Glass.ai provided us with a list of organisations.
If a company is also an incubator/accelerator then this is tagged as such in a ‘is_incubator’ field.
We get the longitude/latitude data (which Glass.ai didn’t provide us with) for these companies using the NSPL postcode look up .
The only filtering needed for this dataset was:
- The company has longitude/latitude data
Merging datasets
The three filtered datasets are concatenated together, then organisation names were cleaned in order to merge together organisations that might have been in more than one of the original datasets. For example the company CodeBase is in both the Company and the Funder categories.
If there is duplication we decide which rows to drop to include based on the criteria (useful if there are conflicting Links or latitude/longitude values):
- Trust Glass.ai first - since several sources were considered to find Links and Lat/Long,
- then trust GtR - since Lat/Long was given in this data,
- lastly trust Crunchbase
Merged dataset outputs:
Number of organisations | |
---|---|
Company category | 2785 |
Funder category | 290 |
Incubator / accelerator category | 74 |
University / RTO category | 152 |
Total deduplicated | 3319 |
Adding place information
We add the ‘Place’ field to any data points that don’t have it by using the postcode or longitude/latitude data. We do this using two methods:
- Query the postcode to get the city using the pgeocode python package . We found this data source to be quite unreliable (e.g. Dulwich came up as the city) and there can be multiple city names given for the same postcode beginning. Thus, we only used it if the city given was in a list of major cities (London, Manchester etc). We keep this step in since it is quite fast, so can be used to quickly get the low hanging fruit.
- Query the longitude/latitude coordinates to get the city/town using the geopy python package . This takes longer and provides us with city, town and village names.
Then we finalise this ‘Place’ field for an organisation using the following method:
- Use the city name found from the original data or the pgeocode package if it’s a predefined list of 4276 cities from the UK (from Nesta’s “geographic_data” SQL table)
- If this isn’t in the list, use the city from the geopy package (as long as it’s in the list). If this isn’t possible use the other geopy outputs; see if the town name is in the predefined list of UK cities, then suburb name, village name, county name, and finally neighbourhood name. For example, one data point had the city given as ‘Vale of White Horse’, but this wasn’t in the predefined list of cities, but the suburb field “Botley” was.
- If no place names from any data sources are found in the predefined city list, then repeat steps 1 and 2 but don’t specify the place name needs to be in the predefined list.
Some cleaning of the place name fields is also included (e.g. convert “London Borough of Camden” to “London”).
For each unique place name we find we add NUTS data using the nuts-finder python package and calculate the average lat/long coordinate from all the organisations from this place.
The 3319 unique organisations in the map are located in 422 unique places, with the most common location being London.