Because database on Demo site is not up to date, there will be a number of apps which are not recorded in our system. You can still check their info on site http://97.64.46.11/AppStore_Crawler_Demo/appInfo.html?{appname}
, just remember to substitute {appname}
with FULL name of your target app.
To crawl the information of ALL the apps in App Store. For current demand, I crawled apps of iPhone in App Store of China.
Here, I write a python crawler. I make the crawling by the following steps:
Get all categories and genres of a certain App Store (iPhone App Store of China, for current situation). This information is provided at this link: https://affiliate.itunes.apple.com/resources/documentation/genre-mapping/, but note that API provided at the former link offers information of the USA's App Store by default. And I've queried Apple about this, but no answer got until now. You will get a json format data if you request the API at the former link. That's something like that following:
{
category_id:{
name:
id:
url:
rssUrls: {'': ''}
chartUrls: {'': ''}
subgenres: {'': {
name:
id:
url:
rssUrls: {'': ''}
chartUrls: {'': ''}
# and for some cases, Games for example, there is still a subgenres
}}
}
}
For example:
{
36:{
name: TV Shows,
id: 32
url: https://itunes.apple.com/us/genre/tv-shows/id32
rssUrls: {
'topTvEpisodes':
'topTvEpisodeRentals':
'topTvSeasons':
}
chartUrls: {
'tvEpisodeRentals':
'tvSeasons':
'tvEpisodes':
}
sugenres: {
'4003':{}
'4004':{}
...
}
}
...
}
You can also get a real example from here. It offers us information of all categories in iTunes Store, and App Store or Mac App Store is a part of iTunes Store. Our target category here is "App Store", and its' id is 36.
Using cgInfo got from the first step, we can get genres of a certain category, note that some genres may have subgenres (although not considered yet).
We can get apps sorted by alphabet at the folloing link: https://itunes.apple.com/cn/genre/id6005?mt=8&letter=A&page=1 . There are four parts we need to pay attention to in the former link.
- cn: This is a tow-lettter country code. You can find more details about that here
- id6005: id6005 here is a genre in App Store category, which means Social Networking.
- letter=A: This parameter sets the initial letter of an app name, and all possible cases is as follows: [A-Z, *]
- page=1: Page number of a html, which is uncertain while crawling. All four parts here can be considered as essential parameters of our crawler. I'm crawling the apps from outer categories into inner pages ( [1] > [2] > [3] > [4] ).
For now, we can get a series of links which contains plenty of apps information, and our following task is to parse those links. It's an easy task, and not worthy of a word. After you got those information you need, you can use them as you wish.
Single thread version: here
Multi-thread version: here
Use a special json format file to implement synchronization between threads, like the following:
crawl_dict = {
genre: {
'A': {
current_index = num
done: False # done, when crawl finished
}
'B': {
}
...
}
}
And this crawl_dict
can also server as a process recorder. If you program terminated due to some reasons, you can record the current state of crawling by saving this crawl_dict. Note that this functionality is not tested till now.
Now we have all the names of all apps, so what we do next is collecting informatio of each app and save those information.
How to Search?
Apple offers an API to call when you want to search an item in iTunes Store, which is called as iTunes Search API. According to documention of this API, our requesting url should be something like http://itunes.apple.com/search?term=Google&country=cn&entity=software&limit=1
if we want to search an app named Google in App Store of China, and we only want 1 result. The result is shown in the format as follows:
{
"resultCount": number
"results": [
{
screenshotUrls: []
ipadScreenshotUrls: []
appletvScreenshotUrls: []
artworkUrl60: str
artworkUrl512: str
artworkUrl100: str
artistViewUrl: str
supportedDevices: []
isGameCenterEnabled: bool
advisories: []
kind: str
features: []
averageUserRatingForCurrentVersion: float
trackCensoredName: str
languageCodesISO2A: []
fileSizeBytes: str
sellerUrl: str
contentAdvisoryRating: str
userRatingCountForCurrentVersion: int
trackViewUrl: str
trackContentRating: str
minimumOsVersion: str
currentVersionReleaseDate: str
releaseNotes: str
sellerName: str
trackId: int
trackName: str
formattedPrice: str
releaseDate: str
primaryGenreName: str
primaryGenreId: int
isVppDeviceBasedLicensingEnabled: bool
currency: str
wrapperType: str
version: str
description: str
artistId: int
artistName: str
genres: []
price: float
bundleId: str
genreIds: []
averageUserRating: float
userRatingCount: int
}
...
]
}
And here comes the real example.
You just pick contents that you're interested in, and do the batch of searching.
Note that there is a request frequency limit, 20 calls per minute. To override thie limit, consider EPF