Analyzing oDesk Tests

Since May 2012 when www.odesk.com/oconomy has gone leaving only web.archive.org snapshot, there's no source of information about the current state of oDesk marketplace.

Indirect information about number of contractors, average hour rates across skills and their demand now can be drawn only from oDesk tests. This information is spread over pages with descriptions of tests and it must be gathered in one place for future analysis. Scraping is tedious even for a pythonista with requests and html5lib.

Extract tests data with Scrapy

Scrapy is a framework intended to ease implementation of web spiders. Extraction of oDesk tests fit in 3 modules:

  • items.py contains declarative description of data being extracted and it resembles models in Django a lot
  • tests_spider.py extends implementation of Scrapy's spider with:
  • defines regex rules for extraction of URLs of pages, which will be fetched and maps handlers
  • implements handlers, which extract data using XPath and assigns them to objects from items.py
  • settings.py contains project settings, e.g. average DOWNLOAD_DELAY between pages, number of CONCURRENT_REQUESTS_PER_DOMAIN, etc.

Example of running a tests spider, extracting data and saving into CSV file, is shown below:

$ scrapy crawl -o tests_apr7.csv -t csv tests
2013-04-07 23:48:13+0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: otests)
2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled item pipelines:
2013-04-07 23:48:13+0400 [tests] INFO: Spider opened
2013-04-07 23:48:13+0400 [tests] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-04-07 23:48:13+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-04-07 23:48:13+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
...
2013-04-08 00:20:47+0400 [tests] INFO: Closing spider (finished)
2013-04-08 00:20:47+0400 [tests] INFO: Stored csv feed (440 items) in: tests_apr7.csv
2013-04-08 00:20:47+0400 [tests] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 259496,
         'downloader/request_count': 888,
         'downloader/request_method_count/GET': 888,
         'downloader/response_bytes': 1672291,
         'downloader/response_count': 888,
         'downloader/response_status_count/200': 446,
         'downloader/response_status_count/301': 1,
         'downloader/response_status_count/302': 441,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 4, 7, 20, 20, 47, 939438),
         'item_scraped_count': 440,
         'log_count/DEBUG': 1334,
         'log_count/INFO': 37,
         'request_depth_max': 3,
         'response_received_count': 446,
         'scheduler/dequeued': 888,
         'scheduler/dequeued/memory': 888,
         'scheduler/enqueued': 888,
         'scheduler/enqueued/memory': 888,
         'start_time': datetime.datetime(2013, 4, 7, 19, 48, 13, 512624)}
2013-04-08 00:20:47+0400 [tests] INFO: Spider closed (finished)

It can be drawn from excerpt, that there're just 440 tests on oDesk. Let's do more analysis.

Analyzing data with Pandas

Pandas is a large library for data-analysis, based on Numpy. If Numpy is called "Matlab in Python", then Pandas is "R-language in Python".

Let's run interactive Python interpreter and load data from CSV file:

>>> import pandas as pd
>>> import numpy as np

>>> def default_value(typ, default, val):
...     try:
...         return typ(val)
...     except ValueError:
...         return default

>>> def maybe_int(val):
...     return default_value(np.int64, None, val.replace(',', ''))

>>> def maybe_float(val):
...    return default_value(np.float64, None, val)

>>> tests = pd.read_csv('tests_apr7.csv', thousands=',', converters={
...     'hourly_rate_max': maybe_float,
...     'hourly_rate_avg': maybe_float,
...     'percent_independent': maybe_float,
...     'average_qualificatinos': maybe_float,
...     'taken_test': maybe_int,
...     'passed_test': maybe_int,
...     'tests_taken': maybe_int,
... })
>>> tests
<class 'pandas.core.frame.DataFrame'>
Int64Index: 440 entries, 0 to 439
Data columns:
hourly_rate_max           432  non-null values
hourly_rate_avg           432  non-null values
percent_independent       440  non-null values
title                     440  non-null values
average_qualifications    440  non-null values
taken_test                440  non-null values
average_hours             435  non-null values
passed_test               440  non-null values
test_id                   440  non-null values
tests_taken               440  non-null values
dtypes: float64(5), int64(4), object(1)

Now some interesting statistics can be devised.

10 tests with most contractors

Guess which test is the most popular.

>>> tests.sort_index(
...     by=['passed_test'], ascending=False
... ).ix[
...     :, ['test_id', 'title', 'passed_test']
... ][:10]


     test_id                                              title  passed_test
0        752  oDesk Readiness Test for Independent Contracto...       743081
439      511                     U.S. English Basic Skills Test       345213
438      688               English Spelling Test (U.S. Version)       269360
435      545                                 Office Skills Test       114577
436      584                                    Windows XP Test       104314
434      693             English Vocabulary Test (U.S. Version)        88943
437      753        oDesk Readiness Test for Agency Contractors        84282
433      506                      Email Etiquette Certification        60019
429      571                  Telephone Etiquette Certification        48861
428      484                            Call Center Skills Test        44063

See also oDesk Knowledgebase article What is the oDesk Readiness Test?

10 tests with highest average hourly rates

Show very interesting correlation between number of contractors and the average cost of the hour.

>>> tests.sort_index(
...     by=['hourly_rate_avg'], ascending=False
... ).ix[
...     :, ['title', 'hourly_rate_avg', 'passed_test']
... ][:10]

                                                 title  hourly_rate_avg passed_test
14   VB.NET Programming Skills Test (Hands-on progr...            49.50           5
253                            Adobe FrameMaker 8 Test            47.75          36
131  Design Considerations for Mobile Web Applicati...            36.19          58
166                                          VLSI Test            34.00          48
143                           Checkpoint Security Test            29.00          68
248                                           RDF Test            28.50          26
240              Knowledge of ColdFusion 9 Skills Test            28.49          50
29                                     PostgreSQL Test            28.10         199
266                                  Web Services Test            27.95         301
53            Cocoa programming for Mac OS X 10.5 Test            27.55         567

10 tests with most worked hours

Can be used to get the lower bound of total worked hours and amount of earned money till Apr 8, 2013.

>>> tests['total_hours'] = tests['passed_test'] * tests['average_hours']
>>> tests['total_earnings'] = tests['total_hours'] * tests['hourly_rate_avg']
>>> tests[tests['total_hours'] > 0].sort_index(
...     by=['total_hours'], ascending=False
... ).ix[
...     :, ['title', 'total_hours', 'total_earnings']
... ][:10]
                                                 title  total_hours  total_earnings
0    oDesk Readiness Test for Independent Contracto...    308378615    2.692145e+09
439                     U.S. English Basic Skills Test    196080984    1.815710e+09
438               English Spelling Test (U.S. Version)    144646320    9.922738e+08
435                                 Office Skills Test     70808586    4.935358e+08
436                                    Windows XP Test     61023690    5.339573e+08
434             English Vocabulary Test (U.S. Version)     52120598    3.841288e+08
437        oDesk Readiness Test for Agency Contractors     49642098    2.765065e+08
433                      Email Etiquette Certification     44474079    3.740270e+08
429                  Telephone Etiquette Certification     36499167    2.901684e+08
428                            Call Center Skills Test     33884447    2.232985e+08

Comments !

blogroll

social